from:"sam smith"

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith

" In this case your program may work because effectively you are not using
the spark in yarn on the hadoop cluster  " I am actually using Yarn as
mentioned (client mode)
I already know that, but it is not just about collectAsList, the execution
freezes also for example when using save() on the dataset (after the
transformations, before them it is ok to perform save() on the dataset).

I hope the question is clearer (for anybody who's reading) now.

Le sam. 11 mars 2023 à 20:15, Mich Talebzadeh  a
écrit :

> collectAsList brings all the data into the driver which is a single JVM
> on a single node. In this case your program may work because effectively
> you are not using the spark in yarn on the hadoop cluster. The benefit of
> Spark is that you can process a large amount of data using the memory and
> processors across multiple executors on multiple nodes.
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 11 Mar 2023 at 19:01, sam smith 
> wrote:
>
>> not sure what you mean by your question, but it is not helping in any case
>>
>>
>> Le sam. 11 mars 2023 à 19:54, Mich Talebzadeh 
>> a écrit :
>>
>>>
>>>
>>> ... To note that if I execute collectAsList on the dataset at the
>>> beginning of the program
>>>
>>> What do you think  collectAsList does?
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 11 Mar 2023 at 18:29, sam smith 
>>> wrote:
>>>
>>>> Hello guys,
>>>>
>>>> I am launching through code (client mode) a Spark program to run in
>>>> Hadoop. If I execute on the dataset methods of the likes of show() and
>>>> count() or collectAsList() (that are displayed in the Spark UI) after
>>>> performing heavy transformations on the columns then the mentioned methods
>>>> will cause the execution to freeze on Hadoop and that independently of the
>>>> dataset size (intriguing issue for small size datasets!).
>>>> Any idea what could be causing this type of issue?
>>>> To note that if I execute collectAsList on the dataset at the beginning
>>>> of the program (before performing the transformations on the columns) then
>>>> the method yields results correctly.
>>>>
>>>> Thanks.
>>>> Regards
>>>>
>>>>

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith

not sure what you mean by your question, but it is not helping in any case


Le sam. 11 mars 2023 à 19:54, Mich Talebzadeh  a
écrit :

>
>
> ... To note that if I execute collectAsList on the dataset at the
> beginning of the program
>
> What do you think  collectAsList does?
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 11 Mar 2023 at 18:29, sam smith 
> wrote:
>
>> Hello guys,
>>
>> I am launching through code (client mode) a Spark program to run in
>> Hadoop. If I execute on the dataset methods of the likes of show() and
>> count() or collectAsList() (that are displayed in the Spark UI) after
>> performing heavy transformations on the columns then the mentioned methods
>> will cause the execution to freeze on Hadoop and that independently of the
>> dataset size (intriguing issue for small size datasets!).
>> Any idea what could be causing this type of issue?
>> To note that if I execute collectAsList on the dataset at the beginning
>> of the program (before performing the transformations on the columns) then
>> the method yields results correctly.
>>
>> Thanks.
>> Regards
>>
>>

What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith

Hello guys,

I am launching through code (client mode) a Spark program to run in Hadoop.
If I execute on the dataset methods of the likes of show() and count() or
collectAsList() (that are displayed in the Spark UI) after performing heavy
transformations on the columns then the mentioned methods will cause the
execution to freeze on Hadoop and that independently of the dataset size
(intriguing issue for small size datasets!).
Any idea what could be causing this type of issue?
To note that if I execute collectAsList on the dataset at the beginning of
the program (before performing the transformations on the columns) then the
method yields results correctly.

Thanks.
Regards

How to allocate vcores to driver (client mode)

2023-03-10 Thread sam smith

Hi,

I am launching through code (client mode) a Spark program to run in Hadoop.
Whenever I check the executors tab of Spark UI I always get 0 as the number
of vcores for the driver. I tried to change that using *spark.driver.cores*,
or also *spark.yarn.am.cores* in the SparkSession configuration but in
vain. I also tried to set those parameters in spark-defaults but, again,
with no success.
To note that in the environment tab, the right config is displayed.

Could this be the reason for a *CollectAsList *to freeze the execution (not
having enough CPU)?

How to share a dataset file across nodes

2023-03-09 Thread sam smith

Hello,

I use Yarn client mode to submit my driver program to Hadoop, the dataset I
load is from the local file system, when i invoke load("file://path") Spark
complains about the csv file being not found, which i totally understand,
since the dataset is not in any of the workers or the applicationMaster but
only where the driver program resides.
I tried to share the file using the configurations:

> *spark.yarn.dist.files* OR *spark.files *

but both ain't working.
My question is how to share the csv dataset across the nodes at the
specified path?

Thanks.

Re: How to explode array columns of a dataframe having the same length

2023-02-16 Thread sam smith

@Enrico Minack  I used arrays_zip to merge values
into one row, and then used toJSON() to export the data.
@Bjørn explode_outer didn't yield the expected results.

Thanks anyway.

Le jeu. 16 févr. 2023 à 09:06, Enrico Minack  a
écrit :

> You have to take each row and zip the lists, each element of the result
> becomes one new row.
>
> So turn write a method that turns
>   Row(List("A","B","null"), List("C","D","null"), List("E","null","null"))
> into
>   List(List("A","C","E"), List("B","D","null"), List("null","null","null"))
> and use flatmap with that method.
>
> In Scala, this would read:
>
> df.flatMap { row => (row.getSeq[String](0), row.getSeq[String](1),
> row.getSeq[String](2)).zipped.toIterable }.show()
>
> Enrico
>
>
> Am 14.02.23 um 22:54 schrieb sam smith:
>
> Hello guys,
>
> I have the following dataframe:
>
> *col1*
>
> *col2*
>
> *col3*
>
> ["A","B","null"]
>
> ["C","D","null"]
>
> ["E","null","null"]
>
>
> I want to explode it to the following dataframe:
>
> *col1*
>
> *col2*
>
> *col3*
>
> "A"
>
> "C"
>
> "E"
>
> "B"
>
> "D"
>
> "null"
>
> "null"
>
> "null"
>
> "null"
>
> How to do that (preferably in Java) using the explode() method ? knowing
> that something like the following won't yield correct output:
>
> for (String colName: dataset.columns())
> dataset=dataset.withColumn(colName,explode(dataset.col(colName)));
>
>
>
>

How to explode array columns of a dataframe having the same length

2023-02-14 Thread sam smith

Hello guys,

I have the following dataframe:

*col1*

*col2*

*col3*

["A","B","null"]

["C","D","null"]

["E","null","null"]
I want to explode it to the following dataframe:

*col1*

*col2*

*col3*

"A"

"C"

"E"

"B"

"D"

"null"

"null"

"null"

"null"

How to do that (preferably in Java) using the explode() method ? knowing
that something like the following won't yield correct output:

for (String colName: dataset.columns())
dataset=dataset.withColumn(colName,explode(dataset.col(colName)));

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-13 Thread sam smith

Alright, this is the working Java version of it:

List listCols = new ArrayList();
> Arrays.asList(dataset.columns()).forEach(column -> {
> listCols.add(org.apache.spark.sql.functions.collect_set(column)); });
> Column[] arrCols = listCols.toArray(new Column[listCols.size()]);
> dataset = dataset.select(arrCols);


But then, I tried to explode the set of values into rows, through the
explode() but the column values repeat to fill the size of the largest
column.

How to set the repeated values to null instead? (thus keeping only one
exploded set of column values in each column).

Thanks.

Le dim. 12 févr. 2023 à 22:43, Enrico Minack  a
écrit :

> @Sean: This aggregate function does work without an explicit groupBy():
>
> ./spark-3.3.1-bin-hadoop2/bin/spark-shell
> Spark context Web UI available at http://*:4040
> Spark context available as 'sc' (master = local[*], app id =
> local-1676237726079).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.3.1
>   /_/
>
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.17)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> val df = Seq((1, 10, "one"), (2, 20, "two"), (3, 20, "one"), (4,
> 10, "one")).toDF("a", "b", "c")
> scala> df.select(df.columns.map(column =>
> collect_set(col(column)).as(column)): _*).show()
> +++--+
>
> |   a|   b| c|
> +++--+
> |[1, 2, 3, 4]|[20, 10]|[one, two]|
> +++--+
>
> @Sam: I haven't tested the Java code, sorry. I presume you can work it out
> from the working Scala code.
>
> Enrico
>
>
> Am 12.02.23 um 21:32 schrieb Sean Owen:
>
> It doesn't work because it's an aggregate function. You have to groupBy()
> (group by nothing) to make that work, but, you can't assign that as a
> column. Folks those approaches don't make sense semantically in SQL or
> Spark or anything.
> They just mean use threads to collect() distinct values for each col in
> parallel using threads in your program. You don't have to but you could.
> What else are we looking for here, the answer has been given a number of
> times I think.
>
>
> On Sun, Feb 12, 2023 at 2:28 PM sam smith 
> wrote:
>
>> OK, what do you mean by " do your outer for loop in parallel "?
>> btw this didn't work:
>> for (String columnName : df.columns()) {
>> df= df.withColumn(columnName,
>> collect_set(col(columnName)).as(columnName));
>> }
>>
>>
>> Le dim. 12 févr. 2023 à 20:36, Enrico Minack  a
>> écrit :
>>
>>> That is unfortunate, but 3.4.0 is around the corner, really!
>>>
>>> Well, then based on your code, I'd suggest two improvements:
>>> - cache your dataframe after reading, this way, you don't read the
>>> entire file for each column
>>> - do your outer for loop in parallel, then you have N parallel Spark
>>> jobs (only helps if your Spark cluster is not fully occupied by a single
>>> column)
>>>
>>> Your withColumn-approach does not work because withColumn expects a
>>> column as the second argument, but df.select(columnName).distinct() is a
>>> DataFrame and .col is a column in *that* DataFrame, it is not a column
>>> of the dataframe that you call withColumn on.
>>>
>>> It should read:
>>>
>>> Scala:
>>> df.select(df.columns.map(column => collect_set(col(column)).as(column)):
>>> _*).show()
>>>
>>> Java:
>>> for (String columnName : df.columns()) {
>>> df= df.withColumn(columnName,
>>> collect_set(col(columnName)).as(columnName));
>>> }
>>>
>>> Then you have a single DataFrame that computes all columns in a single
>>> Spark job.
>>>
>>> But this reads all distinct values into a single partition, which has
>>> the same downside as collect, so this is as bad as using collect.
>>>
>>> Cheers,
>>> Enrico
>>>
>>>
>>> Am 12.02.23 um 18:05 schrieb sam smith:
>>>
>>> @Enrico Minack  Thanks for "unpivot" but I am
>>> using version 3.3.0 (you are taking it way too far as usual :) )
>>> @Sean Owen  Pls then show me how it can be improved
>>> by code.
>>>
>>> Also, why such an approach (using withC

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith

OK, what do you mean by " do your outer for loop in parallel "?
btw this didn't work:
for (String columnName : df.columns()) {
df= df.withColumn(columnName,
collect_set(col(columnName)).as(columnName));
}


Le dim. 12 févr. 2023 à 20:36, Enrico Minack  a
écrit :

> That is unfortunate, but 3.4.0 is around the corner, really!
>
> Well, then based on your code, I'd suggest two improvements:
> - cache your dataframe after reading, this way, you don't read the entire
> file for each column
> - do your outer for loop in parallel, then you have N parallel Spark jobs
> (only helps if your Spark cluster is not fully occupied by a single column)
>
> Your withColumn-approach does not work because withColumn expects a column
> as the second argument, but df.select(columnName).distinct() is a DataFrame
> and .col is a column in *that* DataFrame, it is not a column of the
> dataframe that you call withColumn on.
>
> It should read:
>
> Scala:
> df.select(df.columns.map(column => collect_set(col(column)).as(column)):
> _*).show()
>
> Java:
> for (String columnName : df.columns()) {
> df= df.withColumn(columnName,
> collect_set(col(columnName)).as(columnName));
> }
>
> Then you have a single DataFrame that computes all columns in a single
> Spark job.
>
> But this reads all distinct values into a single partition, which has the
> same downside as collect, so this is as bad as using collect.
>
> Cheers,
> Enrico
>
>
> Am 12.02.23 um 18:05 schrieb sam smith:
>
> @Enrico Minack  Thanks for "unpivot" but I am using
> version 3.3.0 (you are taking it way too far as usual :) )
> @Sean Owen  Pls then show me how it can be improved by
> code.
>
> Also, why such an approach (using withColumn() ) doesn't work:
>
> for (String columnName : df.columns()) {
> df= df.withColumn(columnName,
> df.select(columnName).distinct().col(columnName));
> }
>
> Le sam. 11 févr. 2023 à 13:11, Enrico Minack  a
> écrit :
>
>> You could do the entire thing in DataFrame world and write the result to
>> disk. All you need is unpivot (to be released in Spark 3.4.0, soon).
>>
>> Note this is Scala but should be straightforward to translate into Java:
>>
>> import org.apache.spark.sql.functions.collect_set
>>
>> val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10,
>> 123)).toDF("a", "b", "c")
>>
>> df.unpivot(Array.empty, "column", "value")
>>   .groupBy("column")
>>   .agg(collect_set("value").as("distinct_values"))
>>
>> The unpivot operation turns
>> +---+---+---+
>> |  a|  b|  c|
>> +---+---+---+
>> |  1| 10|123|
>> |  2| 20|124|
>> |  3| 20|123|
>> |  4| 10|123|
>> +---+---+---+
>>
>> into
>>
>> +--+-+
>> |column|value|
>> +--+-+
>> | a|1|
>> | b|   10|
>> | c|  123|
>> | a|2|
>> | b|   20|
>> | c|  124|
>> | a|3|
>> | b|   20|
>> | c|  123|
>> | a|4|
>> | b|   10|
>> | c|  123|
>> +--+-+
>>
>> The groupBy("column").agg(collect_set("value").as("distinct_values"))
>> collects distinct values per column:
>> +--+---+
>>
>> |column|distinct_values|
>> +--+---+
>> | c| [123, 124]|
>> | b|   [20, 10]|
>> | a|   [1, 2, 3, 4]|
>> +--+---+
>>
>> Note that unpivot only works if all columns have a "common" type. Then
>> all columns are cast to that common type. If you have incompatible types
>> like Integer and String, you would have to cast them all to String first:
>>
>> import org.apache.spark.sql.types.StringType
>>
>> df.select(df.columns.map(col(_).cast(StringType)): _*).unpivot(...)
>>
>> If you want to preserve the type of the values and have multiple value
>> types, you cannot put everything into a DataFrame with one
>> distinct_values column. You could still have multiple DataFrames, one
>> per data type, and write those, or collect the DataFrame's values into Maps:
>>
>> import scala.collection.immutable
>>
>> import org.apache.spark.sql.DataFrame
>> import org.apache.spark.sql.functions.collect_set
>>
>> // if all you columns have the same type
>> def distinctValuesPerColumnOneType(df: DataFrame): immutable.Map[String,
>> immutable.Seq[Any]] = {
>>   df.unpivot(Array.empty, "column", "value&quo

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith

@Sean Correct. But I was hoping to improve my solution even more.

Le dim. 12 févr. 2023 à 18:03, Sean Owen  a écrit :

> That's the answer, except, you can never select a result set into a column
> right? you just collect() each of those results. Or, what do you want? I'm
> not clear.
>
> On Sun, Feb 12, 2023 at 10:59 AM sam smith 
> wrote:
>
>> @Enrico Minack  Thanks for "unpivot" but I am
>> using version 3.3.0 (you are taking it way too far as usual :) )
>> @Sean Owen  Pls then show me how it can be improved by
>> code.
>>
>> Also, why such an approach (using withColumn() ) doesn't work:
>>
>> for (String columnName : df.columns()) {
>> df= df.withColumn(columnName,
>> df.select(columnName).distinct().col(columnName));
>> }
>>
>> Le sam. 11 févr. 2023 à 13:11, Enrico Minack  a
>> écrit :
>>
>>> You could do the entire thing in DataFrame world and write the result to
>>> disk. All you need is unpivot (to be released in Spark 3.4.0, soon).
>>>
>>> Note this is Scala but should be straightforward to translate into Java:
>>>
>>> import org.apache.spark.sql.functions.collect_set
>>>
>>> val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10,
>>> 123)).toDF("a", "b", "c")
>>>
>>> df.unpivot(Array.empty, "column", "value")
>>>   .groupBy("column")
>>>   .agg(collect_set("value").as("distinct_values"))
>>>
>>> The unpivot operation turns
>>> +---+---+---+
>>> |  a|  b|  c|
>>> +---+---+---+
>>> |  1| 10|123|
>>> |  2| 20|124|
>>> |  3| 20|123|
>>> |  4| 10|123|
>>> +---+---+---+
>>>
>>> into
>>>
>>> +--+-+
>>> |column|value|
>>> +--+-+
>>> | a|1|
>>> | b|   10|
>>> | c|  123|
>>> | a|2|
>>> | b|   20|
>>> | c|  124|
>>> | a|3|
>>> | b|   20|
>>> | c|  123|
>>> | a|4|
>>> | b|   10|
>>> | c|  123|
>>> +--+-+
>>>
>>> The groupBy("column").agg(collect_set("value").as("distinct_values"))
>>> collects distinct values per column:
>>> +--+---+
>>>
>>> |column|distinct_values|
>>> +--+---+
>>> | c| [123, 124]|
>>> | b|   [20, 10]|
>>> | a|   [1, 2, 3, 4]|
>>> +--+---+
>>>
>>> Note that unpivot only works if all columns have a "common" type. Then
>>> all columns are cast to that common type. If you have incompatible types
>>> like Integer and String, you would have to cast them all to String first:
>>>
>>> import org.apache.spark.sql.types.StringType
>>>
>>> df.select(df.columns.map(col(_).cast(StringType)): _*).unpivot(...)
>>>
>>> If you want to preserve the type of the values and have multiple value
>>> types, you cannot put everything into a DataFrame with one
>>> distinct_values column. You could still have multiple DataFrames, one
>>> per data type, and write those, or collect the DataFrame's values into Maps:
>>>
>>> import scala.collection.immutable
>>>
>>> import org.apache.spark.sql.DataFrame
>>> import org.apache.spark.sql.functions.collect_set
>>>
>>> // if all you columns have the same type
>>> def distinctValuesPerColumnOneType(df: DataFrame): immutable.Map[String,
>>> immutable.Seq[Any]] = {
>>>   df.unpivot(Array.empty, "column", "value")
>>> .groupBy("column")
>>> .agg(collect_set("value").as("distinct_values"))
>>> .collect()
>>> .map(row => row.getString(0) -> row.getSeq[Any](1).toList)
>>> .toMap
>>> }
>>>
>>>
>>> // if your columns have different types
>>> def distinctValuesPerColumn(df: DataFrame): immutable.Map[String,
>>> immutable.Seq[Any]] = {
>>>   df.schema.fields
>>> .groupBy(_.dataType)
>>> .mapValues(_.map(_.name))
>>> .par
>>> .map { case (dataType, columns) => df.select(columns.map(col): _*) }
>>> .map(distinctValuesPerColumnOneType)
>>> .flatten
>>> .toList
>>> .toMap
>>> }
>>>
>>&g

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith

@Enrico Minack  Thanks for "unpivot" but I am using
version 3.3.0 (you are taking it way too far as usual :) )
@Sean Owen  Pls then show me how it can be improved by
code.

Also, why such an approach (using withColumn() ) doesn't work:

for (String columnName : df.columns()) {
df= df.withColumn(columnName,
df.select(columnName).distinct().col(columnName));
}

Le sam. 11 févr. 2023 à 13:11, Enrico Minack  a
écrit :

> You could do the entire thing in DataFrame world and write the result to
> disk. All you need is unpivot (to be released in Spark 3.4.0, soon).
>
> Note this is Scala but should be straightforward to translate into Java:
>
> import org.apache.spark.sql.functions.collect_set
>
> val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10,
> 123)).toDF("a", "b", "c")
>
> df.unpivot(Array.empty, "column", "value")
>   .groupBy("column")
>   .agg(collect_set("value").as("distinct_values"))
>
> The unpivot operation turns
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1| 10|123|
> |  2| 20|124|
> |  3| 20|123|
> |  4| 10|123|
> +---+---+---+
>
> into
>
> +--+-+
> |column|value|
> +--+-+
> | a|1|
> | b|   10|
> | c|  123|
> | a|2|
> | b|   20|
> | c|  124|
> | a|3|
> | b|   20|
> | c|  123|
> | a|4|
> | b|   10|
> | c|  123|
> +--+-+
>
> The groupBy("column").agg(collect_set("value").as("distinct_values"))
> collects distinct values per column:
> +--+---+
>
> |column|distinct_values|
> +--+---+
> | c| [123, 124]|
> | b|   [20, 10]|
> | a|   [1, 2, 3, 4]|
> +--+---+
>
> Note that unpivot only works if all columns have a "common" type. Then all
> columns are cast to that common type. If you have incompatible types like
> Integer and String, you would have to cast them all to String first:
>
> import org.apache.spark.sql.types.StringType
>
> df.select(df.columns.map(col(_).cast(StringType)): _*).unpivot(...)
>
> If you want to preserve the type of the values and have multiple value
> types, you cannot put everything into a DataFrame with one distinct_values
> column. You could still have multiple DataFrames, one per data type, and
> write those, or collect the DataFrame's values into Maps:
>
> import scala.collection.immutable
>
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.functions.collect_set
>
> // if all you columns have the same type
> def distinctValuesPerColumnOneType(df: DataFrame): immutable.Map[String,
> immutable.Seq[Any]] = {
>   df.unpivot(Array.empty, "column", "value")
> .groupBy("column")
> .agg(collect_set("value").as("distinct_values"))
> .collect()
> .map(row => row.getString(0) -> row.getSeq[Any](1).toList)
> .toMap
> }
>
>
> // if your columns have different types
> def distinctValuesPerColumn(df: DataFrame): immutable.Map[String,
> immutable.Seq[Any]] = {
>   df.schema.fields
> .groupBy(_.dataType)
> .mapValues(_.map(_.name))
> .par
> .map { case (dataType, columns) => df.select(columns.map(col): _*) }
> .map(distinctValuesPerColumnOneType)
> .flatten
> .toList
> .toMap
> }
>
> val df = Seq((1, 10, "one"), (2, 20, "two"), (3, 20, "one"), (4, 10,
> "one")).toDF("a", "b", "c")
> distinctValuesPerColumn(df)
>
> The result is: (list values are of original type)
> Map(b -> List(20, 10), a -> List(1, 2, 3, 4), c -> List(one, two))
>
> Hope this helps,
> Enrico
>
>
> Am 10.02.23 um 22:56 schrieb sam smith:
>
> Hi Apotolos,
> Can you suggest a better approach while keeping values within a dataframe?
>
> Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos <
> papad...@csd.auth.gr> a écrit :
>
>> Dear Sam,
>>
>> you are assuming that the data fits in the memory of your local machine.
>> You are using as a basis a dataframe, which potentially can be very large,
>> and then you are storing the data in local lists. Keep in mind that that
>> the number of distinct elements in a column may be very large (depending on
>> the app). I suggest to work on a solution that assumes that the number of
>> distinct values is also large. Thus, you should keep your data in
>> dataframes or RDDs, and store them as csv files, parquet, etc.
>>
>> a.p.
>>
>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith

I am not sure i understand well " Just need to do the cols one at a time".
Plus I think Apostolos is right, this needs a dataframe approach not a list
approach.

Le ven. 10 févr. 2023 à 22:47, Sean Owen  a écrit :

> For each column, select only that call and get distinct values. Similar to
> what you do here. Just need to do the cols one at a time. Your current code
> doesn't do what you want.
>
> On Fri, Feb 10, 2023, 3:46 PM sam smith 
> wrote:
>
>> Hi Sean,
>>
>> "You need to select the distinct values of each col one at a time", how ?
>>
>> Le ven. 10 févr. 2023 à 22:40, Sean Owen  a écrit :
>>
>>> That gives you all distinct tuples of those col values. You need to
>>> select the distinct values of each col one at a time. Sure just collect()
>>> the result as you do here.
>>>
>>> On Fri, Feb 10, 2023, 3:34 PM sam smith 
>>> wrote:
>>>
>>>> I want to get the distinct values of each column in a List (is it good
>>>> practice to use List here?), that contains as first element the column
>>>> name, and the other element its distinct values so that for a dataset we
>>>> get a list of lists, i do it this way (in my opinion no so fast):
>>>>
>>>> List> finalList = new ArrayList>();
>>>> Dataset df = spark.read().format("csv").option("header", 
>>>> "true").load("/pathToCSV");
>>>> String[] columnNames = df.columns();
>>>>  for (int i=0;i>>> List columnList = new ArrayList();
>>>>
>>>> columnList.add(columnNames[i]);
>>>>
>>>>
>>>> List columnValues = 
>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>>>> for (int j=0;j>>> columnList.add(columnValues.get(j).apply(0).toString());
>>>>
>>>> finalList.add(columnList);
>>>>
>>>>
>>>> How to improve this?
>>>>
>>>> Also, can I get the results in JSON format?
>>>>
>>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith

Hi Apotolos,
Can you suggest a better approach while keeping values within a dataframe?

Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos <
papad...@csd.auth.gr> a écrit :

> Dear Sam,
>
> you are assuming that the data fits in the memory of your local machine.
> You are using as a basis a dataframe, which potentially can be very large,
> and then you are storing the data in local lists. Keep in mind that that
> the number of distinct elements in a column may be very large (depending on
> the app). I suggest to work on a solution that assumes that the number of
> distinct values is also large. Thus, you should keep your data in
> dataframes or RDDs, and store them as csv files, parquet, etc.
>
> a.p.
>
>
> On 10/2/23 23:40, sam smith wrote:
>
> I want to get the distinct values of each column in a List (is it good
> practice to use List here?), that contains as first element the column
> name, and the other element its distinct values so that for a dataset we
> get a list of lists, i do it this way (in my opinion no so fast):
>
> List> finalList = new ArrayList>();
> Dataset df = spark.read().format("csv").option("header", 
> "true").load("/pathToCSV");
> String[] columnNames = df.columns();
>  for (int i=0;i List columnList = new ArrayList();
>
> columnList.add(columnNames[i]);
>
>
> List columnValues = 
> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
> for (int j=0;j columnList.add(columnValues.get(j).apply(0).toString());
>
> finalList.add(columnList);
>
>
> How to improve this?
>
> Also, can I get the results in JSON format?
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith

Hi Sean,

"You need to select the distinct values of each col one at a time", how ?

Le ven. 10 févr. 2023 à 22:40, Sean Owen  a écrit :

> That gives you all distinct tuples of those col values. You need to select
> the distinct values of each col one at a time. Sure just collect() the
> result as you do here.
>
> On Fri, Feb 10, 2023, 3:34 PM sam smith 
> wrote:
>
>> I want to get the distinct values of each column in a List (is it good
>> practice to use List here?), that contains as first element the column
>> name, and the other element its distinct values so that for a dataset we
>> get a list of lists, i do it this way (in my opinion no so fast):
>>
>> List> finalList = new ArrayList>();
>> Dataset df = spark.read().format("csv").option("header", 
>> "true").load("/pathToCSV");
>> String[] columnNames = df.columns();
>>  for (int i=0;i> List columnList = new ArrayList();
>>
>> columnList.add(columnNames[i]);
>>
>>
>> List columnValues = 
>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>> for (int j=0;j> columnList.add(columnValues.get(j).apply(0).toString());
>>
>> finalList.add(columnList);
>>
>>
>> How to improve this?
>>
>> Also, can I get the results in JSON format?
>>
>

How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith

I want to get the distinct values of each column in a List (is it good
practice to use List here?), that contains as first element the column
name, and the other element its distinct values so that for a dataset we
get a list of lists, i do it this way (in my opinion no so fast):

List> finalList = new ArrayList>();
Dataset df = spark.read().format("csv").option("header",
"true").load("/pathToCSV");
String[] columnNames = df.columns();
 for (int i=0;i columnList = new ArrayList();

columnList.add(columnNames[i]);


List columnValues =
df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
for (int j=0;j

Can we upload a csv dataset into Hive using SparkSQL?

2022-12-10 Thread sam smith

Hello,

I want to create a table in Hive and then load a CSV file content into it
all by means of Spark SQL.
I saw in the docs the example with the .txt file BUT can we do instead
something like the following to accomplish what i want? :

String warehouseLocation = new
File("spark-warehouse").getAbsolutePath();SparkSession spark =
SparkSession
  .builder()
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate();
spark.sql("CREATE TABLE IF NOT EXISTS csvFile USING
hive");spark.sql("LOAD DATA LOCAL INPATH
'C:/Users/Me/Documents/examples/src/main/resources/data.csv' INTO
TABLE csvFile");

Re: Aggregate over a column: the proper way to do

2022-04-10 Thread sam smith

Exact, one row, and two columns

Le sam. 9 avr. 2022 à 17:44, Sean Owen  a écrit :

> But it only has one row, right?
>
> On Sat, Apr 9, 2022, 10:06 AM sam smith 
> wrote:
>
>> Yes. Returns the number of rows in the Dataset as *long*. but in my case
>> the aggregation returns a table of two columns.
>>
>> Le ven. 8 avr. 2022 à 14:12, Sean Owen  a écrit :
>>
>>> Dataset.count() returns one value directly?
>>>
>>> On Thu, Apr 7, 2022 at 11:25 PM sam smith 
>>> wrote:
>>>
>>>> My bad, yes of course that! still i don't like the ..
>>>> select("count(myCol)") .. part in my line is there any replacement to that 
>>>> ?
>>>>
>>>> Le ven. 8 avr. 2022 à 06:13, Sean Owen  a écrit :
>>>>
>>>>> Just do an average then? Most of my point is that filtering to one
>>>>> group and then grouping is pointless.
>>>>>
>>>>> On Thu, Apr 7, 2022, 11:10 PM sam smith 
>>>>> wrote:
>>>>>
>>>>>> What if i do avg instead of count?
>>>>>>
>>>>>> Le ven. 8 avr. 2022 à 05:32, Sean Owen  a écrit :
>>>>>>
>>>>>>> Wait, why groupBy at all? After the filter only rows with myCol
>>>>>>> equal to your target are left. There is only one group. Don't group just
>>>>>>> count after the filter?
>>>>>>>
>>>>>>> On Thu, Apr 7, 2022, 10:27 PM sam smith 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I want to aggregate a column by counting the number of rows having
>>>>>>>> the value "myTargetValue" and return the result
>>>>>>>> I am doing it like the following:in JAVA
>>>>>>>>
>>>>>>>>> long result =
>>>>>>>>> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);
>>>>>>>>
>>>>>>>>
>>>>>>>> Is that the right way? if no, what if a more optimized way to do
>>>>>>>> that (always in JAVA)?
>>>>>>>> Thanks for the help.
>>>>>>>>
>>>>>>>

Re: Aggregate over a column: the proper way to do

2022-04-09 Thread sam smith

Yes. Returns the number of rows in the Dataset as *long*. but in my case
the aggregation returns a table of two columns.

Le ven. 8 avr. 2022 à 14:12, Sean Owen  a écrit :

> Dataset.count() returns one value directly?
>
> On Thu, Apr 7, 2022 at 11:25 PM sam smith 
> wrote:
>
>> My bad, yes of course that! still i don't like the ..
>> select("count(myCol)") .. part in my line is there any replacement to that ?
>>
>> Le ven. 8 avr. 2022 à 06:13, Sean Owen  a écrit :
>>
>>> Just do an average then? Most of my point is that filtering to one group
>>> and then grouping is pointless.
>>>
>>> On Thu, Apr 7, 2022, 11:10 PM sam smith 
>>> wrote:
>>>
>>>> What if i do avg instead of count?
>>>>
>>>> Le ven. 8 avr. 2022 à 05:32, Sean Owen  a écrit :
>>>>
>>>>> Wait, why groupBy at all? After the filter only rows with myCol equal
>>>>> to your target are left. There is only one group. Don't group just count
>>>>> after the filter?
>>>>>
>>>>> On Thu, Apr 7, 2022, 10:27 PM sam smith 
>>>>> wrote:
>>>>>
>>>>>> I want to aggregate a column by counting the number of rows having
>>>>>> the value "myTargetValue" and return the result
>>>>>> I am doing it like the following:in JAVA
>>>>>>
>>>>>>> long result =
>>>>>>> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);
>>>>>>
>>>>>>
>>>>>> Is that the right way? if no, what if a more optimized way to do that
>>>>>> (always in JAVA)?
>>>>>> Thanks for the help.
>>>>>>
>>>>>

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith

My bad, yes of course that! still i don't like the ..
select("count(myCol)") .. part in my line is there any replacement to that ?

Le ven. 8 avr. 2022 à 06:13, Sean Owen  a écrit :

> Just do an average then? Most of my point is that filtering to one group
> and then grouping is pointless.
>
> On Thu, Apr 7, 2022, 11:10 PM sam smith 
> wrote:
>
>> What if i do avg instead of count?
>>
>> Le ven. 8 avr. 2022 à 05:32, Sean Owen  a écrit :
>>
>>> Wait, why groupBy at all? After the filter only rows with myCol equal to
>>> your target are left. There is only one group. Don't group just count after
>>> the filter?
>>>
>>> On Thu, Apr 7, 2022, 10:27 PM sam smith 
>>> wrote:
>>>
>>>> I want to aggregate a column by counting the number of rows having the
>>>> value "myTargetValue" and return the result
>>>> I am doing it like the following:in JAVA
>>>>
>>>>> long result =
>>>>> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);
>>>>
>>>>
>>>> Is that the right way? if no, what if a more optimized way to do that
>>>> (always in JAVA)?
>>>> Thanks for the help.
>>>>
>>>

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith

What if i do avg instead of count?

Le ven. 8 avr. 2022 à 05:32, Sean Owen  a écrit :

> Wait, why groupBy at all? After the filter only rows with myCol equal to
> your target are left. There is only one group. Don't group just count after
> the filter?
>
> On Thu, Apr 7, 2022, 10:27 PM sam smith 
> wrote:
>
>> I want to aggregate a column by counting the number of rows having the
>> value "myTargetValue" and return the result
>> I am doing it like the following:in JAVA
>>
>>> long result =
>>> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);
>>
>>
>> Is that the right way? if no, what if a more optimized way to do that
>> (always in JAVA)?
>> Thanks for the help.
>>
>

Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith

I want to aggregate a column by counting the number of rows having the
value "myTargetValue" and return the result
I am doing it like the following:in JAVA

> long result =
> dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("count(myCol)").first().getLong(0);


Is that the right way? if no, what if a more optimized way to do that
(always in JAVA)?
Thanks for the help.

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith

spark-submit a spark application on Hadoop (cluster mode) that's what i
mean by  executing on Hadoop

Le lun. 24 janv. 2022 à 18:00, Sean Owen  a écrit :

> I am still not understanding what you mean by "executing on Hadoop". Spark
> does not use Hadoop for execution. Probably can't answer until this is
> cleared up.
>
> On Mon, Jan 24, 2022 at 10:57 AM sam smith 
> wrote:
>
>> I mean the DAG order is somehow altered when executing on Hadoop
>>
>> Le lun. 24 janv. 2022 à 17:17, Sean Owen  a écrit :
>>
>>> Code is not executed by Hadoop, nor passed through Hadoop somehow. Do
>>> you mean data? data is read as-is. There is typically no guarantee about
>>> ordering of data in files but you can order data. Still not sure what
>>> specifically you are worried about here, but I don't think the kind of
>>> thing you're contemplating can happen, no
>>>
>>> On Mon, Jan 24, 2022 at 9:28 AM sam smith 
>>> wrote:
>>>
>>>> I am aware of that, but whenever the chunks of code are returned to
>>>> Spark from Hadoop (after processing) could they be done not in the ordered
>>>> way ? could this ever happen ?
>>>>
>>>> Le lun. 24 janv. 2022 à 16:14, Sean Owen  a écrit :
>>>>
>>>>> Hadoop does not run Spark programs, Spark does. How or why would
>>>>> something, what, modify the byte code? No
>>>>>
>>>>> On Mon, Jan 24, 2022, 9:07 AM sam smith 
>>>>> wrote:
>>>>>
>>>>>> My point is could Hadoop go wrong about one Spark execution ? meaning
>>>>>> that it gets confused (given the concurrent distributed tasks) and then
>>>>>> adds wrong instruction to the program, or maybe does execute an 
>>>>>> instruction
>>>>>> not at its right order (shuffling the order of execution by executing
>>>>>> previous ones, while it shouldn't) ? Before finishing and returning the
>>>>>> results from one node it returns the results of the other in a wrong way
>>>>>> for example.
>>>>>>
>>>>>> Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :
>>>>>>
>>>>>>> Not clear what you mean here. A Spark program is a program, so what
>>>>>>> are the alternatives here? program execution order is still program
>>>>>>> execution order. You are not guaranteed anything about order of 
>>>>>>> concurrent
>>>>>>> tasks. Failed tasks can be reexecuted so should be idempotent. I think 
>>>>>>> the
>>>>>>> answer is 'no' but not sure what you are thinking of here.
>>>>>>>
>>>>>>> On Mon, Jan 24, 2022 at 7:10 AM sam smith <
>>>>>>> qustacksm2123...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello guys,
>>>>>>>>
>>>>>>>> I hope my question does not sound weird, but could a Spark
>>>>>>>> execution on Hadoop cluster give different output than the program 
>>>>>>>> actually
>>>>>>>> does ? I mean by that, the execution order is messed by hadoop, or an
>>>>>>>> instruction executed twice..; ?
>>>>>>>>
>>>>>>>> Thanks for your enlightenment
>>>>>>>>
>>>>>>>

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith

I mean the DAG order is somehow altered when executing on Hadoop

Le lun. 24 janv. 2022 à 17:17, Sean Owen  a écrit :

> Code is not executed by Hadoop, nor passed through Hadoop somehow. Do you
> mean data? data is read as-is. There is typically no guarantee about
> ordering of data in files but you can order data. Still not sure what
> specifically you are worried about here, but I don't think the kind of
> thing you're contemplating can happen, no
>
> On Mon, Jan 24, 2022 at 9:28 AM sam smith 
> wrote:
>
>> I am aware of that, but whenever the chunks of code are returned to Spark
>> from Hadoop (after processing) could they be done not in the ordered way ?
>> could this ever happen ?
>>
>> Le lun. 24 janv. 2022 à 16:14, Sean Owen  a écrit :
>>
>>> Hadoop does not run Spark programs, Spark does. How or why would
>>> something, what, modify the byte code? No
>>>
>>> On Mon, Jan 24, 2022, 9:07 AM sam smith 
>>> wrote:
>>>
>>>> My point is could Hadoop go wrong about one Spark execution ? meaning
>>>> that it gets confused (given the concurrent distributed tasks) and then
>>>> adds wrong instruction to the program, or maybe does execute an instruction
>>>> not at its right order (shuffling the order of execution by executing
>>>> previous ones, while it shouldn't) ? Before finishing and returning the
>>>> results from one node it returns the results of the other in a wrong way
>>>> for example.
>>>>
>>>> Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :
>>>>
>>>>> Not clear what you mean here. A Spark program is a program, so what
>>>>> are the alternatives here? program execution order is still program
>>>>> execution order. You are not guaranteed anything about order of concurrent
>>>>> tasks. Failed tasks can be reexecuted so should be idempotent. I think the
>>>>> answer is 'no' but not sure what you are thinking of here.
>>>>>
>>>>> On Mon, Jan 24, 2022 at 7:10 AM sam smith 
>>>>> wrote:
>>>>>
>>>>>> Hello guys,
>>>>>>
>>>>>> I hope my question does not sound weird, but could a Spark execution
>>>>>> on Hadoop cluster give different output than the program actually does ? 
>>>>>> I
>>>>>> mean by that, the execution order is messed by hadoop, or an instruction
>>>>>> executed twice..; ?
>>>>>>
>>>>>> Thanks for your enlightenment
>>>>>>
>>>>>

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith

I am aware of that, but whenever the chunks of code are returned to Spark
from Hadoop (after processing) could they be done not in the ordered way ?
could this ever happen ?

Le lun. 24 janv. 2022 à 16:14, Sean Owen  a écrit :

> Hadoop does not run Spark programs, Spark does. How or why would
> something, what, modify the byte code? No
>
> On Mon, Jan 24, 2022, 9:07 AM sam smith 
> wrote:
>
>> My point is could Hadoop go wrong about one Spark execution ? meaning
>> that it gets confused (given the concurrent distributed tasks) and then
>> adds wrong instruction to the program, or maybe does execute an instruction
>> not at its right order (shuffling the order of execution by executing
>> previous ones, while it shouldn't) ? Before finishing and returning the
>> results from one node it returns the results of the other in a wrong way
>> for example.
>>
>> Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :
>>
>>> Not clear what you mean here. A Spark program is a program, so what are
>>> the alternatives here? program execution order is still program execution
>>> order. You are not guaranteed anything about order of concurrent tasks.
>>> Failed tasks can be reexecuted so should be idempotent. I think the answer
>>> is 'no' but not sure what you are thinking of here.
>>>
>>> On Mon, Jan 24, 2022 at 7:10 AM sam smith 
>>> wrote:
>>>
>>>> Hello guys,
>>>>
>>>> I hope my question does not sound weird, but could a Spark execution on
>>>> Hadoop cluster give different output than the program actually does ? I
>>>> mean by that, the execution order is messed by hadoop, or an instruction
>>>> executed twice..; ?
>>>>
>>>> Thanks for your enlightenment
>>>>
>>>

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith

My point is could Hadoop go wrong about one Spark execution ? meaning that
it gets confused (given the concurrent distributed tasks) and then adds
wrong instruction to the program, or maybe does execute an instruction not
at its right order (shuffling the order of execution by executing previous
ones, while it shouldn't) ? Before finishing and returning the results from
one node it returns the results of the other in a wrong way for example.

Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :

> Not clear what you mean here. A Spark program is a program, so what are
> the alternatives here? program execution order is still program execution
> order. You are not guaranteed anything about order of concurrent tasks.
> Failed tasks can be reexecuted so should be idempotent. I think the answer
> is 'no' but not sure what you are thinking of here.
>
> On Mon, Jan 24, 2022 at 7:10 AM sam smith 
> wrote:
>
>> Hello guys,
>>
>> I hope my question does not sound weird, but could a Spark execution on
>> Hadoop cluster give different output than the program actually does ? I
>> mean by that, the execution order is messed by hadoop, or an instruction
>> executed twice..; ?
>>
>> Thanks for your enlightenment
>>
>

Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith

Hello guys,

I hope my question does not sound weird, but could a Spark execution on
Hadoop cluster give different output than the program actually does ? I
mean by that, the execution order is messed by hadoop, or an instruction
executed twice..; ?

Thanks for your enlightenment

Re: About some Spark technical help

2021-12-24 Thread sam smith

Thanks for the feedback Andrew.

Le sam. 25 déc. 2021 à 03:17, Andrew Davidson  a écrit :

> Hi Sam
>
> It is kind of hard to review straight code. Adding some some sample data,
> a unit test and expected results. Would be a good place to start. Ie.
> Determine the fidelity of your implementation compared to the original.
>
> Also a verbal description of the algo would be helpful
>
> Happy Holidays
>
> Andy
>
> On Fri, Dec 24, 2021 at 3:17 AM sam smith 
> wrote:
>
>> Hi Gourav,
>>
>> Good question! that's the programming language i am most proficient at.
>> You are always welcome to suggest corrective remarks about my (Spark)
>> code.
>>
>> Kind regards.
>>
>> Le ven. 24 déc. 2021 à 11:58, Gourav Sengupta 
>> a écrit :
>>
>>> Hi,
>>>
>>> out of sheer and utter curiosity, why JAVA?
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Thu, Dec 23, 2021 at 5:10 PM sam smith 
>>> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> Thanks, here's the Github repo to the code and the publication :
>>>> https://github.com/SamSmithDevs10/paperReplicationForReview
>>>>
>>>> Kind regards
>>>>
>>>> Le jeu. 23 déc. 2021 à 17:58, Andrew Davidson  a
>>>> écrit :
>>>>
>>>>> Hi Sam
>>>>>
>>>>>
>>>>>
>>>>> Can you tell us more? What is the algorithm? Can you send us the URL
>>>>> the publication
>>>>>
>>>>>
>>>>>
>>>>> Kind regards
>>>>>
>>>>>
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>>
>>>>> *From: *sam smith 
>>>>> *Date: *Wednesday, December 22, 2021 at 10:59 AM
>>>>> *To: *"user@spark.apache.org" 
>>>>> *Subject: *About some Spark technical help
>>>>>
>>>>>
>>>>>
>>>>> Hello guys,
>>>>>
>>>>>
>>>>>
>>>>> I am replicating a paper's algorithm in Spark / Java, and want to ask
>>>>> you guys for some assistance to validate / review about 150 lines of code.
>>>>> My github repo contains both my java class and the related paper,
>>>>>
>>>>>
>>>>>
>>>>> Any interested reviewer here ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>

Re: About some Spark technical help

2021-12-24 Thread sam smith

Hi Gourav,

Good question! that's the programming language i am most proficient at.
You are always welcome to suggest corrective remarks about my (Spark) code.

Kind regards.

Le ven. 24 déc. 2021 à 11:58, Gourav Sengupta  a
écrit :

> Hi,
>
> out of sheer and utter curiosity, why JAVA?
>
> Regards,
> Gourav Sengupta
>
> On Thu, Dec 23, 2021 at 5:10 PM sam smith 
> wrote:
>
>> Hi Andrew,
>>
>> Thanks, here's the Github repo to the code and the publication :
>> https://github.com/SamSmithDevs10/paperReplicationForReview
>>
>> Kind regards
>>
>> Le jeu. 23 déc. 2021 à 17:58, Andrew Davidson  a
>> écrit :
>>
>>> Hi Sam
>>>
>>>
>>>
>>> Can you tell us more? What is the algorithm? Can you send us the URL the
>>> publication
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>>
>>>
>>> *From: *sam smith 
>>> *Date: *Wednesday, December 22, 2021 at 10:59 AM
>>> *To: *"user@spark.apache.org" 
>>> *Subject: *About some Spark technical help
>>>
>>>
>>>
>>> Hello guys,
>>>
>>>
>>>
>>> I am replicating a paper's algorithm in Spark / Java, and want to ask
>>> you guys for some assistance to validate / review about 150 lines of code.
>>> My github repo contains both my java class and the related paper,
>>>
>>>
>>>
>>> Any interested reviewer here ?
>>>
>>>
>>>
>>>
>>>
>>> Thanks.
>>>
>>

Re: About some Spark technical help

2021-12-23 Thread sam smith

Hi Andrew,

Thanks, here's the Github repo to the code and the publication :
https://github.com/SamSmithDevs10/paperReplicationForReview

Kind regards

Le jeu. 23 déc. 2021 à 17:58, Andrew Davidson  a écrit :

> Hi Sam
>
>
>
> Can you tell us more? What is the algorithm? Can you send us the URL the
> publication
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> *From: *sam smith 
> *Date: *Wednesday, December 22, 2021 at 10:59 AM
> *To: *"user@spark.apache.org" 
> *Subject: *About some Spark technical help
>
>
>
> Hello guys,
>
>
>
> I am replicating a paper's algorithm in Spark / Java, and want to ask you
> guys for some assistance to validate / review about 150 lines of code. My
> github repo contains both my java class and the related paper,
>
>
>
> Any interested reviewer here ?
>
>
>
>
>
> Thanks.
>

dataset partitioning algorithm implementation help

2021-12-23 Thread sam smith

Hello All,

I am replicating a paper's algorithm about a partitioning approach to
anonymize datasets with Spark / Java, and want to ask you for some help to
review my 150 lines of code. My github repo, attached below, contains both
my java class and the related paper:

https://github.com/SamSmithDevs10/paperReplicationForReview

Thanks in advance.

Thanks.

About some Spark technical help

2021-12-22 Thread sam smith

Hello guys,

I am replicating a paper's algorithm in Spark / Java, and want to ask you
guys for some assistance to validate / review about 150 lines of code. My
github repo contains both my java class and the related paper,

Any interested reviewer here ?


Thanks.

About some Spark technical help

2021-12-22 Thread sam smith

Hello guys,

I am replicating a paper's algorithm in Spark / Java, and want to ask you
guys for some assistance to validate / review about 150 lines of code. My
github repo contains both my java class and the related paper,

Any interested reviewer here ?


Thanks.

Re: About some Spark technical assistance

2021-12-13 Thread sam smith

you were added to the repo to contribute, thanks. I included the java class
and the paper i am replicating

Le lun. 13 déc. 2021 à 04:27,  a écrit :

> github url please.
>
> On 2021-12-13 01:06, sam smith wrote:
> > Hello guys,
> >
> > I am replicating a paper's algorithm (graph coloring algorithm) in
> > Spark under Java, and thought about asking you guys for some
> > assistance to validate / review my 600 lines of code. Any volunteers
> > to share the code with ?
> > Thanks
>

About some Spark technical assistance

2021-12-12 Thread sam smith

Hello guys,

I am replicating a paper's algorithm (graph coloring algorithm) in Spark
under Java, and thought about asking you guys for some assistance to
validate / review my 600 lines of code. Any volunteers to share the code
with ?
Thanks

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

What could be the cause of an execution freeze on Hadoop for small datasets?

How to allocate vcores to driver (client mode)

How to share a dataset file across nodes

Re: How to explode array columns of a dataframe having the same length

How to explode array columns of a dataframe having the same length

Re: How to improve efficiency of this piece of code (returning distinct column values)

Re: How to improve efficiency of this piece of code (returning distinct column values)

Re: How to improve efficiency of this piece of code (returning distinct column values)

Re: How to improve efficiency of this piece of code (returning distinct column values)

Re: How to improve efficiency of this piece of code (returning distinct column values)

Re: How to improve efficiency of this piece of code (returning distinct column values)

Re: How to improve efficiency of this piece of code (returning distinct column values)

How to improve efficiency of this piece of code (returning distinct column values)

Can we upload a csv dataset into Hive using SparkSQL?

Re: Aggregate over a column: the proper way to do

Re: Aggregate over a column: the proper way to do

Re: Aggregate over a column: the proper way to do

Re: Aggregate over a column: the proper way to do

Aggregate over a column: the proper way to do

Re: Spark execution on Hadoop cluster (many nodes)

Re: Spark execution on Hadoop cluster (many nodes)

Re: Spark execution on Hadoop cluster (many nodes)

Re: Spark execution on Hadoop cluster (many nodes)

Spark execution on Hadoop cluster (many nodes)

Re: About some Spark technical help

Re: About some Spark technical help

Re: About some Spark technical help

dataset partitioning algorithm implementation help

About some Spark technical help

About some Spark technical help

Re: About some Spark technical assistance

About some Spark technical assistance

34 matches

Site Navigation

Mail list logo

Footer information