Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Enrico Minack
That is unfortunate, but 3.4.0 is around the corner, really! Well, then based on your code, I'd suggest two improvements: - cache your dataframe after reading, this way, you don't read the entire file for each column - do your outer for loop in parallel, then you have N parallel Spark jobs

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
@Sean Correct. But I was hoping to improve my solution even more. Le dim. 12 févr. 2023 à 18:03, Sean Owen a écrit : > That's the answer, except, you can never select a result set into a column > right? you just collect() each of those results. Or, what do you want? I'm > not clear. > > On Sun,

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
@Enrico Minack Thanks for "unpivot" but I am using version 3.3.0 (you are taking it way too far as usual :) ) @Sean Owen Pls then show me how it can be improved by code. Also, why such an approach (using withColumn() ) doesn't work: for (String columnName : df.columns()) { df=

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen
That's the answer, except, you can never select a result set into a column right? you just collect() each of those results. Or, what do you want? I'm not clear. On Sun, Feb 12, 2023 at 10:59 AM sam smith wrote: > @Enrico Minack Thanks for "unpivot" but I am using > version 3.3.0 (you are

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Mich Talebzadeh
Hi Sam, I am curious to know the business use case for this solution if any? HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen
It doesn't work because it's an aggregate function. You have to groupBy() (group by nothing) to make that work, but, you can't assign that as a column. Folks those approaches don't make sense semantically in SQL or Spark or anything. They just mean use threads to collect() distinct values for each

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Enrico Minack
@Sean: This aggregate function does work without an explicit groupBy(): ./spark-3.3.1-bin-hadoop2/bin/spark-shell Spark context Web UI available at http://*:4040 Spark context available as 'sc' (master = local[*], app id = local-1676237726079). Spark session available as 'spark'.

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
OK, what do you mean by " do your outer for loop in parallel "? btw this didn't work: for (String columnName : df.columns()) { df= df.withColumn(columnName, collect_set(col(columnName)).as(columnName)); } Le dim. 12 févr. 2023 à 20:36, Enrico Minack a écrit : > That is unfortunate, but