Re: How to improve efficiency of this piece of code (returning distinct column values)

sam smith Sun, 12 Feb 2023 09:36:25 -0800

@Sean Correct. But I was hoping to improve my solution even more.

Le dim. 12 févr. 2023 à 18:03, Sean Owen <sro...@gmail.com> a écrit :


> That's the answer, except, you can never select a result set into a column
> right? you just collect() each of those results. Or, what do you want? I'm
> not clear.
>
> On Sun, Feb 12, 2023 at 10:59 AM sam smith <qustacksm2123...@gmail.com>
> wrote:
>
>> @Enrico Minack <enrico-min...@gmx.de> Thanks for "unpivot" but I am
>> using version 3.3.0 (you are taking it way too far as usual :) )
>> @Sean Owen <sro...@gmail.com> Pls then show me how it can be improved by
>> code.
>>
>> Also, why such an approach (using withColumn() ) doesn't work:
>>
>> for (String columnName : df.columns()) {
>>     df= df.withColumn(columnName,
>> df.select(columnName).distinct().col(columnName));
>> }
>>
>> Le sam. 11 févr. 2023 à 13:11, Enrico Minack <i...@enrico.minack.dev> a
>> écrit :
>>
>>> You could do the entire thing in DataFrame world and write the result to
>>> disk. All you need is unpivot (to be released in Spark 3.4.0, soon).
>>>
>>> Note this is Scala but should be straightforward to translate into Java:
>>>
>>> import org.apache.spark.sql.functions.collect_set
>>>
>>> val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10,
>>> 123)).toDF("a", "b", "c")
>>>
>>> df.unpivot(Array.empty, "column", "value")
>>>   .groupBy("column")
>>>   .agg(collect_set("value").as("distinct_values"))
>>>
>>> The unpivot operation turns
>>> +---+---+---+
>>> |  a|  b|  c|
>>> +---+---+---+
>>> |  1| 10|123|
>>> |  2| 20|124|
>>> |  3| 20|123|
>>> |  4| 10|123|
>>> +---+---+---+
>>>
>>> into
>>>
>>> +------+-----+
>>> |column|value|
>>> +------+-----+
>>> |     a|    1|
>>> |     b|   10|
>>> |     c|  123|
>>> |     a|    2|
>>> |     b|   20|
>>> |     c|  124|
>>> |     a|    3|
>>> |     b|   20|
>>> |     c|  123|
>>> |     a|    4|
>>> |     b|   10|
>>> |     c|  123|
>>> +------+-----+
>>>
>>> The groupBy("column").agg(collect_set("value").as("distinct_values"))
>>> collects distinct values per column:
>>> +------+---------------+
>>>
>>> |column|distinct_values|
>>> +------+---------------+
>>> |     c|     [123, 124]|
>>> |     b|       [20, 10]|
>>> |     a|   [1, 2, 3, 4]|
>>> +------+---------------+
>>>
>>> Note that unpivot only works if all columns have a "common" type. Then
>>> all columns are cast to that common type. If you have incompatible types
>>> like Integer and String, you would have to cast them all to String first:
>>>
>>> import org.apache.spark.sql.types.StringType
>>>
>>> df.select(df.columns.map(col(_).cast(StringType)): _*).unpivot(...)
>>>
>>> If you want to preserve the type of the values and have multiple value
>>> types, you cannot put everything into a DataFrame with one
>>> distinct_values column. You could still have multiple DataFrames, one
>>> per data type, and write those, or collect the DataFrame's values into Maps:
>>>
>>> import scala.collection.immutable
>>>
>>> import org.apache.spark.sql.DataFrame
>>> import org.apache.spark.sql.functions.collect_set
>>>
>>> // if all you columns have the same type
>>> def distinctValuesPerColumnOneType(df: DataFrame): immutable.Map[String,
>>> immutable.Seq[Any]] = {
>>>   df.unpivot(Array.empty, "column", "value")
>>>     .groupBy("column")
>>>     .agg(collect_set("value").as("distinct_values"))
>>>     .collect()
>>>     .map(row => row.getString(0) -> row.getSeq[Any](1).toList)
>>>     .toMap
>>> }
>>>
>>>
>>> // if your columns have different types
>>> def distinctValuesPerColumn(df: DataFrame): immutable.Map[String,
>>> immutable.Seq[Any]] = {
>>>   df.schema.fields
>>>     .groupBy(_.dataType)
>>>     .mapValues(_.map(_.name))
>>>     .par
>>>     .map { case (dataType, columns) => df.select(columns.map(col): _*) }
>>>     .map(distinctValuesPerColumnOneType)
>>>     .flatten
>>>     .toList
>>>     .toMap
>>> }
>>>
>>> val df = Seq((1, 10, "one"), (2, 20, "two"), (3, 20, "one"), (4, 10,
>>> "one")).toDF("a", "b", "c")
>>> distinctValuesPerColumn(df)
>>>
>>> The result is: (list values are of original type)
>>> Map(b -> List(20, 10), a -> List(1, 2, 3, 4), c -> List(one, two))
>>>
>>> Hope this helps,
>>> Enrico
>>>
>>>
>>> Am 10.02.23 um 22:56 schrieb sam smith:
>>>
>>> Hi Apotolos,
>>> Can you suggest a better approach while keeping values within a
>>> dataframe?
>>>
>>> Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos <
>>> papad...@csd.auth.gr> a écrit :
>>>
>>>> Dear Sam,
>>>>
>>>> you are assuming that the data fits in the memory of your local
>>>> machine. You are using as a basis a dataframe, which potentially can be
>>>> very large, and then you are storing the data in local lists. Keep in mind
>>>> that that the number of distinct elements in a column may be very large
>>>> (depending on the app). I suggest to work on a solution that assumes that
>>>> the number of distinct values is also large. Thus, you should keep your
>>>> data in dataframes or RDDs, and store them as csv files, parquet, etc.
>>>>
>>>> a.p.
>>>>
>>>>
>>>> On 10/2/23 23:40, sam smith wrote:
>>>>
>>>> I want to get the distinct values of each column in a List (is it good
>>>> practice to use List here?), that contains as first element the column
>>>> name, and the other element its distinct values so that for a dataset we
>>>> get a list of lists, i do it this way (in my opinion no so fast):
>>>>
>>>> List<List<String>> finalList = new ArrayList<List<String>>();
>>>>     Dataset<Row> df = spark.read().format("csv").option("header", 
>>>> "true").load("/pathToCSV");
>>>>     String[] columnNames = df.columns();
>>>>  for (int i=0;i<columnNames.length;i++) {
>>>>     List<String> columnList = new ArrayList<String>();
>>>>
>>>>     columnList.add(columnNames[i]);
>>>>
>>>>
>>>>     List<Row> columnValues = 
>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>>>>     for (int j=0;j<columnValues.size();j++)
>>>>         columnList.add(columnValues.get(j).apply(0).toString());
>>>>
>>>>     finalList.add(columnList);
>>>>
>>>>
>>>> How to improve this?
>>>>
>>>> Also, can I get the results in JSON format?
>>>>
>>>> --
>>>> Apostolos N. Papadopoulos, Associate Professor
>>>> Department of Informatics
>>>> Aristotle University of Thessaloniki
>>>> Thessaloniki, GREECE
>>>> tel: ++0030312310991918
>>>> email: papad...@csd.auth.gr
>>>> twitter: @papadopoulos_ap
>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>
>>>>
>>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

Reply via email to