Re: How to improve efficiency of this piece of code (returning distinct column values)

sam smith Fri, 10 Feb 2023 13:47:31 -0800

Hi Sean,

"You need to select the distinct values of each col one at a time", how ?


Le ven. 10 févr. 2023 à 22:40, Sean Owen <sro...@gmail.com> a écrit :

> That gives you all distinct tuples of those col values. You need to select
> the distinct values of each col one at a time. Sure just collect() the
> result as you do here.
>
> On Fri, Feb 10, 2023, 3:34 PM sam smith <qustacksm2123...@gmail.com>
> wrote:
>
>> I want to get the distinct values of each column in a List (is it good
>> practice to use List here?), that contains as first element the column
>> name, and the other element its distinct values so that for a dataset we
>> get a list of lists, i do it this way (in my opinion no so fast):
>>
>> List<List<String>> finalList = new ArrayList<List<String>>();
>>     Dataset<Row> df = spark.read().format("csv").option("header", 
>> "true").load("/pathToCSV");
>>     String[] columnNames = df.columns();
>>  for (int i=0;i<columnNames.length;i++) {
>>     List<String> columnList = new ArrayList<String>();
>>
>>     columnList.add(columnNames[i]);
>>
>>
>>     List<Row> columnValues = 
>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>>     for (int j=0;j<columnValues.size();j++)
>>         columnList.add(columnValues.get(j).apply(0).toString());
>>
>>     finalList.add(columnList);
>>
>>
>> How to improve this?
>>
>> Also, can I get the results in JSON format?
>>
>

Re: How to improve efficiency of this piece of code (returning distinct column values)

Reply via email to