Re: How to improve efficiency of this piece of code (returning distinct column values)

Sean Owen Fri, 10 Feb 2023 16:11:32 -0800

Why would csv or a temp table change anything here? You don't need
windowing for distinct values either


On Fri, Feb 10, 2023, 6:01 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> on top of my head, create a dataframe reading CSV file.
>
> This is python
>
>  listing_df =
> spark.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load(csv_file)
>  listing_df.printSchema()
>  listing_df.createOrReplaceTempView("temp")
>
> ## do your distinct columns using windowing functions on temp table with
> SQL
>
>  HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 10 Feb 2023 at 21:59, sam smith <qustacksm2123...@gmail.com>
> wrote:
>
>> I am not sure i understand well " Just need to do the cols one at a
>> time". Plus I think Apostolos is right, this needs a dataframe approach not
>> a list approach.
>>
>> Le ven. 10 févr. 2023 à 22:47, Sean Owen <sro...@gmail.com> a écrit :
>>
>>> For each column, select only that call and get distinct values. Similar
>>> to what you do here. Just need to do the cols one at a time. Your current
>>> code doesn't do what you want.
>>>
>>> On Fri, Feb 10, 2023, 3:46 PM sam smith <qustacksm2123...@gmail.com>
>>> wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> "You need to select the distinct values of each col one at a time", how
>>>> ?
>>>>
>>>> Le ven. 10 févr. 2023 à 22:40, Sean Owen <sro...@gmail.com> a écrit :
>>>>
>>>>> That gives you all distinct tuples of those col values. You need to
>>>>> select the distinct values of each col one at a time. Sure just collect()
>>>>> the result as you do here.
>>>>>
>>>>> On Fri, Feb 10, 2023, 3:34 PM sam smith <qustacksm2123...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I want to get the distinct values of each column in a List (is it
>>>>>> good practice to use List here?), that contains as first element the 
>>>>>> column
>>>>>> name, and the other element its distinct values so that for a dataset we
>>>>>> get a list of lists, i do it this way (in my opinion no so fast):
>>>>>>
>>>>>> List<List<String>> finalList = new ArrayList<List<String>>();
>>>>>>     Dataset<Row> df = spark.read().format("csv").option("header", 
>>>>>> "true").load("/pathToCSV");
>>>>>>     String[] columnNames = df.columns();
>>>>>>  for (int i=0;i<columnNames.length;i++) {
>>>>>>     List<String> columnList = new ArrayList<String>();
>>>>>>
>>>>>>     columnList.add(columnNames[i]);
>>>>>>
>>>>>>
>>>>>>     List<Row> columnValues = 
>>>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>>>>>>     for (int j=0;j<columnValues.size();j++)
>>>>>>         columnList.add(columnValues.get(j).apply(0).toString());
>>>>>>
>>>>>>     finalList.add(columnList);
>>>>>>
>>>>>>
>>>>>> How to improve this?
>>>>>>
>>>>>> Also, can I get the results in JSON format?
>>>>>>
>>>>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

Reply via email to