Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen
Why would csv or a temp table change anything here? You don't need windowing for distinct values either On Fri, Feb 10, 2023, 6:01 PM Mich Talebzadeh wrote: > on top of my head, create a dataframe reading CSV file. > > This is python > > listing_df = >

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Mich Talebzadeh
on top of my head, create a dataframe reading CSV file. This is python listing_df = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load(csv_file) listing_df.printSchema() listing_df.createOrReplaceTempView("temp") ## do your distinct

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
I am not sure i understand well " Just need to do the cols one at a time". Plus I think Apostolos is right, this needs a dataframe approach not a list approach. Le ven. 10 févr. 2023 à 22:47, Sean Owen a écrit : > For each column, select only that call and get distinct values. Similar to > what

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
Hi Apotolos, Can you suggest a better approach while keeping values within a dataframe? Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos < papad...@csd.auth.gr> a écrit : > Dear Sam, > > you are assuming that the data fits in the memory of your local machine. > You are using as a basis a

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
Hi Sean, "You need to select the distinct values of each col one at a time", how ? Le ven. 10 févr. 2023 à 22:40, Sean Owen a écrit : > That gives you all distinct tuples of those col values. You need to select > the distinct values of each col one at a time. Sure just collect() the > result

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Apostolos N. Papadopoulos
Dear Sam, you are assuming that the data fits in the memory of your local machine. You are using as a basis a dataframe, which potentially can be very large, and then you are storing the data in local lists. Keep in mind that that the number of distinct elements in a column may be very large

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen
That gives you all distinct tuples of those col values. You need to select the distinct values of each col one at a time. Sure just collect() the result as you do here. On Fri, Feb 10, 2023, 3:34 PM sam smith wrote: > I want to get the distinct values of each column in a List (is it good >

How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
I want to get the distinct values of each column in a List (is it good practice to use List here?), that contains as first element the column name, and the other element its distinct values so that for a dataset we get a list of lists, i do it this way (in my opinion no so fast): List> finalList

Re:

2023-02-10 Thread Sunil Prabhakara
unsubscribe On Tue, Feb 7, 2023 at 5:19 AM Tang Jinxin wrote: > unsubscribe >