Re: Count distinct and driver memory

2020-10-19 Thread Raghavendra Ganesh
Spark provides multiple options for caching (including disk). Have you tried caching to disk ? -- Raghavendra On Mon, Oct 19, 2020 at 11:41 PM Lalwani, Jayesh wrote: > I was caching it because I didn't want to re-execute the DAG when I ran > the count query. If you have a spark application

Re: Count distinct and driver memory

2020-10-19 Thread ayan guha
Do not do collect. This brings results back to driver. instead do count distinct and write it out. On Tue, 20 Oct 2020 at 6:43 am, Nicolas Paris wrote: > > I was caching it because I didn't want to re-execute the DAG when I > > ran the count query. If you have a spark application with multiple

Re: Count distinct and driver memory

2020-10-19 Thread Nicolas Paris
> I was caching it because I didn't want to re-execute the DAG when I > ran the count query. If you have a spark application with multiple > actions, Spark reexecutes the entire DAG for each action unless there > is a cache in between. I was trying to avoid reloading 1/2 a terabyte > of data.

Re: Count distinct and driver memory

2020-10-19 Thread Mich Talebzadeh
Best to check this in Spark GUI under storage and see what is causing the issue. HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * *Disclaimer:* Use it at

Re: Count distinct and driver memory

2020-10-19 Thread Lalwani, Jayesh
I was caching it because I didn't want to re-execute the DAG when I ran the count query. If you have a spark application with multiple actions, Spark reexecutes the entire DAG for each action unless there is a cache in between. I was trying to avoid reloading 1/2 a terabyte of data. Also,

Re: Count distinct and driver memory

2020-10-19 Thread Nicolas Paris
> Before I write the data frame to parquet, I do df.cache. After writing > the file out, I do df.countDistinct(“a”, “b”, “c”).collect() if you write the df to parquet, why would you also cache it ? caching by default loads the memory. this might affect later use, such collect. the resulting GC

Re: Count distinct and driver memory

2020-10-18 Thread Gourav Sengupta
Hi, 6 billion rows is quite small, I can do it in my laptop with around 4 GB RAM. What is the version of SPARK you are using and what is the effective memory that you have per executor? Regards, Gourav Sengupta On Mon, Oct 19, 2020 at 4:24 AM Lalwani, Jayesh wrote: > I have a Dataframe with

Count distinct and driver memory

2020-10-18 Thread Lalwani, Jayesh
I have a Dataframe with around 6 billion rows, and about 20 columns. First of all, I want to write this dataframe out to parquet. The, Out of the 20 columns, I have 3 columns of interest, and I want to find how many distinct values of the columns are there in the file. I don’t need the actual