Re: Count distinct and driver memory

2020-10-19 Thread Raghavendra Ganesh
Spark provides multiple options for caching (including disk). Have you tried caching to disk ? -- Raghavendra On Mon, Oct 19, 2020 at 11:41 PM Lalwani, Jayesh wrote: > I was caching it because I didn't want to re-execute the DAG when I ran > the count query. If you have a spark application

Re: Count distinct and driver memory

2020-10-19 Thread ayan guha
Do not do collect. This brings results back to driver. instead do count distinct and write it out. On Tue, 20 Oct 2020 at 6:43 am, Nicolas Paris wrote: > > I was caching it because I didn't want to re-execute the DAG when I > > ran the count query. If you have a spark application with multiple

Re: Count distinct and driver memory

2020-10-19 Thread Nicolas Paris
> I was caching it because I didn't want to re-execute the DAG when I > ran the count query. If you have a spark application with multiple > actions, Spark reexecutes the entire DAG for each action unless there > is a cache in between. I was trying to avoid reloading 1/2 a terabyte > of data.

Re: Count distinct and driver memory

2020-10-19 Thread Mich Talebzadeh
Best to check this in Spark GUI under storage and see what is causing the issue. HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * *Disclaimer:* Use it at

Re: Count distinct and driver memory

2020-10-19 Thread Lalwani, Jayesh
I was caching it because I didn't want to re-execute the DAG when I ran the count query. If you have a spark application with multiple actions, Spark reexecutes the entire DAG for each action unless there is a cache in between. I was trying to avoid reloading 1/2 a terabyte of data. Also,

Re: Count distinct and driver memory

2020-10-19 Thread Nicolas Paris
> Before I write the data frame to parquet, I do df.cache. After writing > the file out, I do df.countDistinct(“a”, “b”, “c”).collect() if you write the df to parquet, why would you also cache it ? caching by default loads the memory. this might affect later use, such collect. the resulting GC

Re: mission statement : unified

2020-10-19 Thread Sonal Goyal
My thought is that Spark supports analytics for structured and unstructured data, batch as well as real time. This was pretty revolutionary when Spark first came out. That's where the unified term came from I think. Even after all these years, Spark remains the trusted framework for enterprise