Re: Count distinct and driver memory

Nicolas Paris Mon, 19 Oct 2020 12:38:00 -0700

> I was caching it because I didn't want to re-execute the DAG when I
> ran the count query. If you have a spark application with multiple
> actions, Spark reexecutes the entire DAG for each action unless there
> is a cache in between. I was trying to avoid reloading 1/2 a terabyte
> of data.  Also, cache should use up executor memory, not driver
> memory.
why not counting the parquet file instead? writing/reading a parquet
files is more efficients than caching in my experience.
if you really need caching you could choose a better strategy such
DISK.


Lalwani, Jayesh <jlalw...@amazon.com.INVALID> writes:

> I was caching it because I didn't want to re-execute the DAG when I ran the 
> count query. If you have a spark application with multiple actions, Spark 
> reexecutes the entire DAG for each action unless there is a cache in between. 
> I was trying to avoid reloading 1/2 a terabyte of data.  Also, cache should 
> use up executor memory, not driver memory.
>
> As it turns out cache was the problem. I didn't expect cache to take Executor 
> memory and spill over to disk. I don't know why it's taking driver memory. 
> The input data has millions of partitions which results in millions of tasks. 
> Perhaps the high memory usage is a side effect of caching the results of lots 
> of tasks. 
>
> On 10/19/20, 1:27 PM, "Nicolas Paris" <nicolas.pa...@riseup.net> wrote:
>
>     CAUTION: This email originated from outside of the organization. Do not 
> click links or open attachments unless you can confirm the sender and know 
> the content is safe.
>
>
>
>     > Before I write the data frame to parquet, I do df.cache. After writing
>     > the file out, I do df.countDistinct(“a”, “b”, “c”).collect()
>     if you write the df to parquet, why would you also cache it ? caching by
>     default loads the memory. this might affect  later use, such
>     collect. the resulting GC can be explained by both caching and collect
>
>
>     Lalwani, Jayesh <jlalw...@amazon.com.INVALID> writes:
>
>     > I have a Dataframe with around 6 billion rows, and about 20 columns. 
> First of all, I want to write this dataframe out to parquet. The, Out of the 
> 20 columns, I have 3 columns of interest, and I want to find how many 
> distinct values of the columns are there in the file. I don’t need the actual 
> distinct values. I just need the count. I knoe that there are around 
> 10-16million distinct values
>     >
>     > Before I write the data frame to parquet, I do df.cache. After writing 
> the file out, I do df.countDistinct(“a”, “b”, “c”).collect()
>     >
>     > When I run this, I see that the memory usage on my driver steadily 
> increases until it starts getting future time outs. I guess it’s spending 
> time in GC. Does countDistinct cause this behavior? Does Spark try to get all 
> 10 million distinct values into the driver? Is countDistinct not recommended 
> for data frames with large number of distinct values?
>     >
>     > What’s the solution? Should I use approx._count_distinct?
>
>
>     --
>     nicolas paris
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-- 
nicolas paris

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Count distinct and driver memory

Reply via email to