> Before I write the data frame to parquet, I do df.cache. After writing
> the file out, I do df.countDistinct(“a”, “b”, “c”).collect()
if you write the df to parquet, why would you also cache it ? caching by
default loads the memory. this might affect  later use, such
collect. the resulting GC can be explained by both caching and collect


Lalwani, Jayesh <jlalw...@amazon.com.INVALID> writes:

> I have a Dataframe with around 6 billion rows, and about 20 columns. First of 
> all, I want to write this dataframe out to parquet. The, Out of the 20 
> columns, I have 3 columns of interest, and I want to find how many distinct 
> values of the columns are there in the file. I don’t need the actual distinct 
> values. I just need the count. I knoe that there are around 10-16million 
> distinct values
>
> Before I write the data frame to parquet, I do df.cache. After writing the 
> file out, I do df.countDistinct(“a”, “b”, “c”).collect()
>
> When I run this, I see that the memory usage on my driver steadily increases 
> until it starts getting future time outs. I guess it’s spending time in GC. 
> Does countDistinct cause this behavior? Does Spark try to get all 10 million 
> distinct values into the driver? Is countDistinct not recommended for data 
> frames with large number of distinct values?
>
> What’s the solution? Should I use approx._count_distinct?


-- 
nicolas paris

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to