Hi,
6 billion rows is quite small, I can do it in my laptop with around 4 GB
RAM. What is the version of SPARK you are using and what is the effective
memory that you have per executor?

Regards,
Gourav Sengupta

On Mon, Oct 19, 2020 at 4:24 AM Lalwani, Jayesh <jlalw...@amazon.com.invalid>
wrote:

> I have a Dataframe with around 6 billion rows, and about 20 columns. First
> of all, I want to write this dataframe out to parquet. The, Out of the 20
> columns, I have 3 columns of interest, and I want to find how many distinct
> values of the columns are there in the file. I don’t need the actual
> distinct values. I just need the count. I knoe that there are around
> 10-16million distinct values
>
>
>
> Before I write the data frame to parquet, I do df.cache. After writing the
> file out, I do df.countDistinct(“a”, “b”, “c”).collect()
>
>
>
> When I run this, I see that the memory usage on my driver steadily
> increases until it starts getting future time outs. I guess it’s spending
> time in GC. Does countDistinct cause this behavior? Does Spark try to get
> all 10 million distinct values into the driver? Is countDistinct not
> recommended for data frames with large number of distinct values?
>
>
>
> What’s the solution? Should I use approx._count_distinct?
>

Reply via email to