one small correction: lots of files leads to pressure on the spark driver
program when reading this data in spark.

On Wed, Aug 8, 2018 at 3:39 PM, Koert Kuipers <ko...@tresata.com> wrote:

> hi,
>
> i am reading data from files into a dataframe, then doing a groupBy for a
> given column with a count, and finally i coalesce to a smaller number of
> partitions before writing out to disk. so roughly:
>
> spark.read.format(...).load(...).groupBy(column).count().
> coalesce(100).write.format(...).save(...)
>
> i have this setting: spark.sql.shuffle.partitions=2048
>
> i expect to see 2048 partitions in shuffle. what i am seeing instead is a
> shuffle with only 100 partitions. it's like the coalesce has taken over the
> partitioning of the groupBy.
>
> any idea why?
>
> i am doing coalesce because it is not helpful to write out 2048 files,
> lots of files leads to pressure down the line on executors reading this
> data (i am writing to just one partition of a larger dataset), and since i
> have less than 100 executors i expect it to be efficient. so sounds like a
> good idea, no?
>
> but i do need 2048 partitions in my shuffle due to the operation i am
> doing in the groupBy (in my real problem i am not just doing a count...).
>
> thanks!
> koert
>
>

Reply via email to