groupBy and then coalesce impacts shuffle partitions in unintended way

Koert Kuipers Wed, 08 Aug 2018 12:39:22 -0700

hi,

i am reading data from files into a dataframe, then doing a groupBy for a
given column with a count, and finally i coalesce to a smaller number of
partitions before writing out to disk. so roughly:


spark.read.format(...).load(...).groupBy(column).count().coalesce(100).write.format(...).save(...)

i have this setting: spark.sql.shuffle.partitions=2048

i expect to see 2048 partitions in shuffle. what i am seeing instead is a
shuffle with only 100 partitions. it's like the coalesce has taken over the
partitioning of the groupBy.

any idea why?

i am doing coalesce because it is not helpful to write out 2048 files, lots
of files leads to pressure down the line on executors reading this data (i
am writing to just one partition of a larger dataset), and since i have
less than 100 executors i expect it to be efficient. so sounds like a good
idea, no?

but i do need 2048 partitions in my shuffle due to the operation i am doing
in the groupBy (in my real problem i am not just doing a count...).

thanks!
koert

groupBy and then coalesce impacts shuffle partitions in unintended way

Reply via email to