one small correction: lots of files leads to pressure on the spark driver program when reading this data in spark.
On Wed, Aug 8, 2018 at 3:39 PM, Koert Kuipers <ko...@tresata.com> wrote: > hi, > > i am reading data from files into a dataframe, then doing a groupBy for a > given column with a count, and finally i coalesce to a smaller number of > partitions before writing out to disk. so roughly: > > spark.read.format(...).load(...).groupBy(column).count(). > coalesce(100).write.format(...).save(...) > > i have this setting: spark.sql.shuffle.partitions=2048 > > i expect to see 2048 partitions in shuffle. what i am seeing instead is a > shuffle with only 100 partitions. it's like the coalesce has taken over the > partitioning of the groupBy. > > any idea why? > > i am doing coalesce because it is not helpful to write out 2048 files, > lots of files leads to pressure down the line on executors reading this > data (i am writing to just one partition of a larger dataset), and since i > have less than 100 executors i expect it to be efficient. so sounds like a > good idea, no? > > but i do need 2048 partitions in my shuffle due to the operation i am > doing in the groupBy (in my real problem i am not just doing a count...). > > thanks! > koert > >