hi, i am reading data from files into a dataframe, then doing a groupBy for a given column with a count, and finally i coalesce to a smaller number of partitions before writing out to disk. so roughly:
spark.read.format(...).load(...).groupBy(column).count().coalesce(100).write.format(...).save(...) i have this setting: spark.sql.shuffle.partitions=2048 i expect to see 2048 partitions in shuffle. what i am seeing instead is a shuffle with only 100 partitions. it's like the coalesce has taken over the partitioning of the groupBy. any idea why? i am doing coalesce because it is not helpful to write out 2048 files, lots of files leads to pressure down the line on executors reading this data (i am writing to just one partition of a larger dataset), and since i have less than 100 executors i expect it to be efficient. so sounds like a good idea, no? but i do need 2048 partitions in my shuffle due to the operation i am doing in the groupBy (in my real problem i am not just doing a count...). thanks! koert