Hi all, Sending this first before creating a jira issue in an effort to start a discussion :)
Problem: We have a situation where we end with a very large number (O(10K)) of partitions, with very little data in most partitions but a lot of data in some of them. This not only causes slow execution but we run into issues like SPARK-12837 <https://issues.apache.org/jira/browse/SPARK-12837> . Naturally the solution here is to coalesce these partitions, however it's impossible for us to know upfront the size and number of partitions we're going to get. Additionally this can change on each run. If we naively try coalescing to say 50 partitions, we're likely to end up with very poorly distributed data and potentially executor OOMs. This is because the coalesce tries to balance number of partitions rather than partition size. Proposed Solution Therefore I was thinking about the possibility of contributing a feature whereby a user could specify a target partition size and spark would aim to get as close as possible to that. Something along these lines has been partially attempted in SPARK-14042 <https://issues.apache.org/jira/browse/SPARK-14042> but this falls short as there is no generic way (AFAIK) of getting access to the partition sizes before the query execution. Thus I would suggest that we try do something similar to AQEs "Post Shuffle Partition Coalescing" whereby we allow the user to trigger an adaptive coalesce even without a shuffle? The optimum partitions would be calculated dynamically at runtime. This way we dont pay the price of shuffling but can still get reasonable partition sizes. What do people think? Happy to clarify things upon request. :) Best, Matt -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org