Hi all, Sending this first before creating a jira issue in an effort to start
a discussion :)

Problem:

We have a situation where we end with a very large number (O(10K)) of
partitions, with very little data in most partitions but a lot of data in
some of them. This not only causes slow execution but we run into issues
like  SPARK-12837 <https://issues.apache.org/jira/browse/SPARK-12837>  .
Naturally the solution here is to coalesce these partitions, however it's
impossible for us to know upfront the size and number of partitions we're
going to get. Additionally this can change on each run. If we naively try
coalescing to say 50 partitions, we're likely to end up with very poorly
distributed data and potentially executor OOMs. This is because the coalesce
tries to balance number of partitions rather than partition size.

Proposed Solution

Therefore I was thinking about the possibility of contributing a feature
whereby a user could specify a target partition size and spark would aim to
get as close as possible to that. Something along these lines has been
partially attempted in  SPARK-14042
<https://issues.apache.org/jira/browse/SPARK-14042>   but this falls short
as there is no generic way (AFAIK) of getting access to the partition sizes
before the query execution. Thus I would suggest that we try do something
similar to AQEs "Post Shuffle Partition Coalescing" whereby we allow the
user to trigger an adaptive coalesce even without a shuffle? The optimum
partitions would be calculated dynamically at runtime. This way we dont pay
the price of shuffling but can still get reasonable partition sizes.

What do people think? Happy to clarify things upon request. :) 

Best,
Matt



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to