GitHub user ala opened a pull request:

    https://github.com/apache/spark/pull/20664

    [SPARK-23496][CORE] Locality of coalesced partitions can be severely skewed 
by the order of input partitions

    ## What changes were proposed in this pull request?
    
    The algorithm in `DefaultPartitionCoalescer.setupGroups` is responsible for 
picking preferred locations for coalesced partitions. It analyzes the preferred 
locations of input partitions. It starts by trying to create one partition for 
each unique location in the input. However, if the the requested number of 
coalesced partitions is higher that the number of unique locations, it has to 
pick duplicate locations.
    
    Previously, the duplicate locations would be picked by iterating over the 
input partitions in order, and copying their preferred locations to coalesced 
partitions. If the input partitions were clustered by location, this could 
result in severe skew.
    
    With the fix, instead of iterating over the list of input partitions in 
order, we pick them at random. It's not perfectly balanced, but it's much 
better.
    
    ## How was this patch tested?
    
    Unit test reproducing the behavior was added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ala/spark SPARK-23496

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20664.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20664
    
----
commit 6d67dfc1d4c012492b97873beaac5a7cbfd6f55a
Author: Ala Luszczak <ala@...>
Date:   2018-02-23T14:37:19Z

    Fix SPARK-23496.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to