GitHub user ala opened a pull request: https://github.com/apache/spark/pull/20664
[SPARK-23496][CORE] Locality of coalesced partitions can be severely skewed by the order of input partitions ## What changes were proposed in this pull request? The algorithm in `DefaultPartitionCoalescer.setupGroups` is responsible for picking preferred locations for coalesced partitions. It analyzes the preferred locations of input partitions. It starts by trying to create one partition for each unique location in the input. However, if the the requested number of coalesced partitions is higher that the number of unique locations, it has to pick duplicate locations. Previously, the duplicate locations would be picked by iterating over the input partitions in order, and copying their preferred locations to coalesced partitions. If the input partitions were clustered by location, this could result in severe skew. With the fix, instead of iterating over the list of input partitions in order, we pick them at random. It's not perfectly balanced, but it's much better. ## How was this patch tested? Unit test reproducing the behavior was added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ala/spark SPARK-23496 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20664.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20664 ---- commit 6d67dfc1d4c012492b97873beaac5a7cbfd6f55a Author: Ala Luszczak <ala@...> Date: 2018-02-23T14:37:19Z Fix SPARK-23496. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org