GitHub user ericvandenbergfb opened a pull request: https://github.com/apache/spark/pull/19955
[SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter ## What changes were proposed in this pull request? Add a multi-shuffle sorter which supports asynchronous spilling during a shuffle external sort. The benefit is that we can insert and sort/spill in parallel, reducing the overall latency for jobs that are heavy on shuffling (as are many ads jobs). The multi-shuffle sorter is added between the UnsafeShuffleWriter and ShuffleExternalSorter such that few changes are needed outside of this component, and as such, we can see clearly there is little room for regressing other code paths. The multi-shuffle sorter is enabled via a configuration flag, spark.shuffle.async.num.sorter (default 1) If the value is 1, then the multi-shuffle sorter is not used, it must be configured to have multiple sorters (>=2) There is a design spec here attached to the jira. ## How was this patch tested? Added unit tests specifically for the MultiShuffleSorter to exercise under various spill and insert conditions. Extended the UnsafeShuffleWriterSuite to run against a single ShuffleExternalSorter or multiple via the MultiShuffleSorter to ensure no regressions. Ran against production work loads and observed gains and validated based on logging. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ericvandenbergfb/spark async.multi.shuffle.sorter.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19955.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19955 ---- commit 7f751ec23ba6d8c53c009edb7d62a460e9166d7f Author: Eric Vandenberg <ericvandenb...@fb.com> Date: 2017-12-12T02:23:26Z [SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter Add a multi-shuffle sorter which supports asynchronous spilling during a shuffle external sort. The benefit is that we can insert and sort/spill in parallel, reducing the overall latency for jobs that are heavy on shuffling (as are many ads jobs). The multi-shuffle sorter is added between the UnsafeShuffleWriter and ShuffleExternalSorter such that few changes are needed outside of this component, and as such, we can see clearly there is little room for regressing other code paths. The multi-shuffle sorter is enabled via a configuration flag, spark.shuffle.async.num.sorter (default 1) If the value is 1, then the multi-shuffle sorter is not used, it must be configured to have multiple sorters (>=2) There is a design spec here attached to the jira. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org