GitHub user ericvandenbergfb opened a pull request:

    https://github.com/apache/spark/pull/19955

    [SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter

    ## What changes were proposed in this pull request?
    
    Add a multi-shuffle sorter which supports asynchronous spilling during a 
shuffle external sort.  The benefit is that we can insert and sort/spill in 
parallel, reducing the overall latency for jobs that are heavy on shuffling (as 
are many ads jobs). The multi-shuffle sorter is added between the 
UnsafeShuffleWriter and ShuffleExternalSorter such that few changes are needed 
outside of this component, and as such, we can see clearly there is little room 
for regressing other code paths.
    
    The multi-shuffle sorter is enabled via a configuration flag, 
spark.shuffle.async.num.sorter (default 1)  If the value is 1, then the 
multi-shuffle sorter is not used, it must be configured to have multiple
    sorters (>=2)
    
    There is a design spec here attached to the jira.
    
    ## How was this patch tested?
    
    Added unit tests specifically for the MultiShuffleSorter to exercise under 
various spill and insert conditions.
    Extended the UnsafeShuffleWriterSuite to run against a single 
ShuffleExternalSorter or multiple via the MultiShuffleSorter to ensure no 
regressions.
    Ran against production work loads and observed gains and validated based on 
logging.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericvandenbergfb/spark 
async.multi.shuffle.sorter.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19955.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19955
    
----
commit 7f751ec23ba6d8c53c009edb7d62a460e9166d7f
Author: Eric Vandenberg <ericvandenb...@fb.com>
Date:   2017-12-12T02:23:26Z

    [SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter
    
    Add a multi-shuffle sorter which supports asynchronous spilling during a 
shuffle external sort.
    The benefit is that we can insert and sort/spill in parallel, reducing the 
overall latency for jobs that
    are heavy on shuffling (as are many ads jobs). The multi-shuffle sorter is 
added between the UnsafeShuffleWriter
    and ShuffleExternalSorter such that few changes are needed outside of this 
component, and as such, we can
    see clearly there is little room for regressing other code paths.
    
    The multi-shuffle sorter is enabled via a configuration flag, 
spark.shuffle.async.num.sorter (default 1)
    If the value is 1, then the multi-shuffle sorter is not used, it must be 
configured to have multiple
    sorters (>=2)
    
    There is a design spec here attached to the jira.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to