GitHub user xuanyuanking opened a pull request:

    https://github.com/apache/spark/pull/19745

    [SPARK-2926][Core][Follow Up] Sort shuffle reader for Spark 2.x

    ## What changes were proposed in this pull request?
    
    As comment in 
[SPARK-2926][https://issues.apache.org/jira/browse/SPARK-2926], this is the 
follow up work for the old patch on Spark 2.x version. Also this is a preview 
PR and will add more UT after community think it still worth to follow up. 
Detailed benchmark attached in jira and this patch mainly to the work below:
    1. For support Spark Streaming, Class `ShuffleBlockFetcherIterator` added 
some wrapping work for ManageBuffer, so here I changes 
ShuffleBlockFetcherIterator to get the ManagerBuffer, and do the wrapping work 
out of ShuffleBlockFetcherIterator
    2. Class `ShuffleMmeoryManager` has been replaced by `TaskMemoryManager`, 
so I write a new class named ExternalMergerinherits from 
`Spillable[ArrayBuffer[MemoryShuffleBlock]]`, this class manage all files and 
in memory block during `SortShuffleReader.read()`
    3. Add a tag named `canUseSortShuffleWriter` in `SortShuffleManager`, for 
the bug fix of Spark UT error in the scenario of using `UnsafeShuffleWriter` in 
shuffle write stage but using `SortShuffleReader` in shuffle read stage.
    4. Add shuffle metrics of peakMemoryUsedBytes.
    5. A Bug fix of datainconsistency in old patch. [Code 
Link][https://github.com/xuanyuanking/spark/blob/f07c939a25839a5b0f69c504afb9aa008b1b3c5d/core/src/main/scala/org/apache/spark/util/collection/ExternalMerger.scala#L97]
    
    ## How was this patch tested?
    
    Like the doc described, running a benchmark test vs current spark master 
and has no data output diff. I will add more UT and complete this PR follow 
community's advise.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuanyuanking/spark sort-shuffle-read-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19745.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19745
    
----
commit b0f1f247cfee8fc7419c6fd3a831f54d1c9d4d63
Author: Yuanjian Li <xyliyuanj...@gmail.com>
Date:   2017-10-19T06:39:53Z

    Reimplementation for SPARK-2045 over branch 2.1

commit 1c07650d82f5c85189d4a5758722c3178caa0a3c
Author: Yuanjian Li <xyliyuanj...@gmail.com>
Date:   2017-10-26T05:12:42Z

    Code clean, include BlockManager and EmternalSorter reuse

commit dac1bf9662f1945df0efb5740df84980baa03d8e
Author: Yuanjian Li <xyliyuanj...@gmail.com>
Date:   2017-10-26T05:35:22Z

    Move ExternalMerger outside

commit 33418cae4c4eb80d12ff3bb7b0b4ee3f0a85575e
Author: Yuanjian Li <xyliyuanj...@gmail.com>
Date:   2017-10-26T07:57:01Z

    fix code style

commit f07c939a25839a5b0f69c504afb9aa008b1b3c5d
Author: Yuanjian Li <xyliyuanj...@gmail.com>
Date:   2017-11-10T12:55:27Z

    Bug fix for data inconsistency and more comments

commit ca43f1b44a41b68c2a9a83ced269c3ed644fef69
Author: Yuanjian Li <xyliyuanj...@gmail.com>
Date:   2017-11-14T14:11:35Z

    Fix unreasoning var name

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to