GitHub user xuanyuanking opened a pull request: https://github.com/apache/spark/pull/19745
[SPARK-2926][Core][Follow Up] Sort shuffle reader for Spark 2.x ## What changes were proposed in this pull request? As comment in [SPARK-2926][https://issues.apache.org/jira/browse/SPARK-2926], this is the follow up work for the old patch on Spark 2.x version. Also this is a preview PR and will add more UT after community think it still worth to follow up. Detailed benchmark attached in jira and this patch mainly to the work below: 1. For support Spark Streaming, Class `ShuffleBlockFetcherIterator` added some wrapping work for ManageBuffer, so here I changes ShuffleBlockFetcherIterator to get the ManagerBuffer, and do the wrapping work out of ShuffleBlockFetcherIterator 2. Class `ShuffleMmeoryManager` has been replaced by `TaskMemoryManager`, so I write a new class named ExternalMergerinherits from `Spillable[ArrayBuffer[MemoryShuffleBlock]]`, this class manage all files and in memory block during `SortShuffleReader.read()` 3. Add a tag named `canUseSortShuffleWriter` in `SortShuffleManager`, for the bug fix of Spark UT error in the scenario of using `UnsafeShuffleWriter` in shuffle write stage but using `SortShuffleReader` in shuffle read stage. 4. Add shuffle metrics of peakMemoryUsedBytes. 5. A Bug fix of datainconsistency in old patch. [Code Link][https://github.com/xuanyuanking/spark/blob/f07c939a25839a5b0f69c504afb9aa008b1b3c5d/core/src/main/scala/org/apache/spark/util/collection/ExternalMerger.scala#L97] ## How was this patch tested? Like the doc described, running a benchmark test vs current spark master and has no data output diff. I will add more UT and complete this PR follow community's advise. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuanyuanking/spark sort-shuffle-read-master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19745.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19745 ---- commit b0f1f247cfee8fc7419c6fd3a831f54d1c9d4d63 Author: Yuanjian Li <xyliyuanj...@gmail.com> Date: 2017-10-19T06:39:53Z Reimplementation for SPARK-2045 over branch 2.1 commit 1c07650d82f5c85189d4a5758722c3178caa0a3c Author: Yuanjian Li <xyliyuanj...@gmail.com> Date: 2017-10-26T05:12:42Z Code clean, include BlockManager and EmternalSorter reuse commit dac1bf9662f1945df0efb5740df84980baa03d8e Author: Yuanjian Li <xyliyuanj...@gmail.com> Date: 2017-10-26T05:35:22Z Move ExternalMerger outside commit 33418cae4c4eb80d12ff3bb7b0b4ee3f0a85575e Author: Yuanjian Li <xyliyuanj...@gmail.com> Date: 2017-10-26T07:57:01Z fix code style commit f07c939a25839a5b0f69c504afb9aa008b1b3c5d Author: Yuanjian Li <xyliyuanj...@gmail.com> Date: 2017-11-10T12:55:27Z Bug fix for data inconsistency and more comments commit ca43f1b44a41b68c2a9a83ced269c3ed644fef69 Author: Yuanjian Li <xyliyuanj...@gmail.com> Date: 2017-11-14T14:11:35Z Fix unreasoning var name ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org