[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251398#comment-16251398
 ] 

Li Yuanjian commented on SPARK-2926:
------------------------------------

During our work of migrating some old Hadoop job to Spark, I noticed this JIRA 
and the code based on spark 1.x.

I re-implemented the old PR based on Spark 2.1 and current master branch. After 
produced some scenario and ran some benchmark tests, I found that this shuffle 
mode can bring {color:red}12x~30x boosting in task duration and reduce peak 
execution memory to 1/12 ~ 1/50{color} vs current master version, especially 
the memory reducing, in this shuffle mode Spark can support more data size in 
less memory usage. The detail doc attached in this jira named 
"SortShuffleReader on Spark 2.x".

I know that DataSet API will have better optimization and performance, but RDD 
API may still useful for flexible control and old Spark/Hadoop jobs. For the 
better performance in ordering cases and more cost-effective memory usage, 
maybe this PR is still worth to merge in to master.

I'll sort out current code base and give a PR soon. Any comments and trying out 
would be greatly appreciated.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> ------------------------------------------------------------------
>
>                 Key: SPARK-2926
>                 URL: https://issues.apache.org/jira/browse/SPARK-2926
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 1.1.0
>            Reporter: Saisai Shao
>            Assignee: Saisai Shao
>         Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to