[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251398#comment-16251398 ]
Li Yuanjian edited comment on SPARK-2926 at 11/14/17 1:54 PM: -------------------------------------------------------------- During our work of migrating some old Hadoop job to Spark, I noticed this JIRA and the code based on spark 1.x. I re-implemented the old PR based on Spark 2.1 and current master branch. After produced some scenario and ran some benchmark tests, I found that this shuffle mode can bring {color:red}12x~30x boosting in task duration and reduce peak execution memory to 1/12 ~ 1/50{color} vs current master version(see detail screenshot and test data in attatched pdf), especially the memory reducing, in this shuffle mode Spark can support more data size in less memory usage. The detail doc attached in this jira named "SortShuffleReader on Spark 2.x.pdf". I know that DataSet API will have better optimization and performance, but RDD API may still useful for flexible control and old Spark/Hadoop jobs. For the better performance in ordering cases and more cost-effective memory usage, maybe this PR is still worth to merge in to master. I'll sort out current code base and give a PR soon. Any comments and trying out would be greatly appreciated. was (Author: xuanyuan): During our work of migrating some old Hadoop job to Spark, I noticed this JIRA and the code based on spark 1.x. I re-implemented the old PR based on Spark 2.1 and current master branch. After produced some scenario and ran some benchmark tests, I found that this shuffle mode can bring {color:red}12x~30x boosting in task duration and reduce peak execution memory to 1/12 ~ 1/50{color}(see detail screenshot and test data in attatched pdf) vs current master version, especially the memory reducing, in this shuffle mode Spark can support more data size in less memory usage. The detail doc attached in this jira named "SortShuffleReader on Spark 2.x.pdf". I know that DataSet API will have better optimization and performance, but RDD API may still useful for flexible control and old Spark/Hadoop jobs. For the better performance in ordering cases and more cost-effective memory usage, maybe this PR is still worth to merge in to master. I'll sort out current code base and give a PR soon. Any comments and trying out would be greatly appreciated. > Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle > ------------------------------------------------------------------ > > Key: SPARK-2926 > URL: https://issues.apache.org/jira/browse/SPARK-2926 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 1.1.0 > Reporter: Saisai Shao > Assignee: Saisai Shao > Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test > Report(contd).pdf, Spark Shuffle Test Report.pdf > > > Currently Spark has already integrated sort-based shuffle write, which > greatly improve the IO performance and reduce the memory consumption when > reducer number is very large. But for the reducer side, it still adopts the > implementation of hash-based shuffle reader, which neglects the ordering > attributes of map output data in some situations. > Here we propose a MR style sort-merge like shuffle reader for sort-based > shuffle to better improve the performance of sort-based shuffle. > Working in progress code and performance test report will be posted later > when some unit test bugs are fixed. > Any comments would be greatly appreciated. > Thanks a lot. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org