[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096608#comment-14096608 ]
Saisai Shao edited comment on SPARK-2926 at 8/14/14 7:12 AM: ------------------------------------------------------------- Hi Matei, I just uploaded a Spark shuffle performance test report. In this report, I choose 3 different workloads (sort-by-key, aggregate-by-key and group-by-key) in SparkPerf to test the performance of current 3 shuffle implementations: hash-based shuffle; sort-based shuffle with HashShuffleReader; sort-based shuffle with sort-merge shuffle reader (our prototype). Generally for sort-by-key our prototype can gain more benefits than other two implementations, while for other two workloads the performance is almost the same. Would you mind taking a look at it, any comment would be greatly appreciated, thanks a lot. was (Author: jerryshao): Hi Matei, I just uploaded a Spark shuffle performance test report. In this report, I choose 3 different workload (sort-by-key, aggregate-by-key and group-by-key) in SparkPerf to test the performance of current 3 shuffle implementations: hash-based shuffle; sort-based shuffle with HashShuffleReader; sort-based shuffle with sort-merge shuffle reader (our prototype). Would you mind taking a look at it, any comment would be greatly appreciated, thanks a lot. > Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle > ------------------------------------------------------------------ > > Key: SPARK-2926 > URL: https://issues.apache.org/jira/browse/SPARK-2926 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 1.1.0 > Reporter: Saisai Shao > Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf > > > Currently Spark has already integrated sort-based shuffle write, which > greatly improve the IO performance and reduce the memory consumption when > reducer number is very large. But for the reducer side, it still adopts the > implementation of hash-based shuffle reader, which neglects the ordering > attributes of map output data in some situations. > Here we propose a MR style sort-merge like shuffle reader for sort-based > shuffle to better improve the performance of sort-based shuffle. > Working in progress code and performance test report will be posted later > when some unit test bugs are fixed. > Any comments would be greatly appreciated. > Thanks a lot. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org