[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265030#comment-16265030
 ] 

Li Yuanjian commented on SPARK-2926:
------------------------------------

[~jerryshao], thanks a lot for your advise and reply.
{quote}
would you please use spark-perf's micro benchmark 
(https://github.com/databricks/spark-perf) to verify again with same workload 
as mentioned in original test report?
{quote}
Sure, I'll verify this again ASAP.
{quote}
Theoretically this solution cannot get 12x-30x boosting according to my test
{quote}
Firstly I also had question on this, I attached all the screenshot in the pdf. 
The 12x boosting happened in both scenario of reducer task number is 1 and 100. 
The duration of this stage reduce from 2min to 9s(13x) while reducer task 
number is 1 and reduce from 20min to 1.4min while the number is 100. The 30x 
boosting happened after I add more data pressure for reducer task.
{quote}
Can you please explain the key difference and the reason of such boosting?
{quote}
I think the key difference mainly comes from this 2 points:
1. Like saisai said, BlockStoreShuffleReader use `ExternalSorter` deal with the 
reduce work, each record should do the compare work, while SortShuffleReader is 
more cpu friendly, it collect all shuffle map result(both data in memory and 
data spilled to disk) and sort them by merging sort(each partition has been 
sorted in map side).
2. The obvious cut down of peak memory used in reduce task, this will save gc 
time during sorting.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> ------------------------------------------------------------------
>
>                 Key: SPARK-2926
>                 URL: https://issues.apache.org/jira/browse/SPARK-2926
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 1.1.0
>            Reporter: Saisai Shao
>            Assignee: Saisai Shao
>         Attachments: SortBasedShuffleRead.pdf, SortBasedShuffleReader on 
> Spark 2.x.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test 
> Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to