[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091753#comment-14091753
 ] 

Saisai Shao commented on SPARK-2926:
------------------------------------

Hi Matei, thanks a lot for your comments.

The original point of this proposal is to directly merge the data without 
re-sorting (when spilling using EAOM) if data is by-key sorted (with 
keyOrdering or hashcode ordering) in map outputs. If there is no Ordering 
available or needed like groupByKey, current design thinking is still using 
EAOM to do aggregation.

I use SparkPerf sort-by-key workload to test the current shuffle 
implementations:

1. sort shuffle write with hash shuffle read (current sort-based shuffle 
implementation).
2. sort shuffle write and sort merge shuffle read (my prototype).

Test data type is String, key and value length is 10, and record number is 2G, 
data is stored in HDFS. My rough test shows that my prototype may be slower in 
shuffle write (1.18x slower) because of another key comparison, but 2.6x faster 
than HashShuffleReader in reduce side. 

I have to admit that only sort-by-key cannot well illustrate the necessity of 
this proposal, also the method is better for sortByKey scenario. I will 
continue to do some other workload tests to see if this method is really 
necessary or not. I will post my test result later.

Thanks again for your comments.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> ------------------------------------------------------------------
>
>                 Key: SPARK-2926
>                 URL: https://issues.apache.org/jira/browse/SPARK-2926
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 1.1.0
>            Reporter: Saisai Shao
>         Attachments: SortBasedShuffleRead.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to