[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

Josh Rosen (JIRA) Thu, 28 Aug 2014 11:16:40 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114092#comment-14114092
 ]


Josh Rosen commented on SPARK-3280:
-----------------------------------

Here are some numbers from August 10.  If I recall, this was running on 8 
m3.8xlarge nodes.  This test linearly scales a bunch of parameters (data set 
size, numbers of mappers and reducers, etc).  You can see that hash-based 
shuffle's performance degrades severely in cases where we have many mappers and 
reducers, while sort scales much more gracefully:

!http://i.imgur.com/rODzaG1.png!

!http://i.imgur.com/72kCkH5.png!

This was run with spark-perf; here's a sample config for one of the bars:

{code}
Java options: -Dspark.storage.memoryFraction=0.66 
-Dspark.serializer=org.apache.spark.serializer.JavaSerializer 
-Dspark.locality.wait=60000000 
-Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager
Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 
--num-partitions=400 --reduce-tasks=400 --random-seed=5 
--persistent-type=memory  --num-records=200000000 --unique-keys=20000 
--key-length=10 --unique-values=1000000 --value-length=10  
--storage-location=hdfs://:9000/spark-perf-kv-data
{code}

I'll try to run a better set of tests today.  I plan to look at a few cases 
that these tests didn't address, including the performance impact when running 
on spinning disks, as well as jobs where we have a large dataset with few 
mappers and reducers (I think this is the case that we'd expect to be most 
favorable to hash-based shuffle).

> Made sort-based shuffle the default implementation
> --------------------------------------------------
>
>                 Key: SPARK-3280
>                 URL: https://issues.apache.org/jira/browse/SPARK-3280
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

Reply via email to