[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114092#comment-14114092 ]
Josh Rosen commented on SPARK-3280: ----------------------------------- Here are some numbers from August 10. If I recall, this was running on 8 m3.8xlarge nodes. This test linearly scales a bunch of parameters (data set size, numbers of mappers and reducers, etc). You can see that hash-based shuffle's performance degrades severely in cases where we have many mappers and reducers, while sort scales much more gracefully: !http://i.imgur.com/rODzaG1.png! !http://i.imgur.com/72kCkH5.png! This was run with spark-perf; here's a sample config for one of the bars: {code} Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.locality.wait=60000000 -Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 --num-partitions=400 --reduce-tasks=400 --random-seed=5 --persistent-type=memory --num-records=200000000 --unique-keys=20000 --key-length=10 --unique-values=1000000 --value-length=10 --storage-location=hdfs://:9000/spark-perf-kv-data {code} I'll try to run a better set of tests today. I plan to look at a few cases that these tests didn't address, including the performance impact when running on spinning disks, as well as jobs where we have a large dataset with few mappers and reducers (I think this is the case that we'd expect to be most favorable to hash-based shuffle). > Made sort-based shuffle the default implementation > -------------------------------------------------- > > Key: SPARK-3280 > URL: https://issues.apache.org/jira/browse/SPARK-3280 > Project: Spark > Issue Type: Improvement > Reporter: Reynold Xin > Assignee: Reynold Xin > > sort-based shuffle has lower memory usage and seems to outperform hash-based > in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org