[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123390#comment-14123390 ] Andrew Ash commented on SPARK-3280: --- [~joshrosen] do you have a theory for the cause of the dropoff between 2800 and 3200 partitions in your chart? My interpretation is that both shuffle implementations behave similarly in this scenario up to ~1600 after which the hash based starts falling behind, then there's another step difference at 3200 where it hits a severe dropoff. I'm interested in the right third of the chart. A couple theories: - more partitions = more stuff in memory concurrently = GC pressure. Sort-based can stream and do merge sort, but hash-based needs to build the hash all at once then spill it - more partitions = more concurrent spills = disk thrashing while writing to lots of files concurrently, exacerbated if the test was on spinnies instead of SSDs. Maybe the sort-based merges spills while writing to disk so ends up writing fewer spill files concurrently. Also the chart is a little unclear, is the y-axis time in seconds? > Made sort-based shuffle the default implementation > -- > > Key: SPARK-3280 > URL: https://issues.apache.org/jira/browse/SPARK-3280 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Attachments: hash-sort-comp.png > > > sort-based shuffle has lower memory usage and seems to outperform hash-based > in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114873#comment-14114873 ] Burak Yavuz commented on SPARK-3280: I don't have as detailed a comparison like Josh has, but for MLlib algorithms, sort based shuffle didn't show the performance boosts Josh has shown. 16 m3.2xlarge instances were used for these experiments. The difference here is that the number of partitions I used were 128. Much less than the number of partitions Josh has shown. !hash-sort-comp.png! > Made sort-based shuffle the default implementation > -- > > Key: SPARK-3280 > URL: https://issues.apache.org/jira/browse/SPARK-3280 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > Attachments: hash-sort-comp.png > > > sort-based shuffle has lower memory usage and seems to outperform hash-based > in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114092#comment-14114092 ] Josh Rosen commented on SPARK-3280: --- Here are some numbers from August 10. If I recall, this was running on 8 m3.8xlarge nodes. This test linearly scales a bunch of parameters (data set size, numbers of mappers and reducers, etc). You can see that hash-based shuffle's performance degrades severely in cases where we have many mappers and reducers, while sort scales much more gracefully: !http://i.imgur.com/rODzaG1.png! !http://i.imgur.com/72kCkH5.png! This was run with spark-perf; here's a sample config for one of the bars: {code} Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.locality.wait=6000 -Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 --num-partitions=400 --reduce-tasks=400 --random-seed=5 --persistent-type=memory --num-records=2 --unique-keys=2 --key-length=10 --unique-values=100 --value-length=10 --storage-location=hdfs://:9000/spark-perf-kv-data {code} I'll try to run a better set of tests today. I plan to look at a few cases that these tests didn't address, including the performance impact when running on spinning disks, as well as jobs where we have a large dataset with few mappers and reducers (I think this is the case that we'd expect to be most favorable to hash-based shuffle). > Made sort-based shuffle the default implementation > -- > > Key: SPARK-3280 > URL: https://issues.apache.org/jira/browse/SPARK-3280 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > > sort-based shuffle has lower memory usage and seems to outperform hash-based > in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114055#comment-14114055 ] Reynold Xin commented on SPARK-3280: [~joshrosen] [~brkyvz] can you guys post the performance comparisons between sort vs hash shuffle in this ticket? > Made sort-based shuffle the default implementation > -- > > Key: SPARK-3280 > URL: https://issues.apache.org/jira/browse/SPARK-3280 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > > sort-based shuffle has lower memory usage and seems to outperform hash-based > in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113519#comment-14113519 ] Apache Spark commented on SPARK-3280: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2178 > Made sort-based shuffle the default implementation > -- > > Key: SPARK-3280 > URL: https://issues.apache.org/jira/browse/SPARK-3280 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > > sort-based shuffle has lower memory usage and seems to outperform hash-based > in almost all of our testing. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org