[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-09-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123390#comment-14123390
 ] 

Andrew Ash commented on SPARK-3280:
---

[~joshrosen] do you have a theory for the cause of the dropoff between 2800 and 
3200 partitions in your chart?  My interpretation is that both shuffle 
implementations behave similarly in this scenario up to ~1600 after which the 
hash based starts falling behind, then there's another step difference at 3200 
where it hits a severe dropoff.  I'm interested in the right third of the chart.

A couple theories:
- more partitions = more stuff in memory concurrently = GC pressure.  
Sort-based can stream and do merge sort, but hash-based needs to build the hash 
all at once then spill it
- more partitions = more concurrent spills = disk thrashing while writing to 
lots of files concurrently, exacerbated if the test was on spinnies instead of 
SSDs.  Maybe the sort-based merges spills while writing to disk so ends up 
writing fewer spill files concurrently.

Also the chart is a little unclear, is the y-axis time in seconds?

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114873#comment-14114873
 ] 

Burak Yavuz commented on SPARK-3280:


I don't have as detailed a comparison like Josh has, but for MLlib algorithms, 
sort based shuffle didn't show the performance boosts Josh has shown. 16 
m3.2xlarge instances were used for these experiments. The difference here is 
that the number of partitions I used were 128. Much less than the number of 
partitions Josh has shown.

!hash-sort-comp.png!

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114092#comment-14114092
 ] 

Josh Rosen commented on SPARK-3280:
---

Here are some numbers from August 10.  If I recall, this was running on 8 
m3.8xlarge nodes.  This test linearly scales a bunch of parameters (data set 
size, numbers of mappers and reducers, etc).  You can see that hash-based 
shuffle's performance degrades severely in cases where we have many mappers and 
reducers, while sort scales much more gracefully:

!http://i.imgur.com/rODzaG1.png!

!http://i.imgur.com/72kCkH5.png!

This was run with spark-perf; here's a sample config for one of the bars:

{code}
Java options: -Dspark.storage.memoryFraction=0.66 
-Dspark.serializer=org.apache.spark.serializer.JavaSerializer 
-Dspark.locality.wait=6000 
-Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager
Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 
--num-partitions=400 --reduce-tasks=400 --random-seed=5 
--persistent-type=memory  --num-records=2 --unique-keys=2 
--key-length=10 --unique-values=100 --value-length=10  
--storage-location=hdfs://:9000/spark-perf-kv-data
{code}

I'll try to run a better set of tests today.  I plan to look at a few cases 
that these tests didn't address, including the performance impact when running 
on spinning disks, as well as jobs where we have a large dataset with few 
mappers and reducers (I think this is the case that we'd expect to be most 
favorable to hash-based shuffle).

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114055#comment-14114055
 ] 

Reynold Xin commented on SPARK-3280:


[~joshrosen] [~brkyvz] can you guys post the performance comparisons between 
sort vs hash shuffle in this ticket?

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113519#comment-14113519
 ] 

Apache Spark commented on SPARK-3280:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2178

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org