[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037932#comment-15037932
 ] 

Sean Owen commented on SPARK-5081:
--

Is this still an issue? I'm trying to figure out whether we believe there is 
still something to do here. Note there have been some shuffle and snappy 
changes in between.

> Shuffle write increases
> ---
>
> Key: SPARK-5081
> URL: https://issues.apache.org/jira/browse/SPARK-5081
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Kevin Jung
>Priority: Critical
> Attachments: Spark_Debug.pdf, diff.txt
>
>
> The size of shuffle write showing in spark web UI is much different when I 
> execute same spark job with same input data in both spark 1.1 and spark 1.2. 
> At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
> in spark 1.2. 
> I set spark.shuffle.manager option to hash because it's default value is 
> changed but spark 1.2 still writes shuffle output more than spark 1.1.
> It can increase disk I/O overhead exponentially as the input file gets bigger 
> and it causes the jobs take more time to complete. 
> In the case of about 100GB input, for example, the size of shuffle write is 
> 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
> spark 1.1
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |9|saveAsTextFile| |1169.4KB| |
> |12|combineByKey| |1265.4KB|1275.0KB|
> |6|sortByKey| |1276.5KB| |
> |8|mapPartitions| |91.0MB|1383.1KB|
> |4|apply| |89.4MB| |
> |5|sortBy|155.6MB| |98.1MB|
> |3|sortBy|155.6MB| | |
> |1|collect| |2.1MB| |
> |2|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |
> spark 1.2
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |12|saveAsTextFile| |1170.2KB| |
> |11|combineByKey| |1264.5KB|1275.0KB|
> |8|sortByKey| |1273.6KB| |
> |7|mapPartitions| |134.5MB|1383.1KB|
> |5|zipWithIndex| |132.5MB| |
> |4|sortBy|155.6MB| |146.9MB|
> |3|sortBy|155.6MB| | |
> |2|collect| |2.0MB| |
> |1|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-12-03 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038535#comment-15038535
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

I'm not sure whether there's still an issue in the old version, but in spark 
1.5.1 everything is back to normal, so from my side it seems to be ok. Given 
that it wasn't possible to bring it down to some reproducible minimal test case 
of which spark/snappy/whateverthirdlibrary had it together with (smaller) 
transformations-DAG, I'd vote for closing this issue. But thanks for asking 
again, I really appreciate that :)

> Shuffle write increases
> ---
>
> Key: SPARK-5081
> URL: https://issues.apache.org/jira/browse/SPARK-5081
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Kevin Jung
>Priority: Critical
> Attachments: Spark_Debug.pdf, diff.txt
>
>
> The size of shuffle write showing in spark web UI is much different when I 
> execute same spark job with same input data in both spark 1.1 and spark 1.2. 
> At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
> in spark 1.2. 
> I set spark.shuffle.manager option to hash because it's default value is 
> changed but spark 1.2 still writes shuffle output more than spark 1.1.
> It can increase disk I/O overhead exponentially as the input file gets bigger 
> and it causes the jobs take more time to complete. 
> In the case of about 100GB input, for example, the size of shuffle write is 
> 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
> spark 1.1
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |9|saveAsTextFile| |1169.4KB| |
> |12|combineByKey| |1265.4KB|1275.0KB|
> |6|sortByKey| |1276.5KB| |
> |8|mapPartitions| |91.0MB|1383.1KB|
> |4|apply| |89.4MB| |
> |5|sortBy|155.6MB| |98.1MB|
> |3|sortBy|155.6MB| | |
> |1|collect| |2.1MB| |
> |2|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |
> spark 1.2
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |12|saveAsTextFile| |1170.2KB| |
> |11|combineByKey| |1264.5KB|1275.0KB|
> |8|sortByKey| |1273.6KB| |
> |7|mapPartitions| |134.5MB|1383.1KB|
> |5|zipWithIndex| |132.5MB| |
> |4|sortBy|155.6MB| |146.9MB|
> |3|sortBy|155.6MB| | |
> |2|collect| |2.0MB| |
> |1|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-06-15 Thread Roi Reshef (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585641#comment-14585641
 ] 

Roi Reshef commented on SPARK-5081:
---

Hi Guys,
Was this issue already solved by any chance? I'm using Spark 1.3.1 for training 
in an iterative fashion. Since implementing a ranking measure (that ultimately 
uses sortBy) i'm experiencing similar problems. It seems that my cache explodes 
after ~100 iterations, and crushes the server with a There is insufficient 
memory for the Java Runtime Environment to continue message. Note that it 
isn't supposed to persist the sorted vectors nor to use them in the following 
iterations. So I wonder why memory consumption keeps growing with each 
iteration.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-05-13 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542152#comment-14542152
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

Yes, I think so too. Just tell me if you need gurther info.





 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-05-13 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541576#comment-14541576
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

Hi,

I can now say that this was not a fix!

Here's a single task from that count stage:

||Index ▾||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
Time||Duration||GC Time||Input Size / Records||Shuffle Read Size / 
Records||Shuffle Spill (Memory)||Shuffle Spill (Disk)||Errors||
|11|128|0|SUCCESS|PROCESS_LOCAL|3 / marvin.pmd.local|2015/05/13 09:55:08|3.9 
min|15 s|352.3 KB (memory) / 1171|10.0 MB / 854977|2.1 GB|113.5 MB|

As you can see, for an input-size+shuffle-read-size of approx 10.0 MByte, it 
produces a ShuffleSpill to Memory of 2.1 GByte, to disk 114 MByte.

Here's the debugString of the RDD that I'm counting in this task:

{noformat}
(20) MapPartitionsRDD[64] at mapValues at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |CachedPartitions: 20; MemorySize: 181.7 MB; TachyonSize: 0.0 B; 
DiskSize: 360.7 MB
 |   MapPartitionsRDD[63] at mapPartitionsToPair at 
NativeMethodAccessorImpl.java:-2 [Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[62] at mapValues at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[61] at leftOuterJoin at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[60] at leftOuterJoin at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[59] at leftOuterJoin at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   CoGroupedRDD[58] at leftOuterJoin at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[57] at mapPartitionsToPair at 
NativeMethodAccessorImpl.java:-2 [Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[56] at mapPartitionsToPair at 
NativeMethodAccessorImpl.java:-2 [Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[55] at mapValues at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[54] at filter at NativeMethodAccessorImpl.java:-2 [Disk 
Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[53] at join at NativeMethodAccessorImpl.java:-2 [Disk 
Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[52] at join at NativeMethodAccessorImpl.java:-2 [Disk 
Memory Deserialized 1x Replicated]
 |   CoGroupedRDD[51] at join at NativeMethodAccessorImpl.java:-2 [Disk Memory 
Deserialized 1x Replicated]
 |   MapPartitionsRDD[50] at mapPartitionsToPair at 
NativeMethodAccessorImpl.java:-2 [Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[49] at mapValues at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[48] at cogroup at core.clj:310 [Disk Memory Deserialized 
1x Replicated]
 |   MapPartitionsRDD[47] at cogroup at core.clj:310 [Disk Memory Deserialized 
1x Replicated]
 |   CoGroupedRDD[46] at cogroup at core.clj:310 [Disk Memory Deserialized 1x 
Replicated]
 |   PartitionerAwareUnionRDD[45] at PartitionerAwareUnionRDD at core.clj:326 
[Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[42] at filter at NativeMethodAccessorImpl.java:-2 [Disk 
Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[41] at mapValues at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   ShuffledRDD[31] at partitionBy at core.clj:489 [Disk Memory Deserialized 
1x Replicated]
 +-(31) MapPartitionsRDD[30] at mapToPair at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
|   CoalescedRDD[29] at coalesce at NativeMethodAccessorImpl.java:-2 [Disk 
Memory Deserialized 1x Replicated]
|   MapPartitionsRDD[28] at filter at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
|   MapPartitionsRDD[27] at map at NativeMethodAccessorImpl.java:-2 [Disk 
Memory Deserialized 1x Replicated]
|   hdfs://host/path.avro NewHadoopRDD[26] at newAPIHadoopFile at 
hadoopAvro.clj:24 [Disk Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[44] at filter at NativeMethodAccessorImpl.java:-2 [Disk 
Memory Deserialized 1x Replicated]
 |   MapPartitionsRDD[43] at mapValues at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
 |   ShuffledRDD[40] at partitionBy at core.clj:489 [Disk Memory Deserialized 
1x Replicated]
 +-(10) MapPartitionsRDD[39] at mapToPair at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
|   JdbcRDD[38] at JdbcRDD at jdbc.clj:64 [Disk Memory Deserialized 1x 
Replicated]
 |   ShuffledRDD[34] at partitionBy at core.clj:489 [Disk Memory Deserialized 
1x Replicated]
 +-(10) MapPartitionsRDD[33] at mapToPair at NativeMethodAccessorImpl.java:-2 
[Disk Memory Deserialized 1x Replicated]
  

[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-05-13 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541580#comment-14541580
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

@[~pwendell]: Pls reopen. :(

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-04-21 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504592#comment-14504592
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

Hi [~pwendell],

I hope this is true - however, I'm not so sure, as I found this has nothing to 
do with snappy. I switched to lz4 (as stated in the comment 
https://issues.apache.org/jira/browse/SPARK-5081?focusedCommentId=14324089page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14324089.

Do you use snappy at a place which isn't affected by the configuration 
spark.io.compression.codec?

Fingers crossed,

Chris

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-04-14 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494478#comment-14494478
 ] 

Josh Rosen commented on SPARK-5081:
---

The snappy-java issue has been fixed upstream (see 
https://github.com/xerial/snappy-java/pull/102) and a new release has been 
published to Maven, so I've opened SPARK-6905 / 
https://github.com/apache/spark/pull/5512 to upgrade to that new release.  
Hopefully that will fix this shuffle size increase, but let's wait and see.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-04-08 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486601#comment-14486601
 ] 

Josh Rosen commented on SPARK-5081:
---

I wrote a microbenchmark for snappy-java which shows an increased in compressed 
data sizes between the pre- and post-1.2 Snappy versions. I ran a bisect across 
published snappy-java releases and think that I've narrowed the problem down to 
a single patch.

I've opened https://github.com/xerial/snappy-java/issues/100 to investigate 
this upstream.

Note that this may not end up fully explaining this issue, since there could be 
multiple contributors to the shuffle file write size increase, but it seems 
suspicious.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-03-20 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372266#comment-14372266
 ] 

Patrick Wendell commented on SPARK-5081:


Hey @cbbetz - the last movement on this is that I reached out to the snappy 
author and asked whether our upgrading of snappy could have resulted in 
different sizes of compressed intermediate data. However, he was fairly adamant 
that this is not the case.

Unfortunately, the reports here are somewhat inconsistent and there exists no 
simple reproduction of this issue. In fact I think its likely there are 
multiple different things being discussed in this thread.

The way this can move forward is if someone is able to create a small 
reproduction that can be run by a Spark developer, then we can dig in and see 
what's going on. A reproduction would ideally demonstrate a verifiable 
regression between two versions of the upstream release, for instance showing 
much larger shuffle files, given the same input.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-03-17 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364813#comment-14364813
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

[~pwendell] Hi, obviously there's nobody looking into that issue. Could you 
please clarify, assign, or whatever it takes to get this issue handled in a 
future version of spark?

That'd be great! Thanks!!!

:)

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-27 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340089#comment-14340089
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

Ok, I can really bring the Thread spilling in-memory-map issue down to a 
difference from
1.1.0-cdh5.2.0 to 1.1.0. With 1.1.0-cdh5.2.0, everything is fine, with 1.1.0 I 
get thread spilling and longer runtimes.


Remember: this is the symptom:
2015-02-27 13:33:41.221 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (9 times so far)
2015-02-27 13:33:41.501 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (10 times so far)
2015-02-27 13:33:41.742 [Executor task launch worker-2   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 77 
spilling in-memory map of 27 MB to disk (1 time so far)
2015-02-27 13:33:41.811 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (11 times so far)
2015-02-27 13:33:42.110 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (12 times so far)
2015-02-27 13:33:42.398 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (13 times so far)
2015-02-27 13:33:42.663 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (14 times so far)
2015-02-27 13:33:42.704 [Executor task launch worker-2   ] INFO  
org.apache.spark.storage.BlockManager   : Found block 
rdd_3_33 locally
2015-02-27 13:33:43.045 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (15 times so far)
2015-02-27 13:33:43.367 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (16 times so far)
2015-02-27 13:33:43.637 [Executor task launch worker-6   ] INFO  
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 109 
spilling in-memory map of 0 MB to disk (17 times so far)


 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-25 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336366#comment-14336366
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

It's not the old asm reference. Running 1.2.1 with asm excluded shows same 
behavior.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-17 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324089#comment-14324089
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

Hi [~pwendell],

thanks for having a look into this. I did some tests and collected everything 
in the document attached ([^Spark_Debug.pdf]). I commented some things using 
PDF comments, so make sure you see those. Chrome PDF Viewer doesn't support 
them :(

That's the thing:
* Basically, hash and nio do not make a difference.
* Same is true for snappy/lz4.

Just send me a note when you need some more info.

Sincerly,

Chris


 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-17 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324130#comment-14324130
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

I was checking my assumption, that the CDH-Version and the Spark-1.1.0 version 
showed the same behavior. It is true concerning shuffle spills. (Spark 1.1.0 
has also 

||Shuffle Spill (Memory)||Shuffle Spill (Disk)||
|0.0 B|0.0 B|

However, I see lots of small spills like this
org.apache.spark.util.collection.ExternalAppendOnlyMap  : Thread 78 
spilling in-memory map of 1 MB to disk (322 times so far)

That's taking about the same time per task (several minutes instead of tens of 
seconds).

So there are several assumptions going on from here:
* Spark 1.1.0 does not report the same shuffle writes to memory and disk as 
does Spark 1.2.1, misleading us.
* It's not a Spark code issue but one with dependencies changed. I'm running 
Spark 1.1.0 against hadoop-client/hdfs 2.6.0, so might come from there.

Here's the diff from the two runs:

Classpath entries only in CDH-Version:
 * /com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar

Classpath entries only in 1.1.0/Hadoop-2.6.0-version:
 * /asm/asm/3.1/asm-3.1.jar
 * /com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar
 * /commons-daemon/commons-daemon/1.0.13/commons-daemon-1.0.13.jar
 * /commons-el/commons-el/1.0/commons-el-1.0.jar
 * /javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.jar
 * /org/htrace/htrace-core/3.0.4/htrace-core-3.0.4.jar
 * /tomcat/jasper-runtime/5.5.23/jasper-runtime-5.5.23.jar
 * /xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar
 * /xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar


Classpath entries with changes:
 * /org/apache/spark/spark-core_2.10/1.1.0/spark-core_2.10-1.1.0.jar - 
/org/apache/spark/spark-core_2.10/1.1.0-cdh5.2.0/spark-core_2.10-1.1.0-cdh5.2.0.jar
 * /org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar - 
1.8.8 (same with other jackson libs)
 * /org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
 * /org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar
 * /org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar


 * /net/java/dev/jets3t/jets3t/0.7.1/jets3t-0.7.1.jar - 
/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar
 * /org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar - 
/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar
 * /org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.jar - 
/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar


 * /org/apache/hadoop/hadoop-annotations/2.6.0/hadoop-annotations-2.6.0.jar - 
...-2.5.0-cdh5.2.0.jar (same below)
 * /org/apache/hadoop/hadoop-auth/2.6.0/hadoop-auth-2.6.0.jar
 * /org/apache/hadoop/hadoop-client/2.6.0/hadoop-client-2.6.0.jar
 * /org/apache/hadoop/hadoop-common/2.6.0/hadoop-common-2.6.0.jar
 * /org/apache/hadoop/hadoop-hdfs/2.6.0/hadoop-hdfs-2.6.0.jar
 * 
/org/apache/hadoop/hadoop-mapreduce-client-app/2.6.0/hadoop-mapreduce-client-app-2.6.0.jar
 * 
/org/apache/hadoop/hadoop-mapreduce-client-common/2.6.0/hadoop-mapreduce-client-common-2.6.0.jar
 * 
/org/apache/hadoop/hadoop-mapreduce-client-core/2.6.0/hadoop-mapreduce-client-core-2.6.0.jar
 * 
/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.6.0/hadoop-mapreduce-client-jobclient-2.6.0.jar
 * 
/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.6.0/hadoop-mapreduce-client-shuffle-2.6.0.jar
 * /org/apache/hadoop/hadoop-yarn-api/2.6.0/hadoop-yarn-api-2.6.0.jar
 * /org/apache/hadoop/hadoop-yarn-client/2.6.0/hadoop-yarn-client-2.6.0.jar
 * /org/apache/hadoop/hadoop-yarn-common/2.6.0/hadoop-yarn-common-2.6.0.jar
 * 
/org/apache/hadoop/hadoop-yarn-server-common/2.6.0/hadoop-yarn-server-common-2.6.0.jar
 









Here's my dependency list (from WebUI) for Spark 1.1.0 with hadoop 2.6.0:

/asm/asm/3.1/asm-3.1.jar
/cheshire/cheshire/5.3.1/cheshire-5.3.1.jar
/cider/cider-nrepl/0.8.2/cider-nrepl-0.8.2.jar
/clj-logging-config/clj-logging-config/1.9.12/clj-logging-config-1.9.12.jar
/clj-time/clj-time/0.8.0/clj-time-0.8.0.jar
/cljs-tooling/cljs-tooling/0.1.3/cljs-tooling-0.1.3.jar
/colt/colt/1.2.0/colt-1.2.0.jar
/com/clearspring/analytics/stream/2.7.0/stream-2.7.0.jar
/com/codahale/metrics/metrics-core/3.0.0/metrics-core-3.0.0.jar
/com/codahale/metrics/metrics-graphite/3.0.0/metrics-graphite-3.0.0.jar
/com/codahale/metrics/metrics-json/3.0.0/metrics-json-3.0.0.jar
/com/codahale/metrics/metrics-jvm/3.0.0/metrics-jvm-3.0.0.jar
/com/damballa/abracad/0.4.11/abracad-0.4.11.jar
/com/damballa/parkour/0.6.1/parkour-0.6.1.jar
/com/esotericsoftware/kryo/kryo/2.21/kryo-2.21.jar
/com/esotericsoftware/minlog/minlog/1.2/minlog-1.2.jar
/com/esotericsoftware/reflectasm/reflectasm/1.07/reflectasm-1.07-shaded.jar
/com/fasterxml/jackson/core/jackson-annotations/2.3.0/jackson-annotations-2.3.0.jar
/com/fasterxml/jackson/core/jackson-core/2.3.1/jackson-core-2.3.1.jar

[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-17 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324137#comment-14324137
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

And that's the diff from Spark-1.1.0-CDH to Spark 1.1.0 with Hadoop-2.5.0:

diff Spark-1.1.0-Hadoop-2.5.0.txt Spark-1.1.0-CDH5.2.0.txt 
1d0
 /asm/asm/3.1/asm-3.1.jar
22a22
 /com/google/code/gson/gson/2.2.4/gson-2.2.4.jar
24a25
 /com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar
28d28
 /com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar
40d39
 /commons-daemon/commons-daemon/1.0.13/commons-daemon-1.0.13.jar
42d40
 /commons-el/commons-el/1.0/commons-el-1.0.jar
55d52
 /javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.jar
62c59
 /net/java/dev/jets3t/jets3t/0.7.1/jets3t-0.7.1.jar
---
 /net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar
72c69
 /org/apache/curator/curator-client/2.4.0/curator-client-2.4.0.jar
---
 /org/apache/curator/curator-client/2.6.0/curator-client-2.6.0.jar
79,94c76,91
 /org/apache/hadoop/hadoop-annotations/2.5.0/hadoop-annotations-2.5.0.jar
 /org/apache/hadoop/hadoop-auth/2.5.0/hadoop-auth-2.5.0.jar
 /org/apache/hadoop/hadoop-client/2.5.0/hadoop-client-2.5.0.jar
 /org/apache/hadoop/hadoop-common/2.5.0/hadoop-common-2.5.0.jar
 /org/apache/hadoop/hadoop-hdfs/2.5.0/hadoop-hdfs-2.5.0.jar
 
/org/apache/hadoop/hadoop-mapreduce-client-app/2.5.0/hadoop-mapreduce-client-app-2.5.0.jar
 
/org/apache/hadoop/hadoop-mapreduce-client-common/2.5.0/hadoop-mapreduce-client-common-2.5.0.jar
 
/org/apache/hadoop/hadoop-mapreduce-client-core/2.5.0/hadoop-mapreduce-client-core-2.5.0.jar
 
/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.5.0/hadoop-mapreduce-client-jobclient-2.5.0.jar
 
/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.5.0/hadoop-mapreduce-client-shuffle-2.5.0.jar
 /org/apache/hadoop/hadoop-yarn-api/2.5.0/hadoop-yarn-api-2.5.0.jar
 /org/apache/hadoop/hadoop-yarn-client/2.5.0/hadoop-yarn-client-2.5.0.jar
 /org/apache/hadoop/hadoop-yarn-common/2.5.0/hadoop-yarn-common-2.5.0.jar
 
/org/apache/hadoop/hadoop-yarn-server-common/2.5.0/hadoop-yarn-server-common-2.5.0.jar
 /org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar
 /org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.jar
---
 /org/apache/hadoop/hadoop-annotations/2.5.0-cdh5.2.0/hadoop-annotations-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-auth/2.5.0-cdh5.2.0/hadoop-auth-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-client/2.5.0-cdh5.2.0/hadoop-client-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-common/2.5.0-cdh5.2.0/hadoop-common-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-hdfs/2.5.0-cdh5.2.0/hadoop-hdfs-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-mapreduce-client-app/2.5.0-cdh5.2.0/hadoop-mapreduce-client-app-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-mapreduce-client-common/2.5.0-cdh5.2.0/hadoop-mapreduce-client-common-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-mapreduce-client-core/2.5.0-cdh5.2.0/hadoop-mapreduce-client-core-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.5.0-cdh5.2.0/hadoop-mapreduce-client-jobclient-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.5.0-cdh5.2.0/hadoop-mapreduce-client-shuffle-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-yarn-api/2.5.0-cdh5.2.0/hadoop-yarn-api-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-yarn-client/2.5.0-cdh5.2.0/hadoop-yarn-client-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-yarn-common/2.5.0-cdh5.2.0/hadoop-yarn-common-2.5.0-cdh5.2.0.jar
 /org/apache/hadoop/hadoop-yarn-server-common/2.5.0-cdh5.2.0/hadoop-yarn-server-common-2.5.0-cdh5.2.0.jar
 /org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar
 /org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar
96c93
 /org/apache/spark/spark-core_2.10/1.1.0/spark-core_2.10-1.1.0.jar
---
 /org/apache/spark/spark-core_2.10/1.1.0-cdh5.2.0/spark-core_2.10-1.1.0-cdh5.2.0.jar
111,114c108,111
 /org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar
 /org/codehaus/jackson/jackson-jaxrs/1.9.13/jackson-jaxrs-1.9.13.jar
 /org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar
 /org/codehaus/jackson/jackson-xc/1.9.13/jackson-xc-1.9.13.jar
---
 /org/codehaus/jackson/jackson-core-asl/1.8.8/jackson-core-asl-1.8.8.jar
 /org/codehaus/jackson/jackson-jaxrs/1.8.8/jackson-jaxrs-1.8.8.jar
 /org/codehaus/jackson/jackson-mapper-asl/1.8.8/jackson-mapper-asl-1.8.8.jar
 /org/codehaus/jackson/jackson-xc/1.8.8/jackson-xc-1.8.8.jar
157d153
 /tomcat/jasper-runtime/5.5.23/jasper-runtime-5.5.23.jar

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf


 

[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-17 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324154#comment-14324154
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

That's Spark 1.1.0, Hadoop 2.5.0 in addition to the attached document 
[^Spark_Debug.pdf]:

It logs a lot of spilling from 
org.apache.spark.util.collection.ExternalAppendOnlyMap.
Performance of Tasks in Stage 10 is low. (Minutes, not 10s of seconds)
No Shuffle Spill according to WebUI:

*Details for Stage 10*
Total task time across all tasks: 0 ms

*Summary Metrics for 3 Completed Tasks*

||Metric || Min||   25th percentile||   Median  ||75th percentile|| 
Max||
|Result serialization time| 0 ms|   0 ms|   0 ms|   0 ms|   0 ms|
|Duration   |4,8 min|   4,8 min|5,0 min|5,0 min|
5,0 min|
|Time spent fetching task results|  0 ms|   0 ms|   0 ms|   0 ms|   0 ms|
|Scheduler delay|   33 ms|  33 ms|  34 ms|  45 ms|  45 ms|

*Aggregated Metrics by Executor*

||Executor ID|| Address ||Task Time||   Total Tasks||   Failed Tasks||  
Succeeded Tasks||   Input   ||Shuffle Read||Shuffle Write|| Shuffle 
Spill (Memory)||Shuffle Spill (Disk)||
|localhost| CANNOT FIND ADDRESS|15 min| 3|  0|  3|  0.0 B|  
0.0 B|  0.0 B|  0.0 B|  0.0 B|

*Tasks*

||Index ||ID||  Attempt ||Status||  Locality Level||Executor||  
Launch Time||   Duration||  GC Time||   Accumulators||  Errors||
|0| 291|0|  SUCCESS|ANY|localhost|  2015/02/17 
13:48:39|4,8 min|35 s|   
|1| 292|0|  SUCCESS|ANY|localhost|  2015/02/17 
13:48:39|5,0 min|35 s|   
|2| |293|   0|  SUCCESS|ANY|localhost|  2015/02/17 
13:48:39|5,0 min|35 s|   




 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-17 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324159#comment-14324159
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

I'm out for the moment, going back to a working set of dependencies/settings.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-16 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323629#comment-14323629
 ] 

Patrick Wendell commented on SPARK-5081:


Hey [~cb_betz], can you verify a few things? It would be good to make sure you 
revert all configuration changes from 1.2.0. Specifically, set 
spark.shuffle.blockTransferService to nio and set spark.shuffle.manager to 
hash. Also, verify in the UI that they are set correctly.

If all this is set, can you give us the change in the size of the aggregate 
shuffle output between the two releases?

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-13 Thread Dr. Christian Betz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319948#comment-14319948
 ] 

Dr. Christian Betz commented on SPARK-5081:
---

From  SPARK-5715 
I see a *factor four performance loss* in my Spark jobs when migrating from 
Spark 1.1.0 to Spark 1.2.0 or 1.2.1.

Also, I see an *increase in the size of shuffle writes* (which is also reported 
by Kevin Jung on the mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tt20894.html
 

Together with this I experience a *huge number of disk spills*.



I'm experiencing these with my job under the following circumstances: 

* Spark 1.2.0 with Sort-based Shuffle 
* Spark 1.2.0 with Hash-based Shuffle 
* Spark 1.2.1 with Sort-based Shuffle 

All three combinations show the same behavior, which contrasts from Spark 
1.1.0. 

In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for 
almost four hours. Configuration is identical otherwise - I only added 
org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for 
Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102. 


As a consequence (I think, but causality might be different) I see lots and 
lots of disk spills. 

I cannot provide a small test case, but maybe the log entries for a single 
worker thread can help someone investigate on this. (See below.) 


I will also open up an issue, if nobody stops me by providing an answer ;) 

Any help will be greatly appreciated, because otherwise I'm stuck with Spark 
1.1.0, as quadrupling runtime is not an option. 

Sincerely, 

Chris 



2015-02-09T14:06:06.328+01:00 INFO org.apache.spark.executor.Executor Running 
task 9.0 in stage 18.0 (TID 300) Executor task launch worker-18 
2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.CacheManager Partition 
rdd_35_9 not found, computing it Executor task launch worker-18 
2015-02-09T14:06:06.351+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 10 non-empty 
blocks out of 10 blocks Executor task launch worker-18 
2015-02-09T14:06:06.351+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches 
in 0 ms Executor task launch worker-18 
2015-02-09T14:06:07.396+01:00 INFO org.apache.spark.storage.MemoryStore 
ensureFreeSpace(2582904) called with curMem=300174944, maxMe... Executor task 
launch worker-18 
2015-02-09T14:06:07.397+01:00 INFO org.apache.spark.storage.MemoryStore Block 
rdd_35_9 stored as bytes in memory (estimated size 2.5... Executor task launch 
worker-18 
2015-02-09T14:06:07.398+01:00 INFO org.apache.spark.storage.BlockManagerMaster 
Updated info of block rdd_35_9 Executor task launch worker-18 
2015-02-09T14:06:07.399+01:00 INFO org.apache.spark.CacheManager Partition 
rdd_38_9 not found, computing it Executor task launch worker-18 
2015-02-09T14:06:07.399+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 10 non-empty 
blocks out of 10 blocks Executor task launch worker-18 
2015-02-09T14:06:07.400+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches 
in 0 ms Executor task launch worker-18 
2015-02-09T14:06:07.567+01:00 INFO org.apache.spark.storage.MemoryStore 
ensureFreeSpace(944848) called with curMem=302757848, maxMem... Executor task 
launch worker-18 
2015-02-09T14:06:07.568+01:00 INFO org.apache.spark.storage.MemoryStore Block 
rdd_38_9 stored as values in memory (estimated size 92... Executor task launch 
worker-18 
2015-02-09T14:06:07.569+01:00 INFO org.apache.spark.storage.BlockManagerMaster 
Updated info of block rdd_38_9 Executor task launch worker-18 
2015-02-09T14:06:07.573+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 34 non-empty 
blocks out of 50 blocks Executor task launch worker-18 
2015-02-09T14:06:07.573+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches 
in 1 ms Executor task launch worker-18 
2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.CacheManager Partition 
rdd_41_9 not found, computing it Executor task launch worker-18 
2015-02-09T14:06:38.931+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 3 non-empty blocks 
out of 10 blocks Executor task launch worker-18 
2015-02-09T14:06:38.931+01:00 INFO 
org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote fetches 
in 0 ms Executor task launch worker-18 
2015-02-09T14:06:38.945+01:00 INFO org.apache.spark.storage.MemoryStore 
ensureFreeSpace(0) called with curMem=307529127, maxMem=9261... Executor task 
launch worker-18 
2015-02-09T14:06:38.945+01:00 INFO org.apache.spark.storage.MemoryStore Block 
rdd_41_9 stored as bytes in memory (estimated size 0.0... Executor task launch 
worker-18 
2015-02-09T14:06:38.946+01:00 INFO org.apache.spark.storage.BlockManagerMaster 
Updated info of block 

[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-10 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315341#comment-14315341
 ] 

Kevin Jung commented on SPARK-5081:
---

Xuefeng Wu mentioned about one difference of snappy version.

dependency
groupIdorg.xerial.snappy/groupId
artifactIdsnappy-java/artifactId
version1.0.5.3/version 
/dependency
It is changed to 1.1.1.6 in spark 1.2. We need to consider these two.



 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308596#comment-14308596
 ] 

Shekhar Bansal commented on SPARK-5081:
---

I faced same problem, moving to lz4 compression did the trick for me.
try spark.io.compression.codec=lz4

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308608#comment-14308608
 ] 

Kevin Jung commented on SPARK-5081:
---

Sorry, I will make an effort to provide another code to replay this problem 
because I don't have the old code anymore.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620
 ] 

Kevin Jung commented on SPARK-5081:
---

To test under the same condition, I set this to snappy for all spark version 
but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it 
has better compression ratio.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307584#comment-14307584
 ] 

Kostas Sakellis commented on SPARK-5081:


Can you add a sample of the code too?

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org