[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340089#comment-14340089 ]
Dr. Christian Betz commented on SPARK-5081: ------------------------------------------- Ok, I can really bring the Thread spilling in-memory-map issue down to a difference from 1.1.0-cdh5.2.0 to 1.1.0. With 1.1.0-cdh5.2.0, everything is fine, with 1.1.0 I get thread spilling and longer runtimes. Remember: this is the symptom: 2015-02-27 13:33:41.221 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (9 times so far) 2015-02-27 13:33:41.501 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (10 times so far) 2015-02-27 13:33:41.742 [Executor task launch worker-2 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 77 spilling in-memory map of 27 MB to disk (1 time so far) 2015-02-27 13:33:41.811 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (11 times so far) 2015-02-27 13:33:42.110 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (12 times so far) 2015-02-27 13:33:42.398 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (13 times so far) 2015-02-27 13:33:42.663 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (14 times so far) 2015-02-27 13:33:42.704 [Executor task launch worker-2 ] INFO org.apache.spark.storage.BlockManager : Found block rdd_3_33 locally 2015-02-27 13:33:43.045 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (15 times so far) 2015-02-27 13:33:43.367 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (16 times so far) 2015-02-27 13:33:43.637 [Executor task launch worker-6 ] INFO org.apache.spark.util.collection.ExternalAppendOnlyMap : Thread 109 spilling in-memory map of 0 MB to disk (17 times so far) > Shuffle write increases > ----------------------- > > Key: SPARK-5081 > URL: https://issues.apache.org/jira/browse/SPARK-5081 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 1.2.0 > Reporter: Kevin Jung > Priority: Critical > Attachments: Spark_Debug.pdf > > > The size of shuffle write showing in spark web UI is much different when I > execute same spark job with same input data in both spark 1.1 and spark 1.2. > At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB > in spark 1.2. > I set spark.shuffle.manager option to hash because it's default value is > changed but spark 1.2 still writes shuffle output more than spark 1.1. > It can increase disk I/O overhead exponentially as the input file gets bigger > and it causes the jobs take more time to complete. > In the case of about 100GB input, for example, the size of shuffle write is > 39.7GB in spark 1.1 but 91.0GB in spark 1.2. > spark 1.1 > ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| > |9|saveAsTextFile| |1169.4KB| | > |12|combineByKey| |1265.4KB|1275.0KB| > |6|sortByKey| |1276.5KB| | > |8|mapPartitions| |91.0MB|1383.1KB| > |4|apply| |89.4MB| | > |5|sortBy|155.6MB| |98.1MB| > |3|sortBy|155.6MB| | | > |1|collect| |2.1MB| | > |2|mapValues|155.6MB| |2.2MB| > |0|first|184.4KB| | | > spark 1.2 > ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| > |12|saveAsTextFile| |1170.2KB| | > |11|combineByKey| |1264.5KB|1275.0KB| > |8|sortByKey| |1273.6KB| | > |7|mapPartitions| |134.5MB|1383.1KB| > |5|zipWithIndex| |132.5MB| | > |4|sortBy|155.6MB| |146.9MB| > |3|sortBy|155.6MB| | | > |2|collect| |2.0MB| | > |1|mapValues|155.6MB| |2.2MB| > |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org