[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Piotr Kołaczkowski updated SPARK-1712: -------------------------------------- Description: {noformat} scala> val collection = (1 to 1000000).map(i => ("foo" + i, i)).toVector collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), (foo... scala> val rdd = sc.parallelize(collection) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd.first res4: (String, Int) = (foo1,1) scala> rdd.map(_._2).sum // nothing happens {noformat} CPU and I/O idle. Memory usage reported by JVM, after manually triggered GC: repl: 216 MB / 2 GB executor: 67 MB / 2 GB worker: 6 MB / 128 MB master: 6 MB / 128 MB No errors found in worker's stderr/stdout. It works fine with 700,000 elements and then it takes about 1 second to process the request and calculate the sum. With 700,000 items the spark executor memory doesn't even exceed 300 MB out of 2GB available. It fails with 800,000 items. Multiple parralelized collections of size 700,000 items at the same time in the same session work fine. was: {noformat} scala> val collection = (1 to 1000000).map(i => ("foo" + i, i)).toVector collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), (foo... scala> val rdd = sc.parallelize(collection) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd.first res4: (String, Int) = (foo1,1) scala> rdd.map(_._2).sum // nothing happens {noformat} CPU and I/O idle. Memory usage reported by JVM, after manually triggered GC: repl: 216 MB / 2 GB executor: 67 MB / 2 GB worker: 6 MB / 128 MB master: 6 MB / 128 MB No errors found in worker's stderr/stdout. It works fine with 700,000 elements and then it takes about 1 second to process the request and calculate the sum. With 700,000 items the spark executor memory doesn't even exceed 300 MB out of 2GB available. It fails with 800,000 items. Multiple parralelized collections of size 700,000 at the same time in the same session items work fine. > ParallelCollectionRDD operations hanging forever without any error messages > ---------------------------------------------------------------------------- > > Key: SPARK-1712 > URL: https://issues.apache.org/jira/browse/SPARK-1712 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 0.9.0 > Environment: Linux Ubuntu 14.04, a single spark node; standalone mode. > Reporter: Piotr Kołaczkowski > Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, > spark-hang.png, worker.jstack.txt > > > {noformat} > scala> val collection = (1 to 1000000).map(i => ("foo" + i, i)).toVector > collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), > (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), > (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), > (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), > (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), > (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), > (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), > (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), > (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), > (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), > (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), > (foo... > scala> val rdd = sc.parallelize(collection) > rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at > parallelize at <console>:24 > scala> rdd.first > res4: (String, Int) = (foo1,1) > scala> rdd.map(_._2).sum > // nothing happens > {noformat} > CPU and I/O idle. > Memory usage reported by JVM, after manually triggered GC: > repl: 216 MB / 2 GB > executor: 67 MB / 2 GB > worker: 6 MB / 128 MB > master: 6 MB / 128 MB > No errors found in worker's stderr/stdout. > It works fine with 700,000 elements and then it takes about 1 second to > process the request and calculate the sum. With 700,000 items the spark > executor memory doesn't even exceed 300 MB out of 2GB available. It fails > with 800,000 items. > Multiple parralelized collections of size 700,000 items at the same time in > the same session work fine. -- This message was sent by Atlassian JIRA (v6.2#6252)