repartition combined with zipWithIndex get stuck
Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines work as expected: scala sc.parallelize(1 to 10).repartition(10).count() res0: Long = 10 scala sc.parallelize(1 to 10).zipWithIndex.count() res1: Long = 10 But this statement get stuck and doesn't return: scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at Option.scala:120 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at Option.scala:120) with 3 output partitions (allowLocal=false) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at Option.scala:120) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4 (ParallelCollectionRDD[7] at parallelize at console:13), which has no missing parents 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called with curMem=7616, maxMem=138938941 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1096.0 B, free 132.5 MB) Am I doing something wrong here or is it a bug? Is there some work around? Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: repartition combined with zipWithIndex get stuck
This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote: Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines work as expected: scala sc.parallelize(1 to 10).repartition(10).count() res0: Long = 10 scala sc.parallelize(1 to 10).zipWithIndex.count() res1: Long = 10 But this statement get stuck and doesn't return: scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at Option.scala:120 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at Option.scala:120) with 3 output partitions (allowLocal=false) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at Option.scala:120) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4 (ParallelCollectionRDD[7] at parallelize at console:13), which has no missing parents 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called with curMem=7616, maxMem=138938941 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1096.0 B, free 132.5 MB) Am I doing something wrong here or is it a bug? Is there some work around? Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: repartition combined with zipWithIndex get stuck
I think I understand where the bug is now. I created a JIRA (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR soon. -Xiangrui On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote: This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote: Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines work as expected: scala sc.parallelize(1 to 10).repartition(10).count() res0: Long = 10 scala sc.parallelize(1 to 10).zipWithIndex.count() res1: Long = 10 But this statement get stuck and doesn't return: scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at Option.scala:120 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at Option.scala:120) with 3 output partitions (allowLocal=false) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at Option.scala:120) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4 (ParallelCollectionRDD[7] at parallelize at console:13), which has no missing parents 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called with curMem=7616, maxMem=138938941 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1096.0 B, free 132.5 MB) Am I doing something wrong here or is it a bug? Is there some work around? Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: repartition combined with zipWithIndex get stuck
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround: val a = sc.parallelize(1 to 10).zipWithIndex() a.partitions // call .partitions explicitly a.repartition(10).count() Thanks for reporting the bug! -Xiangrui On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng men...@gmail.com wrote: I think I understand where the bug is now. I created a JIRA (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR soon. -Xiangrui On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote: This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote: Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines work as expected: scala sc.parallelize(1 to 10).repartition(10).count() res0: Long = 10 scala sc.parallelize(1 to 10).zipWithIndex.count() res1: Long = 10 But this statement get stuck and doesn't return: scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at Option.scala:120 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at Option.scala:120) with 3 output partitions (allowLocal=false) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at Option.scala:120) 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List() 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4 (ParallelCollectionRDD[7] at parallelize at console:13), which has no missing parents 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called with curMem=7616, maxMem=138938941 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1096.0 B, free 132.5 MB) Am I doing something wrong here or is it a bug? Is there some work around? Thanks, Lev. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org