repartition combined with zipWithIndex get stuck

2014-11-15 Thread lev
Hi,

I'm having trouble using both zipWithIndex and repartition. When I use them
both, the following action will get stuck and won't return.
I'm using spark 1.1.0.


Those 2 lines work as expected:

scala sc.parallelize(1 to 10).repartition(10).count()
res0: Long = 10

scala sc.parallelize(1 to 10).zipWithIndex.count()
res1: Long = 10


But this statement get stuck and doesn't return:

scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
Option.scala:120
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
Option.scala:120) with 3 output partitions (allowLocal=false)
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
Option.scala:120)
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
(ParallelCollectionRDD[7] at parallelize at console:13), which has no
missing parents
14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
with curMem=7616, maxMem=138938941
14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
values in memory (estimated size 1096.0 B, free 132.5 MB)


Am I doing something wrong here or is it a bug?
Is there some work around?

Thanks,
Lev.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
This is a bug. Could you make a JIRA? -Xiangrui

On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote:
 Hi,

 I'm having trouble using both zipWithIndex and repartition. When I use them
 both, the following action will get stuck and won't return.
 I'm using spark 1.1.0.


 Those 2 lines work as expected:

 scala sc.parallelize(1 to 10).repartition(10).count()
 res0: Long = 10

 scala sc.parallelize(1 to 10).zipWithIndex.count()
 res1: Long = 10


 But this statement get stuck and doesn't return:

 scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
 Option.scala:120
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
 Option.scala:120) with 3 output partitions (allowLocal=false)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
 Option.scala:120)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
 List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
 (ParallelCollectionRDD[7] at parallelize at console:13), which has no
 missing parents
 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
 with curMem=7616, maxMem=138938941
 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
 values in memory (estimated size 1096.0 B, free 132.5 MB)


 Am I doing something wrong here or is it a bug?
 Is there some work around?

 Thanks,
 Lev.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
I think I understand where the bug is now. I created a JIRA
(https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
soon. -Xiangrui

On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote:
 This is a bug. Could you make a JIRA? -Xiangrui

 On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote:
 Hi,

 I'm having trouble using both zipWithIndex and repartition. When I use them
 both, the following action will get stuck and won't return.
 I'm using spark 1.1.0.


 Those 2 lines work as expected:

 scala sc.parallelize(1 to 10).repartition(10).count()
 res0: Long = 10

 scala sc.parallelize(1 to 10).zipWithIndex.count()
 res1: Long = 10


 But this statement get stuck and doesn't return:

 scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
 Option.scala:120
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
 Option.scala:120) with 3 output partitions (allowLocal=false)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
 Option.scala:120)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
 List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
 (ParallelCollectionRDD[7] at parallelize at console:13), which has no
 missing parents
 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
 with curMem=7616, maxMem=138938941
 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
 values in memory (estimated size 1096.0 B, free 132.5 MB)


 Am I doing something wrong here or is it a bug?
 Is there some work around?

 Thanks,
 Lev.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround:

val a = sc.parallelize(1 to 10).zipWithIndex()
a.partitions // call .partitions explicitly
a.repartition(10).count()

Thanks for reporting the bug! -Xiangrui



On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng men...@gmail.com wrote:
 I think I understand where the bug is now. I created a JIRA
 (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
 soon. -Xiangrui

 On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote:
 This is a bug. Could you make a JIRA? -Xiangrui

 On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote:
 Hi,

 I'm having trouble using both zipWithIndex and repartition. When I use them
 both, the following action will get stuck and won't return.
 I'm using spark 1.1.0.


 Those 2 lines work as expected:

 scala sc.parallelize(1 to 10).repartition(10).count()
 res0: Long = 10

 scala sc.parallelize(1 to 10).zipWithIndex.count()
 res1: Long = 10


 But this statement get stuck and doesn't return:

 scala sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
 14/11/15 03:18:55 INFO spark.SparkContext: Starting job: apply at
 Option.scala:120
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Got job 3 (apply at
 Option.scala:120) with 3 output partitions (allowLocal=false)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Final stage: Stage 4(apply at
 Option.scala:120)
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Parents of final stage:
 List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Missing parents: List()
 14/11/15 03:18:55 INFO scheduler.DAGScheduler: Submitting Stage 4
 (ParallelCollectionRDD[7] at parallelize at console:13), which has no
 missing parents
 14/11/15 03:18:55 INFO storage.MemoryStore: ensureFreeSpace(1096) called
 with curMem=7616, maxMem=138938941
 14/11/15 03:18:55 INFO storage.MemoryStore: Block broadcast_4 stored as
 values in memory (estimated size 1096.0 B, free 132.5 MB)


 Am I doing something wrong here or is it a bug?
 Is there some work around?

 Thanks,
 Lev.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/repartition-combined-with-zipWithIndex-get-stuck-tp18999.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org