Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-09 Thread Pierre B
To add a bit on this one, if you look at RDD.scala in Spark code, you'll see that both parent and firstParent methods are protected[spark]. I guess, for good reasons, that I must admit I don't understand completely, you are not supposed to explore an RDD lineage programmatically... I had a

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Akshat Aranya
Using a var for RDDs in this way is not going to work. In this example, tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after that, you change what tx2 means, so you would end up having a circular dependency. On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Liquan Pei
There is a toDebugString method in rdd that will print a description of this RDD and its recursive dependencies for debugging. Thanks, Liquan On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: My job is not being fault-tolerant (e.g., when there's a fetch failure

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Sung Hwan Chung
There is no circular dependency. Its simply dropping references to prev RDDs because there is no need for it. I wonder if that messes up things up though internally for Spark due to losing references to intermediate RDDs. On Oct 8, 2014, at 12:13 PM, Akshat Aranya aara...@gmail.com wrote:

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Sung Hwan Chung
One thing I didn't mention is that we actually do data.repartition before hand with shuffle. I found that this can actually introduce randomness to lineage steps, because data get shuffled to different partitions and lead to inconsistent behavior if your algorithm is dependent on the order at