To add a bit on this one, if you look at RDD.scala in Spark code, you'll see
that both parent and firstParent methods are protected[spark].
I guess, for good reasons, that I must admit I don't understand completely,
you are not supposed to explore an RDD lineage programmatically...
I had a
Using a var for RDDs in this way is not going to work. In this example,
tx1.zip(tx2) would create and RDD that depends on tx2, but then soon after
that, you change what tx2 means, so you would end up having a circular
dependency.
On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung
There is a toDebugString method in rdd that will print a description of
this RDD and its recursive dependencies for debugging.
Thanks,
Liquan
On Wed, Oct 8, 2014 at 12:01 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
My job is not being fault-tolerant (e.g., when there's a fetch failure
There is no circular dependency. Its simply dropping references to prev RDDs
because there is no need for it.
I wonder if that messes up things up though internally for Spark due to losing
references to intermediate RDDs.
On Oct 8, 2014, at 12:13 PM, Akshat Aranya aara...@gmail.com wrote:
One thing I didn't mention is that we actually do data.repartition before
hand with shuffle.
I found that this can actually introduce randomness to lineage steps,
because data get shuffled to different partitions and lead to inconsistent
behavior if your algorithm is dependent on the order at