What confused me is  the statement of "The final result is that rdd1 is 
calculated twice.” Is it the expected behavior?

Thanks.

Zhan Zhang

On Feb 26, 2015, at 3:03 PM, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:

To distill this a bit further, I don't think you actually want rdd2 to
wait on rdd1 in this case. What you want is for a request for
partition X to wait if partition X is already being calculated in a
persisted RDD. Otherwise the first partition of rdd2 waits on the
final partition of rdd1 even when the rest is ready.

That is probably usually a good idea in almost all cases. That much, I
don't know how hard it is to implement. But I speculate that it's
easier to deal with it at that level than as a function of the
dependency graph.

On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet 
<cjno...@gmail.com<mailto:cjno...@gmail.com>> wrote:
I'm trying to do the scheduling myself now- to determine that rdd2 depends
on rdd1 and rdd1 is a persistent RDD (storage level != None) so that I can
do the no-op on rdd1 before I run rdd2. I would much rather the DAG figure
this out so I don't need to think about all this.

Reply via email to