There is no need to do that if 1) the stage that you are concerned with either made use of or produced MapOutputs/shuffle files; 2) reuse of those shuffle files (which may very well be in the OS buffer cache of the worker nodes) is sufficient for your needs; 3) the relevant Stage objects haven't gone out of scope, which would allow the shuffle files to be removed; 4) you reuse the exact same Stage objects that were used previously. If all of that is true, then Spark will re-use the prior stage with performance very similar to if you had explicitly cached an equivalent RDD.
On Mon, Oct 17, 2016 at 4:53 PM, ayan guha <guha.a...@gmail.com> wrote: > You can use cache or persist. > > On Tue, Oct 18, 2016 at 10:11 AM, Yang <teddyyyy...@gmail.com> wrote: > >> I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell >> >> it seems that after all 10 finished successfully, if I run the last, or >> the 9th again, >> spark reruns all the previous stages from scratch, instead of utilizing >> the partial results. >> >> this is quite serious since I can't experiment while making small changes >> to the code. >> >> any idea what part of the spark framework might have caused this ? >> >> thanks >> Yang >> > > > > -- > Best Regards, > Ayan Guha >