Also check the `RDD.checkpoint()` method https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1533-L1550
On Wed, Aug 2, 2017 at 8:46 PM, Vadim Semenov <vadim.seme...@datadoghq.com> wrote: > I'm not sure that "checkpointed" means the same thing in that sentence. > > You can run a simple test using `spark-shell`: > > sc.setCheckpointDir("/tmp/checkpoint") > val rdd = sc.parallelize(1 to 10).map(x => { > Thread.sleep(1000) > x > }) > rdd.checkpoint() > rdd.foreach(println) // Will take 10 seconds > rdd.foreach(println) // Will be instant, because the RDD is checkpointed > > On Wed, Aug 2, 2017 at 7:05 PM, jeff saremi <jeffsar...@hotmail.com> > wrote: > >> Vadim: >> >> This is from the Mastering Spark book: >> >> *"It is strongly recommended that a checkpointed RDD is persisted in >> memory, otherwise saving it on a file will require recomputation."* >> >> >> To me that means checkpoint will not prevent the recomputation that i was >> hoping for >> ------------------------------ >> *From:* Vadim Semenov <vadim.seme...@datadoghq.com> >> *Sent:* Tuesday, August 1, 2017 12:05:17 PM >> *To:* jeff saremi >> *Cc:* user@spark.apache.org >> *Subject:* Re: How can i remove the need for calling cache >> >> You can use `.checkpoint()`: >> ``` >> val sc: SparkContext >> sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory") >> myrdd.checkpoint() >> val result1 = myrdd.map(op1(_)) >> result1.count() // Will save `myrdd` to HDFS and do map(op1… >> val result2 = myrdd.map(op2(_)) >> result2.count() // Will load `myrdd` from HDFS and do map(op2… >> ``` >> >> On Tue, Aug 1, 2017 at 2:05 PM, jeff saremi <jeffsar...@hotmail.com> >> wrote: >> >>> Calling cache/persist fails all our jobs (i have posted 2 threads on >>> this). >>> >>> And we're giving up hope in finding a solution. >>> So I'd like to find a workaround for that: >>> >>> If I save an RDD to hdfs and read it back, can I use it in more than one >>> operation? >>> >>> Example: (using cache) >>> // do a whole bunch of transformations on an RDD >>> >>> myrdd.cache() >>> >>> val result1 = myrdd.map(op1(_)) >>> >>> val result2 = myrdd.map(op2(_)) >>> >>> // in the above I am assuming that a call to cache will prevent all >>> previous transformation from being calculated twice >>> >>> I'd like to somehow get result1 and result2 without duplicating work. >>> How can I do that? >>> >>> thanks >>> >>> Jeff >>> >> >> >