You can use `.checkpoint()`: ``` val sc: SparkContext sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory") myrdd.checkpoint() val result1 = myrdd.map(op1(_)) result1.count() // Will save `myrdd` to HDFS and do map(op1… val result2 = myrdd.map(op2(_)) result2.count() // Will load `myrdd` from HDFS and do map(op2… ```
On Tue, Aug 1, 2017 at 2:05 PM, jeff saremi <jeffsar...@hotmail.com> wrote: > Calling cache/persist fails all our jobs (i have posted 2 threads on > this). > > And we're giving up hope in finding a solution. > So I'd like to find a workaround for that: > > If I save an RDD to hdfs and read it back, can I use it in more than one > operation? > > Example: (using cache) > // do a whole bunch of transformations on an RDD > > myrdd.cache() > > val result1 = myrdd.map(op1(_)) > > val result2 = myrdd.map(op2(_)) > > // in the above I am assuming that a call to cache will prevent all > previous transformation from being calculated twice > > I'd like to somehow get result1 and result2 without duplicating work. How > can I do that? > > thanks > > Jeff >