So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have to create a new RDD that reads that data, this way you'll avoid recomputing the RDD but may lose time on saving/loading.
Exactly same thing happens in 'checkpoint', 'checkpoint' is just a convenient method that gives you the same RDD back, basically. However, if your job fails, there's no way to run a new job using already 'checkpointed' data from a previous failed run. That's where having a custom check pointer helps. Another note: you can not delete "checkpoint"ed data in the same job, you need to delete it somehow else. BTW, have you tried '.persist(StorageLevel.DISK_ONLY)'? It caches data to local disk, making more space in JVM and letting you to avoid hdfs. On Wednesday, August 2, 2017, Vadim Semenov <vadim.seme...@datadoghq.com> wrote: > `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so > it just saves data to some destination. > > `cache/persist` allow you to cache data and keep the DAG in case of some > executor that holds data goes down, so Spark would still be able to > recalculate missing partitions > > `localCheckpoint` allows you to sacrifice fault-tolerance and truncate the > DAG, so if some executor goes down, the job will fail, because it has > already forgotten the DAG. https://github.com/apache/ > spark/blob/master/core/src/main/scala/org/apache/spark/ > rdd/RDD.scala#L1551-L1610 > > and `checkpoint` allows you to save data to some shared storage and > truncate the DAG, so if an executor goes down, the job will be able to take > missing partitions from the place where it saved the RDD > https://github.com/apache/spark/blob/master/core/src/ > main/scala/org/apache/spark/rdd/RDD.scala#L1533-L1549 > > On Wed, Aug 2, 2017 at 7:20 PM, Suzen, Mehmet <su...@acm.org > <javascript:_e(%7B%7D,'cvml','su...@acm.org');>> wrote: > >> On 3 August 2017 at 01:05, jeff saremi <jeffsar...@hotmail.com >> <javascript:_e(%7B%7D,'cvml','jeffsar...@hotmail.com');>> wrote: >> > Vadim: >> > >> > This is from the Mastering Spark book: >> > >> > "It is strongly recommended that a checkpointed RDD is persisted in >> memory, >> > otherwise saving it on a file will require recomputation." >> >> Is this really true? I had the impression that DAG will not be carried >> out once RDD is serialized to an external file, so 'saveAsObjectFile' >> saves DAG as well? >> > >