Neither of you is making any sense to me. If you just have an RDD for which you have specified a series of transformations but you haven't run any actions, then neither checkpointing nor saving makes sense -- you haven't computed anything yet, you've only written out the recipe for how the computation should be done when it is needed. Neither does the "called before any job" comment pose any restriction in this case since no jobs have yet been executed on the RDD.
On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <yuzhih...@gmail.com> wrote: > See the doc for checkpoint: > > * Mark this RDD for checkpointing. It will be saved to a file inside > the checkpoint > * directory set with `SparkContext#setCheckpointDir` and all references > to its parent > * RDDs will be removed. *This function must be called before any job > has been* > * * executed on this RDD*. It is strongly recommended that this RDD is > persisted in > * memory, otherwise saving it on a file will require recomputation. > > From the above description, you should not call it at the end of > transformations. > > Cheers > > On Wed, Mar 23, 2016 at 7:14 PM, Todd <bit1...@163.com> wrote: > >> Hi, >> >> I have a long computing chain, when I get the last RDD after a series of >> transformation. I have two choices to do with this last RDD >> >> 1. Call checkpoint on RDD to materialize it to disk >> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further >> processing >> >> I would ask which choice is better? It looks to me that is not much >> difference between the two choices. >> Thanks! >> >> >> >