Neither of you is making any sense to me.  If you just have an RDD for
which you have specified a series of transformations but you haven't run
any actions, then neither checkpointing nor saving makes sense -- you
haven't computed anything yet, you've only written out the recipe for how
the computation should be done when it is needed.  Neither does the "called
before any job" comment pose any restriction in this case since no jobs
have yet been executed on the RDD.

On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> See the doc for checkpoint:
>
>    * Mark this RDD for checkpointing. It will be saved to a file inside
> the checkpoint
>    * directory set with `SparkContext#setCheckpointDir` and all references
> to its parent
>    * RDDs will be removed. *This function must be called before any job
> has been*
> *   * executed on this RDD*. It is strongly recommended that this RDD is
> persisted in
>    * memory, otherwise saving it on a file will require recomputation.
>
> From the above description, you should not call it at the end of
> transformations.
>
> Cheers
>
> On Wed, Mar 23, 2016 at 7:14 PM, Todd <bit1...@163.com> wrote:
>
>> Hi,
>>
>> I have a long computing chain, when I get the last RDD after a series of
>> transformation. I have two choices to do with this last RDD
>>
>> 1. Call checkpoint on RDD to materialize it to disk
>> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further
>> processing
>>
>> I would ask which choice is better? It looks to me that is not much
>> difference between the two choices.
>> Thanks!
>>
>>
>>
>

Reply via email to