As Mark said, checkpoint() can be called before calling any action on the RDD.

The choice between checkpoint and saveXXX depends. If you just want to cut the 
long RDD lineage, and the data won’t be re-used later, then use checkpoint, 
because it is simple and the checkpoint data will be cleaned automatically. 
Note that reliable checkpoint has a little performance penalty as it will 
re-compute the RDD. So you can either call RDD.cache before checkpoint or you 
can choose localCheckpoint.

If you want to reuse the data in another application, use SaveXXX(), because 
you can re-create an RDD from the saved data. On the contrary, you have no way 
to create an RDD from checkpoint data (maybe possible in Spark Streaming, but 
not sure).

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, March 25, 2016 5:34 AM
To: Mark Hamstra <m...@clearstorydata.com>
Cc: Todd <bit1...@163.com>; user@spark.apache.org
Subject: Re: What's the benifit of RDD checkpoint against RDD save

Thanks, Mark.

Since checkpoint may get cleaned up later on, it seems option #2 (saveXXX) is 
viable.

On Wed, Mar 23, 2016 at 8:01 PM, Mark Hamstra 
<m...@clearstorydata.com<mailto:m...@clearstorydata.com>> wrote:
Yes, the terminology is being used sloppily/non-standardly in this thread -- 
"the last RDD" after a series of transformation is the RDD at the beginning of 
the chain, just now with an attached chain of "to be done" transformations when 
an action is eventually run.  If the saveXXX action is the only action being 
performed on the RDD, the rest of the chain being purely transformations, then 
checkpointing instead of saving still wouldn't execute any action on the RDD -- 
it would just mark the point at which checkpointing should be done when an 
action is eventually run.

On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
bq. when I get the last RDD
If I read Todd's first email correctly, the computation has been done.
I could be wrong.

On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra 
<m...@clearstorydata.com<mailto:m...@clearstorydata.com>> wrote:
Neither of you is making any sense to me.  If you just have an RDD for which 
you have specified a series of transformations but you haven't run any actions, 
then neither checkpointing nor saving makes sense -- you haven't computed 
anything yet, you've only written out the recipe for how the computation should 
be done when it is needed.  Neither does the "called before any job" comment 
pose any restriction in this case since no jobs have yet been executed on the 
RDD.

On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:
See the doc for checkpoint:

   * Mark this RDD for checkpointing. It will be saved to a file inside the 
checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to 
its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is 
persisted in
   * memory, otherwise saving it on a file will require recomputation.

From the above description, you should not call it at the end of 
transformations.

Cheers

On Wed, Mar 23, 2016 at 7:14 PM, Todd <bit1...@163.com<mailto:bit1...@163.com>> 
wrote:
Hi,

I have a long computing chain, when I get the last RDD after a series of 
transformation. I have two choices to do with this last RDD

1. Call checkpoint on RDD to materialize it to disk
2. Call RDD.saveXXX to save it to HDFS, and read it back for further processing

I would ask which choice is better? It looks to me that is not much difference 
between the two choices.
Thanks!






Reply via email to