Yes, the terminology is being used sloppily/non-standardly in this thread
-- "the last RDD" after a series of transformation is the RDD at the
beginning of the chain, just now with an attached chain of "to be done"
transformations when an action is eventually run.  If the saveXXX action is
the only action being performed on the RDD, the rest of the chain being
purely transformations, then checkpointing instead of saving still wouldn't
execute any action on the RDD -- it would just mark the point at which
checkpointing should be done when an action is eventually run.

On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. when I get the last RDD
> If I read Todd's first email correctly, the computation has been done.
> I could be wrong.
>
> On Wed, Mar 23, 2016 at 7:34 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Neither of you is making any sense to me.  If you just have an RDD for
>> which you have specified a series of transformations but you haven't run
>> any actions, then neither checkpointing nor saving makes sense -- you
>> haven't computed anything yet, you've only written out the recipe for how
>> the computation should be done when it is needed.  Neither does the "called
>> before any job" comment pose any restriction in this case since no jobs
>> have yet been executed on the RDD.
>>
>> On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> See the doc for checkpoint:
>>>
>>>    * Mark this RDD for checkpointing. It will be saved to a file inside
>>> the checkpoint
>>>    * directory set with `SparkContext#setCheckpointDir` and all
>>> references to its parent
>>>    * RDDs will be removed. *This function must be called before any job
>>> has been*
>>> *   * executed on this RDD*. It is strongly recommended that this RDD
>>> is persisted in
>>>    * memory, otherwise saving it on a file will require recomputation.
>>>
>>> From the above description, you should not call it at the end of
>>> transformations.
>>>
>>> Cheers
>>>
>>> On Wed, Mar 23, 2016 at 7:14 PM, Todd <bit1...@163.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a long computing chain, when I get the last RDD after a series
>>>> of transformation. I have two choices to do with this last RDD
>>>>
>>>> 1. Call checkpoint on RDD to materialize it to disk
>>>> 2. Call RDD.saveXXX to save it to HDFS, and read it back for further
>>>> processing
>>>>
>>>> I would ask which choice is better? It looks to me that is not much
>>>> difference between the two choices.
>>>> Thanks!
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to