Re: Dataset API Question

Wenchen Fan Wed, 25 Oct 2017 13:50:24 -0700

It's because of different API design.

*RDD.checkpoint* returns void, which means it mutates the RDD state so you
need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed.


*Dataset.checkpoint* returns a new Dataset, which means there is no
isCheckpointed state in Dataset, and thus we don't need a
*Dataset.isCheckpointed* method.


On Wed, Oct 25, 2017 at 6:39 PM, Bernard Jesop <bernard.je...@gmail.com>
wrote:

> Actually, I realized keeping the info would not be enough as I need to
> find back the checkpoint files to delete them :/
>
> 2017-10-25 19:07 GMT+02:00 Bernard Jesop <bernard.je...@gmail.com>:
>
>> As far as I understand, Dataset.rdd is not the same as InternalRDD.
>> It is just another RDD representation of the same Dataset and is created
>> on demand (lazy val) when Dataset.rdd is called.
>> This totally explains the observed behavior.
>>
>> But how would would it be possible to know that a Dataset have been
>> checkpointed?
>> Should I manually keep track of that info?
>>
>> 2017-10-25 15:51 GMT+02:00 Bernard Jesop <bernard.je...@gmail.com>:
>>
>>> Hello everyone,
>>>
>>> I have a question about checkpointing on dataset.
>>>
>>> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike
>>> RDD there is no Dataset.isCheckpointed().
>>>
>>> I wonder if Dataset.checkpoint is a syntactic sugar for
>>> Dataset.rdd.checkpoint.
>>> When I do :
>>>
>>> Dataset.checkpoint; Dataset.count
>>> Dataset.rdd.isCheckpointed // result: false
>>>
>>> However, when I explicitly do:
>>> Dataset.rdd.checkpoint; Dataset.rdd.count
>>> Dataset.rdd.isCheckpointed // result: true
>>>
>>> Could someone explain this behavior to me, or provide some references?
>>>
>>> Best regards,
>>> Bernard
>>>
>>
>>
>

Re: Dataset API Question

Reply via email to