Re: Why must the dstream.foreachRDD(...) parameter be serializable?

Tathagata Das Mon, 23 Feb 2015 11:40:20 -0800

There are different kinds of checkpointing going on. updateStateByKey
requires RDD checkpointing which can be enabled only by called
sparkContext.setCheckpointDirectory. But that does not enable Spark
Streaming driver checkpoints, which is necessary for recovering from driver
failures. That is enabled only by streamingContext.checkpoint(...) which
internally calls sparkContext.setCheckpointDirectory and also enables other
stuff.


TD

On Mon, Feb 23, 2015 at 1:28 AM, Tobias Pfeiffer <t...@preferred.jp> wrote:

> Sean,
>
> thanks for your message!
>
> On Mon, Feb 23, 2015 at 6:03 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> What I haven't investigated is whether you can enable checkpointing
>> for the state in updateStateByKey separately from this mechanism,
>> which is exactly your question. What happens if you set a checkpoint
>> dir, but do *not* use StreamingContext.getOrCreate, but *do* call
>> DStream.checkpoint?
>>
>
> I didn't even use StreamingContext.getOrCreate(), just calling
> streamingContext.checkpoint(...) blew everything up. Well, "blew up" in the
> sense that actor.OneForOneStrategy will print the stack trace of
> the java.io.NotSerializableException every couple of seconds and
> "something" is not going right with execution (I think).
>
> BUT, indeed, just calling sparkContext.setCheckpointDir seems to be
> sufficient for updateStateByKey! Looking at what
> streamingContext.checkpoint() does, I don't get why ;-) and I am not sure
> that this is a robust solution, but in fact that seems to work!
>
> Thanks a lot,
> Tobias
>
>

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

Reply via email to