There are different kinds of checkpointing going on. updateStateByKey requires RDD checkpointing which can be enabled only by called sparkContext.setCheckpointDirectory. But that does not enable Spark Streaming driver checkpoints, which is necessary for recovering from driver failures. That is enabled only by streamingContext.checkpoint(...) which internally calls sparkContext.setCheckpointDirectory and also enables other stuff.
TD On Mon, Feb 23, 2015 at 1:28 AM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Sean, > > thanks for your message! > > On Mon, Feb 23, 2015 at 6:03 PM, Sean Owen <so...@cloudera.com> wrote: >> >> What I haven't investigated is whether you can enable checkpointing >> for the state in updateStateByKey separately from this mechanism, >> which is exactly your question. What happens if you set a checkpoint >> dir, but do *not* use StreamingContext.getOrCreate, but *do* call >> DStream.checkpoint? >> > > I didn't even use StreamingContext.getOrCreate(), just calling > streamingContext.checkpoint(...) blew everything up. Well, "blew up" in the > sense that actor.OneForOneStrategy will print the stack trace of > the java.io.NotSerializableException every couple of seconds and > "something" is not going right with execution (I think). > > BUT, indeed, just calling sparkContext.setCheckpointDir seems to be > sufficient for updateStateByKey! Looking at what > streamingContext.checkpoint() does, I don't get why ;-) and I am not sure > that this is a robust solution, but in fact that seems to work! > > Thanks a lot, > Tobias > >