Individual DStream Checkpointing in Spark Streaming

Akash Mishra Thu, 05 May 2016 08:41:35 -0700

Hi *,

I am little confused over the checkpointing of Spark Streaming Context and
Individual Streaming context.


E.g:

JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(1));

jssc.checkpoint("hdfs://...")


Will start checkpointing the Dstream operation, configuration & incomplete
info to HDFS. As i understand this will not checkpoint any DStream RDD into
the HDFS.

We also have option of individually checkpointing any DStream in the
streaming context. When we start checkpointing individual DStrem, all the
RDD associated with the DStream will be checkpoint into HDFS.

JavaDStream<SmartProjectionWrapper> transformedWindow =
udsEventStream.window(windowDuration, aggDuration)
.transform(transformer); transformedWindow.checkpoint(aggDuration);



Questions:

1. What will be benifits of individually checkpointing each stream?
2. When the source of stream input is from HDFS, does backing up individual
stream will provide any benifits ?
3. How does Spark uses Individual Stream Checkpoint to become fault
tolerant?
4. According to Spark Documentation,
" For stateful transformations that require RDD checkpointing, the default
interval is a multiple of the batch interval that is at least 10 seconds.
It can be set by using dstream.checkpoint(checkpointInterval). Typically, a
checkpoint interval of 5 - 10 sliding intervals of a DStream is a good
setting to try."

Does all the stateful trasformation are already checkpointed?



Thanks,

-- 

Regards,
Akash Mishra.


"It's not our abilities that make us, but our decisions."--Albus Dumbledore

Individual DStream Checkpointing in Spark Streaming

Reply via email to