Hi *, I am little confused over the checkpointing of Spark Streaming Context and Individual Streaming context.
E.g: JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); jssc.checkpoint("hdfs://...") Will start checkpointing the Dstream operation, configuration & incomplete info to HDFS. As i understand this will not checkpoint any DStream RDD into the HDFS. We also have option of individually checkpointing any DStream in the streaming context. When we start checkpointing individual DStrem, all the RDD associated with the DStream will be checkpoint into HDFS. JavaDStream<SmartProjectionWrapper> transformedWindow = udsEventStream.window(windowDuration, aggDuration) .transform(transformer); transformedWindow.checkpoint(aggDuration); Questions: 1. What will be benifits of individually checkpointing each stream? 2. When the source of stream input is from HDFS, does backing up individual stream will provide any benifits ? 3. How does Spark uses Individual Stream Checkpoint to become fault tolerant? 4. According to Spark Documentation, " For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try." Does all the stateful trasformation are already checkpointed? Thanks, -- Regards, Akash Mishra. "It's not our abilities that make us, but our decisions."--Albus Dumbledore