You are right. "checkpointInterval" is only for data checkpointing. "metadata checkpoint" is done for each batch. Feel free to send a PR to add the missing doc.
Best Regards, Shixiong Zhu 2015-12-18 8:26 GMT-08:00 Lan Jiang <ljia...@gmail.com>: > Need some clarification about the documentation. According to Spark doc > > *"the default interval is a multiple of the batch interval that is at > least 10 seconds. It can be set by > using dstream.checkpoint(checkpointInterval). Typically, a checkpoint > interval of 5 - 10 sliding intervals of a DStream is a good setting to > try.”* > > My question is that does the *checkpointinterval* apply only for *data > checkpointing* or it applies to *metadata checkpointing*? The API says > dstream.checkpoint() is for "Enable periodic checkpointing of RDDs of this > DStream”, implying it is only for data checkpointing. My understanding is > that metadata checkpointing is for driver failure. For example, in Kafka > direct API, driver keeps track of the offset range of each partition. So if > metadata checkpoint is NOT done for each batch, in driver failure, some > messages in Kafka is going to be replayed. > > I do not find the answer in the document saying *whether metadata > checkpointing is done for each batch* and whether checkpointinterval > setting applies to both types of checkpointing. Maybe I miss it. If anyone > can point me to the right documentation, I would highly appreciate it. > > Best Regards, > > Lan >