Need some clarification about the documentation. According to Spark doc "the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.”
My question is that does the checkpointinterval apply only for data checkpointing or it applies to metadata checkpointing? The API says dstream.checkpoint() is for "Enable periodic checkpointing of RDDs of this DStream”, implying it is only for data checkpointing. My understanding is that metadata checkpointing is for driver failure. For example, in Kafka direct API, driver keeps track of the offset range of each partition. So if metadata checkpoint is NOT done for each batch, in driver failure, some messages in Kafka is going to be replayed. I do not find the answer in the document saying whether metadata checkpointing is done for each batch and whether checkpointinterval setting applies to both types of checkpointing. Maybe I miss it. If anyone can point me to the right documentation, I would highly appreciate it. Best Regards, Lan