Need some clarification about the documentation. According to Spark doc

"the default interval is a multiple of the batch interval that is at least 10 
seconds. It can be set by using dstream.checkpoint(checkpointInterval). 
Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a 
good setting to try.”

My question is that does the checkpointinterval apply only for data 
checkpointing or it applies to metadata checkpointing? The API says 
dstream.checkpoint() is for "Enable periodic checkpointing of RDDs of this 
DStream”, implying it is only for data checkpointing. My understanding is that 
metadata checkpointing is for driver failure. For example, in Kafka direct API, 
driver keeps track of the offset range of each partition. So if metadata 
checkpoint is NOT done for each batch, in driver failure, some messages in 
Kafka is going to be replayed. 

I do not find the answer in the document saying whether metadata checkpointing 
is done for each batch and whether checkpointinterval setting applies to both 
types of checkpointing. Maybe I miss it. If anyone can point me to the right 
documentation, I would highly appreciate it.

Best Regards,

Lan

Reply via email to