Re: Question about Spark Streaming checkpoint interval

Shixiong Zhu Fri, 18 Dec 2015 13:35:33 -0800

You are right. "checkpointInterval" is only for data checkpointing.
"metadata checkpoint" is done for each batch. Feel free to send a PR to add
the missing doc.


Best Regards,
Shixiong Zhu

2015-12-18 8:26 GMT-08:00 Lan Jiang <ljia...@gmail.com>:

> Need some clarification about the documentation. According to Spark doc
>
> *"the default interval is a multiple of the batch interval that is at
> least 10 seconds. It can be set by
> using dstream.checkpoint(checkpointInterval). Typically, a checkpoint
> interval of 5 - 10 sliding intervals of a DStream is a good setting to
> try.”*
>
> My question is that does the *checkpointinterval* apply only for *data
> checkpointing* or it applies to *metadata checkpointing*? The API says
> dstream.checkpoint() is for "Enable periodic checkpointing of RDDs of this
> DStream”, implying it is only for data checkpointing. My understanding is
> that metadata checkpointing is for driver failure. For example, in Kafka
> direct API, driver keeps track of the offset range of each partition. So if
> metadata checkpoint is NOT done for each batch, in driver failure, some
> messages in Kafka is going to be replayed.
>
> I do not find the answer in the document saying *whether metadata
> checkpointing is done for each batch* and whether checkpointinterval
> setting applies to both types of checkpointing. Maybe I miss it. If anyone
> can point me to the right documentation, I would highly appreciate it.
>
> Best Regards,
>
> Lan
>

Re: Question about Spark Streaming checkpoint interval

Reply via email to