Spark Streaming: Checkpointing Recovery and Idempotency

Sheel Pancholi Wed, 29 May 2019 15:04:45 -0700

Hello,

I am trying to understand the *content* of a checkpoint and corresponding
recovery; understanding the process of checkpointing is obviously the
natural way of going about it and so I went over the following list:

- medium post
<https://medium.com/@adrianchang/apache-spark-checkpointing-ebd2ec065371>
- SO

<https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk>
- Spark docs

<https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing>
- the very famous Github document by Jerry Leads

<https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md>

I still am struggling to understand what goes and sits on the disk at the
end of a checkpoint.

*My understanding of Spark Checkpointing:*

If you have really long DAGs and your spark cluster fails, checkpointing
helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50
transformations can be reduced to 4-5 transformations with the help of
checkpointing. It breaks the DAG though.

*Checkpointing in Streaming*

My Spark Streaming job has a microbatch of 5 seconds. As I understand, a
new *job* is submitted by the *JobScheduler* every 5 secs that invokes the
*JobGenerator* to generate the *RDD* DAG for the new microbatch from the
*DStreamGraph*, while the *receiver* in the meantime keeps collecting the
data for the *next new* microbatch for the next cycle. If I enable
checkpointing, as I understand, it will periodically keep checkpointing the
"current state".

*Question:*

What is this "state"? Is this the combination of *the base RDD and the
state of the operators/transformations of the DAG for the present
microbatch only*? So I have the following:

ubatch 0 at T=0 ----> SUCCESS
ubatch 1 at T=5 ----> SUCCESS
ubatch 2 at T=10 ---> SUCCESS
--------------------> Checkpointing kicks in now at T=12
ubatch 3 at T=15 ---> SUCCESS
ubatch 4 at T=20
--------------------> Spark Cluster DOWN at T=23 => ubatch 4 FAILS!!!
...
--------------------> Spark Cluster is restarted at *T=100*

What specifically goes and sits on the disk as a result of checkpointing
at *T=12*? Will it just store the *present state of operators of the DAG
for ubatch 2*?

a. If yes, then during recovery at *T=100*, the last checkpoint
available is at *T=12*. What happens to the *ubatch 3* at *T=15* which
was already processed successfully. Does the application reprocess *ubatch
3* and handle idempotency here? If yes, do we go to the streaming source
e.g. Kafka and rewind the offset to be able to replay the contents starting
from the *ubatch 3*?

b. If no, then what exactly goes into the checkpoint directory at T=12?

https://stackoverflow.com/questions/56362347/spark-checkpointing-content-recovery-and-idempotency

Regards

Spark Streaming: Checkpointing Recovery and Idempotency

Reply via email to