Hello,

I am trying to understand the *content* of a checkpoint and corresponding
recovery; understanding the process of checkpointing is obviously the
natural way of going about it and so I went over the following list:

   - medium post
   <https://medium.com/@adrianchang/apache-spark-checkpointing-ebd2ec065371>
   - SO
   
<https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk>
   - Spark docs
   
<https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing>
   - the very famous Github document by Jerry Leads
   
<https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md>

I still am struggling to understand what goes and sits on the disk at the
end of a checkpoint.

*My understanding of Spark Checkpointing:*

If you have really long DAGs and your spark cluster fails, checkpointing
helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50
transformations can be reduced to 4-5 transformations with the help of
checkpointing. It breaks the DAG though.

*Checkpointing in Streaming*

My Spark Streaming job has a microbatch of 5 seconds. As I understand, a
new *job* is submitted by the *JobScheduler* every 5 secs that invokes the
*JobGenerator* to generate the *RDD* DAG for the new microbatch from the
*DStreamGraph*, while the *receiver* in the meantime keeps collecting the
data for the *next new* microbatch for the next cycle. If I enable
checkpointing, as I understand, it will periodically keep checkpointing the
"current state".

*Question:*

   1.

   What is this "state"? Is this the combination of *the base RDD and the
   state of the operators/transformations of the DAG for the present
   microbatch only*? So I have the following:

   ubatch 0 at T=0 ----> SUCCESS
   ubatch 1 at T=5 ----> SUCCESS
   ubatch 2 at T=10 ---> SUCCESS
   --------------------> Checkpointing kicks in now at T=12
   ubatch 3 at T=15 ---> SUCCESS
   ubatch 4 at T=20
   --------------------> Spark Cluster DOWN at T=23 => ubatch 4 FAILS!!!
   ...
   --------------------> Spark Cluster is restarted at *T=100*

   What specifically goes and sits on the disk as a result of checkpointing
   at *T=12*? Will it just store the *present state of operators of the DAG
   for ubatch 2*?

   a. If yes, then during recovery at *T=100*, the last checkpoint
   available is at *T=12*. What happens to the *ubatch 3* at *T=15* which
   was already processed successfully. Does the application reprocess *ubatch
   3* and handle idempotency here? If yes, do we go to the streaming source
   e.g. Kafka and rewind the offset to be able to replay the contents starting
   from the *ubatch 3*?

   b. If no, then what exactly goes into the checkpoint directory at T=12?

https://stackoverflow.com/questions/56362347/spark-checkpointing-content-recovery-and-idempotency


Regards

Reply via email to