Hello, I am trying to understand the *content* of a checkpoint and corresponding recovery; understanding the process of checkpointing is obviously the natural way of going about it and so I went over the following list:
- medium post <https://medium.com/@adrianchang/apache-spark-checkpointing-ebd2ec065371> - SO <https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk> - Spark docs <https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing> - the very famous Github document by Jerry Leads <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md> I still am struggling to understand what goes and sits on the disk at the end of a checkpoint. *My understanding of Spark Checkpointing:* If you have really long DAGs and your spark cluster fails, checkpointing helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50 transformations can be reduced to 4-5 transformations with the help of checkpointing. It breaks the DAG though. *Checkpointing in Streaming* My Spark Streaming job has a microbatch of 5 seconds. As I understand, a new *job* is submitted by the *JobScheduler* every 5 secs that invokes the *JobGenerator* to generate the *RDD* DAG for the new microbatch from the *DStreamGraph*, while the *receiver* in the meantime keeps collecting the data for the *next new* microbatch for the next cycle. If I enable checkpointing, as I understand, it will periodically keep checkpointing the "current state". *Question:* 1. What is this "state"? Is this the combination of *the base RDD and the state of the operators/transformations of the DAG for the present microbatch only*? So I have the following: ubatch 0 at T=0 ----> SUCCESS ubatch 1 at T=5 ----> SUCCESS ubatch 2 at T=10 ---> SUCCESS --------------------> Checkpointing kicks in now at T=12 ubatch 3 at T=15 ---> SUCCESS ubatch 4 at T=20 --------------------> Spark Cluster DOWN at T=23 => ubatch 4 FAILS!!! ... --------------------> Spark Cluster is restarted at *T=100* What specifically goes and sits on the disk as a result of checkpointing at *T=12*? Will it just store the *present state of operators of the DAG for ubatch 2*? a. If yes, then during recovery at *T=100*, the last checkpoint available is at *T=12*. What happens to the *ubatch 3* at *T=15* which was already processed successfully. Does the application reprocess *ubatch 3* and handle idempotency here? If yes, do we go to the streaming source e.g. Kafka and rewind the offset to be able to replay the contents starting from the *ubatch 3*? b. If no, then what exactly goes into the checkpoint directory at T=12? https://stackoverflow.com/questions/56362347/spark-checkpointing-content-recovery-and-idempotency Regards