Hi all,
Incremental checkpointing can help a lot in improving the efficiency of fault tolerance and recovery in Flink. I wrote an initial design of incremental checkpointing in Flink, and am looking forwards for your comments. https://docs.google.com/document/d/1VvvPp09gGdVb9D2wHx6NX99yQK0jSUMWHQPrQ_mn520/edit?usp=sharing Some more issues, I think, are needed to be discussed in the introduction of incremental checkpointing. One is the implementation of savepoints. Savepoints are supposed to be full and independent of backend implementation. Currently, the implementation of Savepoints and Checkpoints are identical in backends. With the introduction of incremental checkpointing, I think backends should take different snapshots for them. Regards, Xiaogang