Hi all,
Currently, the checkpoint's failure handling logic is somewhat confusing (not focused), which makes some functions on existing code passive. So I provide a design document to improve the Checkpoint failure process logic. This design document primarily describes how to improve checkpoint failure handling logic and make it more clear. Based on this, we introduce a CheckpointFailureManager, which makes the checkpoint failure processing more flexible. This mainly comes from the following appeals: - FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints - FLINK-10074[3]: Allowable number of checkpoint failure - FLINK-10724[2]: Refactor failure handling in checkpoint coordinator https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing *Thanks to @Andrey Zagrebin for helping me review the documentation and suggesting a lot of improvements.* Feedback and comments are very welcome! Best, Vino [1]: https://issues.apache.org/jira/browse/FLINK-4810 [2]: https://issues.apache.org/jira/browse/FLINK-10724 [3]: https://issues.apache.org/jira/browse/FLINK-10074
