Here is some brief context about the new feature. 1. Actively checkpoint rejecting by the operator. Follow by the current checkpoint mechanism, one more preliminary step is added to help the operator determine that if it is able to take snapshots. The preliminary step is a new API provided to the users/developers. The new API will be implemented in the Source API (the new one based on FLIP-27) for CDC implementation. The new API can also be implemented in other operator if necessary.
2. Handling the failure returned from the operator. If the checkpoint is rejected by the operator, an appropriate failure reason needs to be returned from the operator as well. In the current design, two failure reasons are listed, soft failure and hard failure. The previous one would be ignored by the Flink and the later one would be counted as continuous checkpoint failure according to the current checkpoint failure manager mechanism. 3. To prevent that the operator keeps reporting soft failure and therefore no checkpoint can be completed for a long time, we introduce a new configuration about the tolerable checkpoint failure timeout, which is a timer that starts with the checkpoint scheduler. Overall, the timer would only be reset if and only if the checkpoint completes. Otherwise, it would do nothing until the tolerable timeout is hit. If the timer rings, it would then trigger the current checkpoint failover. Question: a. According to the current design, the checkpoint might fail for a possibly long time with a large checkpoint interval, for example. Is there any better idea to make the checkpoint more likely to succeed? For example, trigger the checkpoint immediately after the last one is rejected. But it seems unappropriate because that would increase the overhead. b. Is there any better idea on handling the soft failure? -- Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/