Hi Senhong, Thanks for the proposal. I have a couple of questions.
Have you seen `org.apache.flink.streaming.api.checkpoint.ExternallyInducedSource` (for the legacy SourceFunction) and `org.apache.flink.api.connector.source.ExternallyInducedSourceReader` (for FLIP-27) interfaces? They work the other way around, by letting the source to trigger/initiate a checkpoint, instead of declining it. Could it be made to work in your use case? If not, can you explain why? Regarding declining/failing the checkpoint (without blocking the barrier waiting for snapshot availability), can not you achieve the same thing by a combination of throwing an exception in for example `org.apache.flink.api.connector.source.SourceReader#snapshotState` call and setting the tolerable checkpoint failure number? [1] Best, Piotrek [1] https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/environment/CheckpointConfig.html#setTolerableCheckpointFailureNumber-int- śr., 9 cze 2021 o 09:11 Senhong Liu <senhong...@gmail.com> napisał(a): > Here is some brief context about the new feature. > > 1. Actively checkpoint rejecting by the operator. Follow by the current > checkpoint mechanism, one more preliminary step is added to help the > operator determine that if it is able to take snapshots. The preliminary > step is a new API provided to the users/developers. The new API will be > implemented in the Source API (the new one based on FLIP-27) for CDC > implementation. The new API can also be implemented in other operator if > necessary. > > 2. Handling the failure returned from the operator. If the checkpoint is > rejected by the operator, an appropriate failure reason needs to be > returned > from the operator as well. In the current design, two failure reasons are > listed, soft failure and hard failure. The previous one would be ignored by > the Flink and the later one would be counted as continuous checkpoint > failure according to the current checkpoint failure manager mechanism. > > 3. To prevent that the operator keeps reporting soft failure and therefore > no checkpoint can be completed for a long time, we introduce a new > configuration about the tolerable checkpoint failure timeout, which is a > timer that starts with the checkpoint scheduler. Overall, the timer would > only be reset if and only if the checkpoint completes. Otherwise, it would > do nothing until the tolerable timeout is hit. If the timer rings, it would > then trigger the current checkpoint failover. > > Question: > a. According to the current design, the checkpoint might fail for a > possibly > long time with a large checkpoint interval, for example. Is there any > better > idea to make the checkpoint more likely to succeed? For example, trigger > the > checkpoint immediately after the last one is rejected. But it seems > unappropriate because that would increase the overhead. > b. Is there any better idea on handling the soft failure? > > > > > > -- > Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ >