Here is some brief context about the new feature.

1. Actively checkpoint rejecting by the operator. Follow by the current
checkpoint mechanism, one more preliminary step is added to help the
operator determine that if it is able to take snapshots. The preliminary
step is a new API provided to the users/developers. The new API will be
implemented in the Source API (the new one based on FLIP-27) for CDC
implementation. The new API can also be implemented in other operator if
necessary.

2. Handling the failure returned from the operator. If the checkpoint is
rejected by the operator, an appropriate failure reason needs to be returned
from the operator as well. In the current design, two failure reasons are
listed, soft failure and hard failure. The previous one would be ignored by
the Flink and the later one would be counted as continuous checkpoint
failure according to the current checkpoint failure manager mechanism.

3. To prevent that the operator keeps reporting soft failure and therefore
no checkpoint can be completed for a long time, we introduce a new
configuration about the tolerable checkpoint failure timeout, which is a
timer that starts with the checkpoint scheduler. Overall, the timer would
only be reset if and only if the checkpoint completes. Otherwise, it would
do nothing until the tolerable timeout is hit. If the timer rings, it would
then trigger the current checkpoint failover. 

Question:
a. According to the current design, the checkpoint might fail for a possibly
long time with a large checkpoint interval, for example. Is there any better
idea to make the checkpoint more likely to succeed? For example, trigger the
checkpoint immediately after the last one is rejected. But it seems
unappropriate because that would increase the overhead.
b. Is there any better idea on handling the soft failure?





--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Reply via email to