Hello,

When restarting jobs (e.g. after upgrade) with "large" state a task can
take some time to "initialize" (depending on the state size). During this
time I noticed that Flink attempts to checkpoint. In many cases
checkpointing will fail repeatedly, and cause the job to hit the
tolerable-failed-checkpoints limit and restart. The only way to overcome
the issue seems to be to increase the checkpoint interval, but this is
suboptimal.

Could Flink wait to trigger checkpointing when one or more task is
initializing?

Lars

Reply via email to