Hello, When restarting jobs (e.g. after upgrade) with "large" state a task can take some time to "initialize" (depending on the state size). During this time I noticed that Flink attempts to checkpoint. In many cases checkpointing will fail repeatedly, and cause the job to hit the tolerable-failed-checkpoints limit and restart. The only way to overcome the issue seems to be to increase the checkpoint interval, but this is suboptimal.
Could Flink wait to trigger checkpointing when one or more task is initializing? Lars