Hi Till, I think the way you proposed is a solution. But I think we also can provide a solution to prevent Checkpoint from failing indefinitely, in case the Job does not fail.
Instead, a threshold is given to allow the checkpoint to fail a few times. When this threshold is reached, we decide to let the job fail. Thanks, vino. 2018-08-06 15:14 GMT+08:00 Till Rohrmann <trohrm...@apache.org>: > Hi Lakshmi, > > you could somewhat achieve the described behaviour by setting > setFailOnCheckpointintErrors(true) and using the > FailureRateRestartStrategy > as the restart strategy. That way checkpoint failures will trigger a job > restart (this is the downside) which is handled by the restart strategy. > The FailureRateRestartStrategy allows for x failures to happen within in a > given time interval. If this number is exceeded, then the job will > terminally fail. > > Cheers, > Till > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <yanghua1...@gmail.com> wrote: > > > Hi Lakshmi, > > > > Your understanding of " > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If > this > > is set to false, the task will only decline a the checkpoint and continue > > running. > > > > I think it is also a good choice to allow a number of failures to be set. > > Flink currently only supports whether the Task fails if the checkpoint > > fails. It is not supported to configure a threshold. > > > > You can create an issue in JIRA to feedback this requirement. > > > > Thanks, vino. > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <l...@lyft.com>: > > > > > Hi, > > > > > > We are running into intermittent checkpoint failures while > checkpointing > > to > > > S3. > > > > > > As described in this thread - > > > http://apache-flink-user-mailing-list-archive.2336050. > > > n4.nabble.com/1-5-some-thing-weird-td21309.html > > > <http://apache-flink-user-mailing-list-archive.2336050. > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > > > we see that the job restarts when it encounters such a failure. > > > > > > As mentioned in the thread, I see that there is an option to not fail > > tasks > > > on checkpoint errors - > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, > this > > > would mean that the job would continue running even in the case of > > > persistent checkpoint failures. Is my understanding here correct? > > > > > > If above is true, then is there a way to configure an allowable number > of > > > checkpoint failures? i.e. something along the lines of "Don't fail the > > job > > > if there are <=X number of checkpoint failures", so that *only > *transient > > > failures can be ignored. > > > > > > Thanks, > > > Lakshmi > > > > > >