Re: Setting an allowable number of checkpoint failures

Till Rohrmann Mon, 06 Aug 2018 13:28:13 -0700

Thanks Thomas for opening the issue. This is indeed a useful feature to
make the checkpointing more controllable.


On Mon, Aug 6, 2018 at 6:36 PM Thomas Weise <[email protected]> wrote:

> Hi vino,
>
> Yes, I believe we are on the same page. I created
> https://issues.apache.org/jira/browse/FLINK-10074 to track it.
>
> Thanks,
> Thomas
>
> On Mon, Aug 6, 2018 at 8:42 AM vino yang <[email protected]> wrote:
>
> > Hi Thomas,
> >
> > What I am saying is what you mean, maybe I am not very accurate.
> >
> > Thanks, vino.
> >
> > 2018-08-06 21:22 GMT+08:00 Thomas Weise <[email protected]>:
> >
> > > Hi,
> > >
> > > What we are looking for is that the job does *not* restart on transient
> > > checkpoint failures and we would like to cap the number of allowable
> > > subsequent failures until a restart occurs.
> > >
> > > The reason is that every restart is a service interruption that is
> > > potentially very expensive.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Aug 6, 2018 at 5:09 AM vino yang <[email protected]>
> wrote:
> > >
> > > > Hi Till,
> > > >
> > > > I think the way you proposed is a solution. But I think we also can
> > > provide
> > > > a solution to prevent Checkpoint from failing indefinitely, in case
> the
> > > Job
> > > > does not fail.
> > > >
> > > > Instead, a threshold is given to allow the checkpoint to fail a few
> > > times.
> > > > When this threshold is reached, we decide to let the job fail.
> > > >
> > > > Thanks, vino.
> > > >
> > > > 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[email protected]>:
> > > >
> > > > > Hi Lakshmi,
> > > > >
> > > > > you could somewhat achieve the described behaviour by setting
> > > > > setFailOnCheckpointintErrors(true) and using the
> > > > > FailureRateRestartStrategy
> > > > > as the restart strategy. That way checkpoint failures will trigger
> a
> > > job
> > > > > restart (this is the downside) which is handled by the restart
> > > strategy.
> > > > > The FailureRateRestartStrategy allows for x failures to happen
> within
> > > in
> > > > a
> > > > > given time interval. If this number is exceeded, then the job will
> > > > > terminally fail.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Lakshmi,
> > > > > >
> > > > > > Your understanding of "
> > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is
> correct,
> > > If
> > > > > this
> > > > > > is set to false, the task will only decline a the checkpoint and
> > > > continue
> > > > > > running.
> > > > > >
> > > > > > I think it is also a good choice to allow a number of failures to
> > be
> > > > set.
> > > > > > Flink currently only supports whether the Task fails if the
> > > checkpoint
> > > > > > fails. It is not supported to configure a threshold.
> > > > > >
> > > > > > You can create an issue in JIRA to feedback this requirement.
> > > > > >
> > > > > > Thanks, vino.
> > > > > >
> > > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[email protected]>:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > We are running into intermittent checkpoint failures while
> > > > > checkpointing
> > > > > > to
> > > > > > > S3.
> > > > > > >
> > > > > > > As described in this thread -
> > > > > > >  http://apache-flink-user-mailing-list-archive.2336050.
> > > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > > > > > > <http://apache-flink-user-mailing-list-archive.2336050.
> > > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > > > > > > we see that the job restarts when it encounters such a failure.
> > > > > > >
> > > > > > > As mentioned in the thread, I see that there is an option to
> not
> > > fail
> > > > > > tasks
> > > > > > > on checkpoint errors -
> > > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**.
> > *However,
> > > > > this
> > > > > > > would mean that the job would continue running even in the case
> > of
> > > > > > > persistent checkpoint failures. Is my understanding here
> correct?
> > > > > > >
> > > > > > > If above is true, then is there a way to configure an allowable
> > > > number
> > > > > of
> > > > > > > checkpoint failures? i.e. something along the lines of "Don't
> > fail
> > > > the
> > > > > > job
> > > > > > > if there are <=X number of checkpoint failures", so that *only
> > > > > *transient
> > > > > > > failures can be ignored.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Lakshmi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Setting an allowable number of checkpoint failures

Reply via email to