Thanks Thomas for opening the issue. This is indeed a useful feature to
make the checkpointing more controllable.
On Mon, Aug 6, 2018 at 6:36 PM Thomas Weise wrote:
> Hi vino,
>
> Yes, I believe we are on the same page. I created
> https://issues.apache.org/jira/browse/FLINK-10074 to track it.
Hi vino,
Yes, I believe we are on the same page. I created
https://issues.apache.org/jira/browse/FLINK-10074 to track it.
Thanks,
Thomas
On Mon, Aug 6, 2018 at 8:42 AM vino yang wrote:
> Hi Thomas,
>
> What I am saying is what you mean, maybe I am not very accurate.
>
> Thanks, vino.
>
>
Hi Thomas,
What I am saying is what you mean, maybe I am not very accurate.
Thanks, vino.
2018-08-06 21:22 GMT+08:00 Thomas Weise :
> Hi,
>
> What we are looking for is that the job does *not* restart on transient
> checkpoint failures and we would like to cap the number of allowable
>
Hi,
What we are looking for is that the job does *not* restart on transient
checkpoint failures and we would like to cap the number of allowable
subsequent failures until a restart occurs.
The reason is that every restart is a service interruption that is
potentially very expensive.
Thanks,
Hi Till,
I think the way you proposed is a solution. But I think we also can provide
a solution to prevent Checkpoint from failing indefinitely, in case the Job
does not fail.
Instead, a threshold is given to allow the checkpoint to fail a few times.
When this threshold is reached, we decide to
Hi Lakshmi,
you could somewhat achieve the described behaviour by setting
setFailOnCheckpointintErrors(true) and using the FailureRateRestartStrategy
as the restart strategy. That way checkpoint failures will trigger a job
restart (this is the downside) which is handled by the restart strategy.
Hi Lakshmi,
Your understanding of "
*CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this
is set to false, the task will only decline a the checkpoint and continue
running.
I think it is also a good choice to allow a number of failures to be set.
Flink currently only
Hi,
We are running into intermittent checkpoint failures while checkpointing to
S3.
As described in this thread -
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-5-some-thing-weird-td21309.html