Re: Setting an allowable number of checkpoint failures

2018-08-06 Thread Till Rohrmann
Thanks Thomas for opening the issue. This is indeed a useful feature to make the checkpointing more controllable. On Mon, Aug 6, 2018 at 6:36 PM Thomas Weise wrote: > Hi vino, > > Yes, I believe we are on the same page. I created > https://issues.apache.org/jira/browse/FLINK-10074 to track it.

Re: Setting an allowable number of checkpoint failures

2018-08-06 Thread Thomas Weise
Hi vino, Yes, I believe we are on the same page. I created https://issues.apache.org/jira/browse/FLINK-10074 to track it. Thanks, Thomas On Mon, Aug 6, 2018 at 8:42 AM vino yang wrote: > Hi Thomas, > > What I am saying is what you mean, maybe I am not very accurate. > > Thanks, vino. > >

Re: Setting an allowable number of checkpoint failures

2018-08-06 Thread vino yang
Hi Thomas, What I am saying is what you mean, maybe I am not very accurate. Thanks, vino. 2018-08-06 21:22 GMT+08:00 Thomas Weise : > Hi, > > What we are looking for is that the job does *not* restart on transient > checkpoint failures and we would like to cap the number of allowable >

Re: Setting an allowable number of checkpoint failures

2018-08-06 Thread Thomas Weise
Hi, What we are looking for is that the job does *not* restart on transient checkpoint failures and we would like to cap the number of allowable subsequent failures until a restart occurs. The reason is that every restart is a service interruption that is potentially very expensive. Thanks,

Re: Setting an allowable number of checkpoint failures

2018-08-06 Thread vino yang
Hi Till, I think the way you proposed is a solution. But I think we also can provide a solution to prevent Checkpoint from failing indefinitely, in case the Job does not fail. Instead, a threshold is given to allow the checkpoint to fail a few times. When this threshold is reached, we decide to

Re: Setting an allowable number of checkpoint failures

2018-08-06 Thread Till Rohrmann
Hi Lakshmi, you could somewhat achieve the described behaviour by setting setFailOnCheckpointintErrors(true) and using the FailureRateRestartStrategy as the restart strategy. That way checkpoint failures will trigger a job restart (this is the downside) which is handled by the restart strategy.

Re: Setting an allowable number of checkpoint failures

2018-08-03 Thread vino yang
Hi Lakshmi, Your understanding of " *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this is set to false, the task will only decline a the checkpoint and continue running. I think it is also a good choice to allow a number of failures to be set. Flink currently only

Setting an allowable number of checkpoint failures

2018-08-03 Thread Lakshmi Gururaja Rao
Hi, We are running into intermittent checkpoint failures while checkpointing to S3. As described in this thread - http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-5-some-thing-weird-td21309.html