Re: Question on Job Restart strategy

Gary Yao Tue, 26 May 2020 04:25:53 -0700

Hi Bhaskar,

> Why the reset counter is not zero after streaming job restart is successful?

The short answer is that the fixed delay restart strategy is not
implemented like that (see [1] if you are using Flink 1.10 or above).
There are also other systems that behave similarly, e.g., Apache
Hadoop YARN (see yarn.resourcemanager.am.max-attempts).

If you have such a requirement, you can try to approximate it using
the failure rate restart strategy [2]. Resetting the attempt counter
to zero after a successful restart cannot be easily implemented with
the current RestartBackoffTimeStrategy interface [3]; for this to be
possible, the strategy would need to be informed if a restart was
successful. However, it is not clear what constitutes a successful
restart. For example, is it sufficient that enough TMs/slots could be
acquired to run the job? The job could still fail afterwards due to a
bug in user code. Could it be sufficient to require all tasks to
produce at least one record? I do not think so because the job could
still fail deterministically afterwards due to a particular record.

Best,
Gary

[1] 
https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L29
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/task_failure_recovery.html#failure-rate-restart-strategy
[3] 
https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartBackoffTimeStrategy.java#L23

On Tue, May 26, 2020 at 9:28 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> wrote:
>
> Hi
> We are using restart strategy of fixed delay.
> I have fundamental question:
> Why the reset counter is not zero after streaming job restart is successful?
> Let's say I have number of restarts max are: 5
> My streaming job tried 2 times and 3'rd attempt its successful, why counter 
> is still 2 but not zero?
> Traditionally in network world, clients will retry for some time and once 
> they are successful, they will reset the counter back to zero.
>
> Why this is the case in flink?
>
> Regards
> Bhaskar

Re: Question on Job Restart strategy

Reply via email to