Hi Bhaskar, > Why the reset counter is not zero after streaming job restart is successful?
The short answer is that the fixed delay restart strategy is not implemented like that (see [1] if you are using Flink 1.10 or above). There are also other systems that behave similarly, e.g., Apache Hadoop YARN (see yarn.resourcemanager.am.max-attempts). If you have such a requirement, you can try to approximate it using the failure rate restart strategy [2]. Resetting the attempt counter to zero after a successful restart cannot be easily implemented with the current RestartBackoffTimeStrategy interface [3]; for this to be possible, the strategy would need to be informed if a restart was successful. However, it is not clear what constitutes a successful restart. For example, is it sufficient that enough TMs/slots could be acquired to run the job? The job could still fail afterwards due to a bug in user code. Could it be sufficient to require all tasks to produce at least one record? I do not think so because the job could still fail deterministically afterwards due to a particular record. Best, Gary [1] https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L29 [2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/task_failure_recovery.html#failure-rate-restart-strategy [3] https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartBackoffTimeStrategy.java#L23 On Tue, May 26, 2020 at 9:28 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> wrote: > > Hi > We are using restart strategy of fixed delay. > I have fundamental question: > Why the reset counter is not zero after streaming job restart is successful? > Let's say I have number of restarts max are: 5 > My streaming job tried 2 times and 3'rd attempt its successful, why counter > is still 2 but not zero? > Traditionally in network world, clients will retry for some time and once > they are successful, they will reset the counter back to zero. > > Why this is the case in flink? > > Regards > Bhaskar