Stephan Ewen created FLINK-6666:
-----------------------------------
Summary: RestartStrategy should differentiate between types of
recovery (global / local / resource missing)
Key: FLINK-6666
URL: https://issues.apache.org/jira/browse/FLINK-6666
Project: Flink
Issue Type: Sub-task
Components: Distributed Coordination
Affects Versions: 1.3.0
Reporter: Stephan Ewen
Currently, the {{RestrartStrategy}} has a single method that is called when a
failure requires an ExecutionGraph restart.
With the new addition of incremental recovery, it is desirable to distinguish
between the type of failover that happens.
I would suggest to extend the {{RestartStrategy}} to support three
cases/methods:
- {{restartGlobal()}} for a full restart recovery
- {{restartLocal()}} for a recovery coordinated by the {{FailoverStrategy}}
- {{restartOnMissingResources()}} if the failure cause was missing slots
The last case is interesting, in my opinion, because it is commonly desirable
that regular failover has no delay, but failover on missing resources has a
short delay (1s or so) to avoid very fast cycles of restart attempts (in
standalone mode, there can easily be 100,000 restarts after a second, when no
resources are available and no delay happens during restarts).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)