[ https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048870#comment-17048870 ]
Stephan Ewen commented on FLINK-16357: -------------------------------------- Yes, {{OperatorCoordinator#resetToCheckpoint(...)}} is expected to be invoked in {{CheckpointCoordinator#restoreLatestCheckpointedState(...)}}, iff a failure/recovery came from {{ExecutionGraph.failGlobal()}} or {{SchedulerNG.handleGlobalFailure()}}. Currently, if we would call {{OperatorCoordinator#resetToCheckpoint(...)}} within {{CheckpointCoordinator#restoreLatestCheckpointedState(...)}} we would restore it on every regional failover as well, if I read the code correctly. The {{OperatorCoordinator}} exists once per {{ExecutionJobVertex}}, not per each {{ExecutionVertex}}. > Extend Checkpoint Coordinator to differentiate between "regional restore" and > "full restore". > --------------------------------------------------------------------------------------------- > > Key: FLINK-16357 > URL: https://issues.apache.org/jira/browse/FLINK-16357 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing > Reporter: Stephan Ewen > Priority: Major > Fix For: 1.11.0 > > > The {{ExecutionGraph}} has the notion of "global failure" (failing the entire > execution graph) and "regional failure" (recover a region with transient > pipelined data exchanges). > The latter one is for common failover, the former one is a safety net to > handle unexpected failures or inconsistencies (full reset of ExecutionGraph > recovers most inconsistencies). > The OperatorCoordinators should only be reset to a checkpoint in the "global > failover" case. In the "regional failover" case, they are only notified of > the tasks that are reset and keep their internal state and adjust it for the > failed tasks. > To implement that, the ExecutionGraph needs to forward the information about > whether we are restoring from a "regional failure" or from a "global failure". -- This message was sent by Atlassian Jira (v8.3.4#803005)