[ https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048882#comment-17048882 ]
Zhu Zhu commented on FLINK-16357: --------------------------------- In the case of a global failure, what's the difference between * invoking {{OperatorCoordinator #subtaskFailed(...)}} for all execution vertices of an {{ExecutionJobVertex}}, and * invoking {{OperatorCoordinator#resetToCheckpoint(...)}} Is {{resetToCheckpoint(...)}} another safety net? > Extend Checkpoint Coordinator to differentiate between "regional restore" and > "full restore". > --------------------------------------------------------------------------------------------- > > Key: FLINK-16357 > URL: https://issues.apache.org/jira/browse/FLINK-16357 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing > Reporter: Stephan Ewen > Priority: Major > Fix For: 1.11.0 > > > The {{ExecutionGraph}} has the notion of "global failure" (failing the entire > execution graph) and "regional failure" (recover a region with transient > pipelined data exchanges). > The latter one is for common failover, the former one is a safety net to > handle unexpected failures or inconsistencies (full reset of ExecutionGraph > recovers most inconsistencies). > The OperatorCoordinators should only be reset to a checkpoint in the "global > failover" case. In the "regional failover" case, they are only notified of > the tasks that are reset and keep their internal state and adjust it for the > failed tasks. > To implement that, the ExecutionGraph needs to forward the information about > whether we are restoring from a "regional failure" or from a "global failure". -- This message was sent by Atlassian Jira (v8.3.4#803005)