[ 
https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048882#comment-17048882
 ] 

Zhu Zhu commented on FLINK-16357:
---------------------------------

In the case of a global failure, what's the difference between
 * invoking {{OperatorCoordinator #subtaskFailed(...)}} for all execution 
vertices of an {{ExecutionJobVertex}}, and
 * invoking {{OperatorCoordinator#resetToCheckpoint(...)}}

Is {{resetToCheckpoint(...)}} another safety net?

> Extend Checkpoint Coordinator to differentiate between "regional restore" and 
> "full restore".
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-16357
>                 URL: https://issues.apache.org/jira/browse/FLINK-16357
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Stephan Ewen
>            Priority: Major
>             Fix For: 1.11.0
>
>
> The {{ExecutionGraph}} has the notion of "global failure" (failing the entire 
> execution graph) and "regional failure" (recover a region with transient 
> pipelined data exchanges).
> The latter one is for common failover, the former one is a safety net to 
> handle unexpected failures or inconsistencies (full reset of ExecutionGraph 
> recovers most inconsistencies).
> The OperatorCoordinators should only be reset to a checkpoint in the "global 
> failover" case. In the "regional failover" case, they are only notified of 
> the tasks that are reset and keep their internal state and adjust it for the 
> failed tasks.
> To implement that, the ExecutionGraph needs to forward the information about 
> whether we are restoring from a "regional failure" or from a "global failure".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to