[ 
https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051965#comment-17051965
 ] 

Stephan Ewen commented on FLINK-16357:
--------------------------------------

Invoking {{OperatorCoordinator #subtaskFailed(...)}} is the "common case" for 
failover handling. This for example tells the OperatorCoordinator that any 
splits assigned to that task since the last checkpoint need to be added back to 
the pool of unassigned splits.

{{OperatorCoordinator#resetToCheckpoint(...)}} is primarily needed to restore 
from a savepoint and when the JobManager fails and recovers.
I wanted to call it additionally in the case of a Global Failover as a safety 
net, exactly like you say. The execution graph has the strategy to to a global 
failover when any inconsistency is suspected, and in that case, I think it 
would be good to have a global failover on the OperatorCoordinator as well.


> Extend Checkpoint Coordinator to differentiate between "regional restore" and 
> "full restore".
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-16357
>                 URL: https://issues.apache.org/jira/browse/FLINK-16357
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Stephan Ewen
>            Priority: Major
>             Fix For: 1.11.0
>
>
> The {{ExecutionGraph}} has the notion of "global failure" (failing the entire 
> execution graph) and "regional failure" (recover a region with transient 
> pipelined data exchanges).
> The latter one is for common failover, the former one is a safety net to 
> handle unexpected failures or inconsistencies (full reset of ExecutionGraph 
> recovers most inconsistencies).
> The OperatorCoordinators should only be reset to a checkpoint in the "global 
> failover" case. In the "regional failover" case, they are only notified of 
> the tasks that are reset and keep their internal state and adjust it for the 
> failed tasks.
> To implement that, the ExecutionGraph needs to forward the information about 
> whether we are restoring from a "regional failure" or from a "global failure".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to