[jira] [Commented] (FLINK-16357) Extend Checkpoint Coordinator to differentiate between "regional restore" and "full restore".

Stephan Ewen (Jira) Mon, 02 Mar 2020 00:34:21 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048870#comment-17048870
 ]


Stephan Ewen commented on FLINK-16357:
--------------------------------------

Yes, {{OperatorCoordinator#resetToCheckpoint(...)}} is expected to be invoked 
in {{CheckpointCoordinator#restoreLatestCheckpointedState(...)}}, iff a 
failure/recovery came from {{ExecutionGraph.failGlobal()}} or 
{{SchedulerNG.handleGlobalFailure()}}.

Currently, if we would call  {{OperatorCoordinator#resetToCheckpoint(...)}} 
within {{CheckpointCoordinator#restoreLatestCheckpointedState(...)}} we would 
restore it on every regional failover as well, if I read the code correctly.

The {{OperatorCoordinator}} exists once per {{ExecutionJobVertex}}, not per 
each {{ExecutionVertex}}.

> Extend Checkpoint Coordinator to differentiate between "regional restore" and 
> "full restore".
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-16357
>                 URL: https://issues.apache.org/jira/browse/FLINK-16357
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Stephan Ewen
>            Priority: Major
>             Fix For: 1.11.0
>
>
> The {{ExecutionGraph}} has the notion of "global failure" (failing the entire 
> execution graph) and "regional failure" (recover a region with transient 
> pipelined data exchanges).
> The latter one is for common failover, the former one is a safety net to 
> handle unexpected failures or inconsistencies (full reset of ExecutionGraph 
> recovers most inconsistencies).
> The OperatorCoordinators should only be reset to a checkpoint in the "global 
> failover" case. In the "regional failover" case, they are only notified of 
> the tasks that are reset and keep their internal state and adjust it for the 
> failed tasks.
> To implement that, the ExecutionGraph needs to forward the information about 
> whether we are restoring from a "regional failure" or from a "global failure".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16357) Extend Checkpoint Coordinator to differentiate between "regional restore" and "full restore".

Reply via email to