[jira] [Commented] (FLINK-16357) Extend Checkpoint Coordinator to differentiate between "regional restore" and "full restore".

Zhu Zhu (Jira) Sun, 01 Mar 2020 22:43:20 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048828#comment-17048828
 ]


Zhu Zhu commented on FLINK-16357:
---------------------------------

Is OperatorCoordinator#resetToCheckpoint(...) expected to be invoked in 
CheckpointCoordinator#restoreLatestCheckpointedState(...) ? If so, seems there 
is not need to tell the CheckpointCoordinator it is a global failure or a 
regional failure, but can just be a set of execution vertices which are 
affected by the failure, namely changing the param {{tasks}} of 
CheckpointCoordinator#restoreLatestCheckpointedState(...) from 
Set<ExecutionJobVertex> to Set<ExecutionVertex>.

In the new scheduler (DefaultScheduler), the logics of global failure recovery 
and regional failure recovery are almost the same except for the logic to 
calculate the ExecutionVertex to restart. So it does not differentiate global 
failure nor regional failure in the stage to restore task states and reschedule 
the tasks. And there would always be a set of ExecutionVertex to restart which 
can be passed to the CheckpointCoordinator#restoreLatestCheckpointedState(...).

> Extend Checkpoint Coordinator to differentiate between "regional restore" and 
> "full restore".
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-16357
>                 URL: https://issues.apache.org/jira/browse/FLINK-16357
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Stephan Ewen
>            Priority: Major
>             Fix For: 1.11.0
>
>
> The {{ExecutionGraph}} has the notion of "global failure" (failing the entire 
> execution graph) and "regional failure" (recover a region with transient 
> pipelined data exchanges).
> The latter one is for common failover, the former one is a safety net to 
> handle unexpected failures or inconsistencies (full reset of ExecutionGraph 
> recovers most inconsistencies).
> The OperatorCoordinators should only be reset to a checkpoint in the "global 
> failover" case. In the "regional failover" case, they are only notified of 
> the tasks that are reset and keep their internal state and adjust it for the 
> failed tasks.
> To implement that, the ExecutionGraph needs to forward the information about 
> whether we are restoring from a "regional failure" or from a "global failure".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16357) Extend Checkpoint Coordinator to differentiate between "regional restore" and "full restore".

Reply via email to