[jira] [Updated] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering

Flink Jira Bot (Jira) Sat, 30 Oct 2021 15:39:08 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Flink Jira Bot updated FLINK-22088:
-----------------------------------
    Labels: auto-unassigned stale-assigned  (was: auto-unassigned)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help 
the community manage its development. I see this issue is assigned but has not 
received an update in 30 days, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a 
comment updating the community on your progress.  If this issue is waiting on 
feedback, please consider this a reminder to the committer/reviewer. Flink is a 
very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone 
else may work on it.


> CheckpointCoordinator might not be able to abort triggering checkpoint if 
> failover happens during triggering
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-22088
>                 URL: https://issues.apache.org/jira/browse/FLINK-22088
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.2, 1.13.0
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Minor
>              Labels: auto-unassigned, stale-assigned
>
> Currently when job failover, it would try to cancel all the pending 
> checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges -> 
> stopCheckpointScheduler, it would try to cancel all the pending checkpoints 
> and also set periodicScheduling to false. 
> If at this time there is just one checkpoint start triggering, it might 
> acquire all the execution to trigger before failover and start triggering. 
> ideally it should be aborted in createPendingCheckpoint-> 
> preCheckGlobalState. However, since the check and creating pending checkpoint 
> is in two different scope, there might be cases the 
> CheckpointCoordinator#stopCheckpointScheduler happens during the two lock 
> scope. 
> We may optimize this checking; However, since the execution would finally 
> fail to trigger checkpoint, it should not affect the rightness of the job. 
> Besides, even if we optimize it, there might still be cases that the 
> execution trigger failed due to concurrent failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-22088) CheckpointCoordinator might not be able to abort triggering checkpoint if failover happens during triggering

Reply via email to