[
https://issues.apache.org/jira/browse/FLINK-22088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-22088:
-----------------------------------
Labels: auto-unassigned stale-assigned (was: auto-unassigned)
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issue is assigned but has not
received an update in 30 days, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a
comment updating the community on your progress. If this issue is waiting on
feedback, please consider this a reminder to the committer/reviewer. Flink is a
very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone
else may work on it.
> CheckpointCoordinator might not be able to abort triggering checkpoint if
> failover happens during triggering
> ------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-22088
> URL: https://issues.apache.org/jira/browse/FLINK-22088
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.12.2, 1.13.0
> Reporter: Yun Gao
> Assignee: Yun Gao
> Priority: Minor
> Labels: auto-unassigned, stale-assigned
>
> Currently when job failover, it would try to cancel all the pending
> checkpoint via CheckpointCoordinatorDeActivator#jobStatusChanges ->
> stopCheckpointScheduler, it would try to cancel all the pending checkpoints
> and also set periodicScheduling to false.
> If at this time there is just one checkpoint start triggering, it might
> acquire all the execution to trigger before failover and start triggering.
> ideally it should be aborted in createPendingCheckpoint->
> preCheckGlobalState. However, since the check and creating pending checkpoint
> is in two different scope, there might be cases the
> CheckpointCoordinator#stopCheckpointScheduler happens during the two lock
> scope.
> We may optimize this checking; However, since the execution would finally
> fail to trigger checkpoint, it should not affect the rightness of the job.
> Besides, even if we optimize it, there might still be cases that the
> execution trigger failed due to concurrent failover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)