Yun Gao created FLINK-22088:
-------------------------------

             Summary: CheckpointCoordinator might not be able to abort 
triggering checkpoint if failover happens during triggering
                 Key: FLINK-22088
                 URL: https://issues.apache.org/jira/browse/FLINK-22088
             Project: Flink
          Issue Type: Bug
            Reporter: Yun Gao


Currently when job failover, it would try to cancel all the pending checkpoint 
via CheckpointCoordinatorDeActivator#jobStatusChanges -> 
stopCheckpointScheduler, it would try to cancel all the pending checkpoints and 
also set periodicScheduling to false. 

If at this time there is just one checkpoint start triggering, it might acquire 
all the execution to trigger before failover and start triggering. ideally it 
should be aborted in createPendingCheckpoint-> preCheckGlobalState. However, 
since the check and creating pending checkpoint is in two different scope, 
there might be cases the CheckpointCoordinator#stopCheckpointScheduler happens 
during the two lock scope. 

We may optimize this checking; However, since the execution would finally fail 
to trigger checkpoint, it should not affect the rightness of the job. Besides, 
even if we optimize it, there might still be cases that the execution trigger 
failed due to concurrent failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to