Gen Luo created FLINK-26394:
-------------------------------

             Summary: CheckpointCoordinator.isTriggering can not be reset if a 
checkpoint expires while the checkpointCoordinator task is queuing in the 
SourceCoordinator executor.
                 Key: FLINK-26394
                 URL: https://issues.apache.org/jira/browse/FLINK-26394
             Project: Flink
          Issue Type: Bug
            Reporter: Gen Luo


We found a job can no longer trigger checkpoints or savepoints after recovering 
from a checkpoint timeout failure. After investigation, we found that the 
`isTriggering` flag is CheckpointCoordinator is true while no checkpoint is 
actually doing, and the root cause is as following:

 
 # The job uses a source whose coordinator needs to scan a table while 
requesting splits, which may cost more than 10min. The source coordinator 
executor thread will be occupied by `handleSplitRequest`, and 
`checkpointCoordinator` task of the first checkpoint will be queued after it.
 # 10min later, the checkpoint is expired, removing the pending checkpoint from 
the coordinator, and triggering a global failover. But the `isTriggering` is 
not reset here. It can only be reset after the checkpoint completable future is 
done, which is now holding only by the `checkpointCoordinator` task in the 
queue, along with the PendingCheckpoint.
 # Then the job failover, and the RecreateOnResetOperatorCoordinator will 
recreate a new SourceCoordinator, and close the previous coordinator 
asynchronously. Timeout for the closing is fixed to 60s. SourceCoordinator will 
try to `shutdown` the coordinator executor then `awaitTermination`. If the 
tasks are done within 60s, nothing wrong will happen.
 # But if the closing method is stuck for more than 60s (which in this case is 
actually stuck in the `handleSplitRequest`), the async closing thread will be 
interrupted and SourceCoordinator will `shutdownNow` the executor. All tasks 
queuing will be discarded, including the `checkpointCoordinator` task.
 # Then the checkpoint completable future will never complete and the 
`isTriggering` flag will never be reset.

 

I see that the closing part of SourceCoordinator is recently refactored. But I 
find the new implementation also has this issue. And since it calls 
`shutdownNow` directly, the issue should be easier to encounter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to