[ https://issues.apache.org/jira/browse/FLINK-26394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499474#comment-17499474 ]
Gen Luo edited comment on FLINK-26394 at 3/1/22, 11:29 AM: ----------------------------------------------------------- [~yunta] The problem can be reproduced by: # add a 10s sleeping (or 120s for the older version) in the RequestSplitEvent processing branch in SourceCoordinator.handleEventFromOperator. This is imitating the behavior that enumerator.handleSplitRequest takes too long. # set the checkpoint timeout to 2s for FileSourceTextLinesITCase.testContinuousTextFileSource # run the test FileSourceTextLinesITCase.testContinuousTextFileSource (with FailoverType=NONE) was (Author: pltbkd): [~yunta] The problem can be reproduced by: # add a 10s sleeping (or 120s for the elder version) in the RequestSplitEvent processing branch in SourceCoordinator.handleEventFromOperator. This is imitating the behavior that enumerator.handleSplitRequest takes too long. # set the checkpoint timeout to 2s for FileSourceTextLinesITCase.testContinuousTextFileSource # run the test FileSourceTextLinesITCase.testContinuousTextFileSource (with FailoverType=NONE) > CheckpointCoordinator.isTriggering can not be reset if a checkpoint expires > while the checkpointCoordinator task is queuing in the SourceCoordinator > executor. > -------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-26394 > URL: https://issues.apache.org/jira/browse/FLINK-26394 > Project: Flink > Issue Type: Bug > Reporter: Gen Luo > Priority: Major > > We found a job can no longer trigger checkpoints or savepoints after > recovering from a checkpoint timeout failure. After investigation, we found > that the `isTriggering` flag is CheckpointCoordinator is true while no > checkpoint is actually doing, and the root cause is as following: > > # The job uses a source whose coordinator needs to scan a table while > requesting splits, which may cost more than 10min. The source coordinator > executor thread will be occupied by `handleSplitRequest`, and > `checkpointCoordinator` task of the first checkpoint will be queued after it. > # 10min later, the checkpoint is expired, removing the pending checkpoint > from the coordinator, and triggering a global failover. But the > `isTriggering` is not reset here. It can only be reset after the checkpoint > completable future is done, which is now holding only by the > `checkpointCoordinator` task in the queue, along with the PendingCheckpoint. > # Then the job failover, and the RecreateOnResetOperatorCoordinator will > recreate a new SourceCoordinator, and close the previous coordinator > asynchronously. Timeout for the closing is fixed to 60s. SourceCoordinator > will try to `shutdown` the coordinator executor then `awaitTermination`. If > the tasks are done within 60s, nothing wrong will happen. > # But if the closing method is stuck for more than 60s (which in this case > is actually stuck in the `handleSplitRequest`), the async closing thread will > be interrupted and SourceCoordinator will `shutdownNow` the executor. All > tasks queuing will be discarded, including the `checkpointCoordinator` task. > # Then the checkpoint completable future will never complete and the > `isTriggering` flag will never be reset. > > I see that the closing part of SourceCoordinator is recently refactored. But > I find the new implementation also has this issue. And since it calls > `shutdownNow` directly, the issue should be easier to encounter. -- This message was sent by Atlassian Jira (v8.20.1#820001)