[jira] [Comment Edited] (FLINK-26394) CheckpointCoordinator.isTriggering can not be reset if a checkpoint expires while the checkpointCoordinator task is queuing in the SourceCoordinator executor.

Gen Luo (Jira) Tue, 01 Mar 2022 03:30:14 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-26394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499474#comment-17499474
 ]


Gen Luo edited comment on FLINK-26394 at 3/1/22, 11:29 AM:
-----------------------------------------------------------

[~yunta] The problem can be reproduced by:
 # add a 10s sleeping (or 120s for the older version) in the RequestSplitEvent 
processing branch in SourceCoordinator.handleEventFromOperator. This is 
imitating the behavior that enumerator.handleSplitRequest takes too long.
 # set the checkpoint timeout to 2s for 
FileSourceTextLinesITCase.testContinuousTextFileSource
 # run the test FileSourceTextLinesITCase.testContinuousTextFileSource (with 
FailoverType=NONE)

 


was (Author: pltbkd):
[~yunta] The problem can be reproduced by:
 # add a 10s sleeping (or 120s for the elder version) in the RequestSplitEvent 
processing branch in SourceCoordinator.handleEventFromOperator. This is 
imitating the behavior that enumerator.handleSplitRequest takes too long.
 # set the checkpoint timeout to 2s for 
FileSourceTextLinesITCase.testContinuousTextFileSource
 # run the test FileSourceTextLinesITCase.testContinuousTextFileSource (with 
FailoverType=NONE)

 

> CheckpointCoordinator.isTriggering can not be reset if a checkpoint expires 
> while the checkpointCoordinator task is queuing in the SourceCoordinator 
> executor.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26394
>                 URL: https://issues.apache.org/jira/browse/FLINK-26394
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Gen Luo
>            Priority: Major
>
> We found a job can no longer trigger checkpoints or savepoints after 
> recovering from a checkpoint timeout failure. After investigation, we found 
> that the `isTriggering` flag is CheckpointCoordinator is true while no 
> checkpoint is actually doing, and the root cause is as following:
>  
>  # The job uses a source whose coordinator needs to scan a table while 
> requesting splits, which may cost more than 10min. The source coordinator 
> executor thread will be occupied by `handleSplitRequest`, and 
> `checkpointCoordinator` task of the first checkpoint will be queued after it.
>  # 10min later, the checkpoint is expired, removing the pending checkpoint 
> from the coordinator, and triggering a global failover. But the 
> `isTriggering` is not reset here. It can only be reset after the checkpoint 
> completable future is done, which is now holding only by the 
> `checkpointCoordinator` task in the queue, along with the PendingCheckpoint.
>  # Then the job failover, and the RecreateOnResetOperatorCoordinator will 
> recreate a new SourceCoordinator, and close the previous coordinator 
> asynchronously. Timeout for the closing is fixed to 60s. SourceCoordinator 
> will try to `shutdown` the coordinator executor then `awaitTermination`. If 
> the tasks are done within 60s, nothing wrong will happen.
>  # But if the closing method is stuck for more than 60s (which in this case 
> is actually stuck in the `handleSplitRequest`), the async closing thread will 
> be interrupted and SourceCoordinator will `shutdownNow` the executor. All 
> tasks queuing will be discarded, including the `checkpointCoordinator` task.
>  # Then the checkpoint completable future will never complete and the 
> `isTriggering` flag will never be reset.
>  
> I see that the closing part of SourceCoordinator is recently refactored. But 
> I find the new implementation also has this issue. And since it calls 
> `shutdownNow` directly, the issue should be easier to encounter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-26394) CheckpointCoordinator.isTriggering can not be reset if a checkpoint expires while the checkpointCoordinator task is queuing in the SourceCoordinator executor.

Reply via email to