[ 
https://issues.apache.org/jira/browse/FLINK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huang Xingbo updated FLINK-28474:
---------------------------------
    Fix Version/s: 1.17.0
                   1.16.1
                       (was: 1.16.0)

> ChannelStateWriteResult may not fail after checkpoint abort
> -----------------------------------------------------------
>
>                 Key: FLINK-28474
>                 URL: https://issues.apache.org/jira/browse/FLINK-28474
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.5, 1.15.1
>            Reporter: fanrui
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.17.0, 1.15.3, 1.14.7, 1.16.1
>
>         Attachments: image-2022-07-09-22-21-24-417.png
>
>
> After Checkpoint abort, ChannelStateWriteResult should fail.
> But if _channelStateWriter.start(id, checkpointOptions);_ is executed after 
> Checkpoint abort, ChannelStateWriteResult will not fail.
>  
> h2. Cause Analysis:
> When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may 
> not be executed yet. These checkpointIds will be stored in the 
> abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when 
> checkpointState is called, it will check if the checkpointId should be 
> aborted.
> _ChannelStateWriter.abort(checkpointId, exception, true) should also be 
> executed here._
> The unit test can reproduce this bug.
> !image-2022-07-09-22-21-24-417.png|width=803,height=307!
>  
> Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it 
> doesn't account for channelStateWriter.start after notifyCheckpointAborted.
> JIRA: FLINK-17869
> commit: 
> https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e
>  
> The bug will affect the new feature FLINK-26803, because the channel state 
> file can be closed only after the Checkpoints of all tasks of the shared file 
> are complete or abort. So when the checkpoint of some tasks fails, if abort 
> is not called, the file cannot be closed and all tasks sharing the file 
> cannot execute inputChannelStateHandles.completeExceptionally(e); and 
> resultSubpartitionStateHandles.completeExceptionally(e); , 
> AsyncCheckpointRunnable will wait forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to