[ https://issues.apache.org/jira/browse/FLINK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xingbo Huang updated FLINK-28474: --------------------------------- Fix Version/s: 1.14.7 (was: 1.14.6) > ChannelStateWriteResult may not fail after checkpoint abort > ----------------------------------------------------------- > > Key: FLINK-28474 > URL: https://issues.apache.org/jira/browse/FLINK-28474 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.14.5, 1.15.1 > Reporter: fanrui > Priority: Major > Labels: pull-request-available > Fix For: 1.16.0, 1.15.3, 1.14.7 > > Attachments: image-2022-07-09-22-21-24-417.png > > > After Checkpoint abort, ChannelStateWriteResult should fail. > But if _channelStateWriter.start(id, checkpointOptions);_ is executed after > Checkpoint abort, ChannelStateWriteResult will not fail. > > h2. Cause Analysis: > When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may > not be executed yet. These checkpointIds will be stored in the > abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when > checkpointState is called, it will check if the checkpointId should be > aborted. > _ChannelStateWriter.abort(checkpointId, exception, true) should also be > executed here._ > The unit test can reproduce this bug. > !image-2022-07-09-22-21-24-417.png|width=803,height=307! > > Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it > doesn't account for channelStateWriter.start after notifyCheckpointAborted. > JIRA: FLINK-17869 > commit: > https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e > > The bug will affect the new feature FLINK-26803, because the channel state > file can be closed only after the Checkpoints of all tasks of the shared file > are complete or abort. So when the checkpoint of some tasks fails, if abort > is not called, the file cannot be closed and all tasks sharing the file > cannot execute inputChannelStateHandles.completeExceptionally(e); and > resultSubpartitionStateHandles.completeExceptionally(e); , > AsyncCheckpointRunnable will wait forever. -- This message was sent by Atlassian Jira (v8.20.10#820010)