[ https://issues.apache.org/jira/browse/FLINK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132873#comment-17132873 ]
Zhijiang commented on FLINK-17869: ---------------------------------- Merged in release-1.11: 7bb3ffa91a9916348d2f0a6a2e6cba4b109be56e, d8069249703bbe7858e0c6a044deb54ce75e3989 Merged in master: 64ff6765036dd00761f79d9e206f6128c5bad671, 91df1a5fd0f4937a852a82a6139a1a0ed28165e0 > Fix the race condition of aborting unaligned checkpoint > ------------------------------------------------------- > > Key: FLINK-17869 > URL: https://issues.apache.org/jira/browse/FLINK-17869 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Reporter: Zhijiang > Assignee: Roman Khachatryan > Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.0, 1.12.0 > > > On ChannelStateWriter side, the lifecycle of checkpoint should be as follows: > start -> in progress/abort -> stop. > The ChannelStateWriteResult is created during #start, and removed by #abort > or #stop processes. There are some potential race conditions here: > * #start is called while receiving the first barrier by netty thread and > schedule to execute the checkpoint > * The task thread might process cancel checkpoint and call #abort before > performing the above respective checkpoint > * The checkpoint can still be executed by task thread afterwards even > thought the above abort happened before, because we can not remove the > checkpoint action from mailbox during aborting. > * While checkpoint executing, it will call > `ChannelStateWriter#getWriteResult` then it would cause > `IllegalStateException` because the respective result was already removed in > advance during handling #abort method before. > * Therefore it will cause unnecessary task failure during performing > checkpoint > I guess we do not want to fail the task when one checkpoint is aborted by > design. And the illegal state check during ChannelStateWriter#getWriteResult > was mainly proposed for normal process validation I guess. > If we do not remove the ChannelStateWriteResult while handling #abort and > rely on #stop to remove it, then it might probably exist another scenario > that the checkpoint will never be performed after #start (we have another > mechanism to exit the triggering checkpoint in advance if the abort is sent > by CheckpointCoordinator), then the legacy ChannelStateWriteResult will be > retained inside ChannelStateWriter long time. > Maybe the potential option to fix this issue is to let > SubtaskCheckpointCoordinatorImpl handle the exception from > ChannelStateWriter#getWriteResult properly to not fail the task in the > aborted case. -- This message was sent by Atlassian Jira (v8.3.4#803005)