[ https://issues.apache.org/jira/browse/FLINK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhijiang updated FLINK-18063: ----------------------------- Description: There are three aborting scenarios which might encounter race condition: 1. CheckpointBarrierUnaligner#processCancellationBarrier 2. CheckpointBarrierUnaligner#processEndOfPartition 3. AlternatingCheckpointBarrierHandler#processBarrier We only consider the pending checkpoint triggered by #processBarrier from task thread to abort it. Actually the checkpoint might also be triggered by #notifyBarrierReceived from netty thread in race condition, so we should also handle properly to abort it. was: In the handle of CheckpointBarrierUnaligner#processEndOfPartition, it only aborts the current checkpoint by judging the condition of pending checkpoint from task thread processing, so it will miss one scenario that checkpoint triggered by notifyBarrierReceived from netty thread. The proper fix should also judge the pending checkpoint inside ThreadSafeUnaligner in order to abort it and reset internal variables in case. > Fix the race condition for aborting current checkpoint in > CheckpointBarrierUnaligner#processEndOfPartition > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-18063 > URL: https://issues.apache.org/jira/browse/FLINK-18063 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.11.0 > Reporter: Zhijiang > Assignee: Zhijiang > Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.0, 1.12.0 > > > There are three aborting scenarios which might encounter race condition: > 1. CheckpointBarrierUnaligner#processCancellationBarrier > 2. CheckpointBarrierUnaligner#processEndOfPartition > 3. AlternatingCheckpointBarrierHandler#processBarrier > We only consider the pending checkpoint triggered by #processBarrier from > task thread to abort it. Actually the checkpoint might also be triggered by > #notifyBarrierReceived from netty thread in race condition, so we should also > handle properly to abort it. -- This message was sent by Atlassian Jira (v8.3.4#803005)