[ https://issues.apache.org/jira/browse/FLINK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhijiang updated FLINK-17994: ----------------------------- Fix Version/s: 1.12.0 > Fix the race condition between CheckpointBarrierUnaligner#processBarrier and > #notifyBarrierReceived > --------------------------------------------------------------------------------------------------- > > Key: FLINK-17994 > URL: https://issues.apache.org/jira/browse/FLINK-17994 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Reporter: Zhijiang > Assignee: Zhijiang > Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.0, 1.12.0 > > > The race condition issue happens as follow: > * ch1 is received from network for one input channel by netty thread and > schedule the ch1 into mailbox via #notifyBarrierReceived > * ch2 is received from network for another input channel by netty thread, > but before calling #notifyBarrierReceived this barrier was inserted into > channel's data queue in advance. Then it would cause task thread process ch2 > earlier than #notifyBarrierReceived by netty thread. > * Task thread would execute checkpoint for ch2 immediately because ch2 > ch1. > * After that, the previous scheduled ch1 is performed from mailbox by task > thread, then it causes the IllegalArgumentException inside > SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the > initial assumption that checkpoint is executed in incremental way for aligned > mode. > The key problem is that we can not remove the checkpoint action from mailbox > queue before the next checkpoint is going to execute now. One possible > solution is that we record the previous aborted checkpoint id inside > SubtaskCheckpointCoordinatorImpl#abortedCheckpointIds, then when the queued > checkpoint inside mailbox is executing, it will exit directly if found the > checkpoint id was already aborted before. -- This message was sent by Atlassian Jira (v8.3.4#803005)