[ 
https://issues.apache.org/jira/browse/FLINK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18063:
-----------------------------
    Description: 
There are three aborting scenarios which might encounter race condition:

    1. CheckpointBarrierUnaligner#processCancellationBarrier

    2. CheckpointBarrierUnaligner#processEndOfPartition

    3. AlternatingCheckpointBarrierHandler#processBarrier

We only consider the pending checkpoint triggered by #processBarrier from task 
thread to abort it. Actually the checkpoint might also be triggered by 
#notifyBarrierReceived from netty thread in race condition, so we should also 
handle properly to abort it.

  was:
In the handle of CheckpointBarrierUnaligner#processEndOfPartition, it only 
aborts the current checkpoint by judging the condition of pending checkpoint 
from task thread processing, so it will miss one scenario that checkpoint 
triggered by notifyBarrierReceived from netty thread.

The proper fix should also judge the pending checkpoint inside 
ThreadSafeUnaligner in order to abort it and reset internal variables in case.


> Fix the race condition for aborting current checkpoint in 
> CheckpointBarrierUnaligner#processEndOfPartition
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-18063
>                 URL: https://issues.apache.org/jira/browse/FLINK-18063
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Zhijiang
>            Assignee: Zhijiang
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.11.0, 1.12.0
>
>
> There are three aborting scenarios which might encounter race condition:
>     1. CheckpointBarrierUnaligner#processCancellationBarrier
>     2. CheckpointBarrierUnaligner#processEndOfPartition
>     3. AlternatingCheckpointBarrierHandler#processBarrier
> We only consider the pending checkpoint triggered by #processBarrier from 
> task thread to abort it. Actually the checkpoint might also be triggered by 
> #notifyBarrierReceived from netty thread in race condition, so we should also 
> handle properly to abort it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to