[ 
https://issues.apache.org/jira/browse/FLINK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132873#comment-17132873
 ] 

Zhijiang commented on FLINK-17869:
----------------------------------

Merged in release-1.11: 7bb3ffa91a9916348d2f0a6a2e6cba4b109be56e, 

d8069249703bbe7858e0c6a044deb54ce75e3989

 

Merged in master: 64ff6765036dd00761f79d9e206f6128c5bad671, 

91df1a5fd0f4937a852a82a6139a1a0ed28165e0

> Fix the race condition of aborting unaligned checkpoint
> -------------------------------------------------------
>
>                 Key: FLINK-17869
>                 URL: https://issues.apache.org/jira/browse/FLINK-17869
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>            Reporter: Zhijiang
>            Assignee: Roman Khachatryan
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.11.0, 1.12.0
>
>
> On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:
> start -> in progress/abort -> stop.
> The ChannelStateWriteResult is created during #start, and removed by #abort 
> or #stop processes. There are some potential race conditions here:
>  * #start is called while receiving the first barrier by netty thread and 
> schedule to execute the checkpoint
>  * The task thread might process cancel checkpoint and call #abort before 
> performing the above respective checkpoint
>  * The checkpoint can still be executed by task thread afterwards even 
> thought the above abort happened before, because we can not remove the 
> checkpoint action from mailbox during aborting.
>  * While checkpoint executing, it will call 
> `ChannelStateWriter#getWriteResult` then it would cause 
> `IllegalStateException` because the respective result was already removed in 
> advance during handling #abort method before.
>  * Therefore it will cause unnecessary task failure during performing 
> checkpoint
> I guess we do not want to fail the task when one checkpoint is aborted by 
> design. And the illegal state check during ChannelStateWriter#getWriteResult 
> was mainly proposed for normal process validation I guess.
> If we do not remove the ChannelStateWriteResult while handling #abort and 
> rely on #stop to remove it, then it might probably exist another scenario 
> that the checkpoint will never be performed after #start (we have another 
> mechanism to exit the triggering checkpoint in advance if the abort is sent 
> by CheckpointCoordinator), then the legacy ChannelStateWriteResult will be 
> retained inside ChannelStateWriter long time.
> Maybe the potential option to fix this issue is to let 
> SubtaskCheckpointCoordinatorImpl handle the exception from 
> ChannelStateWriter#getWriteResult properly to not fail the task in the 
> aborted case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to