[
https://issues.apache.org/jira/browse/FLINK-38408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023144#comment-18023144
]
Rui Fan commented on FLINK-38408:
---------------------------------
h2. Root Cause Analysis
h3. Problem Location
Log analysis revealed that the checkpoint had actually completed successfully:
{{}}
{code:java}
{code}
{{07:19:37,522 [jobmanager-io-thread-1] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 1 for job b809cf46d67c23697786fd514565c737 (4464 bytes,
checkpointDuration=45 ms, finalizationTime=4 ms). }}
However, the test code could not find the completed checkpoint when calling
{{{}CommonTestUtils.getLatestCompletedCheckpointPath(){}}}.
h3. Root Cause
The problem occurs in the execution order of the
{{CheckpointCoordinator.completePendingCheckpoint()}} method:
[https://github.com/apache/flink/blob/39a46288c7e74d7c5c799b48ef5a42f0c47dcaad/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389]
{code:java}
pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint);
reportCompletedCheckpoint(completedCheckpoint); {code}
*Checkpoint Coordinator mechanism:*
# {*}A{*}:
{{pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint)}}
completes the completion future first{{{}{}}}
# {*}B{*}: {{reportCompletedCheckpoint(completedCheckpoint)}} updates
checkpoint statistics.
Test code timeline:
# *C:* Detect future completion
# *D:* Call {{getLatestCompletedCheckpointPath() immediately}}
{{Usually, the execution sequence is A -> B -> C -> D, it works well. }}
{{The bug happens if }}{{execution sequence is A }}{{-> C -> D }}{{-> B.}}
h2. Reproduction Method
In the {{completePendingCheckpoint()}} method, inserting {{Thread.sleep(100)}}
between {{complete()}} and {{reportCompletedCheckpoint()}} can reproduce this
issue 100%.
h1. Solution: Adjust the execution order in CheckpointCoordinator
*Changes:*
{{}}
{code:java}
{code}
{{// Update statistics first reportCompletedCheckpoint(completedCheckpoint); }}
{{// Complete the future later
pendingCheckpoint.getCompletionFuture().complete(completedCheckpoint); }}
*Benefits:*
* Fundamentally eliminates race conditions
* Ensures semantic correctness: Waiting parties are notified only when the
checkpoint is fully processed
> MapStateNullValueCheckpointingITCase failed in test_cron_azure tests
> --------------------------------------------------------------------
>
> Key: FLINK-38408
> URL: https://issues.apache.org/jira/browse/FLINK-38408
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 2.2.0
> Reporter: Ruan Hang
> Assignee: Rui Fan
> Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=69803&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=115e5c38-6efb-5006-4921-5e2851da71ef
--
This message was sent by Atlassian Jira
(v8.20.10#820010)