[
https://issues.apache.org/jira/browse/FLINK-38534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18065646#comment-18065646
]
Mukul Gupta edited comment on FLINK-38534 at 3/13/26 1:14 PM:
--------------------------------------------------------------
Please assign this jira to me.
Suspected Root Cause: The test triggers a checkpoint immediately after
setAllExecutionsToRunning(). While the state updates are synchronous, they may
trigger async callbacks (checkpoint
coordinator, etc.) that need time to process. In slower CI environments, the
checkpoint coordinator might not have finished processing the state updates,
causing the checkpoint trigger to
be rejected and waitForCheckpointInProgress() to timeout.
Proposed Fix: Added waitForAllTasksRunning(executionGraph) to ensure all tasks
have fully transitioned to RUNNING state before triggering the checkpoint.
Similar approach to FLINK-39182 (PR #27740) for ExecutionGraphRestartTest.
Verified locally with 300+ iterations without failure. Submitting PR for CI
validation.
was (Author: JIRAUSER312410):
I'm working on a fix for this. Please assign this jira to me.
Suspected Root Cause: The test triggers a checkpoint immediately after
setAllExecutionsToRunning(). While the state updates are synchronous, they may
trigger async callbacks (checkpoint
coordinator, etc.) that need time to process. In slower CI environments, the
checkpoint coordinator might not have finished processing the state updates,
causing the checkpoint trigger to
be rejected and waitForCheckpointInProgress() to timeout.
Proposed Fix: Added waitForAllTasksRunning(executionGraph) to ensure all tasks
have fully transitioned to RUNNING state before triggering the checkpoint.
Similar approach to FLINK-39182 (PR #27740) for ExecutionGraphRestartTest.
Verified locally with 300+ iterations without failure. Submitting PR for CI
validation.
> LocalRecoveryTest failed in test_cron_azure core
> ------------------------------------------------
>
> Key: FLINK-38534
> URL: https://issues.apache.org/jira/browse/FLINK-38534
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 2.2.0
> Reporter: Ruan Hang
> Priority: Major
>
> {code:java}
> Feb 27 04:21:50 04:21:50.067 [INFO] Results:
> Feb 27 04:21:50 04:21:50.068 [INFO]
> Feb 27 04:21:50 04:21:50.069 [ERROR] Errors:
> Feb 27 04:21:50 04:21:50.070 [ERROR]
> LocalRecoveryTest.testStateSizeIsConsideredForLocalRecoveryOnRestart:113 ยป
> Flink Exhausted retry attempts.
> Feb 27 04:21:50 04:21:50.071 [INFO]
> Feb 27 04:21:50 04:21:50.071 [ERROR] Tests run: 109715, Failures: 0, Errors:
> 1, Skipped: 354
> Feb 27 04:21:50 04:21:50.071 [INFO]
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70334&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=25baecb7-cea0-597a-6b01-188b1478210d
--
This message was sent by Atlassian Jira
(v8.20.10#820010)