[jira] [Comment Edited] (FLINK-38534) LocalRecoveryTest failed in test_cron_azure core

Mukul Gupta (Jira) Fri, 13 Mar 2026 06:15:09 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-38534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18065646#comment-18065646
 ]


Mukul Gupta edited comment on FLINK-38534 at 3/13/26 1:14 PM:
--------------------------------------------------------------

Please assign this jira to me.

Suspected Root Cause: The test triggers a checkpoint immediately after 
setAllExecutionsToRunning(). While the state updates are synchronous, they may 
trigger async callbacks (checkpoint
coordinator, etc.) that need time to process. In slower CI environments, the 
checkpoint coordinator might not have finished processing the state updates, 
causing the checkpoint trigger to
be rejected and waitForCheckpointInProgress() to timeout.

Proposed Fix: Added waitForAllTasksRunning(executionGraph) to ensure all tasks 
have fully transitioned to RUNNING state before triggering the checkpoint.

Similar approach to FLINK-39182 (PR #27740) for ExecutionGraphRestartTest.

Verified locally with 300+ iterations without failure. Submitting PR for CI 
validation.


was (Author: JIRAUSER312410):
I'm working on a fix for this. Please assign this jira to me.

Suspected Root Cause: The test triggers a checkpoint immediately after 
setAllExecutionsToRunning(). While the state updates are synchronous, they may 
trigger async callbacks (checkpoint
coordinator, etc.) that need time to process. In slower CI environments, the 
checkpoint coordinator might not have finished processing the state updates, 
causing the checkpoint trigger to
be rejected and waitForCheckpointInProgress() to timeout.

Proposed Fix: Added waitForAllTasksRunning(executionGraph) to ensure all tasks 
have fully transitioned to RUNNING state before triggering the checkpoint.

Similar approach to FLINK-39182 (PR #27740) for ExecutionGraphRestartTest.

Verified locally with 300+ iterations without failure. Submitting PR for CI 
validation.

> LocalRecoveryTest failed in test_cron_azure core
> ------------------------------------------------
>
>                 Key: FLINK-38534
>                 URL: https://issues.apache.org/jira/browse/FLINK-38534
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.2.0
>            Reporter: Ruan Hang
>            Priority: Major
>
> {code:java}
> Feb 27 04:21:50 04:21:50.067 [INFO] Results:
> Feb 27 04:21:50 04:21:50.068 [INFO] 
> Feb 27 04:21:50 04:21:50.069 [ERROR] Errors: 
> Feb 27 04:21:50 04:21:50.070 [ERROR]   
> LocalRecoveryTest.testStateSizeIsConsideredForLocalRecoveryOnRestart:113 » 
> Flink Exhausted retry attempts.
> Feb 27 04:21:50 04:21:50.071 [INFO] 
> Feb 27 04:21:50 04:21:50.071 [ERROR] Tests run: 109715, Failures: 0, Errors: 
> 1, Skipped: 354
> Feb 27 04:21:50 04:21:50.071 [INFO] 
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70334&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=25baecb7-cea0-597a-6b01-188b1478210d



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-38534) LocalRecoveryTest failed in test_cron_azure core

Reply via email to