[jira] [Updated] (FLINK-40068) KeyedComplexChainTest.testMigrationAndRestore fails with NoResourceAvailableException when the window apply-assertion crashes the MiniCluster TaskManager

Martijn Visser (Jira) Fri, 03 Jul 2026 13:06:16 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-40068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Martijn Visser updated FLINK-40068:
-----------------------------------
    Labels: test-stability  (was: )

> KeyedComplexChainTest.testMigrationAndRestore fails with 
> NoResourceAvailableException when the window apply-assertion crashes the 
> MiniCluster TaskManager
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-40068
>                 URL: https://issues.apache.org/jira/browse/FLINK-40068
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 2.4.0
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76627&view=results
>  (leg: test_ci tests, parameterizations 1.17, 1.18 and 1.19); previously 
> builds 76400 and 76448 (test_cron_jdk11_tests).
> {noformat}
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Job 
> ... reached terminal state FAILED while waiting for FINISHED.
>       at 
> org.apache.flink.test.state.operator.restore.AbstractOperatorRestoreTestBase.waitForJobStatus(AbstractOperatorRestoreTestBase.java:271)
>       at 
> org.apache.flink.test.state.operator.restore.AbstractOperatorRestoreTestBase.restoreJob(AbstractOperatorRestoreTestBase.java:236)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not acquire the minimum required resources.
> {noformat}
> The failure is not a slot leak. {{KeyedJob.StatefulWindowFunction.close()}} 
> asserts {{applyCalled}} on every close, including the cancellation path. The 
> GENERATE/MIGRATE jobs are always stopped via cancel-with-savepoint, which 
> does not drain, so under CI load a window subtask can be closed before its 
> element is processed and {{apply()}} is called. The assertion then throws 
> from {{close()}}, which fails the shared static {{MiniClusterExtension}}'s 
> only TaskManager:
> {noformat}
> [GlobalWindows -> Map -> Map (2/4)#0] ERROR 
> org.apache.flink.runtime.minicluster.MiniCluster - TaskManager #0 failed.
> java.lang.Exception: org.opentest4j.AssertionFailedError: [Apply was never 
> called.]
>       at 
> org.apache.flink.test.state.operator.restore.keyed.KeyedJob$StatefulWindowFunction.close(KeyedJob.java:217)
> {noformat}
> With NUM_TMS = 1 and no TaskManager restart, the subsequent restore step and 
> every later parameterization in the class see "Registered TMs: 0" and fail 
> with {{NoResourceAvailableException}} after the ~5 min slot-request timeout. 
> This is the "why does the job legitimately reach FAILED" investigation 
> deferred in FLINK-39918.
> There is no data loss: the window's keyed state is restored from the 
> savepoint independently of {{apply()}} running, and the RESTORE job (which 
> runs to completion) re-validates the migrated state end-to-end via the 
> untouched apply() assertions.
> Proposed fix: restrict the {{applyCalled}} assertion to 
> {{ExecutionMode.RESTORE}}, the only mode whose job runs to completion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-40068) KeyedComplexChainTest.testMigrationAndRestore fails with NoResourceAvailableException when the window apply-assertion crashes the MiniCluster TaskManager

Reply via email to