[
https://issues.apache.org/jira/browse/FLINK-40068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martijn Visser updated FLINK-40068:
-----------------------------------
Labels: test-stability (was: )
> KeyedComplexChainTest.testMigrationAndRestore fails with
> NoResourceAvailableException when the window apply-assertion crashes the
> MiniCluster TaskManager
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-40068
> URL: https://issues.apache.org/jira/browse/FLINK-40068
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Affects Versions: 2.4.0
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76627&view=results
> (leg: test_ci tests, parameterizations 1.17, 1.18 and 1.19); previously
> builds 76400 and 76448 (test_cron_jdk11_tests).
> {noformat}
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Job
> ... reached terminal state FAILED while waiting for FINISHED.
> at
> org.apache.flink.test.state.operator.restore.AbstractOperatorRestoreTestBase.waitForJobStatus(AbstractOperatorRestoreTestBase.java:271)
> at
> org.apache.flink.test.state.operator.restore.AbstractOperatorRestoreTestBase.restoreJob(AbstractOperatorRestoreTestBase.java:236)
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not acquire the minimum required resources.
> {noformat}
> The failure is not a slot leak. {{KeyedJob.StatefulWindowFunction.close()}}
> asserts {{applyCalled}} on every close, including the cancellation path. The
> GENERATE/MIGRATE jobs are always stopped via cancel-with-savepoint, which
> does not drain, so under CI load a window subtask can be closed before its
> element is processed and {{apply()}} is called. The assertion then throws
> from {{close()}}, which fails the shared static {{MiniClusterExtension}}'s
> only TaskManager:
> {noformat}
> [GlobalWindows -> Map -> Map (2/4)#0] ERROR
> org.apache.flink.runtime.minicluster.MiniCluster - TaskManager #0 failed.
> java.lang.Exception: org.opentest4j.AssertionFailedError: [Apply was never
> called.]
> at
> org.apache.flink.test.state.operator.restore.keyed.KeyedJob$StatefulWindowFunction.close(KeyedJob.java:217)
> {noformat}
> With NUM_TMS = 1 and no TaskManager restart, the subsequent restore step and
> every later parameterization in the class see "Registered TMs: 0" and fail
> with {{NoResourceAvailableException}} after the ~5 min slot-request timeout.
> This is the "why does the job legitimately reach FAILED" investigation
> deferred in FLINK-39918.
> There is no data loss: the window's keyed state is restored from the
> savepoint independently of {{apply()}} running, and the RESTORE job (which
> runs to completion) re-validates the migrated state end-to-end via the
> untouched apply() assertions.
> Proposed fix: restrict the {{applyCalled}} assertion to
> {{ExecutionMode.RESTORE}}, the only mode whose job runs to completion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)