Martijn Visser created FLINK-40068:
--------------------------------------
Summary: KeyedComplexChainTest.testMigrationAndRestore fails with
NoResourceAvailableException when the window apply-assertion crashes the
MiniCluster TaskManager
Key: FLINK-40068
URL: https://issues.apache.org/jira/browse/FLINK-40068
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination, Tests
Affects Versions: 2.4.0
Reporter: Martijn Visser
Assignee: Martijn Visser
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76627&view=results
(leg: test_ci tests, parameterizations 1.17, 1.18 and 1.19); previously builds
76400 and 76448 (test_cron_jdk11_tests).
{noformat}
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Job
... reached terminal state FAILED while waiting for FINISHED.
at
org.apache.flink.test.state.operator.restore.AbstractOperatorRestoreTestBase.waitForJobStatus(AbstractOperatorRestoreTestBase.java:271)
at
org.apache.flink.test.state.operator.restore.AbstractOperatorRestoreTestBase.restoreJob(AbstractOperatorRestoreTestBase.java:236)
Caused by:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not acquire the minimum required resources.
{noformat}
The failure is not a slot leak. {{KeyedJob.StatefulWindowFunction.close()}}
asserts {{applyCalled}} on every close, including the cancellation path. The
GENERATE/MIGRATE jobs are always stopped via cancel-with-savepoint, which does
not drain, so under CI load a window subtask can be closed before its element
is processed and {{apply()}} is called. The assertion then throws from
{{close()}}, which fails the shared static {{MiniClusterExtension}}'s only
TaskManager:
{noformat}
[GlobalWindows -> Map -> Map (2/4)#0] ERROR
org.apache.flink.runtime.minicluster.MiniCluster - TaskManager #0 failed.
java.lang.Exception: org.opentest4j.AssertionFailedError: [Apply was never
called.]
at
org.apache.flink.test.state.operator.restore.keyed.KeyedJob$StatefulWindowFunction.close(KeyedJob.java:217)
{noformat}
With NUM_TMS = 1 and no TaskManager restart, the subsequent restore step and
every later parameterization in the class see "Registered TMs: 0" and fail with
{{NoResourceAvailableException}} after the ~5 min slot-request timeout. This is
the "why does the job legitimately reach FAILED" investigation deferred in
FLINK-39918.
There is no data loss: the window's keyed state is restored from the savepoint
independently of {{apply()}} running, and the RESTORE job (which runs to
completion) re-validates the migrated state end-to-end via the untouched
apply() assertions.
Proposed fix: restrict the {{applyCalled}} assertion to
{{ExecutionMode.RESTORE}}, the only mode whose job runs to completion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)