Martijn Visser created FLINK-40068:
--------------------------------------

             Summary: KeyedComplexChainTest.testMigrationAndRestore fails with 
NoResourceAvailableException when the window apply-assertion crashes the 
MiniCluster TaskManager
                 Key: FLINK-40068
                 URL: https://issues.apache.org/jira/browse/FLINK-40068
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination, Tests
    Affects Versions: 2.4.0
            Reporter: Martijn Visser
            Assignee: Martijn Visser


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76627&view=results
 (leg: test_ci tests, parameterizations 1.17, 1.18 and 1.19); previously builds 
76400 and 76448 (test_cron_jdk11_tests).

{noformat}
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Job 
... reached terminal state FAILED while waiting for FINISHED.
      at 
org.apache.flink.test.state.operator.restore.AbstractOperatorRestoreTestBase.waitForJobStatus(AbstractOperatorRestoreTestBase.java:271)
      at 
org.apache.flink.test.state.operator.restore.AbstractOperatorRestoreTestBase.restoreJob(AbstractOperatorRestoreTestBase.java:236)
Caused by: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not acquire the minimum required resources.
{noformat}

The failure is not a slot leak. {{KeyedJob.StatefulWindowFunction.close()}} 
asserts {{applyCalled}} on every close, including the cancellation path. The 
GENERATE/MIGRATE jobs are always stopped via cancel-with-savepoint, which does 
not drain, so under CI load a window subtask can be closed before its element 
is processed and {{apply()}} is called. The assertion then throws from 
{{close()}}, which fails the shared static {{MiniClusterExtension}}'s only 
TaskManager:

{noformat}
[GlobalWindows -> Map -> Map (2/4)#0] ERROR 
org.apache.flink.runtime.minicluster.MiniCluster - TaskManager #0 failed.
java.lang.Exception: org.opentest4j.AssertionFailedError: [Apply was never 
called.]
      at 
org.apache.flink.test.state.operator.restore.keyed.KeyedJob$StatefulWindowFunction.close(KeyedJob.java:217)
{noformat}

With NUM_TMS = 1 and no TaskManager restart, the subsequent restore step and 
every later parameterization in the class see "Registered TMs: 0" and fail with 
{{NoResourceAvailableException}} after the ~5 min slot-request timeout. This is 
the "why does the job legitimately reach FAILED" investigation deferred in 
FLINK-39918.

There is no data loss: the window's keyed state is restored from the savepoint 
independently of {{apply()}} running, and the RESTORE job (which runs to 
completion) re-validates the migrated state end-to-end via the untouched 
apply() assertions.

Proposed fix: restrict the {{applyCalled}} assertion to 
{{ExecutionMode.RESTORE}}, the only mode whose job runs to completion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to