Martijn Visser created FLINK-39918:
--------------------------------------

             Summary: KeyedComplexChainTest hangs until the CI watchdog kills 
the fork: AbstractOperatorRestoreTestBase waits ~2.7h for a job status that can 
never arrive
                 Key: FLINK-39918
                 URL: https://issues.apache.org/jira/browse/FLINK-39918
             Project: Flink
          Issue Type: Bug
          Components: Tests
    Affects Versions: 2.4.0
            Reporter: Martijn Visser
            Assignee: Martijn Visser


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
 (leg: test_ci tests)

{code}
  04:23:42 Process produced no output for 900 seconds.
{code}

{{org.apache.flink.test.state.operator.restore.keyed.KeyedComplexChainTest}} 
started but never completed; the watchdog killed the surefire fork (exit code 
143) after 900 s of silence. The surefire dump shows the test thread blocked at 
{{AbstractOperatorRestoreTestBase.restoreJob:257}} on the 1.10 savepoint, with 
no task threads alive.

Root cause: {{migrateJob}}/{{restoreJob}} wait for one specific terminal 
{{JobStatus}} (RUNNING then CANCELED, resp. FINISHED) via 
{{retrySuccessfulWithDelay}} against {{TEST_TIMEOUT = 
Duration.ofSeconds(10000L)}} (~2.7 hours). If the job reaches a *different* 
globally terminal state (e.g. FAILED), the predicate never matches and the wait 
spins far beyond the 900 s CI watchdog, killing the entire fork and hiding both 
the offending test and the actual job failure.

Historic hang tickets for this test (FLINK-18138, FLINK-12916) are long closed 
and unrelated.

Proposed fix (pattern of FLINK-39879): a {{waitForJobStatus}} helper that fails 
fast when the job reaches a globally terminal state other than the target 
(surfacing the unexpected state), {{TEST_TIMEOUT}} reduced to 5 minutes, and 
{{@Timeout(10, MINUTES)}} on the test template as a hard anti-hang guard. This 
converts the fork-killing hang into a localized, diagnosable failure; whether 
the job legitimately reaches FAILED in these restore scenarios may warrant a 
separate runtime investigation once one is captured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to