Martijn Visser created FLINK-39918:
--------------------------------------
Summary: KeyedComplexChainTest hangs until the CI watchdog kills
the fork: AbstractOperatorRestoreTestBase waits ~2.7h for a job status that can
never arrive
Key: FLINK-39918
URL: https://issues.apache.org/jira/browse/FLINK-39918
Project: Flink
Issue Type: Bug
Components: Tests
Affects Versions: 2.4.0
Reporter: Martijn Visser
Assignee: Martijn Visser
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
(leg: test_ci tests)
{code}
04:23:42 Process produced no output for 900 seconds.
{code}
{{org.apache.flink.test.state.operator.restore.keyed.KeyedComplexChainTest}}
started but never completed; the watchdog killed the surefire fork (exit code
143) after 900 s of silence. The surefire dump shows the test thread blocked at
{{AbstractOperatorRestoreTestBase.restoreJob:257}} on the 1.10 savepoint, with
no task threads alive.
Root cause: {{migrateJob}}/{{restoreJob}} wait for one specific terminal
{{JobStatus}} (RUNNING then CANCELED, resp. FINISHED) via
{{retrySuccessfulWithDelay}} against {{TEST_TIMEOUT =
Duration.ofSeconds(10000L)}} (~2.7 hours). If the job reaches a *different*
globally terminal state (e.g. FAILED), the predicate never matches and the wait
spins far beyond the 900 s CI watchdog, killing the entire fork and hiding both
the offending test and the actual job failure.
Historic hang tickets for this test (FLINK-18138, FLINK-12916) are long closed
and unrelated.
Proposed fix (pattern of FLINK-39879): a {{waitForJobStatus}} helper that fails
fast when the job reaches a globally terminal state other than the target
(surfacing the unexpected state), {{TEST_TIMEOUT}} reduced to 5 minutes, and
{{@Timeout(10, MINUTES)}} on the test template as a hard anti-hang guard. This
converts the fork-killing hang into a localized, diagnosable failure; whether
the job legitimately reaches FAILED in these restore scenarios may warrant a
separate runtime investigation once one is captured.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)