ConfX created FLINK-38872:
-----------------------------
Summary: waitForAllTaskRunning Hang Indefinitely
Key: FLINK-38872
URL: https://issues.apache.org/jira/browse/FLINK-38872
Project: Flink
Issue Type: Bug
Components: Test Infrastructure, Tests
Affects Versions: 2.2.0
Reporter: ConfX
Attachments: timeout.patch
`CommonTestUtils.waitForAllTaskRunning()` does not have a timeout mechanism and
may hang indefinitely if the job being waited on is lost, cancelled, or enters
an unexpected terminal state.
{code:java}
public static void waitForAllTaskRunning(
MiniCluster miniCluster, JobID jobId, boolean allowFinished) throws
Exception {
waitForAllTaskRunning(() -> getGraph(miniCluster, jobId), allowFinished);
} {code}
This method calls `waitUntilCondition()` which polls indefinitely with no upper
bound on wait time:
{code:java}
public static void waitUntilCondition(
SupplierWithException<Boolean, Exception> condition) throws Exception {
waitUntilCondition(condition, Duration.ofMillis(1));
} {code}
If the job identified by `jobId` is:
- Lost due to cluster failure or restart
- Cancelled unexpectedly
- Failed and entered a terminal state
- Never properly scheduled The method will continue polling forever, causing
the test to hang indefinitely rather than failing with a clear error message.
h2. Scenario Example
{code:java}
JobClient jobClient = env.executeAsync();
// If something causes the job to be lost here (e.g., cluster issue)
// This will hang forever because the job no longer exists
CommonTestUtils.waitForAllTaskRunning(
miniCluster, jobClient.getJobID(), false); {code}
Add a timeout parameter to `waitForAllTaskRunning()` to ensure tests fail fast
with a clear error message rather than hanging indefinitely.
I attached a proposed patch for this issue and happy to send a PR for the
issue if you think this is reasonable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)