[ 
https://issues.apache.org/jira/browse/FLINK-38872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ConfX updated FLINK-38872:
--------------------------
    Description: 
`CommonTestUtils.waitForAllTaskRunning()` does not have a timeout mechanism and 
may hang indefinitely if the job being waited on is lost, cancelled, or enters 
an unexpected terminal state.
 
{code:java}
public static void waitForAllTaskRunning(
        MiniCluster miniCluster, JobID jobId, boolean allowFinished) throws 
Exception {
    waitForAllTaskRunning(() -> getGraph(miniCluster, jobId), allowFinished);
} {code}
This method calls `waitUntilCondition()` which polls indefinitely with no upper 
bound on wait time:
{code:java}
public static void waitUntilCondition(
        SupplierWithException<Boolean, Exception> condition) throws Exception {
    waitUntilCondition(condition, Duration.ofMillis(1));
} {code}
If the job identified by `jobId` is:
 - Lost due to cluster failure or restart
 - Cancelled unexpectedly
 - Failed and entered a terminal state

Never properly scheduled The method will continue polling forever, causing the 
test to hang indefinitely rather than failing with a clear error message.
 
h2. Scenario Example
{code:java}
JobClient jobClient = env.executeAsync();

// If something causes the job to be lost here (e.g., cluster issue)
// This will hang forever because the job no longer exists
CommonTestUtils.waitForAllTaskRunning(
    miniCluster, jobClient.getJobID(), false); {code}
 
Add a timeout parameter to `waitForAllTaskRunning()` to ensure tests fail fast 
with a clear error message rather than hanging indefinitely.

I attached a proposed patch for this issue and happy to send a PR for the issue 
if you think this is reasonable.

  was:
`CommonTestUtils.waitForAllTaskRunning()` does not have a timeout mechanism and 
may hang indefinitely if the job being waited on is lost, cancelled, or enters 
an unexpected terminal state.
 
{code:java}
public static void waitForAllTaskRunning(
        MiniCluster miniCluster, JobID jobId, boolean allowFinished) throws 
Exception {
    waitForAllTaskRunning(() -> getGraph(miniCluster, jobId), allowFinished);
} {code}
This method calls `waitUntilCondition()` which polls indefinitely with no upper 
bound on wait time:
{code:java}
public static void waitUntilCondition(
        SupplierWithException<Boolean, Exception> condition) throws Exception {
    waitUntilCondition(condition, Duration.ofMillis(1));
} {code}
If the job identified by `jobId` is:
- Lost due to cluster failure or restart
- Cancelled unexpectedly
- Failed and entered a terminal state
- Never properly scheduled The method will continue polling forever, causing 
the test to hang indefinitely rather than failing with a clear error message.
 
h2. Scenario Example
{code:java}
JobClient jobClient = env.executeAsync();

// If something causes the job to be lost here (e.g., cluster issue)
// This will hang forever because the job no longer exists
CommonTestUtils.waitForAllTaskRunning(
    miniCluster, jobClient.getJobID(), false); {code}
 
Add a timeout parameter to `waitForAllTaskRunning()` to ensure tests fail fast 
with a clear error message rather than hanging indefinitely.
 I attached a proposed patch for this issue and happy to send a PR for the 
issue if you think this is reasonable.


> waitForAllTaskRunning Hang Indefinitely
> ---------------------------------------
>
>                 Key: FLINK-38872
>                 URL: https://issues.apache.org/jira/browse/FLINK-38872
>             Project: Flink
>          Issue Type: Bug
>          Components: Test Infrastructure, Tests
>    Affects Versions: 2.2.0
>            Reporter: ConfX
>            Priority: Major
>         Attachments: timeout.patch
>
>
> `CommonTestUtils.waitForAllTaskRunning()` does not have a timeout mechanism 
> and may hang indefinitely if the job being waited on is lost, cancelled, or 
> enters an unexpected terminal state.
>  
> {code:java}
> public static void waitForAllTaskRunning(
>         MiniCluster miniCluster, JobID jobId, boolean allowFinished) throws 
> Exception {
>     waitForAllTaskRunning(() -> getGraph(miniCluster, jobId), allowFinished);
> } {code}
> This method calls `waitUntilCondition()` which polls indefinitely with no 
> upper bound on wait time:
> {code:java}
> public static void waitUntilCondition(
>         SupplierWithException<Boolean, Exception> condition) throws Exception 
> {
>     waitUntilCondition(condition, Duration.ofMillis(1));
> } {code}
> If the job identified by `jobId` is:
>  - Lost due to cluster failure or restart
>  - Cancelled unexpectedly
>  - Failed and entered a terminal state
> Never properly scheduled The method will continue polling forever, causing 
> the test to hang indefinitely rather than failing with a clear error message.
>  
> h2. Scenario Example
> {code:java}
> JobClient jobClient = env.executeAsync();
> // If something causes the job to be lost here (e.g., cluster issue)
> // This will hang forever because the job no longer exists
> CommonTestUtils.waitForAllTaskRunning(
>     miniCluster, jobClient.getJobID(), false); {code}
>  
> Add a timeout parameter to `waitForAllTaskRunning()` to ensure tests fail 
> fast with a clear error message rather than hanging indefinitely.
> I attached a proposed patch for this issue and happy to send a PR for the 
> issue if you think this is reasonable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to