Martijn Visser created FLINK-39858:
--------------------------------------

             Summary: RestClient.close() can leave in-flight request futures 
uncompleted, hanging the caller
                 Key: FLINK-39858
                 URL: https://issues.apache.org/jira/browse/FLINK-39858
             Project: Flink
          Issue Type: Bug
          Components: Runtime / REST, Tests
            Reporter: Martijn Visser
            Assignee: Martijn Visser


{{RestClientTest.testRestClientClosedHandling}} hung intermittently in the 
{{test_cron_hadoop313}} leg on master, where the surefire JVM produced no 
output for 900s and was watchdog-killed.
Unlike a deterministic failure it only reproduces under load: the preceding 
{{ForwardEdgesAdapterTest}} (100k invocations, ~531s) saturated the agent and 
widened the race window.
The thread dump taken at the watchdog kill shows the test worker parked forever 
on the request future:

{code:java}
"ForkJoinPool-295-worker-1"
   java.lang.Thread.State: WAITING
        at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
        at 
org.apache.flink.core.testutils.FlinkCompletableFutureAssert.assertEventuallyFails(FlinkCompletableFutureAssert.java:161)
        at 
org.apache.flink.core.testutils.FlinkCompletableFutureAssert.eventuallyFailsWith(FlinkCompletableFutureAssert.java:135)
        at 
org.apache.flink.runtime.rest.RestClientTest.testRestClientClosedHandling(RestClientTest.java:257)
{code}

Root cause: {{RestClient}} tracks in-flight requests only via 
{{responseChannelFutures}}, which holds each request's connect-phase 
{{CompletableFuture}}. The connect listener removes that future the moment the 
TCP connection is established, before the request enters its in-flight 
(response) phase, so from then on the request is tracked by nothing. On 
{{close()}}, {{notifyResponseFuturesOfShutdown()}} only fails the futures still 
in {{responseChannelFutures}}. When {{close()}} races with a request that has 
just passed the connect phase, the terminal response future is never completed 
(the channel's {{channelInactive}} callback may not be dispatched once the 
event-loop group is being torn down), so a caller blocking on it hangs 
indefinitely.

FLINK-39180 previously treated the same test failure as a benign assertion-type 
mismatch and assumed the future is always completed on close; that holds only 
for the connect phase, not the in-flight phase, so the underlying defect 
remained.

Solution: track the terminal per-request response future for its whole lifetime 
in a dedicated set, fail those futures on close, and re-check {{isRunning}} 
after registration (failing only a
future still atomically registered) to close the check-then-act race.

Failed CI build (Azure DevOps {{flink-ci.flink-master-mirror}}, 20260604.1): 
https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75618



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to