[
https://issues.apache.org/jira/browse/FLINK-39858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39858:
-----------------------------------
Labels: pull-request-available test-stability (was: test-stability)
> RestClient.close() can leave in-flight request futures uncompleted, hanging
> the caller
> --------------------------------------------------------------------------------------
>
> Key: FLINK-39858
> URL: https://issues.apache.org/jira/browse/FLINK-39858
> Project: Flink
> Issue Type: Bug
> Components: Runtime / REST, Tests
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: pull-request-available, test-stability
>
> {{RestClientTest.testRestClientClosedHandling}} hung intermittently in the
> {{test_cron_hadoop313}} leg on master, where the surefire JVM produced no
> output for 900s and was watchdog-killed.
> Unlike a deterministic failure it only reproduces under load: the preceding
> {{ForwardEdgesAdapterTest}} (100k invocations, ~531s) saturated the agent and
> widened the race window.
> The thread dump taken at the watchdog kill shows the test worker parked
> forever on the request future:
> {code:java}
> "ForkJoinPool-295-worker-1"
> java.lang.Thread.State: WAITING
> at
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
> at
> org.apache.flink.core.testutils.FlinkCompletableFutureAssert.assertEventuallyFails(FlinkCompletableFutureAssert.java:161)
> at
> org.apache.flink.core.testutils.FlinkCompletableFutureAssert.eventuallyFailsWith(FlinkCompletableFutureAssert.java:135)
> at
> org.apache.flink.runtime.rest.RestClientTest.testRestClientClosedHandling(RestClientTest.java:257)
> {code}
> Root cause: {{RestClient}} tracks in-flight requests only via
> {{responseChannelFutures}}, which holds each request's connect-phase
> {{CompletableFuture}}. The connect listener removes that future the moment
> the TCP connection is established, before the request enters its in-flight
> (response) phase, so from then on the request is tracked by nothing. On
> {{close()}}, {{notifyResponseFuturesOfShutdown()}} only fails the futures
> still in {{responseChannelFutures}}. When {{close()}} races with a request
> that has just passed the connect phase, the terminal response future is never
> completed (the channel's {{channelInactive}} callback may not be dispatched
> once the event-loop group is being torn down), so a caller blocking on it
> hangs indefinitely.
> FLINK-39180 previously treated the same test failure as a benign
> assertion-type mismatch and assumed the future is always completed on close;
> that holds only for the connect phase, not the in-flight phase, so the
> underlying defect remained.
> Solution: track the terminal per-request response future for its whole
> lifetime in a dedicated set, fail those futures on close, and re-check
> {{isRunning}} after registration (failing only a
> future still atomically registered) to close the check-then-act race.
> Failed CI build (Azure DevOps {{flink-ci.flink-master-mirror}}, 20260604.1):
> https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75618
--
This message was sent by Atlassian Jira
(v8.20.10#820010)