[jira] [Commented] (HTTPCLIENT-2398) StackOverflowError in pool release sync chain

Oleg Kalnichevski (Jira) Sat, 27 Sep 2025 01:36:19 -0700


    [ 
https://issues.apache.org/jira/browse/HTTPCLIENT-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023264#comment-18023264
 ]


Oleg Kalnichevski commented on HTTPCLIENT-2398:
-----------------------------------------------

[~abernal] The proposed change-set still looks hack-y to me even without the 
use of ThreadLocal. I am not even sure it actually fixes the original problem 
in the first place. Given there is no reproducer it cannot be properly verified.

If you want, you could look into two options

1. Provide a safe-guard and and a configuration option to limit the total 
number of queued requests in the execution pipeline and to cancel those that 
exceed the limit defined by the configuration. 
2. Create a HttpAsyncClient facade that orchestrates and throttles request 
submission to ensure the execution pipeline does not get overloaded. 

Both options would require some work and careful API design but would be 
valuable improvements.

Oleg

> StackOverflowError in pool release sync chain
> ---------------------------------------------
>
>                 Key: HTTPCLIENT-2398
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-2398
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient (async)
>    Affects Versions: 5.5
>            Reporter: Stephan
>            Priority: Major
>         Attachments: stacktrace, stacktrace-2.rtf
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hello,
> I have investigated a tricky error we experienced multiple times in our 
> benchmarks (that use the async httpclient to send thousands of requests to 
> our clusters).
> You can find a deeper analysis 
> [here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932]
>  in our issue tracker.
>  
> *TLDR;*
> The stacktrace shows a *tight synchronous callback cycle inside 
> HttpComponents' async path* that repeatedly alternates between
> {{completed → release/discard/fail → connect/proceedToNextHop → completed}}
> causing unbounded recursion until the JVM stack overflows.
>  
> Concretely the cycle is:
>  * *{{AsyncConnectExec$1.completed}}* → 
> {{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
>  * {{PoolingAsyncClientConnectionManager}} lease/completed → 
> {{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} → 
> {{PoolingAsyncClientConnectionManager.release}}
>  * {{InternalHttpAsyncExecRuntime.discardEndpoint}} → 
> {{InternalAbstractHttpAsyncClient$2.failed}} → 
> {{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
>  → {{BasicFuture.failed}} / {{ComplexFuture.failed}} → 
> {{PoolingAsyncClientConnectionManager$4.failed}} → 
> {{DefaultAsyncClientConnectionOperator$1.failed}} → 
> {{MultihomeIOSessionRequester.connect}} → 
> {{DefaultAsyncClientConnectionOperator.connect}} → 
> {{PoolingAsyncClientConnectionManager.connect}} → 
> {{InternalHttpAsyncExecRuntime.connectEndpoint}} → 
> {{AsyncConnectExec.proceedToNextHop}} → *back to 
> {{}}*{{*AsyncConnectExec$1.completed*}}
>  
>  
> Possible concrete root causes
>  # *Synchronous BasicFuture callbacks*
> BasicFuture.completed() and .failed() call callbacks immediately on the 
> thread that completes the future. If a callback in turn calls pool release() 
> which calls fireCallbacks() (synchronously), the chain can re-enter callback 
> code without unwinding. Re-entrancy depth grows with each attempted 
> connect/release cycle.
>  # *Multihome connect tries multiple addresses in the same stack*
> MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA 
> records). If an address fails quickly and the code immediately tries the next 
> address by invoking connection manager code and its callbacks synchronously, 
> you build deeper recursion for each try.
>  # *Retries/redirects executed synchronously*
> The exec chain (redirect → retry → protocol → connect) will call failed() 
> listeners which in turn call connect again. If those calls are synchronous, 
> you get direct recursive invocation.
>  # *Potential omission of an async boundary*
> A simple but dangerous pattern is: complete future → call listener → listener 
> calls code that completes other futures → repeat. If there is no executor 
> handoff, the recursion remains on the same thread.
>  
> I haven’t been able to create a unit test that reproduces the issue locally, 
> even though I tried multiple approaches (synthetic http server that is flaky, 
> randomly failing custom dns resolver, thousands of requests scheduled, etc.).
> Currently I am running a few more benchmarks tests to see if this yields an 
> improvement
> {code:java}
>   @Override
>   public void release(
>       final AsyncConnectionEndpoint endpoint, final Object state, final 
> TimeValue keepAlive) {
>     CompletableFuture.runAsync(
>         () -> {
>           super.release(endpoint, state, keepAlive);
>         });
>   }
> } {code}
> Does someone have an idea what we are doing wrong? Is this a bug or 
> misconfiguration on our side? We switched now to the {{LAX}} concurrency 
> policy which seems to mitigate the issue, but it's not fixing the root cause 
> and we sill occasionally get the StackOverFlowError. (I can see the lax pool 
> also has the sync release/fireCallbacks approach etc.)
> I have attached two stacktraces (one with StrictConnPoll and one with 
> LaxConnPool).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HTTPCLIENT-2398) StackOverflowError in pool release sync chain

Reply via email to