[jira] [Comment Edited] (HTTPCLIENT-2398) StackOverflowError in pool release sync chain

Oleg Kalnichevski (Jira) Sat, 18 Oct 2025 10:28:38 -0700


    [ 
https://issues.apache.org/jira/browse/HTTPCLIENT-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021808#comment-18021808
 ]


Oleg Kalnichevski edited comment on HTTPCLIENT-2398 at 9/22/25 9:38 AM:
------------------------------------------------------------------------

> I pulled the stack traces from our GCP logs as is, they are just cut-off at 
> some point. But you can already clearly see the repeating pattern. What do 
> you mean with garbled? how should I prep the Stacktrace?

[~eppdot] Properly formatted stack trace is just plain text one can feed into 
intelliJ `Analyze Stack Trace ` dialog. Generally this is what one gets when 
printing the stack trace with {{{}Throwable#printStackTrace(PrintStream){}}}. 
Most of the logging toolkits I have worked with (Log4j2 and Logback) preserve 
the conventional stack trace formatting.

I am not sure about the fix proposed by [~abernal] . It likely eliminates the 
recursion but I doubt it fixes the root cause. 

Oleg


was (Author: olegk):
> I pulled the stack traces from our GCP logs as is, they are just cut-off at 
> some point. But you can already clearly see the repeating pattern. What do 
> you mean with garbled? how should I prep the Stacktrace?

[~eppdot] Properly formatted stack trace is just plain text one can feed into 
intelliJ `Analyze Stack Trace ` dialog. Generally this is what one gets when 
printing the stack trace with `Throwable#printStackTrace(PrintStream)`. Most of 
the logging toolkits I have worked with (Log4j2 and Logback) preserve the 
conventional stack trace formatting.

I am not sure about the fix proposed by [~abernal] . It likely eliminates the 
recursion but I doubt it fixes the root cause. 

Oleg

> StackOverflowError in pool release sync chain
> ---------------------------------------------
>
>                 Key: HTTPCLIENT-2398
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-2398
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient (async)
>    Affects Versions: 5.5
>            Reporter: Stephan
>            Priority: Major
>         Attachments: stacktrace, stacktrace-2.rtf
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hello,
> I have investigated a tricky error we experienced multiple times in our 
> benchmarks (that use the async httpclient to send thousands of requests to 
> our clusters).
> You can find a deeper analysis 
> [here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932]
>  in our issue tracker.
>  
> *TLDR;*
> The stacktrace shows a *tight synchronous callback cycle inside 
> HttpComponents' async path* that repeatedly alternates between
> {{completed → release/discard/fail → connect/proceedToNextHop → completed}}
> causing unbounded recursion until the JVM stack overflows.
>  
> Concretely the cycle is:
>  * *{{AsyncConnectExec$1.completed}}* → 
> {{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
>  * {{PoolingAsyncClientConnectionManager}} lease/completed → 
> {{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} → 
> {{PoolingAsyncClientConnectionManager.release}}
>  * {{InternalHttpAsyncExecRuntime.discardEndpoint}} → 
> {{InternalAbstractHttpAsyncClient$2.failed}} → 
> {{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
>  → {{BasicFuture.failed}} / {{ComplexFuture.failed}} → 
> {{PoolingAsyncClientConnectionManager$4.failed}} → 
> {{DefaultAsyncClientConnectionOperator$1.failed}} → 
> {{MultihomeIOSessionRequester.connect}} → 
> {{DefaultAsyncClientConnectionOperator.connect}} → 
> {{PoolingAsyncClientConnectionManager.connect}} → 
> {{InternalHttpAsyncExecRuntime.connectEndpoint}} → 
> {{AsyncConnectExec.proceedToNextHop}} → *back to 
> {{}}*{{*AsyncConnectExec$1.completed*}}
>  
>  
> Possible concrete root causes
>  # *Synchronous BasicFuture callbacks*
> BasicFuture.completed() and .failed() call callbacks immediately on the 
> thread that completes the future. If a callback in turn calls pool release() 
> which calls fireCallbacks() (synchronously), the chain can re-enter callback 
> code without unwinding. Re-entrancy depth grows with each attempted 
> connect/release cycle.
>  # *Multihome connect tries multiple addresses in the same stack*
> MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA 
> records). If an address fails quickly and the code immediately tries the next 
> address by invoking connection manager code and its callbacks synchronously, 
> you build deeper recursion for each try.
>  # *Retries/redirects executed synchronously*
> The exec chain (redirect → retry → protocol → connect) will call failed() 
> listeners which in turn call connect again. If those calls are synchronous, 
> you get direct recursive invocation.
>  # *Potential omission of an async boundary*
> A simple but dangerous pattern is: complete future → call listener → listener 
> calls code that completes other futures → repeat. If there is no executor 
> handoff, the recursion remains on the same thread.
>  
> I haven’t been able to create a unit test that reproduces the issue locally, 
> even though I tried multiple approaches (synthetic http server that is flaky, 
> randomly failing custom dns resolver, thousands of requests scheduled, etc.).
> Currently I am running a few more benchmarks tests to see if this yields an 
> improvement
> {code:java}
>   @Override
>   public void release(
>       final AsyncConnectionEndpoint endpoint, final Object state, final 
> TimeValue keepAlive) {
>     CompletableFuture.runAsync(
>         () -> {
>           super.release(endpoint, state, keepAlive);
>         });
>   }
> } {code}
> Does someone have an idea what we are doing wrong? Is this a bug or 
> misconfiguration on our side? We switched now to the {{LAX}} concurrency 
> policy which seems to mitigate the issue, but it's not fixing the root cause 
> and we sill occasionally get the StackOverFlowError. (I can see the lax pool 
> also has the sync release/fireCallbacks approach etc.)
> I have attached two stacktraces (one with StrictConnPoll and one with 
> LaxConnPool).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HTTPCLIENT-2398) StackOverflowError in pool release sync chain

Reply via email to