[ 
https://issues.apache.org/jira/browse/HTTPCLIENT-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan updated HTTPCLIENT-2398:
--------------------------------
    Description: 
Hello,

I have investigated a tricky error we experienced multiple times in our 
benchmarks (that use the async httpclient to send thousands of requests to our 
clusters).

You can find a deeper analysis 
[here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932] 
in our issue tracker.

 

*TLDR;*

The stacktrace shows a *tight synchronous callback cycle inside HttpComponents' 
async path* that repeatedly alternates between

{{completed → release/discard/fail → connect/proceedToNextHop → completed}}

causing unbounded recursion until the JVM stack overflows.

 

Concretely the cycle is:
 * *{{AsyncConnectExec$1.completed}}* → 
{{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
 * {{PoolingAsyncClientConnectionManager}} lease/completed → 
{{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} → 
{{PoolingAsyncClientConnectionManager.release}}
 * {{InternalHttpAsyncExecRuntime.discardEndpoint}} → 
{{InternalAbstractHttpAsyncClient$2.failed}} → 
{{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
 → {{BasicFuture.failed}} / {{ComplexFuture.failed}} → 
{{PoolingAsyncClientConnectionManager$4.failed}} → 
{{DefaultAsyncClientConnectionOperator$1.failed}} → 
{{MultihomeIOSessionRequester.connect}} → 
{{DefaultAsyncClientConnectionOperator.connect}} → 
{{PoolingAsyncClientConnectionManager.connect}} → 
{{InternalHttpAsyncExecRuntime.connectEndpoint}} → 
{{AsyncConnectExec.proceedToNextHop}} → *{*}back to{*} 
{{{*}AsyncConnectExec$1.completed{*}.}}

 

 

Possible concrete root causes
 # *Synchronous BasicFuture callbacks*
BasicFuture.completed() and .failed() call callbacks immediately on the thread 
that completes the future. If a callback in turn calls pool release() which 
calls fireCallbacks() (synchronously), the chain can re-enter callback code 
without unwinding. Re-entrancy depth grows with each attempted connect/release 
cycle.
 # *Multihome connect tries multiple addresses in the same stack*
MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA 
records). If an address fails quickly and the code immediately tries the next 
address by invoking connection manager code and its callbacks synchronously, 
you build deeper recursion for each try.
 # *Retries/redirects executed synchronously*
The exec chain (redirect → retry → protocol → connect) will call failed() 
listeners which in turn call connect again. If those calls are synchronous, you 
get direct recursive invocation.
 # *Potential omission of an async boundary*
A simple but dangerous pattern is: complete future → call listener → listener 
calls code that completes other futures → repeat. If there is no executor 
handoff, the recursion remains on the same thread.

 

I haven’t been able to create a unit test that reproduces the issue locally, 
even though I tried multiple approaches (synthetic http server that is flaky, 
randomly failing custom dns resolver, thousands of requests scheduled, etc.).

Currently I am running a few more benchmarks tests to see if this yields an 
improvement
{code:java}
  @Override
  public void release(
      final AsyncConnectionEndpoint endpoint, final Object state, final 
TimeValue keepAlive) {
    CompletableFuture.runAsync(
        () -> {
          super.release(endpoint, state, keepAlive);
        });
  }
} {code}
Does someone have an idea what we are doing wrong? Is this a bug or 
misconfiguration on our side? We switched now to the {{LAX}} concurrency policy 
which seems to mitigate the issue, but it's not fixing the root cause and we 
sill occasionally get the StackOverFlowError. (I can see the lax pool also has 
the sync release/fireCallbacks approach etc.)

I have attached two stacktraces (one with StrictConnPoll and one with 
LaxConnPool).

  was:
Hello,

I have investigated a tricky error we experienced multiple times in our 
benchmarks (that use the async httpclient to send thousands of requests to our 
clusters).

You can find a deeper analysis 
[here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932] 
in our issue tracker.

 

*TLDR;*

The stacktrace shows a *tight synchronous callback cycle inside HttpComponents' 
async path* that repeatedly alternates between

{{completed → release/discard/fail → connect/proceedToNextHop → completed}}

causing unbounded recursion until the JVM stack overflows.

 

Concretely the cycle is:
 * *{{AsyncConnectExec$1.completed}}* → 
{{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
 * {{PoolingAsyncClientConnectionManager}} lease/completed → 
{{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} → 
{{PoolingAsyncClientConnectionManager.release}}
 * {{InternalHttpAsyncExecRuntime.discardEndpoint}} → 
{{InternalAbstractHttpAsyncClient$2.failed}} → 
{{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
 → {{BasicFuture.failed}} / {{ComplexFuture.failed}} → 
{{PoolingAsyncClientConnectionManager$4.failed}} → 
{{DefaultAsyncClientConnectionOperator$1.failed}} → 
{{MultihomeIOSessionRequester.connect}} → 
{{DefaultAsyncClientConnectionOperator.connect}} → 
{{PoolingAsyncClientConnectionManager.connect}} → 
{{InternalHttpAsyncExecRuntime.connectEndpoint}} → 
{{AsyncConnectExec.proceedToNextHop}} → **back to* 
{{{*}AsyncConnectExec$1.completed{*}.}}

 

 

Possible concrete root causes (detailed)
 # Synchronous BasicFuture callbacks
BasicFuture.completed() and .failed() call callbacks immediately on the thread 
that completes the future. If a callback in turn calls pool release() which 
calls fireCallbacks() (synchronously), the chain can re-enter callback code 
without unwinding. Re-entrancy depth grows with each attempted connect/release 
cycle.
 # Multihome connect tries multiple addresses in the same stack
MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA 
records). If an address fails quickly and the code immediately tries the next 
address by invoking connection manager code and its callbacks synchronously, 
you build deeper recursion for each try.
 # Retries/redirects executed synchronously
The exec chain (redirect → retry → protocol → connect) will call failed() 
listeners which in turn call connect again. If those calls are synchronous, you 
get direct recursive invocation.
 # Potential omission of an async boundary
A simple but dangerous pattern is: complete future → call listener → listener 
calls code that completes other futures → repeat. If there is no executor 
handoff, the recursion remains on the same thread.

 

I haven’t been able to create a unit test that reproduces the issue locally, 
even though I tried multiple approaches (synthetic http server that is flaky, 
randomly failing custom dns resolver, thousands of requests scheduled, etc.).

Currently I am running a few more benchmarks tests to see if this yields an 
improvement
{code:java}
  @Override
  public void release(
      final AsyncConnectionEndpoint endpoint, final Object state, final 
TimeValue keepAlive) {
    CompletableFuture.runAsync(
        () -> {
          super.release(endpoint, state, keepAlive);
        });
  }
} {code}
Does someone have an idea what we are doing wrong? Is this a bug or 
misconfiguration on our side? We switched now to the {{LAX}} concurrency policy 
which seems to mitigate the issue, but it's not fixing the root cause and we 
sill occasionally get the StackOverFlowError. (I can see the lax pool also has 
the sync release/fireCallbacks approach etc.)

I have attached two stacktraces (one with StrictConnPoll and one with 
LaxConnPool).


> StackOverflowError in pool release sync chain
> ---------------------------------------------
>
>                 Key: HTTPCLIENT-2398
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-2398
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient (async)
>    Affects Versions: 5.5
>            Reporter: Stephan
>            Priority: Major
>         Attachments: stacktrace, stacktrace-2.rtf
>
>
> Hello,
> I have investigated a tricky error we experienced multiple times in our 
> benchmarks (that use the async httpclient to send thousands of requests to 
> our clusters).
> You can find a deeper analysis 
> [here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932]
>  in our issue tracker.
>  
> *TLDR;*
> The stacktrace shows a *tight synchronous callback cycle inside 
> HttpComponents' async path* that repeatedly alternates between
> {{completed → release/discard/fail → connect/proceedToNextHop → completed}}
> causing unbounded recursion until the JVM stack overflows.
>  
> Concretely the cycle is:
>  * *{{AsyncConnectExec$1.completed}}* → 
> {{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
>  * {{PoolingAsyncClientConnectionManager}} lease/completed → 
> {{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} → 
> {{PoolingAsyncClientConnectionManager.release}}
>  * {{InternalHttpAsyncExecRuntime.discardEndpoint}} → 
> {{InternalAbstractHttpAsyncClient$2.failed}} → 
> {{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
>  → {{BasicFuture.failed}} / {{ComplexFuture.failed}} → 
> {{PoolingAsyncClientConnectionManager$4.failed}} → 
> {{DefaultAsyncClientConnectionOperator$1.failed}} → 
> {{MultihomeIOSessionRequester.connect}} → 
> {{DefaultAsyncClientConnectionOperator.connect}} → 
> {{PoolingAsyncClientConnectionManager.connect}} → 
> {{InternalHttpAsyncExecRuntime.connectEndpoint}} → 
> {{AsyncConnectExec.proceedToNextHop}} → *{*}back to{*} 
> {{{*}AsyncConnectExec$1.completed{*}.}}
>  
>  
> Possible concrete root causes
>  # *Synchronous BasicFuture callbacks*
> BasicFuture.completed() and .failed() call callbacks immediately on the 
> thread that completes the future. If a callback in turn calls pool release() 
> which calls fireCallbacks() (synchronously), the chain can re-enter callback 
> code without unwinding. Re-entrancy depth grows with each attempted 
> connect/release cycle.
>  # *Multihome connect tries multiple addresses in the same stack*
> MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA 
> records). If an address fails quickly and the code immediately tries the next 
> address by invoking connection manager code and its callbacks synchronously, 
> you build deeper recursion for each try.
>  # *Retries/redirects executed synchronously*
> The exec chain (redirect → retry → protocol → connect) will call failed() 
> listeners which in turn call connect again. If those calls are synchronous, 
> you get direct recursive invocation.
>  # *Potential omission of an async boundary*
> A simple but dangerous pattern is: complete future → call listener → listener 
> calls code that completes other futures → repeat. If there is no executor 
> handoff, the recursion remains on the same thread.
>  
> I haven’t been able to create a unit test that reproduces the issue locally, 
> even though I tried multiple approaches (synthetic http server that is flaky, 
> randomly failing custom dns resolver, thousands of requests scheduled, etc.).
> Currently I am running a few more benchmarks tests to see if this yields an 
> improvement
> {code:java}
>   @Override
>   public void release(
>       final AsyncConnectionEndpoint endpoint, final Object state, final 
> TimeValue keepAlive) {
>     CompletableFuture.runAsync(
>         () -> {
>           super.release(endpoint, state, keepAlive);
>         });
>   }
> } {code}
> Does someone have an idea what we are doing wrong? Is this a bug or 
> misconfiguration on our side? We switched now to the {{LAX}} concurrency 
> policy which seems to mitigate the issue, but it's not fixing the root cause 
> and we sill occasionally get the StackOverFlowError. (I can see the lax pool 
> also has the sync release/fireCallbacks approach etc.)
> I have attached two stacktraces (one with StrictConnPoll and one with 
> LaxConnPool).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to