[
https://issues.apache.org/jira/browse/HTTPCLIENT-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephan updated HTTPCLIENT-2398:
--------------------------------
Description:
Hello,
I have investigated a tricky error we experienced multiple times in our
benchmarks (that use the async httpclient to send thousands of requests to our
clusters).
You can find a deeper analysis
[here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932]
in our issue tracker.
*TLDR;*
The stacktrace shows a *tight synchronous callback cycle inside HttpComponents'
async path* that repeatedly alternates between
{{completed → release/discard/fail → connect/proceedToNextHop → completed}}
causing unbounded recursion until the JVM stack overflows.
Concretely the cycle is:
* *{{AsyncConnectExec$1.completed}}* →
{{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
* {{PoolingAsyncClientConnectionManager}} lease/completed →
{{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} →
{{PoolingAsyncClientConnectionManager.release}}
* {{InternalHttpAsyncExecRuntime.discardEndpoint}} →
{{InternalAbstractHttpAsyncClient$2.failed}} →
{{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
→ {{BasicFuture.failed}} / {{ComplexFuture.failed}} →
{{PoolingAsyncClientConnectionManager$4.failed}} →
{{DefaultAsyncClientConnectionOperator$1.failed}} →
{{MultihomeIOSessionRequester.connect}} →
{{DefaultAsyncClientConnectionOperator.connect}} →
{{PoolingAsyncClientConnectionManager.connect}} →
{{InternalHttpAsyncExecRuntime.connectEndpoint}} →
{{AsyncConnectExec.proceedToNextHop}} → *back to
{{}}*{{*AsyncConnectExec$1.completed*}}
Possible concrete root causes
# *Synchronous BasicFuture callbacks*
BasicFuture.completed() and .failed() call callbacks immediately on the thread
that completes the future. If a callback in turn calls pool release() which
calls fireCallbacks() (synchronously), the chain can re-enter callback code
without unwinding. Re-entrancy depth grows with each attempted connect/release
cycle.
# *Multihome connect tries multiple addresses in the same stack*
MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA
records). If an address fails quickly and the code immediately tries the next
address by invoking connection manager code and its callbacks synchronously,
you build deeper recursion for each try.
# *Retries/redirects executed synchronously*
The exec chain (redirect → retry → protocol → connect) will call failed()
listeners which in turn call connect again. If those calls are synchronous, you
get direct recursive invocation.
# *Potential omission of an async boundary*
A simple but dangerous pattern is: complete future → call listener → listener
calls code that completes other futures → repeat. If there is no executor
handoff, the recursion remains on the same thread.
I haven’t been able to create a unit test that reproduces the issue locally,
even though I tried multiple approaches (synthetic http server that is flaky,
randomly failing custom dns resolver, thousands of requests scheduled, etc.).
Currently I am running a few more benchmarks tests to see if this yields an
improvement
{code:java}
@Override
public void release(
final AsyncConnectionEndpoint endpoint, final Object state, final
TimeValue keepAlive) {
CompletableFuture.runAsync(
() -> {
super.release(endpoint, state, keepAlive);
});
}
} {code}
Does someone have an idea what we are doing wrong? Is this a bug or
misconfiguration on our side? We switched now to the {{LAX}} concurrency policy
which seems to mitigate the issue, but it's not fixing the root cause and we
sill occasionally get the StackOverFlowError. (I can see the lax pool also has
the sync release/fireCallbacks approach etc.)
I have attached two stacktraces (one with StrictConnPoll and one with
LaxConnPool).
was:
Hello,
I have investigated a tricky error we experienced multiple times in our
benchmarks (that use the async httpclient to send thousands of requests to our
clusters).
You can find a deeper analysis
[here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932]
in our issue tracker.
*TLDR;*
The stacktrace shows a *tight synchronous callback cycle inside HttpComponents'
async path* that repeatedly alternates between
{{completed → release/discard/fail → connect/proceedToNextHop → completed}}
causing unbounded recursion until the JVM stack overflows.
Concretely the cycle is:
* *{{AsyncConnectExec$1.completed}}* →
{{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
* {{PoolingAsyncClientConnectionManager}} lease/completed →
{{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} →
{{PoolingAsyncClientConnectionManager.release}}
* {{InternalHttpAsyncExecRuntime.discardEndpoint}} →
{{InternalAbstractHttpAsyncClient$2.failed}} →
{{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
→ {{BasicFuture.failed}} / {{ComplexFuture.failed}} →
{{PoolingAsyncClientConnectionManager$4.failed}} →
{{DefaultAsyncClientConnectionOperator$1.failed}} →
{{MultihomeIOSessionRequester.connect}} →
{{DefaultAsyncClientConnectionOperator.connect}} →
{{PoolingAsyncClientConnectionManager.connect}} →
{{InternalHttpAsyncExecRuntime.connectEndpoint}} →
{{AsyncConnectExec.proceedToNextHop}} → *{*}back to{*}
{{{*}AsyncConnectExec$1.completed{*}.}}
Possible concrete root causes
# *Synchronous BasicFuture callbacks*
BasicFuture.completed() and .failed() call callbacks immediately on the thread
that completes the future. If a callback in turn calls pool release() which
calls fireCallbacks() (synchronously), the chain can re-enter callback code
without unwinding. Re-entrancy depth grows with each attempted connect/release
cycle.
# *Multihome connect tries multiple addresses in the same stack*
MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA
records). If an address fails quickly and the code immediately tries the next
address by invoking connection manager code and its callbacks synchronously,
you build deeper recursion for each try.
# *Retries/redirects executed synchronously*
The exec chain (redirect → retry → protocol → connect) will call failed()
listeners which in turn call connect again. If those calls are synchronous, you
get direct recursive invocation.
# *Potential omission of an async boundary*
A simple but dangerous pattern is: complete future → call listener → listener
calls code that completes other futures → repeat. If there is no executor
handoff, the recursion remains on the same thread.
I haven’t been able to create a unit test that reproduces the issue locally,
even though I tried multiple approaches (synthetic http server that is flaky,
randomly failing custom dns resolver, thousands of requests scheduled, etc.).
Currently I am running a few more benchmarks tests to see if this yields an
improvement
{code:java}
@Override
public void release(
final AsyncConnectionEndpoint endpoint, final Object state, final
TimeValue keepAlive) {
CompletableFuture.runAsync(
() -> {
super.release(endpoint, state, keepAlive);
});
}
} {code}
Does someone have an idea what we are doing wrong? Is this a bug or
misconfiguration on our side? We switched now to the {{LAX}} concurrency policy
which seems to mitigate the issue, but it's not fixing the root cause and we
sill occasionally get the StackOverFlowError. (I can see the lax pool also has
the sync release/fireCallbacks approach etc.)
I have attached two stacktraces (one with StrictConnPoll and one with
LaxConnPool).
> StackOverflowError in pool release sync chain
> ---------------------------------------------
>
> Key: HTTPCLIENT-2398
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-2398
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient (async)
> Affects Versions: 5.5
> Reporter: Stephan
> Priority: Major
> Attachments: stacktrace, stacktrace-2.rtf
>
>
> Hello,
> I have investigated a tricky error we experienced multiple times in our
> benchmarks (that use the async httpclient to send thousands of requests to
> our clusters).
> You can find a deeper analysis
> [here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932]
> in our issue tracker.
>
> *TLDR;*
> The stacktrace shows a *tight synchronous callback cycle inside
> HttpComponents' async path* that repeatedly alternates between
> {{completed → release/discard/fail → connect/proceedToNextHop → completed}}
> causing unbounded recursion until the JVM stack overflows.
>
> Concretely the cycle is:
> * *{{AsyncConnectExec$1.completed}}* →
> {{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
> * {{PoolingAsyncClientConnectionManager}} lease/completed →
> {{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} →
> {{PoolingAsyncClientConnectionManager.release}}
> * {{InternalHttpAsyncExecRuntime.discardEndpoint}} →
> {{InternalAbstractHttpAsyncClient$2.failed}} →
> {{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
> → {{BasicFuture.failed}} / {{ComplexFuture.failed}} →
> {{PoolingAsyncClientConnectionManager$4.failed}} →
> {{DefaultAsyncClientConnectionOperator$1.failed}} →
> {{MultihomeIOSessionRequester.connect}} →
> {{DefaultAsyncClientConnectionOperator.connect}} →
> {{PoolingAsyncClientConnectionManager.connect}} →
> {{InternalHttpAsyncExecRuntime.connectEndpoint}} →
> {{AsyncConnectExec.proceedToNextHop}} → *back to
> {{}}*{{*AsyncConnectExec$1.completed*}}
>
>
> Possible concrete root causes
> # *Synchronous BasicFuture callbacks*
> BasicFuture.completed() and .failed() call callbacks immediately on the
> thread that completes the future. If a callback in turn calls pool release()
> which calls fireCallbacks() (synchronously), the chain can re-enter callback
> code without unwinding. Re-entrancy depth grows with each attempted
> connect/release cycle.
> # *Multihome connect tries multiple addresses in the same stack*
> MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA
> records). If an address fails quickly and the code immediately tries the next
> address by invoking connection manager code and its callbacks synchronously,
> you build deeper recursion for each try.
> # *Retries/redirects executed synchronously*
> The exec chain (redirect → retry → protocol → connect) will call failed()
> listeners which in turn call connect again. If those calls are synchronous,
> you get direct recursive invocation.
> # *Potential omission of an async boundary*
> A simple but dangerous pattern is: complete future → call listener → listener
> calls code that completes other futures → repeat. If there is no executor
> handoff, the recursion remains on the same thread.
>
> I haven’t been able to create a unit test that reproduces the issue locally,
> even though I tried multiple approaches (synthetic http server that is flaky,
> randomly failing custom dns resolver, thousands of requests scheduled, etc.).
> Currently I am running a few more benchmarks tests to see if this yields an
> improvement
> {code:java}
> @Override
> public void release(
> final AsyncConnectionEndpoint endpoint, final Object state, final
> TimeValue keepAlive) {
> CompletableFuture.runAsync(
> () -> {
> super.release(endpoint, state, keepAlive);
> });
> }
> } {code}
> Does someone have an idea what we are doing wrong? Is this a bug or
> misconfiguration on our side? We switched now to the {{LAX}} concurrency
> policy which seems to mitigate the issue, but it's not fixing the root cause
> and we sill occasionally get the StackOverFlowError. (I can see the lax pool
> also has the sync release/fireCallbacks approach etc.)
> I have attached two stacktraces (one with StrictConnPoll and one with
> LaxConnPool).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]