[
https://issues.apache.org/jira/browse/HTTPCLIENT-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023202#comment-18023202
]
Oleg Kalnichevski commented on HTTPCLIENT-2398:
-----------------------------------------------
[~eppdot] I think I have an idea what is happening. Basically your code shoves
thousands (many thousands?) of requests into the execution pipeline without
making any attempt at checking whether or not those requests actually succeed.
HttpClient dutifully tries to process them all. At some point all those
requests start failing one after another rapidly for a reason I can not see
because the stack trace is incomplete or garbled. At some point the call stack
gets overflown.
I will see if I can make HttpClient recover from this kind of condition by
itself but in my opinion your code is clearly misusing the library. There are
better ways of executing massive amount of requests without needlessly
overloading the execution pipeline.
This is the benchmark I use to load test HttpClient [1] in case you are
wondering if HttpClient has been tested under load.
Oleg
[1]
https://github.com/ok2c/httpclient-benchmark/blob/master/src/main/java/com/ok2c/http/client/benchmark/ApacheHttpAsyncClientV5.java
> StackOverflowError in pool release sync chain
> ---------------------------------------------
>
> Key: HTTPCLIENT-2398
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-2398
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpClient (async)
> Affects Versions: 5.5
> Reporter: Stephan
> Priority: Major
> Attachments: stacktrace, stacktrace-2.rtf
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Hello,
> I have investigated a tricky error we experienced multiple times in our
> benchmarks (that use the async httpclient to send thousands of requests to
> our clusters).
> You can find a deeper analysis
> [here|https://github.com/camunda/camunda/issues/34597#issuecomment-3301797932]
> in our issue tracker.
>
> *TLDR;*
> The stacktrace shows a *tight synchronous callback cycle inside
> HttpComponents' async path* that repeatedly alternates between
> {{completed → release/discard/fail → connect/proceedToNextHop → completed}}
> causing unbounded recursion until the JVM stack overflows.
>
> Concretely the cycle is:
> * *{{AsyncConnectExec$1.completed}}* →
> {{InternalHttpAsyncExecRuntime$1.completed}} → {{BasicFuture.completed}}
> * {{PoolingAsyncClientConnectionManager}} lease/completed →
> {{StrictConnPool.fireCallbacks}} → {{StrictConnPool.release}} →
> {{PoolingAsyncClientConnectionManager.release}}
> * {{InternalHttpAsyncExecRuntime.discardEndpoint}} →
> {{InternalAbstractHttpAsyncClient$2.failed}} →
> {{AsyncRedirectExec/AsyncHttpRequestRetryExec/AsyncProtocolExec/AsyncConnectExec.failed}}
> → {{BasicFuture.failed}} / {{ComplexFuture.failed}} →
> {{PoolingAsyncClientConnectionManager$4.failed}} →
> {{DefaultAsyncClientConnectionOperator$1.failed}} →
> {{MultihomeIOSessionRequester.connect}} →
> {{DefaultAsyncClientConnectionOperator.connect}} →
> {{PoolingAsyncClientConnectionManager.connect}} →
> {{InternalHttpAsyncExecRuntime.connectEndpoint}} →
> {{AsyncConnectExec.proceedToNextHop}} → *back to
> {{}}*{{*AsyncConnectExec$1.completed*}}
>
>
> Possible concrete root causes
> # *Synchronous BasicFuture callbacks*
> BasicFuture.completed() and .failed() call callbacks immediately on the
> thread that completes the future. If a callback in turn calls pool release()
> which calls fireCallbacks() (synchronously), the chain can re-enter callback
> code without unwinding. Re-entrancy depth grows with each attempted
> connect/release cycle.
> # *Multihome connect tries multiple addresses in the same stack*
> MultihomeIOSessionRequester.connect will attempt alternate addresses (A/AAAA
> records). If an address fails quickly and the code immediately tries the next
> address by invoking connection manager code and its callbacks synchronously,
> you build deeper recursion for each try.
> # *Retries/redirects executed synchronously*
> The exec chain (redirect → retry → protocol → connect) will call failed()
> listeners which in turn call connect again. If those calls are synchronous,
> you get direct recursive invocation.
> # *Potential omission of an async boundary*
> A simple but dangerous pattern is: complete future → call listener → listener
> calls code that completes other futures → repeat. If there is no executor
> handoff, the recursion remains on the same thread.
>
> I haven’t been able to create a unit test that reproduces the issue locally,
> even though I tried multiple approaches (synthetic http server that is flaky,
> randomly failing custom dns resolver, thousands of requests scheduled, etc.).
> Currently I am running a few more benchmarks tests to see if this yields an
> improvement
> {code:java}
> @Override
> public void release(
> final AsyncConnectionEndpoint endpoint, final Object state, final
> TimeValue keepAlive) {
> CompletableFuture.runAsync(
> () -> {
> super.release(endpoint, state, keepAlive);
> });
> }
> } {code}
> Does someone have an idea what we are doing wrong? Is this a bug or
> misconfiguration on our side? We switched now to the {{LAX}} concurrency
> policy which seems to mitigate the issue, but it's not fixing the root cause
> and we sill occasionally get the StackOverFlowError. (I can see the lax pool
> also has the sync release/fireCallbacks approach etc.)
> I have attached two stacktraces (one with StrictConnPoll and one with
> LaxConnPool).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]