I've spent two days trying to reproduce this issue, and I've had no luck so far. We believe that there is a race condition, or maybe an exception path, that causes the connection pool to leak resources and/or account incorrectly for leased and available connections. I'll continue to try to find a repro, and to work with the handful of service owners who can reproduce the problem.
On Sat, Mar 27, 2021 at 9:50 AM Ryan Schmitt <[email protected]> wrote: > I'll work on it. In the meantime, a coworker made the following > observation: > > "The 100% failures on host seems (I could be wrong) to have been reported > only after the latest patch [to enable idle connection validation], not > earlier. So wonder if that latest commit is specifically trigger this issue" > > On Sat, Mar 27, 2021 at 8:28 AM Oleg Kalnichevski <[email protected]> > wrote: > >> >> >> On 3/26/2021 6:01 PM, Ryan Schmitt wrote: >> > We've reverted to Apache 4 for now since we're continuing to see >> reports of >> > this issue trickle in. Here's some more information I received: >> > >> >> I can tell you though what was triggering it. We had priming that sent >> > several requests to the service at the exact same second, so we had up >> to >> > 500 requests to that dependency at same time (our hosts are pretty big >> - 48 >> > physical cores) which I believe led to some race condition when freeing >> > resources. Once we changed our priming so it’s evenly distributed over >> time >> > we stopped seeing the issue. >> > >> >> This does look like a race condition of some sort but I cannot really >> say much without a reproducer. >> >> Is there any chance you could tweak my branchmark [1] to trigger the >> same condition or put together a test case of your own? >> >> [1] https://github.com/ok2c/httpclient-benchmark >> >> Oleg >> >> > >> > On Wed, Mar 24, 2021 at 11:44 AM Ryan Schmitt <[email protected]> >> wrote: >> > >> >> OK, I'll try to get more diagnostics next week when I get back from >> >> vacation. >> >> >> >> On Wed, Mar 24, 2021 at 11:42 AM Oleg Kalnichevski <[email protected]> >> >> wrote: >> >> >> >>> On Wed, 2021-03-24 at 11:38 -0700, Ryan Schmitt wrote: >> >>>> But how does that explain the client/connection pool locking up and >> >>>> failing all requests until the service is restarted? >> >>>> >> >>> >> >>> It does not. Just an exception stack trace is not enough to make any >> >>> semi-educated guesses here. Thread dump would likely help more. >> >>> >> >>> Oleg >> >>> >> >>> >> >>>> On Wed, Mar 24, 2021 at 10:34 AM Oleg Kalnichevski <[email protected] >> > >> >>>> wrote: >> >>>>> >> >>>>> On 3/23/2021 11:13 PM, Ryan Schmitt wrote: >> >>>>>> Oh, I actually *do* have more: >> >>>>>> >> >>>>>> Internal Failure java.lang.IllegalStateException: Endpoint not >> >>>>> acquired / >> >>>>>> already released >> >>>>>> at >> >>>>>> >> >>>>> com.amazon.coral.apache.hc.client5.http.impl.classic.InternalExecRu >> >>>>> ntime.ensureValid(InternalExecRuntime.java:142) >> >>>>>> >> >>>>>> at >> >>>>>> >> >>>>> com.amazon.coral.apache.hc.client5.http.impl.classic.InternalExecRu >> >>>>> ntime.connectEndpoint(InternalExecRuntime.java:172) >> >>>>>> >> >>>>> >> >>>>> Hi Ryan >> >>>>> >> >>>>> It looks like this is most likely to be happening when a request >> >>>>> being >> >>>>> executed gets aborted from another execution thread. >> >>>>> >> >>>>> Oleg >> >>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: [email protected] >> >>> For additional commands, e-mail: [email protected] >> >>> >> >>> >> > >> >
