I'll work on it. In the meantime, a coworker made the following observation:
"The 100% failures on host seems (I could be wrong) to have been reported only after the latest patch [to enable idle connection validation], not earlier. So wonder if that latest commit is specifically trigger this issue" On Sat, Mar 27, 2021 at 8:28 AM Oleg Kalnichevski <[email protected]> wrote: > > > On 3/26/2021 6:01 PM, Ryan Schmitt wrote: > > We've reverted to Apache 4 for now since we're continuing to see reports > of > > this issue trickle in. Here's some more information I received: > > > >> I can tell you though what was triggering it. We had priming that sent > > several requests to the service at the exact same second, so we had up to > > 500 requests to that dependency at same time (our hosts are pretty big - > 48 > > physical cores) which I believe led to some race condition when freeing > > resources. Once we changed our priming so it’s evenly distributed over > time > > we stopped seeing the issue. > > > > This does look like a race condition of some sort but I cannot really > say much without a reproducer. > > Is there any chance you could tweak my branchmark [1] to trigger the > same condition or put together a test case of your own? > > [1] https://github.com/ok2c/httpclient-benchmark > > Oleg > > > > > On Wed, Mar 24, 2021 at 11:44 AM Ryan Schmitt <[email protected]> > wrote: > > > >> OK, I'll try to get more diagnostics next week when I get back from > >> vacation. > >> > >> On Wed, Mar 24, 2021 at 11:42 AM Oleg Kalnichevski <[email protected]> > >> wrote: > >> > >>> On Wed, 2021-03-24 at 11:38 -0700, Ryan Schmitt wrote: > >>>> But how does that explain the client/connection pool locking up and > >>>> failing all requests until the service is restarted? > >>>> > >>> > >>> It does not. Just an exception stack trace is not enough to make any > >>> semi-educated guesses here. Thread dump would likely help more. > >>> > >>> Oleg > >>> > >>> > >>>> On Wed, Mar 24, 2021 at 10:34 AM Oleg Kalnichevski <[email protected]> > >>>> wrote: > >>>>> > >>>>> On 3/23/2021 11:13 PM, Ryan Schmitt wrote: > >>>>>> Oh, I actually *do* have more: > >>>>>> > >>>>>> Internal Failure java.lang.IllegalStateException: Endpoint not > >>>>> acquired / > >>>>>> already released > >>>>>> at > >>>>>> > >>>>> com.amazon.coral.apache.hc.client5.http.impl.classic.InternalExecRu > >>>>> ntime.ensureValid(InternalExecRuntime.java:142) > >>>>>> > >>>>>> at > >>>>>> > >>>>> com.amazon.coral.apache.hc.client5.http.impl.classic.InternalExecRu > >>>>> ntime.connectEndpoint(InternalExecRuntime.java:172) > >>>>>> > >>>>> > >>>>> Hi Ryan > >>>>> > >>>>> It looks like this is most likely to be happening when a request > >>>>> being > >>>>> executed gets aborted from another execution thread. > >>>>> > >>>>> Oleg > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >>> > > >
