I'll work on it. In the meantime, a coworker made the following observation:

"The 100% failures on host seems (I could be wrong) to have been reported
only after the latest patch [to enable idle connection validation], not
earlier. So wonder if that latest commit is specifically trigger this issue"

On Sat, Mar 27, 2021 at 8:28 AM Oleg Kalnichevski <[email protected]> wrote:

>
>
> On 3/26/2021 6:01 PM, Ryan Schmitt wrote:
> > We've reverted to Apache 4 for now since we're continuing to see reports
> of
> > this issue trickle in. Here's some more information I received:
> >
> >> I can tell you though what was triggering it. We had priming that sent
> > several requests to the service at the exact same second, so we had up to
> > 500 requests to that dependency at same time (our hosts are pretty big -
> 48
> > physical cores) which I believe led to some race condition when freeing
> > resources. Once we changed our priming so it’s evenly distributed over
> time
> > we stopped seeing the issue.
> >
>
> This does look like a race condition of some sort but I cannot really
> say much without a reproducer.
>
> Is there any chance you could tweak my branchmark [1] to trigger the
> same condition or put together a test case of your own?
>
> [1] https://github.com/ok2c/httpclient-benchmark
>
> Oleg
>
> >
> > On Wed, Mar 24, 2021 at 11:44 AM Ryan Schmitt <[email protected]>
> wrote:
> >
> >> OK, I'll try to get more diagnostics next week when I get back from
> >> vacation.
> >>
> >> On Wed, Mar 24, 2021 at 11:42 AM Oleg Kalnichevski <[email protected]>
> >> wrote:
> >>
> >>> On Wed, 2021-03-24 at 11:38 -0700, Ryan Schmitt wrote:
> >>>> But how does that explain the client/connection pool locking up and
> >>>> failing all requests until the service is restarted?
> >>>>
> >>>
> >>> It does not. Just an exception stack trace is not enough to make any
> >>> semi-educated guesses here. Thread dump would likely help more.
> >>>
> >>> Oleg
> >>>
> >>>
> >>>> On Wed, Mar 24, 2021 at 10:34 AM Oleg Kalnichevski <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> On 3/23/2021 11:13 PM, Ryan Schmitt wrote:
> >>>>>> Oh, I actually *do* have more:
> >>>>>>
> >>>>>> Internal Failure java.lang.IllegalStateException: Endpoint not
> >>>>> acquired /
> >>>>>> already released
> >>>>>>    at
> >>>>>>
> >>>>> com.amazon.coral.apache.hc.client5.http.impl.classic.InternalExecRu
> >>>>> ntime.ensureValid(InternalExecRuntime.java:142)
> >>>>>>
> >>>>>>    at
> >>>>>>
> >>>>> com.amazon.coral.apache.hc.client5.http.impl.classic.InternalExecRu
> >>>>> ntime.connectEndpoint(InternalExecRuntime.java:172)
> >>>>>>
> >>>>>
> >>>>> Hi Ryan
> >>>>>
> >>>>> It looks like this is most likely to be happening when a request
> >>>>> being
> >>>>> executed gets aborted from another execution thread.
> >>>>>
> >>>>> Oleg
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>>
> >
>

Reply via email to