I've spent two days trying to reproduce this issue, and I've had no luck so
far. We believe that there is a race condition, or maybe an exception path,
that causes the connection pool to leak resources and/or account
incorrectly for leased and available connections. I'll continue to try to
find a repro, and to work with the handful of service owners who can
reproduce the problem.

On Sat, Mar 27, 2021 at 9:50 AM Ryan Schmitt <[email protected]> wrote:

> I'll work on it. In the meantime, a coworker made the following
> observation:
>
> "The 100% failures on host seems (I could be wrong) to have been reported
> only after the latest patch [to enable idle connection validation], not
> earlier. So wonder if that latest commit is specifically trigger this issue"
>
> On Sat, Mar 27, 2021 at 8:28 AM Oleg Kalnichevski <[email protected]>
> wrote:
>
>>
>>
>> On 3/26/2021 6:01 PM, Ryan Schmitt wrote:
>> > We've reverted to Apache 4 for now since we're continuing to see
>> reports of
>> > this issue trickle in. Here's some more information I received:
>> >
>> >> I can tell you though what was triggering it. We had priming that sent
>> > several requests to the service at the exact same second, so we had up
>> to
>> > 500 requests to that dependency at same time (our hosts are pretty big
>> - 48
>> > physical cores) which I believe led to some race condition when freeing
>> > resources. Once we changed our priming so it’s evenly distributed over
>> time
>> > we stopped seeing the issue.
>> >
>>
>> This does look like a race condition of some sort but I cannot really
>> say much without a reproducer.
>>
>> Is there any chance you could tweak my branchmark [1] to trigger the
>> same condition or put together a test case of your own?
>>
>> [1] https://github.com/ok2c/httpclient-benchmark
>>
>> Oleg
>>
>> >
>> > On Wed, Mar 24, 2021 at 11:44 AM Ryan Schmitt <[email protected]>
>> wrote:
>> >
>> >> OK, I'll try to get more diagnostics next week when I get back from
>> >> vacation.
>> >>
>> >> On Wed, Mar 24, 2021 at 11:42 AM Oleg Kalnichevski <[email protected]>
>> >> wrote:
>> >>
>> >>> On Wed, 2021-03-24 at 11:38 -0700, Ryan Schmitt wrote:
>> >>>> But how does that explain the client/connection pool locking up and
>> >>>> failing all requests until the service is restarted?
>> >>>>
>> >>>
>> >>> It does not. Just an exception stack trace is not enough to make any
>> >>> semi-educated guesses here. Thread dump would likely help more.
>> >>>
>> >>> Oleg
>> >>>
>> >>>
>> >>>> On Wed, Mar 24, 2021 at 10:34 AM Oleg Kalnichevski <[email protected]
>> >
>> >>>> wrote:
>> >>>>>
>> >>>>> On 3/23/2021 11:13 PM, Ryan Schmitt wrote:
>> >>>>>> Oh, I actually *do* have more:
>> >>>>>>
>> >>>>>> Internal Failure java.lang.IllegalStateException: Endpoint not
>> >>>>> acquired /
>> >>>>>> already released
>> >>>>>>    at
>> >>>>>>
>> >>>>> com.amazon.coral.apache.hc.client5.http.impl.classic.InternalExecRu
>> >>>>> ntime.ensureValid(InternalExecRuntime.java:142)
>> >>>>>>
>> >>>>>>    at
>> >>>>>>
>> >>>>> com.amazon.coral.apache.hc.client5.http.impl.classic.InternalExecRu
>> >>>>> ntime.connectEndpoint(InternalExecRuntime.java:172)
>> >>>>>>
>> >>>>>
>> >>>>> Hi Ryan
>> >>>>>
>> >>>>> It looks like this is most likely to be happening when a request
>> >>>>> being
>> >>>>> executed gets aborted from another execution thread.
>> >>>>>
>> >>>>> Oleg
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>> >>>
>> >
>>
>

Reply via email to