Re: Limiting the blast radius of connection leaks

Gary Gregory Fri, 31 Jul 2020 11:32:44 -0700

On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <[email protected]> wrote:


> On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > Hello Friends,
> >
> > I recently had a service crash due to an out of memory error as the
> > result of a response leak. While this is caused by code that uses
> > closeable resources incorrectly, and that code should be fixed, I'd
> > like to discuss some ideas which could avoid catastrophic failure in
> > this scenario.
> >
> > A little background on my environment: I maintain an RPC library
> > which wraps an hc5 client. There's no limit set on the hc5 connection
> > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > variation of additive-increase/multiplicative-decrease limiters to
> > scale based on load. Some endpoints are capable of returning a raw
> > binary stream, this introduces some risk of resource leaks -- ideally
> > we would take an inputstream consumer to fully own the lifecycle, but
> > unfortunately that's not an API change we can make at this point. We
> > return permits before passing the resulting inputstream to custom
> > code to prevent leaks from blocking the entire service.
> >
> > Unfortunately leased connections are anchored directly to both
> > StrictConnPool and LaxConnPool in order to validate that connections
> > are returned to the pool where they originated, which prevents a
> > relatively large swath of resources from being garbage collected in
> > the event of a leak. I realize my case is a bit different due to the
> > effectively unlimited pool size, where memory becomes the limiting
> > factor rather than requests blocking against the pool limit, but I
> > believe there are a few options that can help us to both detect and
> > endure leaks.
> >
> > A couple potential solutions:
> >
> > Use weak references to track leased connections with a periodic sweep
> > to detect leaks and reclaim per-route permits. This approach would be
> > complex, but allows the default configuration to detect and handle
> > rare response leaks without eventually blocking connection requests.
> > Using this approach, a leaked connection wouldn't necessarily be
> > closed when the pool is shut down, but may survive until a full GC.
> > In most cases I expect clients to be long-lived, and I'm not sure
> > what guarantees are made about leaked connections.
> >
> > Use a counter to track leased connections instead of tracking leased
> > entries directly. This loses the safety guarantees that connections
> > aren't already leased, and that they're being returned to the pool
> > where they originated, however it eliminates the need for the
> > connection pools hash table and references to checked-out values.
> > This solution would not be helpful when connections are leaked and
> > the route limit is reached, so it may make more sense as a
> > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> > to close leased connections when the pool is closed, they would
> > remain open until they are released, when the closed pool could close
> > the incoming connection.
> >
> > Have you encountered similar problems? If you would consider a
> > contribution with one of these designs or another I have not yet
> > considered, I will open a ticket and plan to implement a solution.
> > Any and all feedback is greatly appreciated.
> >
> > Thanks,
> > Carter Kozak
> >
>
> Hi Carter
>
> Automatic recovery of leaked connections has been actively discussed in
> HC 3.x and early HC 4.x days without ever producing any practical
> outcome. It is _massively_ easier to just release connections then jump
> through many hoops to try and reclaim what might or might not be some
> leaked connections.
>
> You are very welcome to give it a shot but I would kindly ask you to
> start it off as a completely new connection manager initially. Once
> there is a working solution we can decide how to fold it into the
> project and in what form.
>

This might be made easier by using Apache Commons Pool which can track
borrowed but abandoned objects:

The pool can also be configured to detect and remove "abandoned" objects,
i.e. objects that have been checked out of the pool but neither used nor
returned before the configured removeAbandonedTimeout
<https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout-->.
Abandoned object removal can be configured to happen when borrowObject is
invoked and the pool is close to starvation, or it can be executed by the
idle object evictor, or both. If pooled objects implement the TrackedUse
<https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html>
interface,
their last use will be queried using the getLastUsed method on that
interface; otherwise abandonment is determined by how long an object has
been checked out from the pool.

See
https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html

Gary

Cheers
>
> Oleg
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Limiting the blast radius of connection leaks

Reply via email to