Re: Limiting the blast radius of connection leaks

Gary Gregory Fri, 31 Jul 2020 14:39:45 -0700

What I would prefer we do is not reinvent yet another pooling feature, all
its potential bugs and added maintenance load when that wheel's been
invented over and over. For me personally, I would prefer that this project
adopt a FOSS pooling library and improve _it_ if that were needed, thereby
improving the whole ecosystem.


Gary

On Fri, Jul 31, 2020, 16:22 Carter Kozak <cko...@ckozak.net> wrote:

> On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org>
> wrote:
> >
> > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > Hello Friends,
> > > >
> > > > I recently had a service crash due to an out of memory error as the
> > > > result of a response leak. While this is caused by code that uses
> > > > closeable resources incorrectly, and that code should be fixed, I'd
> > > > like to discuss some ideas which could avoid catastrophic failure in
> > > > this scenario.
> > > >
> > > > A little background on my environment: I maintain an RPC library
> > > > which wraps an hc5 client. There's no limit set on the hc5 connection
> > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > variation of additive-increase/multiplicative-decrease limiters to
> > > > scale based on load. Some endpoints are capable of returning a raw
> > > > binary stream, this introduces some risk of resource leaks -- ideally
> > > > we would take an inputstream consumer to fully own the lifecycle, but
> > > > unfortunately that's not an API change we can make at this point. We
> > > > return permits before passing the resulting inputstream to custom
> > > > code to prevent leaks from blocking the entire service.
> > > >
> > > > Unfortunately leased connections are anchored directly to both
> > > > StrictConnPool and LaxConnPool in order to validate that connections
> > > > are returned to the pool where they originated, which prevents a
> > > > relatively large swath of resources from being garbage collected in
> > > > the event of a leak. I realize my case is a bit different due to the
> > > > effectively unlimited pool size, where memory becomes the limiting
> > > > factor rather than requests blocking against the pool limit, but I
> > > > believe there are a few options that can help us to both detect and
> > > > endure leaks.
> > > >
> > > > A couple potential solutions:
> > > >
> > > > Use weak references to track leased connections with a periodic sweep
> > > > to detect leaks and reclaim per-route permits. This approach would be
> > > > complex, but allows the default configuration to detect and handle
> > > > rare response leaks without eventually blocking connection requests.
> > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > closed when the pool is shut down, but may survive until a full GC.
> > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > what guarantees are made about leaked connections.
> > > >
> > > > Use a counter to track leased connections instead of tracking leased
> > > > entries directly. This loses the safety guarantees that connections
> > > > aren't already leased, and that they're being returned to the pool
> > > > where they originated, however it eliminates the need for the
> > > > connection pools hash table and references to checked-out values.
> > > > This solution would not be helpful when connections are leaked and
> > > > the route limit is reached, so it may make more sense as a
> > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> > > > to close leased connections when the pool is closed, they would
> > > > remain open until they are released, when the closed pool could close
> > > > the incoming connection.
> > > >
> > > > Have you encountered similar problems? If you would consider a
> > > > contribution with one of these designs or another I have not yet
> > > > considered, I will open a ticket and plan to implement a solution.
> > > > Any and all feedback is greatly appreciated.
> > > >
> > > > Thanks,
> > > > Carter Kozak
> > > >
> > >
> > > Hi Carter
> > >
> > > Automatic recovery of leaked connections has been actively discussed in
> > > HC 3.x and early HC 4.x days without ever producing any practical
> > > outcome. It is _massively_ easier to just release connections then jump
> > > through many hoops to try and reclaim what might or might not be some
> > > leaked connections.
> > >
> > > You are very welcome to give it a shot but I would kindly ask you to
> > > start it off as a completely new connection manager initially. Once
> > > there is a working solution we can decide how to fold it into the
> > > project and in what form.
> > >
> >
> > This might be made easier by using Apache Commons Pool which can track
> > borrowed but abandoned objects:
> >
> > The pool can also be configured to detect and remove "abandoned" objects,
> > i.e. objects that have been checked out of the pool but neither used nor
> > returned before the configured removeAbandonedTimeout
> > <
> https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> >.
> > Abandoned object removal can be configured to happen when borrowObject is
> > invoked and the pool is close to starvation, or it can be executed by the
> > idle object evictor, or both. If pooled objects implement the TrackedUse
> > <
> https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> >
> > interface,
> > their last use will be queried using the getLastUsed method on that
> > interface; otherwise abandonment is determined by how long an object has
> > been checked out from the pool.
> >
> > See
> >
> https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> >
> > Gary
> >
> > Cheers
> > >
> > > Oleg
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
> > > For additional commands, e-mail: dev-h...@hc.apache.org
> > >
> > >
> >
>
> Thank you both,
>
> Oleg, I agree that it's much easier to close resources than it is to write
> a pool that is capable of recovery, and we shouldn't take on undue
> complexity to force a solution to work. That said, the surface area of code
> that has the potential to leak (code that uses HC) is orders of magnitude
> larger than the connection pool, and a fix could reduce JIRA load on the
> project. I will plan to implement an initial experimental version as an
> alternative to PoolingHttpClientConnectionManager so we can avoid impacting
> users, and I can validate it in a high-load production environment before
> it considered for a release.
>
> Gary, thanks for pointing out the abandonment feature in commons-pool! I
> worry about detection based on a timeout because it's difficult to be
> certain an object is abandoned. While multi-hour requests waiting for
> response headers aren't a good idea, I've run into a few places that depend
> on them and wouldn't want to break users.
>
> Carter
>

Re: Limiting the blast radius of connection leaks

Reply via email to