Re: Limiting the blast radius of connection leaks

Carter Kozak Fri, 31 Jul 2020 13:22:43 -0700

On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org> wrote:
> 
> > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > Hello Friends,
> > >
> > > I recently had a service crash due to an out of memory error as the
> > > result of a response leak. While this is caused by code that uses
> > > closeable resources incorrectly, and that code should be fixed, I'd
> > > like to discuss some ideas which could avoid catastrophic failure in
> > > this scenario.
> > >
> > > A little background on my environment: I maintain an RPC library
> > > which wraps an hc5 client. There's no limit set on the hc5 connection
> > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > variation of additive-increase/multiplicative-decrease limiters to
> > > scale based on load. Some endpoints are capable of returning a raw
> > > binary stream, this introduces some risk of resource leaks -- ideally
> > > we would take an inputstream consumer to fully own the lifecycle, but
> > > unfortunately that's not an API change we can make at this point. We
> > > return permits before passing the resulting inputstream to custom
> > > code to prevent leaks from blocking the entire service.
> > >
> > > Unfortunately leased connections are anchored directly to both
> > > StrictConnPool and LaxConnPool in order to validate that connections
> > > are returned to the pool where they originated, which prevents a
> > > relatively large swath of resources from being garbage collected in
> > > the event of a leak. I realize my case is a bit different due to the
> > > effectively unlimited pool size, where memory becomes the limiting
> > > factor rather than requests blocking against the pool limit, but I
> > > believe there are a few options that can help us to both detect and
> > > endure leaks.
> > >
> > > A couple potential solutions:
> > >
> > > Use weak references to track leased connections with a periodic sweep
> > > to detect leaks and reclaim per-route permits. This approach would be
> > > complex, but allows the default configuration to detect and handle
> > > rare response leaks without eventually blocking connection requests.
> > > Using this approach, a leaked connection wouldn't necessarily be
> > > closed when the pool is shut down, but may survive until a full GC.
> > > In most cases I expect clients to be long-lived, and I'm not sure
> > > what guarantees are made about leaked connections.
> > >
> > > Use a counter to track leased connections instead of tracking leased
> > > entries directly. This loses the safety guarantees that connections
> > > aren't already leased, and that they're being returned to the pool
> > > where they originated, however it eliminates the need for the
> > > connection pools hash table and references to checked-out values.
> > > This solution would not be helpful when connections are leaked and
> > > the route limit is reached, so it may make more sense as a
> > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> > > to close leased connections when the pool is closed, they would
> > > remain open until they are released, when the closed pool could close
> > > the incoming connection.
> > >
> > > Have you encountered similar problems? If you would consider a
> > > contribution with one of these designs or another I have not yet
> > > considered, I will open a ticket and plan to implement a solution.
> > > Any and all feedback is greatly appreciated.
> > >
> > > Thanks,
> > > Carter Kozak
> > >
> >
> > Hi Carter
> >
> > Automatic recovery of leaked connections has been actively discussed in
> > HC 3.x and early HC 4.x days without ever producing any practical
> > outcome. It is _massively_ easier to just release connections then jump
> > through many hoops to try and reclaim what might or might not be some
> > leaked connections.
> >
> > You are very welcome to give it a shot but I would kindly ask you to
> > start it off as a completely new connection manager initially. Once
> > there is a working solution we can decide how to fold it into the
> > project and in what form.
> >
> 
> This might be made easier by using Apache Commons Pool which can track
> borrowed but abandoned objects:
> 
> The pool can also be configured to detect and remove "abandoned" objects,
> i.e. objects that have been checked out of the pool but neither used nor
> returned before the configured removeAbandonedTimeout
> <https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout-->.
> Abandoned object removal can be configured to happen when borrowObject is
> invoked and the pool is close to starvation, or it can be executed by the
> idle object evictor, or both. If pooled objects implement the TrackedUse
> <https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html>
> interface,
> their last use will be queried using the getLastUsed method on that
> interface; otherwise abandonment is determined by how long an object has
> been checked out from the pool.
> 
> See
> https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> 
> Gary
> 
> Cheers
> >
> > Oleg
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
> > For additional commands, e-mail: dev-h...@hc.apache.org
> >
> >
>


Thank you both,

Oleg, I agree that it's much easier to close resources than it is to write a 
pool that is capable of recovery, and we shouldn't take on undue complexity to 
force a solution to work. That said, the surface area of code that has the 
potential to leak (code that uses HC) is orders of magnitude larger than the 
connection pool, and a fix could reduce JIRA load on the project. I will plan 
to implement an initial experimental version as an alternative to 
PoolingHttpClientConnectionManager so we can avoid impacting users, and I can 
validate it in a high-load production environment before it considered for a 
release.

Gary, thanks for pointing out the abandonment feature in commons-pool! I worry 
about detection based on a timeout because it's difficult to be certain an 
object is abandoned. While multi-hour requests waiting for response headers 
aren't a good idea, I've run into a few places that depend on them and wouldn't 
want to break users.

Carter

Re: Limiting the blast radius of connection leaks

Reply via email to