Re: Limiting the blast radius of connection leaks

Carter Kozak Mon, 03 Aug 2020 10:18:46 -0700

Hi Gary,

That sounds reasonable, however there are some subtle distinctions
between general purpose object pools and those used for http
connections, perhaps the most obvious of which is http/2 multiplexed
connections. It's certainly possible that it could work, but I imagine it
could become a large project and may result in something very
specific to http clients. Pools used by blocking and asynchronous
clients may also differ substantially, so I'm not sure how general this
implementation could be.


I'd be happy to help in any way I can if folks more familiar than myself
with the current hc pool implementations are supportive of your proposal,
but I'm worried that it may become a blocker that prevents incremental
improvements.

In conclusion, I don't feel comfortable taking a stance on this and I trust
the team to make the right decision.

-ck

On Fri, Jul 31, 2020, at 17:38, Gary Gregory wrote:
> What I would prefer we do is not reinvent yet another pooling feature, all
> its potential bugs and added maintenance load when that wheel's been
> invented over and over. For me personally, I would prefer that this project
> adopt a FOSS pooling library and improve _it_ if that were needed, thereby
> improving the whole ecosystem.
> 
> Gary
> 
> On Fri, Jul 31, 2020, 16:22 Carter Kozak <[email protected]> wrote:
> 
> > On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <[email protected]>
> > wrote:
> > >
> > > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > > Hello Friends,
> > > > >
> > > > > I recently had a service crash due to an out of memory error as the
> > > > > result of a response leak. While this is caused by code that uses
> > > > > closeable resources incorrectly, and that code should be fixed, I'd
> > > > > like to discuss some ideas which could avoid catastrophic failure in
> > > > > this scenario.
> > > > >
> > > > > A little background on my environment: I maintain an RPC library
> > > > > which wraps an hc5 client. There's no limit set on the hc5 connection
> > > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > > variation of additive-increase/multiplicative-decrease limiters to
> > > > > scale based on load. Some endpoints are capable of returning a raw
> > > > > binary stream, this introduces some risk of resource leaks -- ideally
> > > > > we would take an inputstream consumer to fully own the lifecycle, but
> > > > > unfortunately that's not an API change we can make at this point. We
> > > > > return permits before passing the resulting inputstream to custom
> > > > > code to prevent leaks from blocking the entire service.
> > > > >
> > > > > Unfortunately leased connections are anchored directly to both
> > > > > StrictConnPool and LaxConnPool in order to validate that connections
> > > > > are returned to the pool where they originated, which prevents a
> > > > > relatively large swath of resources from being garbage collected in
> > > > > the event of a leak. I realize my case is a bit different due to the
> > > > > effectively unlimited pool size, where memory becomes the limiting
> > > > > factor rather than requests blocking against the pool limit, but I
> > > > > believe there are a few options that can help us to both detect and
> > > > > endure leaks.
> > > > >
> > > > > A couple potential solutions:
> > > > >
> > > > > Use weak references to track leased connections with a periodic sweep
> > > > > to detect leaks and reclaim per-route permits. This approach would be
> > > > > complex, but allows the default configuration to detect and handle
> > > > > rare response leaks without eventually blocking connection requests.
> > > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > > closed when the pool is shut down, but may survive until a full GC.
> > > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > > what guarantees are made about leaked connections.
> > > > >
> > > > > Use a counter to track leased connections instead of tracking leased
> > > > > entries directly. This loses the safety guarantees that connections
> > > > > aren't already leased, and that they're being returned to the pool
> > > > > where they originated, however it eliminates the need for the
> > > > > connection pools hash table and references to checked-out values.
> > > > > This solution would not be helpful when connections are leaked and
> > > > > the route limit is reached, so it may make more sense as a
> > > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> > > > > to close leased connections when the pool is closed, they would
> > > > > remain open until they are released, when the closed pool could close
> > > > > the incoming connection.
> > > > >
> > > > > Have you encountered similar problems? If you would consider a
> > > > > contribution with one of these designs or another I have not yet
> > > > > considered, I will open a ticket and plan to implement a solution.
> > > > > Any and all feedback is greatly appreciated.
> > > > >
> > > > > Thanks,
> > > > > Carter Kozak
> > > > >
> > > >
> > > > Hi Carter
> > > >
> > > > Automatic recovery of leaked connections has been actively discussed in
> > > > HC 3.x and early HC 4.x days without ever producing any practical
> > > > outcome. It is _massively_ easier to just release connections then jump
> > > > through many hoops to try and reclaim what might or might not be some
> > > > leaked connections.
> > > >
> > > > You are very welcome to give it a shot but I would kindly ask you to
> > > > start it off as a completely new connection manager initially. Once
> > > > there is a working solution we can decide how to fold it into the
> > > > project and in what form.
> > > >
> > >
> > > This might be made easier by using Apache Commons Pool which can track
> > > borrowed but abandoned objects:
> > >
> > > The pool can also be configured to detect and remove "abandoned" objects,
> > > i.e. objects that have been checked out of the pool but neither used nor
> > > returned before the configured removeAbandonedTimeout
> > > <
> > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> > >.
> > > Abandoned object removal can be configured to happen when borrowObject is
> > > invoked and the pool is close to starvation, or it can be executed by the
> > > idle object evictor, or both. If pooled objects implement the TrackedUse
> > > <
> > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> > >
> > > interface,
> > > their last use will be queried using the getLastUsed method on that
> > > interface; otherwise abandonment is determined by how long an object has
> > > been checked out from the pool.
> > >
> > > See
> > >
> > https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> > >
> > > Gary
> > >
> > > Cheers
> > > >
> > > > Oleg
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> > >
> >
> > Thank you both,
> >
> > Oleg, I agree that it's much easier to close resources than it is to write
> > a pool that is capable of recovery, and we shouldn't take on undue
> > complexity to force a solution to work. That said, the surface area of code
> > that has the potential to leak (code that uses HC) is orders of magnitude
> > larger than the connection pool, and a fix could reduce JIRA load on the
> > project. I will plan to implement an initial experimental version as an
> > alternative to PoolingHttpClientConnectionManager so we can avoid impacting
> > users, and I can validate it in a high-load production environment before
> > it considered for a release.
> >
> > Gary, thanks for pointing out the abandonment feature in commons-pool! I
> > worry about detection based on a timeout because it's difficult to be
> > certain an object is abandoned. While multi-hour requests waiting for
> > response headers aren't a good idea, I've run into a few places that depend
> > on them and wouldn't want to break users.
> >
> > Carter
> >
>

Re: Limiting the blast radius of connection leaks

Reply via email to