Re: Limiting the blast radius of connection leaks

Gary Gregory Mon, 03 Aug 2020 10:39:48 -0700

Hi Carter,

I don't have a super strong feeling for this either. I just would like us
to try and reuse a library before reinventing the wheel, but my time is
limited, like everyone else.


Gary

On Mon, Aug 3, 2020, 13:19 Carter Kozak <cko...@ckozak.net> wrote:

> Hi Gary,
>
> That sounds reasonable, however there are some subtle distinctions
> between general purpose object pools and those used for http
> connections, perhaps the most obvious of which is http/2 multiplexed
> connections. It's certainly possible that it could work, but I imagine it
> could become a large project and may result in something very
> specific to http clients. Pools used by blocking and asynchronous
> clients may also differ substantially, so I'm not sure how general this
> implementation could be.
>
> I'd be happy to help in any way I can if folks more familiar than myself
> with the current hc pool implementations are supportive of your proposal,
> but I'm worried that it may become a blocker that prevents incremental
> improvements.
>
> In conclusion, I don't feel comfortable taking a stance on this and I trust
> the team to make the right decision.
>
> -ck
>
> On Fri, Jul 31, 2020, at 17:38, Gary Gregory wrote:
> > What I would prefer we do is not reinvent yet another pooling feature,
> all
> > its potential bugs and added maintenance load when that wheel's been
> > invented over and over. For me personally, I would prefer that this
> project
> > adopt a FOSS pooling library and improve _it_ if that were needed,
> thereby
> > improving the whole ecosystem.
> >
> > Gary
> >
> > On Fri, Jul 31, 2020, 16:22 Carter Kozak <cko...@ckozak.net> wrote:
> >
> > > On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > > > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org
> >
> > > wrote:
> > > >
> > > > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > > > Hello Friends,
> > > > > >
> > > > > > I recently had a service crash due to an out of memory error as
> the
> > > > > > result of a response leak. While this is caused by code that uses
> > > > > > closeable resources incorrectly, and that code should be fixed,
> I'd
> > > > > > like to discuss some ideas which could avoid catastrophic
> failure in
> > > > > > this scenario.
> > > > > >
> > > > > > A little background on my environment: I maintain an RPC library
> > > > > > which wraps an hc5 client. There's no limit set on the hc5
> connection
> > > > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > > > variation of additive-increase/multiplicative-decrease limiters
> to
> > > > > > scale based on load. Some endpoints are capable of returning a
> raw
> > > > > > binary stream, this introduces some risk of resource leaks --
> ideally
> > > > > > we would take an inputstream consumer to fully own the
> lifecycle, but
> > > > > > unfortunately that's not an API change we can make at this
> point. We
> > > > > > return permits before passing the resulting inputstream to custom
> > > > > > code to prevent leaks from blocking the entire service.
> > > > > >
> > > > > > Unfortunately leased connections are anchored directly to both
> > > > > > StrictConnPool and LaxConnPool in order to validate that
> connections
> > > > > > are returned to the pool where they originated, which prevents a
> > > > > > relatively large swath of resources from being garbage collected
> in
> > > > > > the event of a leak. I realize my case is a bit different due to
> the
> > > > > > effectively unlimited pool size, where memory becomes the
> limiting
> > > > > > factor rather than requests blocking against the pool limit, but
> I
> > > > > > believe there are a few options that can help us to both detect
> and
> > > > > > endure leaks.
> > > > > >
> > > > > > A couple potential solutions:
> > > > > >
> > > > > > Use weak references to track leased connections with a periodic
> sweep
> > > > > > to detect leaks and reclaim per-route permits. This approach
> would be
> > > > > > complex, but allows the default configuration to detect and
> handle
> > > > > > rare response leaks without eventually blocking connection
> requests.
> > > > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > > > closed when the pool is shut down, but may survive until a full
> GC.
> > > > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > > > what guarantees are made about leaked connections.
> > > > > >
> > > > > > Use a counter to track leased connections instead of tracking
> leased
> > > > > > entries directly. This loses the safety guarantees that
> connections
> > > > > > aren't already leased, and that they're being returned to the
> pool
> > > > > > where they originated, however it eliminates the need for the
> > > > > > connection pools hash table and references to checked-out values.
> > > > > > This solution would not be helpful when connections are leaked
> and
> > > > > > the route limit is reached, so it may make more sense as a
> > > > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be
> able
> > > > > > to close leased connections when the pool is closed, they would
> > > > > > remain open until they are released, when the closed pool could
> close
> > > > > > the incoming connection.
> > > > > >
> > > > > > Have you encountered similar problems? If you would consider a
> > > > > > contribution with one of these designs or another I have not yet
> > > > > > considered, I will open a ticket and plan to implement a
> solution.
> > > > > > Any and all feedback is greatly appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Carter Kozak
> > > > > >
> > > > >
> > > > > Hi Carter
> > > > >
> > > > > Automatic recovery of leaked connections has been actively
> discussed in
> > > > > HC 3.x and early HC 4.x days without ever producing any practical
> > > > > outcome. It is _massively_ easier to just release connections then
> jump
> > > > > through many hoops to try and reclaim what might or might not be
> some
> > > > > leaked connections.
> > > > >
> > > > > You are very welcome to give it a shot but I would kindly ask you
> to
> > > > > start it off as a completely new connection manager initially. Once
> > > > > there is a working solution we can decide how to fold it into the
> > > > > project and in what form.
> > > > >
> > > >
> > > > This might be made easier by using Apache Commons Pool which can
> track
> > > > borrowed but abandoned objects:
> > > >
> > > > The pool can also be configured to detect and remove "abandoned"
> objects,
> > > > i.e. objects that have been checked out of the pool but neither used
> nor
> > > > returned before the configured removeAbandonedTimeout
> > > > <
> > >
> https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> > > >.
> > > > Abandoned object removal can be configured to happen when
> borrowObject is
> > > > invoked and the pool is close to starvation, or it can be executed
> by the
> > > > idle object evictor, or both. If pooled objects implement the
> TrackedUse
> > > > <
> > >
> https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> > > >
> > > > interface,
> > > > their last use will be queried using the getLastUsed method on that
> > > > interface; otherwise abandonment is determined by how long an object
> has
> > > > been checked out from the pool.
> > > >
> > > > See
> > > >
> > >
> https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> > > >
> > > > Gary
> > > >
> > > > Cheers
> > > > >
> > > > > Oleg
> > > > >
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
> > > > > For additional commands, e-mail: dev-h...@hc.apache.org
> > > > >
> > > > >
> > > >
> > >
> > > Thank you both,
> > >
> > > Oleg, I agree that it's much easier to close resources than it is to
> write
> > > a pool that is capable of recovery, and we shouldn't take on undue
> > > complexity to force a solution to work. That said, the surface area of
> code
> > > that has the potential to leak (code that uses HC) is orders of
> magnitude
> > > larger than the connection pool, and a fix could reduce JIRA load on
> the
> > > project. I will plan to implement an initial experimental version as an
> > > alternative to PoolingHttpClientConnectionManager so we can avoid
> impacting
> > > users, and I can validate it in a high-load production environment
> before
> > > it considered for a release.
> > >
> > > Gary, thanks for pointing out the abandonment feature in commons-pool!
> I
> > > worry about detection based on a timeout because it's difficult to be
> > > certain an object is abandoned. While multi-hour requests waiting for
> > > response headers aren't a good idea, I've run into a few places that
> depend
> > > on them and wouldn't want to break users.
> > >
> > > Carter
> > >
> >
>

Re: Limiting the blast radius of connection leaks

Reply via email to