On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote: > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org> wrote: > > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote: > > > Hello Friends, > > > > > > I recently had a service crash due to an out of memory error as the > > > result of a response leak. While this is caused by code that uses > > > closeable resources incorrectly, and that code should be fixed, I'd > > > like to discuss some ideas which could avoid catastrophic failure in > > > this scenario. > > > > > > A little background on my environment: I maintain an RPC library > > > which wraps an hc5 client. There's no limit set on the hc5 connection > > > pool (Integer.MAX_VALUE), and resource use is bounded using a > > > variation of additive-increase/multiplicative-decrease limiters to > > > scale based on load. Some endpoints are capable of returning a raw > > > binary stream, this introduces some risk of resource leaks -- ideally > > > we would take an inputstream consumer to fully own the lifecycle, but > > > unfortunately that's not an API change we can make at this point. We > > > return permits before passing the resulting inputstream to custom > > > code to prevent leaks from blocking the entire service. > > > > > > Unfortunately leased connections are anchored directly to both > > > StrictConnPool and LaxConnPool in order to validate that connections > > > are returned to the pool where they originated, which prevents a > > > relatively large swath of resources from being garbage collected in > > > the event of a leak. I realize my case is a bit different due to the > > > effectively unlimited pool size, where memory becomes the limiting > > > factor rather than requests blocking against the pool limit, but I > > > believe there are a few options that can help us to both detect and > > > endure leaks. > > > > > > A couple potential solutions: > > > > > > Use weak references to track leased connections with a periodic sweep > > > to detect leaks and reclaim per-route permits. This approach would be > > > complex, but allows the default configuration to detect and handle > > > rare response leaks without eventually blocking connection requests. > > > Using this approach, a leaked connection wouldn't necessarily be > > > closed when the pool is shut down, but may survive until a full GC. > > > In most cases I expect clients to be long-lived, and I'm not sure > > > what guarantees are made about leaked connections. > > > > > > Use a counter to track leased connections instead of tracking leased > > > entries directly. This loses the safety guarantees that connections > > > aren't already leased, and that they're being returned to the pool > > > where they originated, however it eliminates the need for the > > > connection pools hash table and references to checked-out values. > > > This solution would not be helpful when connections are leaked and > > > the route limit is reached, so it may make more sense as a > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able > > > to close leased connections when the pool is closed, they would > > > remain open until they are released, when the closed pool could close > > > the incoming connection. > > > > > > Have you encountered similar problems? If you would consider a > > > contribution with one of these designs or another I have not yet > > > considered, I will open a ticket and plan to implement a solution. > > > Any and all feedback is greatly appreciated. > > > > > > Thanks, > > > Carter Kozak > > > > > > > Hi Carter > > > > Automatic recovery of leaked connections has been actively discussed in > > HC 3.x and early HC 4.x days without ever producing any practical > > outcome. It is _massively_ easier to just release connections then jump > > through many hoops to try and reclaim what might or might not be some > > leaked connections. > > > > You are very welcome to give it a shot but I would kindly ask you to > > start it off as a completely new connection manager initially. Once > > there is a working solution we can decide how to fold it into the > > project and in what form. > > > > This might be made easier by using Apache Commons Pool which can track > borrowed but abandoned objects: > > The pool can also be configured to detect and remove "abandoned" objects, > i.e. objects that have been checked out of the pool but neither used nor > returned before the configured removeAbandonedTimeout > <https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout-->. > Abandoned object removal can be configured to happen when borrowObject is > invoked and the pool is close to starvation, or it can be executed by the > idle object evictor, or both. If pooled objects implement the TrackedUse > <https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html> > interface, > their last use will be queried using the getLastUsed method on that > interface; otherwise abandonment is determined by how long an object has > been checked out from the pool. > > See > https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html > > Gary > > Cheers > > > > Oleg > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org > > For additional commands, e-mail: dev-h...@hc.apache.org > > > > >
Thank you both, Oleg, I agree that it's much easier to close resources than it is to write a pool that is capable of recovery, and we shouldn't take on undue complexity to force a solution to work. That said, the surface area of code that has the potential to leak (code that uses HC) is orders of magnitude larger than the connection pool, and a fix could reduce JIRA load on the project. I will plan to implement an initial experimental version as an alternative to PoolingHttpClientConnectionManager so we can avoid impacting users, and I can validate it in a high-load production environment before it considered for a release. Gary, thanks for pointing out the abandonment feature in commons-pool! I worry about detection based on a timeout because it's difficult to be certain an object is abandoned. While multi-hour requests waiting for response headers aren't a good idea, I've run into a few places that depend on them and wouldn't want to break users. Carter