Limiting the blast radius of connection leaks

Carter Kozak Thu, 30 Jul 2020 10:29:15 -0700

Hello Friends,

I recently had a service crash due to an out of memory error as the result of a 
response leak. While this is caused by code that uses closeable resources 
incorrectly, and that code should be fixed, I'd like to discuss some ideas 
which could avoid catastrophic failure in this scenario.


A little background on my environment: I maintain an RPC library which wraps an 
hc5 client. There's no limit set on the hc5 connection pool 
(Integer.MAX_VALUE), and resource use is bounded using a variation of 
additive-increase/multiplicative-decrease limiters to scale based on load. Some 
endpoints are capable of returning a raw binary stream, this introduces some 
risk of resource leaks -- ideally we would take an inputstream consumer to 
fully own the lifecycle, but unfortunately that's not an API change we can make 
at this point. We return permits before passing the resulting inputstream to 
custom code to prevent leaks from blocking the entire service.

Unfortunately leased connections are anchored directly to both StrictConnPool 
and LaxConnPool in order to validate that connections are returned to the pool 
where they originated, which prevents a relatively large swath of resources 
from being garbage collected in the event of a leak. I realize my case is a bit 
different due to the effectively unlimited pool size, where memory becomes the 
limiting factor rather than requests blocking against the pool limit, but I 
believe there are a few options that can help us to both detect and endure 
leaks.

A couple potential solutions:

Use weak references to track leased connections with a periodic sweep to detect 
leaks and reclaim per-route permits. This approach would be complex, but allows 
the default configuration to detect and handle rare response leaks without 
eventually blocking connection requests. Using this approach, a leaked 
connection wouldn't necessarily be closed when the pool is shut down, but may 
survive until a full GC. In most cases I expect clients to be long-lived, and 
I'm not sure what guarantees are made about leaked connections.

Use a counter to track leased connections instead of tracking leased entries 
directly. This loses the safety guarantees that connections aren't already 
leased, and that they're being returned to the pool where they originated, 
however it eliminates the need for the connection pools hash table and 
references to checked-out values. This solution would not be helpful when 
connections are leaked and the route limit is reached, so it may make more 
sense as a PoolConcurrencyPolicy.UNLIMITED option. This design would not be 
able to close leased connections when the pool is closed, they would remain 
open until they are released, when the closed pool could close the incoming 
connection.

Have you encountered similar problems? If you would consider a contribution 
with one of these designs or another I have not yet considered, I will open a 
ticket and plan to implement a solution. Any and all feedback is greatly 
appreciated.

Thanks,
Carter Kozak

Limiting the blast radius of connection leaks

Reply via email to