One interesting thing to me - it seems like all of the past mail threads were focused on a situation different from mine. Lots of discussion about freeing resources like CPU.
In the outage I saw, the system was idle and we completely ran out of max_connections because all sessions were waiting on a row lock. Importantly, the app was closing these conns but we had sockets stacking up on the server in CLOSE-WAIT state - and postgres simply never cleaned them up until we had an outage. The processes were completely idle waiting for a row lock that was not going to be released. Impact could have been isolated to sessions hitting that row (with this GUC), but it escalated to a system outage. It's pretty simple to reproduce this: https://github.com/ardentperf/pg-idle-test/tree/main/conn_exhaustion On Thu, 5 Feb 2026 09:26:34 -0800 Jacob Champion <[email protected]> wrote: > On Wed, Feb 4, 2026 at 9:30 PM Jeremy Schneider > <[email protected]> wrote: > > While a fix has been merged in pgx for the most direct root cause of > > the incident I saw, this setting just seems like a good behavior to > > make Postgres more robust in general. > > At the risk of making perfect the enemy of better, the protocol-level > heartbeat mentioned in the original thread [1] would cover more use > cases, which might give it a better chance of eventually becoming > default behavior. It might also be a lot of work, though. It seems like a fair bit of discussion is around OS coverage - even Thomas' message there references keepalive working as expected on Linux. Tom objects in 2023 that "the default behavior would then be platform-dependent and that's a documentation problem we could do without." But it's been five years - has there been further work on implementing a postgres-level heartbeat? And I see other places in the docs where we note platform differences, is it really such a big problem to change the default here? On Thu, 5 Feb 2026 10:00:29 -0500 Greg Sabino Mullane <[email protected]> wrote: > I'm a weak -1 on this. Certainly not 2s! That's a lot of context > switching for a busy system for no real reason. Also see this past > discussion: In the other thread I see larger perf concerns with some early implementations before they refactored the patch? Konstantin's message on 2019-08-02 said he didn't see much difference, and the value of the timeout didn't seem to matter, and if anything the marginal effect was simply from the presence of any timer (same effect as setting statement_timeout) - and later on the thread it seems like Thomas also saw minimal performance concern here. I did see a real system outage that could have been prevented by an appropriate default value here, since I didn't yet know to change it. -Jeremy
