Thanks Tom Lane for detailing the issue. >So ... how many is "a bunch"? more than 85
>Looking at the code, it seems like it'd be possible for a sufficiently >aggressive spawner of incoming connections to reach the >MaxLivePostmasterChildren limit. While the postmaster would correctly >reject additional connection attempts after that, what it would not do >is ensure that any child slots are left for new parallel worker processes. >So we could hypothesize that the error you're seeing in the log is from >failure to spawn a parallel worker process, due to being out of child >slots. Thanks Tom Lane for detailing the issue. we have enabled "max_parallel_workers_per_gather = 4". 20 days before we ran into this issue . >However, given that max_connections = 500, MaxLivePostmasterChildren() >would be 1000-plus. This would mean that reaching this condition would >require *at least* 500 concurrent connection-attempts-that-haven't-yet- >been-rejected, maybe well more than that if you didn't have close to >500 legitimately open sessions. That seems like a lot, enough to suggest >that you've got some pretty serious bug in your client-side logic. below errors observed after crash in postgres logfile : ERROR: xlog flush request is not satisfied for couple of tables , we have initiated the vacuum full on those tables and the error went off after that. ERROR: right sibling's left-link doesn't match: block 273660 links to 273500 instead of expected 273661 in index -- observed this error while doing vacuum freeze on databsase , we have dropped this index and created a new one Observations : Vacuum freeze analyze job is getting stuck at database end which is initiated thru cronjob, pg_cancel_backend(), pg_termiante_backend() is not able to terminate those stuck process , Restarting the database only able to clear those process , i am thinking this is happening due to corruption (if this is true how can i detect this ? pg_dump ?). is there any way to overcome this problem ? does migrating the database to a new instance (pg_basebackup and switching over to new instance ) solves this issue ? Anyway, I think it's clearly a bug that canAcceptConnections() thinks the number of acceptable connections is identical to the number of allowed child processes; it needs to be less, by the number of background processes we want to support. But it seems like a darn hard-to-hit bug, so I'm not quite sure that that explains your observation. On Fri, 4 Oct 2019 at 03:49, Tom Lane <t...@sss.pgh.pa.us> wrote: > bhargav kamineni <bhargavpostg...@gmail.com> writes: > >> What was the database doing just before the FATAL line? > > > Postgres was rejecting a bunch of connections from a user who is having a > > connection limit set. that was the the FATAL error that i could see in > log > > file. > > FATAL,53300,"too many connections for role ""user_app""" > > So ... how many is "a bunch"? > > Looking at the code, it seems like it'd be possible for a sufficiently > aggressive spawner of incoming connections to reach the > MaxLivePostmasterChildren limit. While the postmaster would correctly > reject additional connection attempts after that, what it would not do > is ensure that any child slots are left for new parallel worker processes. > So we could hypothesize that the error you're seeing in the log is from > failure to spawn a parallel worker process, due to being out of child > slots. > > However, given that max_connections = 500, MaxLivePostmasterChildren() > would be 1000-plus. This would mean that reaching this condition would > require *at least* 500 concurrent connection-attempts-that-haven't-yet- > been-rejected, maybe well more than that if you didn't have close to > 500 legitimately open sessions. That seems like a lot, enough to suggest > that you've got some pretty serious bug in your client-side logic. > > Anyway, I think it's clearly a bug that canAcceptConnections() thinks the > number of acceptable connections is identical to the number of allowed > child processes; it needs to be less, by the number of background > processes we want to support. But it seems like a darn hard-to-hit bug, > so I'm not quite sure that that explains your observation. > > regards, tom lane >