On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh <kuntalghosh.2...@gmail.com> wrote: > On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> On 04/04/2017 06:52 PM, Robert Haas wrote: >>> >>> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2...@gmail.com> >>> wrote: >>>> >>>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmh...@gmail.com> >>>> wrote: >>>>> >>>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh >>>>> <kuntalghosh.2...@gmail.com> wrote: >>>>>> >>>>>> 2. the server restarts automatically, initialize >>>>>> BackgroundWorkerData->parallel_register_count and >>>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory. >>>>>> After that, it calls ForgetBackgroundWorker and it increments >>>>>> parallel_terminate_count. >>>>> >>>>> >>>>> Hmm. So this seems like the root of the problem. Presumably those >>>>> things need to be reset AFTER forgetting any background workers from >>>>> before the crash. >>>>> >>>> IMHO, the fix would be not to increase the terminated parallel worker >>>> count whenever ForgetBackgroundWorker is called due to a bgworker >>>> crash. I've attached a patch for the same. PFA. >>> >>> >>> While I'm not opposed to that approach, I don't think this is a good >>> way to implement it. If you want to pass an explicit flag to >>> ForgetBackgroundWorker telling it whether or not it should performing >>> the increment, fine. But with what you've got here, you're >>> essentially relying on "spooky action at a distance". It would be >>> easy for future code changes to break this, not realizing that >>> somebody's got a hard dependency on 0 having a specific meaning. >>> >> >> I'm probably missing something, but I don't quite understand how these >> values actually survive the crash. I mean, what I observed is OOM followed >> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the >> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for >> some reason? > AFAICU, during crash recovery, we wait for all non-syslogger children > to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform > StartupDataBase. While starting the startup process we check if any > bgworker is scheduled for a restart. >
In general, your theory appears right, but can you check how it behaves in standby server because there is a difference in how the startup process behaves during master and standby startup? In master, it stops after recovery whereas in standby it will keep on running to receive WAL. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers