On Sun, Feb 19, 2017 at 2:17 PM, Robert Haas <robertmh...@gmail.com> wrote: > Such a change can be made, but as I pointed out in the part you didn't > quote, there are reasons to wonder whether that will be a constructive > change in real life even if it's better for the regression tests. > Optimizing PostgreSQL for the use case of running regression tests in > the buildfarm at the expense of other use cases wouldn't be very > smart. Maybe such a change is better in real-world applications too, > but that deserves at least a little bit of thought and substantive > discussion.
Rewind. Wait a minute. Looking at this code again, it looks like we're supposed to ALREADY BE DOING THIS. DestroyParallelContext() calls WaitForParallelWorkersToExit() which calls WaitForBackgroundWorkerShutdown() for each worker. That function returns only when the postmaster dies (which causes an error with that specific complaint) or when GetBackgroundWorkerPid() sets the status to BGWH_STOPPED. GetBackgroundWorkerPid() only returns BGWH_STOPPED when either (a) handle->generation != slot->generation (meaning that the slot got reused, and therefore must have been freed) or when (b) slot->pid == 0. The pid only gets set to 0 in BackgroundWorkerStateChange() when slot->terminate is set, or in ReportBackgroundWorkerPID() when it's called from CleanupBackgroundWorker. So this function should not be returning until after all workers have actually exited. However, it looks like there's a race condition here, because the slot doesn't get freed up at the same time that the PID gets set to 0. That actually happens later, when the postmaster calls maybe_start_bgworker() or DetermineSleepTime() and one of those functions calls ForgetBackgroundWorker(). We could tighten this up by changing CleanupBackgroundWorker() to also call ForgetBackgroundWorker() immediately after calling ReportBackgroundWorker() if rw->rw_terminate || rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART. If we do that BEFORE sending the notification to the starting process, that closes this hole. Almost. There's still a possibility that the waiting process might receive SIGUSR1 for some unrelated reason and check the state of shared memory just after slot->pid == 0 and before slot->in_use is cleared and (also relevantly) just before BackgroundWorkerData->parallel_terminate_count is incremented. Fixing that seems a bit tougher, since we're certainly not going to have the postmaster start taking locks on shared data structures, but just forgetting the worker sooner would probably improve things a lot. I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers