I wrote: > =?utf-8?Q?R=C3=A9mi_Zara?= <remi_z...@mac.com> writes: >> coypu was not stuck (no buildfarm related process running), but failed to >> clean-up shared memory and semaphores. >> I’ve done the clean-up.
> Huh, that's even more interesting. I installed NetBSD 5.1.5 on an old Mac G4; I believe this is a reasonable approximation to coypu's environment. With the pselect patch installed, I can replicate the behavior we saw in the buildfarm of connections immediately failing with "the database system is starting up". Investigation shows that pselect reports ready sockets correctly (which is what allows connections to get in at all), and it does stop waiting either for a signal or for a timeout. What it forgets to do is to actually service the signal. The observed behavior is caused by the fact that reaper() is never called so the postmaster never realizes that the startup process has finished. I experimented with putting PG_SETMASK(&UnBlockSig); PG_SETMASK(&BlockSig); immediately after the pselect() call, and found that indeed that lets signals get serviced, and things work pretty much normally. However, closer inspection finds that pselect only stops waiting when a signal arrives *while it's waiting*, not if there was a signal already pending. So this is actually even more broken than the so called "non atomic" behavior we had expected to see --- at least with that, the pending signal would have gotten serviced promptly, even if ServerLoop itself didn't iterate. This is all giving me less than warm fuzzy feelings about the state of pselect support out in the real world. So at this point we seem to have three plausible alternatives: 1. Let HEAD stand as it is. We have a problem with slow response to bgworker start requests that arrive while ServerLoop is active, but that's a pretty tight window usually (although I believe I've seen it hit at least once in testing). 2. Reinstall the pselect patch, blacklisting NetBSD and HPUX and whatever else we find to be flaky. Then only the blacklisted platforms have the problem. 3. Go ahead with converting the postmaster to use WaitEventSet, a la the draft patch I posted earlier. I'd be happy to do this if we were at the start of a devel cycle, but right now seems a bit late --- not to mention that we really need to fix 9.6 as well. We could substantially ameliorate the slow-response problem by allowing maybe_start_bgworker to launch multiple workers per call, which is something I think we should do regardless. (I have a patch written to allow it to launch up to N workers per call, but have held off committing that till after the dust settles in ServerLoop.) I'm leaning to doing #1 plus the maybe_start_bgworker change. There's certainly room for difference of opinion here, though. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers