On Sun, May 4, 2014 at 9:57 AM, Petr Jelinek <p...@2ndquadrant.com> wrote: > On 30/04/14 23:35, Robert Haas wrote: >> On Mon, Apr 28, 2014 at 4:19 PM, Petr Jelinek <p...@2ndquadrant.com> >> wrote: >>> On 28/04/14 16:27, Robert Haas wrote: >>>> It seems we have consensus on what to do about this, but what we >>>> haven't got is a patch. >>> >>> >>> If you mean the consensus that exit status 0 should mean permanent stop >>> then >>> I think the patch can be as simple as attached. >> >> Hmm. Well, at the very least, you need to update the comment. > > Obviously, also docs, the above was just demonstration that the patch is > simple (and also all I could manage to write and test on the way from > airport to hotel). > > The attached patch is more proper.
Hmm, I see some problems here. The current comment for bgworker_quickdie() says that we do exit(0) there rather than exit(2) to avoid having the postmaster delay restart based on bgw_restart_time, but with the proposed change in the meaning of exit(2), that would have the effect of unregistering the worker, which I guess is why you've changed it to call exit(2) instead, but then the bgw_restart_delay applies, which I think we do not want. To make matters more confusing, the existing comment is actually only correct for non-shmem-connected workers - shmem-connected workers will treat an exit status of anything other than 0 or 1 in exactly the same fashion as a failure to disengage the deadman switch. Which brings up another point: the behavior of non-shmem-connected workers is totally bizarre. An exit status other than 0 or 1 is not treated as a crash requiring a restart, but failure to disengage the deadman switch is still treated as a crash requiring a restart. Why? If the workers are not shmem-connected, then no crash requires a system-wide restart. Of course, there's the tiny problem that we aren't actually unmapping shared memory from supposedly non-shmem connected workers, which is a different bug, but ignoring that for the moment there's no reason for this logic to be like this. What I'm inclined to do is change the logic so that: (1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so that anything which is still registered gets restarted immediately. (2) If a shmem-connected backend fails to release the deadman switch or exits with an exit code other than 0 or 1, we crash-and-restart. A non-shmem-connected backend never causes a crash-and-restart. (3) When a background worker exits without triggering a crash-and-restart, an exit code of precisely 0 causes the worker to be unregistered; any other exit code has no special effect, so bgw_restart_time controls. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers