Re: [HACKERS] bgworker crashed or not?

Robert Haas Tue, 06 May 2014 10:06:17 -0700

On Sun, May 4, 2014 at 9:57 AM, Petr Jelinek <[email protected]> wrote:
> On 30/04/14 23:35, Robert Haas wrote:
>> On Mon, Apr 28, 2014 at 4:19 PM, Petr Jelinek <[email protected]>
>> wrote:
>>> On 28/04/14 16:27, Robert Haas wrote:
>>>> It seems we have consensus on what to do about this, but what we
>>>> haven't got is a patch.
>>>
>>>
>>> If you mean the consensus that exit status 0 should mean permanent stop
>>> then
>>> I think the patch can be as simple as attached.
>>
>> Hmm.  Well, at the very least, you need to update the comment.
>
> Obviously, also docs, the above was just demonstration that the patch is
> simple (and also all I could manage to write and test on the way from
> airport to hotel).
>
> The attached patch is more proper.


Hmm, I see some problems here.  The current comment for
bgworker_quickdie() says that we do exit(0) there rather than exit(2)
to avoid having the postmaster delay restart based on
bgw_restart_time, but with the proposed change in the meaning of
exit(2), that would have the effect of unregistering the worker, which
I guess is why you've changed it to call exit(2) instead, but then the
bgw_restart_delay applies, which I think we do not want.  To make
matters more confusing, the existing comment is actually only correct
for non-shmem-connected workers - shmem-connected workers will treat
an exit status of anything other than 0 or 1 in exactly the same
fashion as a failure to disengage the deadman switch.

Which brings up another point: the behavior of non-shmem-connected
workers is totally bizarre.  An exit status other than 0 or 1 is not
treated as a crash requiring a restart, but failure to disengage the
deadman switch is still treated as a crash requiring a restart.  Why?
If the workers are not shmem-connected, then no crash requires a
system-wide restart.  Of course, there's the tiny problem that we
aren't actually unmapping shared memory from supposedly non-shmem
connected workers, which is a different bug, but ignoring that for the
moment there's no reason for this logic to be like this.

What I'm inclined to do is change the logic so that:

(1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so
that anything which is still registered gets restarted immediately.

(2) If a shmem-connected backend fails to release the deadman switch
or exits with an exit code other than 0 or 1, we crash-and-restart.  A
non-shmem-connected backend never causes a crash-and-restart.

(3) When a background worker exits without triggering a
crash-and-restart, an exit code of precisely 0 causes the worker to be
unregistered; any other exit code has no special effect, so
bgw_restart_time controls.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bgworker crashed or not?

Reply via email to