Re: [HACKERS] Archiver not exiting upon crash

Jeff Janes Wed, 23 May 2012 13:34:54 -0700

On Wed, May 23, 2012 at 1:10 PM, Tom Lane <[email protected]> wrote:
> Jeff Janes <[email protected]> writes:
>> It looks to me like the SIGQUIT from the postmaster is simply getting
>> lost.  And from what little I understand of signal handling, this is a
>> known race with system(3).  The archive_command, child of archiver,
>> exits before it can receive the signal sent to the entire archiver
>> process group, so it doesn't set its exit status to show it was
>> signalled.  But the signal sent directly to the archiver reaches it
>> while it is still ignoring SIGQUITs.
>
> Ugh.
>
>> If the SIGQUIT is getting lost in a race, could it just be blocked
>> during the system(3) call?
>> I don't know what happens if you call system(3) with SIGQUIT being blocked.
>
> On my machine, man system(3) saith:
>
>     system() ignores the SIGINT and SIGQUIT signals, and blocks the
>     SIGCHLD signal, while waiting for the command to terminate.  If this
>     might cause the application to miss a signal that would have killed
>     it, the application should examine the return value from system() and
>     take whatever action is appropriate to the application if the command
>     terminated due to receipt of a signal.


But what happens if the SIGQUIT is blocked before the system(3) is
invoked?  Does the ignore take precedence over the block, or does the
block take precedence over the ignore, and so the signal is still
waiting once the block is reversed after the system(3) is over?  I
could write a test program to see, but that wouldn't be very good
evidence of the portability.

>
> Now, the code that directly calls system(), namely pgarch_archiveXlog(),
> knows this perfectly well, as per the comment at lines 590ff in HEAD.
> However, the code that *calls* it did not get the memo :-(, and appears
> to be willing to retry regardless.

But if the signal is lost, how could it know to do anything differently?

>> Or maybe the postmaster should not be infinitely patient, but send
>> another round of signals after a brief delay.
>
> If the first one was ignored, later ones might be too.

True, but I don't see how it make anything worse.   If the postmaster
is going to hang for eternity anyway, it might as well do something
every now and then.  Eventually a signal is highly likely to get
through.  (And in this case it would be almost certain to, as it is a
rare race and the archiver would run out of things to archive anyway.)

>
> I'm inclined to think that we should change pgarch_archiveXlog to
> detect these specific signal conditions and just directly exit(),
> rather than giving its caller a chance to blow the decision.

But that is exactly the problem.  How do you detect a signal that gets lost?

I wonder if this is a quirk of my current hardware/kernel.  I have
occasionally seen hangs on other systems during this same exercise,
but they seemed much less common and I never investigated them.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Archiver not exiting upon crash

Reply via email to