subject:"\[HACKERS\] Platform\-dependent\(\?\) failure in timeout handling"

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

2013-11-27 Thread Martijn van Oosterhout

On Tue, Nov 26, 2013 at 06:50:28PM -0500, Tom Lane wrote:
 I believe the reason for this is the mechanism that I speculated about in
 that previous thread.  The platform is blocking SIGALRM while it executes
 handle_sig_alarm(), and that calls LockTimeoutHandler() which does
 kill(MyProcPid, SIGINT), and that SIGINT is being delivered immediately
 (or at least before we can get out of handle_sig_alarm).  So now the
 platform blocks SIGINT, too, and calls StatementCancelHandler(), which
 proceeds to longjmp out of the whole signal-handling call stack.  So
 the signal unblocking that would have happened after the handlers
 returned doesn't happen.  In simpler cases we don't see an issue because
 the longjmp returns to the setsigjmp(foo,1) call in postgres.c, which
 will result in restoring the signal mask that was active at that stack
 level, so we're all good.  However, PG_TRY() uses setsigjmp(foo,0),
 which means that no signal mask restoration happens if we catch the
 longjmp and don't ever re-throw it.  Which is exactly what happens in
 plpgsql because of the EXCEPTION clause in the above example.
 
 I don't know how many platforms block signals during handlers in this way,
 but I'm seeing it on Linux (RHEL6.5 to be exact) and we know that at least
 OpenBSD acts likewise, so that's a pretty darn large chunk of the world.

Isn't this why sigsetjmp/siglongjmp where invented? Is there a
situation where you don't want the signal mask restored?

BTW, seems on BSD systems sigsetjmp == setjmp:

http://www.gnu.org/software/libc/manual/html_node/Non_002dLocal-Exits-and-Signals.html

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

2013-11-27 Thread Andres Freund

Hi,

On 2013-11-26 18:50:28 -0500, Tom Lane wrote:
 I don't know how many platforms block signals during handlers in this way,
 but I'm seeing it on Linux (RHEL6.5 to be exact) and we know that at least
 OpenBSD acts likewise, so that's a pretty darn large chunk of the world.

Just as a datapoint, I think on linux (and likely on any system using
sigaction()) that depends on how you setup the signal handler. There's a
flag that controls that behaviour, namely SA_NODEFER.
When using signal(2) the behaviour depends on the flavor of system we're
running and even on some #defines.

I am not really sure whether using SA_NODEFER would be a good idea, even
if we could achieve that behaviour on all platforms.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

2013-11-27 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:

 3. Establish a coding rule that if you catch an error with
 PG_TRY() and don't re-throw it, you have to unblock signals in
 your PG_CATCH block.

Could that be done in the PG_END_TRY macro?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

2013-11-27 Thread Tom Lane

Kevin Grittner kgri...@ymail.com writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 3. Establish a coding rule that if you catch an error with
 PG_TRY() and don't re-throw it, you have to unblock signals in
 your PG_CATCH block.

 Could that be done in the PG_END_TRY macro?

Interesting idea ... [ thinks for a bit ... ] but I'm not sure it's really
a great fix.  PG_END_TRY would have to unblock *all* signals, I think,
and it's not clear you want that in that mechanism.  Consider a PG_TRY
executing inside a signal handler.  It would be semantically reasonable
to use sigsetjmp(foo,1) in PG_TRY, but as I said, I'm trying to avoid
that on performance grounds.

In any case, after sleeping on it I've realized that the HOLD_INTERRUPTS
dance I proposed adding to handle_sig_alarm() is necessary but not
sufficient to deal with interruptions from SIGINT.  The real reason that
we need HOLD_INTERRUPTS there is that the signal handler may be executing
while ImmediateInterruptsOK is true.  So if a random SIGINT happened to
arrive while we're updating the timeout data structures, we could lose
control and leave partially updated (ie, totally corrupted) timeout info.
Adding HOLD_INTERRUPTS prevents that case.  But, imagine that SIGINT
happens just as we enter handle_sig_alarm(), before we can do
HOLD_INTERRUPTS.  Then the SIGINT handler could longjmp out of the SIGALRM
handler; the data structures are OK, but we've effectively lost that
interrupt, and no new one is scheduled.  This isn't much of a problem
for the existing timer event handlers, because we'd just cancel them
anyway on the way out of the (sub)transaction, cf LockErrorCleanup.
But it could be a problem for future usages of the timer infrastructure.

In the case where we longjmp all the way out to the idle loop, we cancel
all pending timeouts anyway in postgres.c's sigsetjmp cleanup stanza.
It would be reasonable to add a PG_SETMASK(UnBlockSig) in the same area.

I am thinking that what we want to do is add a signal unblock step to
(sub)transaction abort, too, as well as a call to timeout.c to let it
re-establish its timer interrupt in case one got lost.  This would fix
the problem as long as you assume that anyone catching an arbitrary error
condition will do a subtransaction abort to get back into a good state.
But we're requiring that already.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Platform-dependent(?) failure in timeout handling

2013-11-26 Thread Tom Lane

Dan Wood sent me off-list the test case mentioned in

http://www.postgresql.org/message-id/CAPweHKfExEsbACRXQTBdu_4QxhHk-Cic_iwmbh5XHo_0Z=q...@mail.gmail.com

I've been poking at it, and one of the failure modes I'm seeing is that
all the backends hang up without crashing.  I thought at first it was an
undetected deadlock, but what it turns out to be is breakage in timeout
processing.  The symptom is reminiscent of what we saw on OpenBSD awhile
back:

http://www.postgresql.org/message-id/5151eddd.9080...@kaltenbrunner.cc

namely that backends end up with SIGALRM blocked and therefore can't
process any more timeout interrupts.  I've been able to devise a
repeatable test case: do this setup

create table lock_bug(f1 int, f2 int);

create or replace function timeoutfail() returns integer as $$
BEGIN
  SET lock_timeout = 100;
  BEGIN
INSERT INTO lock_bug VALUES (1, 2);
  EXCEPTION WHEN OTHERS THEN
return 0;
  END;
  return 1;
END;
$$ LANGUAGE plpgsql;

and then try timeoutfail() a few times:

regression=# select timeoutfail();
 timeoutfail 
-
   1
(1 row)

regression=# select timeoutfail();
 timeoutfail 
-
   1
(1 row)

All fine so far.  Now, in another session, do

regression=# begin;
BEGIN
regression=# lock table lock_bug ;
LOCK TABLE

and then go back and try timeoutfail() some more.

regression=# select timeoutfail();
 timeoutfail 
-
   0
(1 row)

regression=# select timeoutfail();
 hangs ... 

After the call that returns zero, the process has SIGALRM and SIGINT
blocked, so the timeout that should happen in the next call never comes.

I believe the reason for this is the mechanism that I speculated about in
that previous thread.  The platform is blocking SIGALRM while it executes
handle_sig_alarm(), and that calls LockTimeoutHandler() which does
kill(MyProcPid, SIGINT), and that SIGINT is being delivered immediately
(or at least before we can get out of handle_sig_alarm).  So now the
platform blocks SIGINT, too, and calls StatementCancelHandler(), which
proceeds to longjmp out of the whole signal-handling call stack.  So
the signal unblocking that would have happened after the handlers
returned doesn't happen.  In simpler cases we don't see an issue because
the longjmp returns to the setsigjmp(foo,1) call in postgres.c, which
will result in restoring the signal mask that was active at that stack
level, so we're all good.  However, PG_TRY() uses setsigjmp(foo,0),
which means that no signal mask restoration happens if we catch the
longjmp and don't ever re-throw it.  Which is exactly what happens in
plpgsql because of the EXCEPTION clause in the above example.

I don't know how many platforms block signals during handlers in this way,
but I'm seeing it on Linux (RHEL6.5 to be exact) and we know that at least
OpenBSD acts likewise, so that's a pretty darn large chunk of the world.

There are several ways that we might address this:

1. Make PG_TRY() use setsigjmp(foo,1) instead.

2. Establish a coding rule that signal handlers that throw longjmp
must unblock signals first.

3. Establish a coding rule that if you catch an error with PG_TRY()
and don't re-throw it, you have to unblock signals in your PG_CATCH
block.

I don't especially like #1 because of performance considerations ---
I assume (perhaps wrongly) that saving and restoring the signal mask would
add noticeably to the cost of PG_TRY, and most of the time we don't need
that.  Also, it's arguable that we don't *want* signals to be unblocked
right away if we're dealing with some arbitrary error condition.  At
least, it seems scary for the PG_TRY mechanism to be enforcing a policy
that we unblock before any cleanup can happen.

#2 is more or less what I proposed in the prior thread.  I experimented
with putting PG_SETMASK(UnBlockSig); into ProcessInterrupts() before
any of the ereport(ERROR) cases, and verified that this seems to resolve
the issue.  One could argue that this approach has the same defect of
unblocking signals before any cleanup can happen --- but the scope for
problems seems considerably narrower than if we make PG_TRY() itself
change behavior.

#3 is probably logically cleanest but I'm worried about the prospects
for missing some places that would need to be changed to conform to
this new coding rule.  In particular, any third-party PL that has
something corresponding to exception catching would have an issue.
(#2 is less bad on the how-much-code-changes scale because there are
very few signal handlers that ought to be throwing longjmps.)

So on balance I like approach #2 but I wonder if anyone has a different
opinion.

Also, while looking at this example, I see that there's another issue
internal to timeout.c: if a timeout handler function throws a longjmp
then handle_sig_alarm() loses control, which means that (1) any other
handler functions that should have fired at the same time won't, and
(2) we fail to reschedule the SIGALRM interrupt, meaning that we may
forget

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

Re: [HACKERS] Platform-dependent(?) failure in timeout handling

[HACKERS] Platform-dependent(?) failure in timeout handling

5 matches

Site Navigation

Mail list logo

Footer information