The following reply was made to PR general/792; it has been noted by GNATS.
From: Dean Gaudet <[EMAIL PROTECTED]>
To: Nathan Kurz <[EMAIL PROTECTED]>
Subject: Re: general/792: race condition with SIGUSR1 graceful restart
Date: Thu, 26 Jun 1997 12:30:13 -0700 (PDT)
When it's doing deferred_die it doesn't want to take the signal
immediately because it's in a position where it will possibly receive a
request. USR1 will break it out of accept() though with an EINTR and
it'll die out. So everything is fine after it hits accept.
Before it hits accept it's in the "die immediately" signal handler. So
it's fine from the top of the loop through to the signal() that changes
the handlers. In addition it checks the generation through there so it
catches anything that occured before the "die" signal was put on.
So the only place where there's a problem is between the signal(deferred)
call and the accept(). But if we do what you suggest then we run into a
problem *after* the accept call. We may have accepted a connection, and
then get hit with a signal before we can disable the longjmp. That is
something I deliberately tried to avoid. We also can't trust the values
of any of the local variables, so it'd be hard to know if we accepted a
connection or if we're just supposed to die or what.
I've been able to run a "while 1; kill -USR1" loop against the server and
surf without a broken link with the current code. But before I had the
deferred stuff in there I did have the slight race condition I talk about
above -- where an accept may succeed and then get signalled to death. And
when that was in there I did get broken links while surfing.
On architectures with serialization in that loop the current code lets one
child live for at most one more request. On other architectures, where
everyone gets plopped into accept() and the OS gets to wake 'em up, it's
possible for some children to be stuck "gracefully exiting" if the OS
starves them at the accept().
We could protect that with a timer for 1.2... but I'm thinking of
serializing to solve PR#467 so it may be moot in the future. Dunno. Am I
on crack?
Your longjmp thing is clever though. But I can't think of how to test if
we really got a connection after the loop exits due to deferred_die being
set. For example, we could get the signal after accept() returns but
before csd is set. race conditions rule.
Dean
On Thu, 26 Jun 1997, Nathan Kurz wrote:
>
> >Number: 792
> >Category: general
> >Synopsis: race condition with SIGUSR1 graceful restart
> >Confidential: no
> >Severity: non-critical
> >Priority: medium
> >Responsible: apache (Apache HTTP Project)
> >State: open
> >Class: sw-bug
> >Submitter-Id: apache
> >Arrival-Date: Thu Jun 26 08:50:01 1997
> >Originator: [EMAIL PROTECTED]
> >Organization:
> apache
> >Release: 1.2.0
> >Environment:
> any
> >Description:
> There is a problem with the signal handling of SIGUSR1 in child_main()
> in http_main.c around line 1775. If a SIGUSR1 comes too early in the
> for loop it will be ignored and the process will wait in accept.
> It's none too critical, but could be improved.
> >How-To-Repeat:
> This condition can be tested by putting a pause() or sleep in the
> for loop just before the accept and then sending a SIGUSR1 to the
> process.
> >Fix:
> It needs a long jump. Something like:
>
> if (ap_setjmp(deferred_die_jump_buffer, 1) == 0) {
> signal(SIGUSR1, deferred_die_and_jump_handler);
> }
> while (! deferred_die) {
> clen = sizeof();
> csd = accept();
> if (csd >=0 || errno != EINTR) break;
> }
> signal(SIGUSR1, deferred_die_handler)%3
> >Audit-Trail:
> >Unformatted:
>
>
>