[sorry for the long post, but I think this is important...]

We had three instances recently where daedalus's httpd parent died because of a
fatal error detected in a child.  There has been some discussion about what to
do about the remaining children etc, but none about how screwed up our Unix
accept() error handling is which led to the fatal error.

The kernel was periodically running out of fd's.  prefork happened to
(indirectly) invoke unixd_accept at an unlucky moment.  unixd_accept called apr,
which did the right thing and returned ENFILE.  unixd_accept goes thru a rather
unsightly switch statement where none of the cases match.  So we hit the
default, which logs the error and returns APR_EGENERAL.  Back in prefork,

if (status == APR_EGENERAL) {
    clean_child_exit(APEXIT_CHILDFATAL);

yikes!  Since Brian B had mentioned the fd problem, I scrambled to get a patch
ready for 2.0.31 so we could survive it.  

While I was doing this, I told Jeff about it and he was astonished.  He had
recently been beating the crap out of worker on AIX and Solaris with the number
of fd's severely restricted to make sure we didn't do anything foolish, and
hadn't seen such a problem.  As it turns out, worker ignores the error and
re-issues the accept().  That's probably better, but if we had a 3rd party
module leaking descriptors for example, it wouldn't help resolve the situation.

1.3 has the same comments mentioning fd leaks so the problem was well known.  In
1.3, ENFILE | EMFILE also hit the default (line 4419 in src/main/http_main.c)
but the Unix code then does a clean_child_exit(1).  TPF considers the default
fatal.  

At first I thought somebody had accidently copied the 1.3 TPF code into 2.0
Unix.  But then I had a look at the accept(2) man page on FreeBSD and Linux. 
Most of the errno's look like should-not-occur programming errors, so having
default/APR_EGENERAL be a fatal error might be reasonable.

What I think we need is a new APR error category (APR_ENORESOURCE ?
APR_ESICKCHILD ?), similar to APR_EGENERAL, that advises the MPM to cleanly shut
down the child without affecting the parent.  This could be set in unixd_accept
for things like EMFILE, ENFILE,  ENOBUFS, and ENOMEM, which deal with resource
shortages.

Comments?  Better names for the new error category?

Greg

Reply via email to