On Tue, Feb 5, 2008 at 7:53 AM, Joe Orton <jor...@redhat.com> wrote: > On Fri, Feb 01, 2008 at 10:41:39AM +0100, Stefan Fritsch wrote: > > Joe Orton wrote: > > > I mentioned in the bug that the signal handler could cause undefined > > > behaviour, but I'm not sure now whether that is true. On Linux I can > > > reproduce some cases where this will happen, which are all due to > > > well-defined behaviour: > > > > > > 1) with some (default on Linux) accept mutex types, > > > apr_proc_mutex_lock() will loop on EINTR. Hence, children blocked > > > waiting for the mutex do "hang" until the mutex is released. Fixing > > > this would need some APR work, new interfaces, blah > > > > This is not a problem. On graceful-stop or reload the processes will get > > the lock one by one and die (or hang somewhere else). I have never seen a > > left over process hanging in this function. > > Well, normally all children will be woken up and take the accept mutex > because of the dummy connections. But if you have one child blocked > because of issue (3) - whilst holding the accept mutex - all the other > children will also be blocked. If the EINTR could be processed at MPM > level, this wouldn't happen. So I think it is a problem, though you > could argue that solving (3) also sort of solves (1). > > > > I can also reproduce a third case, but I'm not sure about the cause: > > > > > > 3) apr_pollset_poll() is blocking despite the fact that the listening > > > fds are supposedly already closed before entering the syscall. > > > > This is the main problem in my experience. > ... > > On Linux with epoll, the hanging processes just blocks in > > apr_pollset_poll(), so checking the return value won't do any good. > > > > Maybe the problem is that (AIUI) poll() returns POLLNVAL if a fd is not > > open, while epoll() does not have something similar. In epoll.c, a > comment > > says "APR_POLLNVAL is not handled by epoll". Or should epoll return > > EPOLLHUP in this case? > > I did some more research on this: the case is covered in the epoll(7) > man page - fds are removed from any containing epoll sets on closure. > So it is well-defined behaviour, and the "hang" is expected; when all > the listeners are closed, the poll set becomes empty, so the > apr_pollset_poll() call will sleep forever, or until interrupted by > signal! > > select() and poll() will indeed return POLLNVAL for the closed-fds case, > and prefork needs to check for that. > > From some brief googling, FreeBSD kqueue appears to have the same > guarantee. This PR has some investigation of what happens with Solaris > ports: http://issues.apache.org/bugzilla/show_bug.cgi?id=42580 > > For the graceful-stop case, it would be simple enough to just signal any > dozy children again to wake them up in the wait-for-exit loop, but > graceful-restart doesn't have that opportunity, so I'm not sure about a > general solution. Reducing the poll timeout to some non-infinite time > would work.
This holds up to some very light graceful-restart testing on OpenSolaris (the same light testing that triggered a hang): Index: server/mpm/prefork/prefork.c =================================================================== --- server/mpm/prefork/prefork.c (revision 731724) +++ server/mpm/prefork/prefork.c (working copy) @@ -540,10 +540,12 @@ apr_int32_t numdesc; const apr_pollfd_t *pdesc; - /* timeout == -1 == wait forever */ - status = apr_pollset_poll(pollset, -1, &numdesc, &pdesc); + /* timeout == 10 seconds to avoid a hang at graceful restart/stop + * caused by the closing of sockets by the signal handler + */ + status = apr_pollset_poll(pollset, apr_time_from_sec(10), &numdesc, &pdesc); if (status != APR_SUCCESS) { - if (APR_STATUS_IS_EINTR(status)) { + if (APR_STATUS_IS_TIMEUP(status) || APR_STATUS_IS_EINTR(status)) { if (one_process && shutdown_pending) { return; }