proc_pthread accept mutex + graceful restarts = race condition

Greg Ames Tue, 15 Jan 2013 09:06:40 -0800

see PR 49504 https://issues.apache.org/bugzilla/show_bug.cgi?id=49504 for
an excellent analysis with supporting traces.  There have been various
other PRs that haven't led to resolution, perhaps because it is easy to
circumvent via AcceptMutex sysvsem and is highly timing dependent.


The problem affects the worker and prefork MPMs.  Event has a "get out of
jail free" card - no accept mutex!  I've only heard of it on Solaris,
probably because APR_USE_PROC_PTHREAD_SERIALIZE is the default on Solaris.
 We have a recent report of somebody hitting the problem consistently using
worker 2.2.x on Solaris 10 x64 running virtualized.

Basically the pthread accept mutex lives in pconf but the lifetime of pconf
is not quite right.  In server/main.c we have

    for (;;) {
        apr_hook_deregister_all();
        apr_pool_clear(pconf);

A graceful restart causes the MPM to notify the child processes to shut
down.  The MPM then immediately exits and causes this loop to iterate. Then
the last generation's pconf is cleared and off we go with the new config.
 The problem is that there is no guarantee that the old generation
processes are done with the accept mutex when we clear pconf.

Event appears to be doing fine with no accept mutex.
 prefork definitely needs one.  Some of the old timers may remember back
when we were trying to stabilize 2.0 prefork enough to ship a beta, and
Brian B's pager was going off at night due to high load on apache.org.  The
bug was due to no accept serialization in prefork around the poll for
listening sockets.

That makes me wonder whether worker really needs an accept mutex... is it
more like Event or prefork in this regard?  Since ThreadsPerChild 100 or
higher is reasonable on most systems these days with the old linuxthreads
library dead and buried, with that tuning we would only see 1% or less of
the polling overhead seen on prefork if worker didn't use an accept mutex
at all.  There's a chance that performance would actually improve due to
eliminating the lock/unlock path length.  I'm thinking "AcceptMutex none"
might be a good first step for worker.

Then what about prefork?  Some people apparently still care about it.  I
don't think we can blame APR's default mutex choice on Solaris for the
problem.  Inserting a sleep() or otherwise blocking the server main loop
seems like it would make graceful restarts not-so-graceful.  One idea is to
move the accept mutex into pconf's parent pool so it stays around forever.
 The guy who did the detailed analysis of the problem suggested refcounting
the uses of the accept mutex so that its last user could do the cleanup.
 That strikes me as too complex.

Thoughts/comments/patches?

Greg

proc_pthread accept mutex + graceful restarts = race condition

Reply via email to