see PR 49504 https://issues.apache.org/bugzilla/show_bug.cgi?id=49504 for an excellent analysis with supporting traces. There have been various other PRs that haven't led to resolution, perhaps because it is easy to circumvent via AcceptMutex sysvsem and is highly timing dependent.
The problem affects the worker and prefork MPMs. Event has a "get out of jail free" card - no accept mutex! I've only heard of it on Solaris, probably because APR_USE_PROC_PTHREAD_SERIALIZE is the default on Solaris. We have a recent report of somebody hitting the problem consistently using worker 2.2.x on Solaris 10 x64 running virtualized. Basically the pthread accept mutex lives in pconf but the lifetime of pconf is not quite right. In server/main.c we have for (;;) { apr_hook_deregister_all(); apr_pool_clear(pconf); A graceful restart causes the MPM to notify the child processes to shut down. The MPM then immediately exits and causes this loop to iterate. Then the last generation's pconf is cleared and off we go with the new config. The problem is that there is no guarantee that the old generation processes are done with the accept mutex when we clear pconf. Event appears to be doing fine with no accept mutex. prefork definitely needs one. Some of the old timers may remember back when we were trying to stabilize 2.0 prefork enough to ship a beta, and Brian B's pager was going off at night due to high load on apache.org. The bug was due to no accept serialization in prefork around the poll for listening sockets. That makes me wonder whether worker really needs an accept mutex... is it more like Event or prefork in this regard? Since ThreadsPerChild 100 or higher is reasonable on most systems these days with the old linuxthreads library dead and buried, with that tuning we would only see 1% or less of the polling overhead seen on prefork if worker didn't use an accept mutex at all. There's a chance that performance would actually improve due to eliminating the lock/unlock path length. I'm thinking "AcceptMutex none" might be a good first step for worker. Then what about prefork? Some people apparently still care about it. I don't think we can blame APR's default mutex choice on Solaris for the problem. Inserting a sleep() or otherwise blocking the server main loop seems like it would make graceful restarts not-so-graceful. One idea is to move the accept mutex into pconf's parent pool so it stays around forever. The guy who did the detailed analysis of the problem suggested refcounting the uses of the accept mutex so that its last user could do the cleanup. That strikes me as too complex. Thoughts/comments/patches? Greg