We've got a problem where sometimes some child processes of mod_fcgid PM
terminate, but are never cleaned up by the PM.

Below I first state my understanding about how mod_fcgid process management
works and then I describe what I see and make some conclusions based on that.

Could you please check my understanding for correctness?
Perhaps you would have any suggestions on how to debug / workaround / fix the
problem?

An important mote is that we still use apache 2.2.23 and mod_fcgid 2.3.5.  Yeah,
I know!  We plan to upgrade really soon.

Thank you very much!

First a note that the safest way to take care of process'es children is to
handle SIGCHLD.  This way no child termination can be missed.
The trade-off is that any program with non-default signal handlers becomes
asynchronously concurrent and that imposes certain rules and limitations on the
code.
I presume that this is the reason why mod_fcgid does not use a SIGCHLD handler.

My analysis of mod_fcgid's process management follows.
mod_fcgid keeps three lists of child processes. The lists are kept in a special
structure allocated in shared memory. Thus it can be inspected and modified by
multiple processes.
The lists are:
- idle list is a list of processes that currently do not perform any work and
can be re-used for a new request processing
- error list is a list of processes that had any communication problem (e.g. an
error writing to their socket or timeout waiting for a reply)
- busy list is a list of processes that are performing any work (or at least
supposed to be)

mod_fcgid code running in apache worker processes directly inspects the lists.
The code picks up a process, if any is available, from the idle list and inserts
it into the busy list.
Communication to the process is done directly via a local socket.
If there is no available process in the idle list, then the code sends a spawn
request to a special dedicated mod_fcgid process (it appears as another apache
process).
The process is known as Process Manager (PM).
The PM spawns a new process upon the request. Thus it is a parent process of all
fcgid workers. The new process is inserted into the idle list if spawning is
successful.
The original apache process waits a little bit after issuing the spawn request
and then re-examines the idle list.
There is a hardcoded limit on a number of retries / iterations that can be done
until the code gives up on the attempt to grab an idle process.

The PM periodically (with configurable periods, default 3 seconds) walks the
idle and the error lists and executes a non-blocking waitpid() call on every
process in the lists.
This way the PM detects the idle or "errored" processes that have terminated in
any fashion.
It must be noted that until the waitpid call the terminated processes are kept
by Unix-like operating systems as "zombies".
After waitpid call, which collects their termination information, the zombies
are reaped.

The PM never walks the busy list.
A different mechanism is used for managing processes on the busy list.
Apache has a concept of resource pools. For example, all memory allocations must
refer to a pool.
When the pool is cleared or destroyed all memory allocated from it is
automatically cleaned up.
Additionally, it is possible to register an arbitrary object and a cleanup
callback with the pool.
When the pool is cleared or destroyed all the registered callbacks are called
upon their associated objects.

To avoid any memory / resource leaks apache creates separate pools per each
configured server, per each connection and per each request.
All the code is supposed to use an appropriate pool based on the scope of its
operation.
When fcgid code grabs a process to handle a request and puts it on the busy list
the code also registers a process handle and a special callback with a pool
allocated for the request in question.
The callback function moves the process from the busy list back to the idle list
if there was no problems, or to the error list.
Thus, if the apache server and the apache framework work as expected /
documented, then the process should be "unbusied" as soon as the request is 
handled.

Given the above, I can not find any holes in the mod_fcgid logic which could
lead to unreaped zombies.

On the affected system I observe that mod_fcgid reports the zombie processes as
still working (being on the busy list).
For example:
$ sudo ps axwwl | fgrep -w Z
2084 67497 71375 0 20 0 0 0 - Z ?? 0:01.15 <defunct>
2125 82246 71375 0 20 0 0 0 - Z ?? 0:24.08 <defunct>

Process name: php-fastcgi-wrapper
Pid Active Idle Accesses State
67497 275184 275174 1 Working
Process name: php-fastcgi-wrapper
Pid Active Idle Accesses VirtualHost State
82246 335933 335672 119 Working

So, this leads me to conclude that the problem lies somewhere in the apache
server code or in the apache pool management code.
Apparently the process cleanup callback has never been called for these
processes and thus they are stuck on the busy list.
Even more obvious is that the processes terminated in some fashion, most likely
crashed.
Possibly there is a correlation between these two observations, maybe some error
conditions result in request cleanup not being properly done.

-- 
Andriy Gapon

Reply via email to