We've got a problem where sometimes some child processes of mod_fcgid PM terminate, but are never cleaned up by the PM.
Below I first state my understanding about how mod_fcgid process management works and then I describe what I see and make some conclusions based on that. Could you please check my understanding for correctness? Perhaps you would have any suggestions on how to debug / workaround / fix the problem? An important mote is that we still use apache 2.2.23 and mod_fcgid 2.3.5. Yeah, I know! We plan to upgrade really soon. Thank you very much! First a note that the safest way to take care of process'es children is to handle SIGCHLD. This way no child termination can be missed. The trade-off is that any program with non-default signal handlers becomes asynchronously concurrent and that imposes certain rules and limitations on the code. I presume that this is the reason why mod_fcgid does not use a SIGCHLD handler. My analysis of mod_fcgid's process management follows. mod_fcgid keeps three lists of child processes. The lists are kept in a special structure allocated in shared memory. Thus it can be inspected and modified by multiple processes. The lists are: - idle list is a list of processes that currently do not perform any work and can be re-used for a new request processing - error list is a list of processes that had any communication problem (e.g. an error writing to their socket or timeout waiting for a reply) - busy list is a list of processes that are performing any work (or at least supposed to be) mod_fcgid code running in apache worker processes directly inspects the lists. The code picks up a process, if any is available, from the idle list and inserts it into the busy list. Communication to the process is done directly via a local socket. If there is no available process in the idle list, then the code sends a spawn request to a special dedicated mod_fcgid process (it appears as another apache process). The process is known as Process Manager (PM). The PM spawns a new process upon the request. Thus it is a parent process of all fcgid workers. The new process is inserted into the idle list if spawning is successful. The original apache process waits a little bit after issuing the spawn request and then re-examines the idle list. There is a hardcoded limit on a number of retries / iterations that can be done until the code gives up on the attempt to grab an idle process. The PM periodically (with configurable periods, default 3 seconds) walks the idle and the error lists and executes a non-blocking waitpid() call on every process in the lists. This way the PM detects the idle or "errored" processes that have terminated in any fashion. It must be noted that until the waitpid call the terminated processes are kept by Unix-like operating systems as "zombies". After waitpid call, which collects their termination information, the zombies are reaped. The PM never walks the busy list. A different mechanism is used for managing processes on the busy list. Apache has a concept of resource pools. For example, all memory allocations must refer to a pool. When the pool is cleared or destroyed all memory allocated from it is automatically cleaned up. Additionally, it is possible to register an arbitrary object and a cleanup callback with the pool. When the pool is cleared or destroyed all the registered callbacks are called upon their associated objects. To avoid any memory / resource leaks apache creates separate pools per each configured server, per each connection and per each request. All the code is supposed to use an appropriate pool based on the scope of its operation. When fcgid code grabs a process to handle a request and puts it on the busy list the code also registers a process handle and a special callback with a pool allocated for the request in question. The callback function moves the process from the busy list back to the idle list if there was no problems, or to the error list. Thus, if the apache server and the apache framework work as expected / documented, then the process should be "unbusied" as soon as the request is handled. Given the above, I can not find any holes in the mod_fcgid logic which could lead to unreaped zombies. On the affected system I observe that mod_fcgid reports the zombie processes as still working (being on the busy list). For example: $ sudo ps axwwl | fgrep -w Z 2084 67497 71375 0 20 0 0 0 - Z ?? 0:01.15 <defunct> 2125 82246 71375 0 20 0 0 0 - Z ?? 0:24.08 <defunct> Process name: php-fastcgi-wrapper Pid Active Idle Accesses State 67497 275184 275174 1 Working Process name: php-fastcgi-wrapper Pid Active Idle Accesses VirtualHost State 82246 335933 335672 119 Working So, this leads me to conclude that the problem lies somewhere in the apache server code or in the apache pool management code. Apparently the process cleanup callback has never been called for these processes and thus they are stuck on the busy list. Even more obvious is that the processes terminated in some fashion, most likely crashed. Possibly there is a correlation between these two observations, maybe some error conditions result in request cleanup not being properly done. -- Andriy Gapon