> I presume that this is the reason why mod_fcgid does not use a SIGCHLD 
> handler.
mod_fcgid have to work on both UNIX and Win32, so the it pick a portable way to 
make it done on both platforms.

> Given the above, I can not find any holes in the mod_fcgid logic which could
> lead to unreaped zombies.
Yes, the logic is exactly what you said.
I think your problem is find out how the zombies stay. Actually I can't tell 
base on the information you gave, but I think you can find out with these:
1. find out the pid of PM
2. use strace -p $PM_pid (linux) or truss -p $PM_pid(Solaris), it will tell you 
what PM doing, is the waitpid() called? is waitpid() return error? or the PM 
just die itself for some reasons? ...and other useful information.

Good luck :)


 
> 
> We've got a problem where sometimes some child processes of mod_fcgid PM
> terminate, but are never cleaned up by the PM.
> 
> Below I first state my understanding about how mod_fcgid process management
> works and then I describe what I see and make some conclusions based on that.
> 
> Could you please check my understanding for correctness?
> Perhaps you would have any suggestions on how to debug / workaround / fix the
> problem?
> 
> An important mote is that we still use apache 2.2.23 and mod_fcgid 2.3.5.  
> Yeah,
> I know!  We plan to upgrade really soon.
> 
> Thank you very much!
> 
> First a note that the safest way to take care of process'es children is to
> handle SIGCHLD.  This way no child termination can be missed.
> The trade-off is that any program with non-default signal handlers becomes
> asynchronously concurrent and that imposes certain rules and limitations on 
> the
> code.
> I presume that this is the reason why mod_fcgid does not use a SIGCHLD 
> handler.
> 
> My analysis of mod_fcgid's process management follows.
> mod_fcgid keeps three lists of child processes. The lists are kept in a 
> special
> structure allocated in shared memory. Thus it can be inspected and modified by
> multiple processes.
> The lists are:
> - idle list is a list of processes that currently do not perform any work and
> can be re-used for a new request processing
> - error list is a list of processes that had any communication problem (e.g. 
> an
> error writing to their socket or timeout waiting for a reply)
> - busy list is a list of processes that are performing any work (or at least
> supposed to be)
> 
> mod_fcgid code running in apache worker processes directly inspects the lists.
> The code picks up a process, if any is available, from the idle list and 
> inserts
> it into the busy list.
> Communication to the process is done directly via a local socket.
> If there is no available process in the idle list, then the code sends a spawn
> request to a special dedicated mod_fcgid process (it appears as another apache
> process).
> The process is known as Process Manager (PM).
> The PM spawns a new process upon the request. Thus it is a parent process of 
> all
> fcgid workers. The new process is inserted into the idle list if spawning is
> successful.
> The original apache process waits a little bit after issuing the spawn request
> and then re-examines the idle list.
> There is a hardcoded limit on a number of retries / iterations that can be 
> done
> until the code gives up on the attempt to grab an idle process.
> 
> The PM periodically (with configurable periods, default 3 seconds) walks the
> idle and the error lists and executes a non-blocking waitpid() call on every
> process in the lists.
> This way the PM detects the idle or "errored" processes that have terminated 
> in
> any fashion.
> It must be noted that until the waitpid call the terminated processes are kept
> by Unix-like operating systems as "zombies".
> After waitpid call, which collects their termination information, the zombies
> are reaped.
> 
> The PM never walks the busy list.
> A different mechanism is used for managing processes on the busy list.
> Apache has a concept of resource pools. For example, all memory allocations 
> must
> refer to a pool.
> When the pool is cleared or destroyed all memory allocated from it is
> automatically cleaned up.
> Additionally, it is possible to register an arbitrary object and a cleanup
> callback with the pool.
> When the pool is cleared or destroyed all the registered callbacks are called
> upon their associated objects.
> 
> To avoid any memory / resource leaks apache creates separate pools per each
> configured server, per each connection and per each request.
> All the code is supposed to use an appropriate pool based on the scope of its
> operation.
> When fcgid code grabs a process to handle a request and puts it on the busy 
> list
> the code also registers a process handle and a special callback with a pool
> allocated for the request in question.
> The callback function moves the process from the busy list back to the idle 
> list
> if there was no problems, or to the error list.
> Thus, if the apache server and the apache framework work as expected /
> documented, then the process should be "unbusied" as soon as the request is 
> handled.
> 
> Given the above, I can not find any holes in the mod_fcgid logic which could
> lead to unreaped zombies.
> 
> On the affected system I observe that mod_fcgid reports the zombie processes 
> as
> still working (being on the busy list).
> For example:
> $ sudo ps axwwl | fgrep -w Z
> 2084 67497 71375 0 20 0 0 0 - Z ?? 0:01.15 <defunct>
> 2125 82246 71375 0 20 0 0 0 - Z ?? 0:24.08 <defunct>
> 
> Process name: php-fastcgi-wrapper
> Pid Active Idle Accesses State
> 67497 275184 275174 1 Working
> Process name: php-fastcgi-wrapper
> Pid Active Idle Accesses VirtualHost State
> 82246 335933 335672 119 Working
> 
> So, this leads me to conclude that the problem lies somewhere in the apache
> server code or in the apache pool management code.
> Apparently the process cleanup callback has never been called for these
> processes and thus they are stuck on the busy list.
> Even more obvious is that the processes terminated in some fashion, most 
> likely
> crashed.
> Possibly there is a correlation between these two observations, maybe some 
> error
> conditions result in request cleanup not being properly done.
> 
> -- 
> Andriy Gapon
> 
> 


Reply via email to