Re: unreaped zombie children of mod_fcgid

2013-09-10 Thread Pqf 潘庆峰
> I presume that this is the reason why mod_fcgid does not use a SIGCHLD 
> handler.
mod_fcgid have to work on both UNIX and Win32, so the it pick a portable way to 
make it done on both platforms.

> Given the above, I can not find any holes in the mod_fcgid logic which could
> lead to unreaped zombies.
Yes, the logic is exactly what you said.
I think your problem is find out how the zombies stay. Actually I can't tell 
base on the information you gave, but I think you can find out with these:
1. find out the pid of PM
2. use strace -p $PM_pid (linux) or truss -p $PM_pid(Solaris), it will tell you 
what PM doing, is the waitpid() called? is waitpid() return error? or the PM 
just die itself for some reasons? ...and other useful information.

Good luck :)


 
> 
> We've got a problem where sometimes some child processes of mod_fcgid PM
> terminate, but are never cleaned up by the PM.
> 
> Below I first state my understanding about how mod_fcgid process management
> works and then I describe what I see and make some conclusions based on that.
> 
> Could you please check my understanding for correctness?
> Perhaps you would have any suggestions on how to debug / workaround / fix the
> problem?
> 
> An important mote is that we still use apache 2.2.23 and mod_fcgid 2.3.5.  
> Yeah,
> I know!  We plan to upgrade really soon.
> 
> Thank you very much!
> 
> First a note that the safest way to take care of process'es children is to
> handle SIGCHLD.  This way no child termination can be missed.
> The trade-off is that any program with non-default signal handlers becomes
> asynchronously concurrent and that imposes certain rules and limitations on 
> the
> code.
> I presume that this is the reason why mod_fcgid does not use a SIGCHLD 
> handler.
> 
> My analysis of mod_fcgid's process management follows.
> mod_fcgid keeps three lists of child processes. The lists are kept in a 
> special
> structure allocated in shared memory. Thus it can be inspected and modified by
> multiple processes.
> The lists are:
> - idle list is a list of processes that currently do not perform any work and
> can be re-used for a new request processing
> - error list is a list of processes that had any communication problem (e.g. 
> an
> error writing to their socket or timeout waiting for a reply)
> - busy list is a list of processes that are performing any work (or at least
> supposed to be)
> 
> mod_fcgid code running in apache worker processes directly inspects the lists.
> The code picks up a process, if any is available, from the idle list and 
> inserts
> it into the busy list.
> Communication to the process is done directly via a local socket.
> If there is no available process in the idle list, then the code sends a spawn
> request to a special dedicated mod_fcgid process (it appears as another apache
> process).
> The process is known as Process Manager (PM).
> The PM spawns a new process upon the request. Thus it is a parent process of 
> all
> fcgid workers. The new process is inserted into the idle list if spawning is
> successful.
> The original apache process waits a little bit after issuing the spawn request
> and then re-examines the idle list.
> There is a hardcoded limit on a number of retries / iterations that can be 
> done
> until the code gives up on the attempt to grab an idle process.
> 
> The PM periodically (with configurable periods, default 3 seconds) walks the
> idle and the error lists and executes a non-blocking waitpid() call on every
> process in the lists.
> This way the PM detects the idle or "errored" processes that have terminated 
> in
> any fashion.
> It must be noted that until the waitpid call the terminated processes are kept
> by Unix-like operating systems as "zombies".
> After waitpid call, which collects their termination information, the zombies
> are reaped.
> 
> The PM never walks the busy list.
> A different mechanism is used for managing processes on the busy list.
> Apache has a concept of resource pools. For example, all memory allocations 
> must
> refer to a pool.
> When the pool is cleared or destroyed all memory allocated from it is
> automatically cleaned up.
> Additionally, it is possible to register an arbitrary object and a cleanup
> callback with the pool.
> When the pool is cleared or destroyed all the registered callbacks are called
> upon their associated objects.
> 
> To avoid any memory / resource leaks apache creates separate pools per each
> configured server, per each connection and per each request.
> All the code is supposed to use an appropriate pool based on the scope of its
> operation.
> When fcgid code grabs a process to handle a request and puts it on the busy 
> list
> the code also registers a process handle and a special callback with a pool
> allocated for the request in question.
> The callback function moves the process from the busy list back to the idle 
> list
> if there was no problems, or to the error list.
> Thus, if the apache server and the apach

Re: unreaped zombie children of mod_fcgid

2013-09-10 Thread Andriy Gapon
on 11/09/2013 04:26 Pqf 潘庆峰 said the following:
> I think your problem is find out how the zombies stay. Actually I can't tell 
> base on the information you gave, but I think you can find out with these:
> 1. find out the pid of PM
> 2. use strace -p $PM_pid (linux) or truss -p $PM_pid(Solaris), it will tell 
> you what PM doing, is the waitpid() called? is waitpid() return error? or the 
> PM just die itself for some reasons? ...and other useful information.

Sorry that I was not clear about this in my original post.
The PM is doing well: it's running and it's calling waitpid on other processes.
It does not call waitpid on the zombie processes in question because they are
still on the busy list.  And it seems that the PM never checks processes on the
busy list.

I've been thinking about this problem and the only theory that I have got so far
is that perhaps an owner httpd process could terminate ungracefully (e.g.
crash).  In that case the pool cleanup would never be run.  That's OK for
process local resources like memory or file descriptors, which would be freed by
OS because the process dies anyway.  But that's not OK for external resources
like other processes.
In other words, if an httpd process marks an fcgid process as busy and then
suddenly dies, then there is nobody to move the fcgid process back to the idle 
list.

-- 
Andriy Gapon


Re: unreaped zombie children of mod_fcgid

2013-09-11 Thread Andriy Gapon
on 11/09/2013 09:11 Andriy Gapon said the following:
> I've been thinking about this problem and the only theory that I have got so 
> far
> is that perhaps an owner httpd process could terminate ungracefully (e.g.
> crash).  In that case the pool cleanup would never be run.  That's OK for
> process local resources like memory or file descriptors, which would be freed 
> by
> OS because the process dies anyway.  But that's not OK for external resources
> like other processes.
> In other words, if an httpd process marks an fcgid process as busy and then
> suddenly dies, then there is nobody to move the fcgid process back to the 
> idle list.

Just an idea: perhaps scan_busylist should move a process to a different list
(the error list?) after treating it with proc_kill_force?
Currently the processes on the busy list are never checked for being a zombie.
With the proposed change they should be correctly reaped after all the waiting
and killing.

-- 
Andriy Gapon


Re: Re: unreaped zombie children of mod_fcgid

2013-09-11 Thread Pqf 潘庆峰
Hi,
   Yes, you are right, a httpd process marks a fcgid process as busy and then 
suddenly dies would cause the zombie issue. 
   
   I agree with your idea except: don't move process to error list, but a new 
create process list (like zombie list?). Because a processs busy timeout not 
necessary mean the httpd process die, what if the httpd process not die and 
then try to modify the process slot which has been apply by the another httpd 
process?

   This idea has a problem: how to deal with the slots in "zombie list"? these 
slots may or may not be using by the httpd processes which apply them. A 
reasonable timeout may solve this problem, but any better idea?
  
> 
> on 11/09/2013 09:11 Andriy Gapon said the following:
> > I've been thinking about this problem and the only theory that I have got 
> > so far
> > is that perhaps an owner httpd process could terminate ungracefully (e.g.
> > crash).  In that case the pool cleanup would never be run.  That's OK for
> > process local resources like memory or file descriptors, which would be 
> > freed by
> > OS because the process dies anyway.  But that's not OK for external 
> > resources
> > like other processes.
> > In other words, if an httpd process marks an fcgid process as busy and then
> > suddenly dies, then there is nobody to move the fcgid process back to the 
> > idle list.
> 
> Just an idea: perhaps scan_busylist should move a process to a different list
> (the error list?) after treating it with proc_kill_force?
> Currently the processes on the busy list are never checked for being a zombie.
> With the proposed change they should be correctly reaped after all the waiting
> and killing.
> 
> -- 
> Andriy Gapon
> 
>