The following issue has been UPDATED. 
====================================================================== 
http://www.dbmail.org/mantis/view.php?id=363 
====================================================================== 
Reported By:                ryo
Assigned To:                
====================================================================== 
Project:                    DBMail
Issue ID:                   363
Category:                   General
Reproducibility:            sometimes
Severity:                   minor
Priority:                   normal
Status:                     resolved
target:                     2.1.7 
Resolution:                 fixed
Fixed in Version:           SVN Trunk
====================================================================== 
Date Submitted:             12-Jun-06 09:22 CEST
Last Modified:              22-Jun-06 12:53 CEST
====================================================================== 
Summary:                    Somtimes the count of grandchild processes does not
decrease.
Description: 
I'm sorry, my English is poor.

After many access to dbmail-imapd, somtimes the count of grandchild 
processes does not decrease to NCHILDREN all the time.

I could know by using strace command that the child process of 
dbmail-imapd stopped at the waitpid() as follows.

  [EMAIL PROTECTED] ~]# strace -p 21208
  Process 21208 attached - interrupt to quit
  waitpid(3422,

I sent SIGTERM to the grandchild process(in the above example:pid = 3422)
with kill command, then the child process resume and the count of 
grandchild processes decreased.

I think this cause is that the waitpid function is called without 
WNOHANG option in the pool.c:reap_child(). Is this intentional? 
Any idea?

======================================================================
Relationships       ID      Summary
----------------------------------------------------------------------
related to          0000361 IMAP zombies after about a day.
====================================================================== 

---------------------------------------------------------------------- 
 aaron - 12-Jun-06 18:13  
---------------------------------------------------------------------- 
For bug http://www.dbmail.org/mantis/view.php?id=361, I removed a trigger of
this bug, but it looks like the core
issue is reaping the exit status from child processes. 

---------------------------------------------------------------------- 
 kaname - 13-Jun-06 07:05  
---------------------------------------------------------------------- 
I think that I should change the parameter of waitpid() as follows.

Note is that processing stops in waitpid() when failing in kill().
Kill() sometimes fails though it succeeds almost.

Kill is done as for pid that fails in kill() some time because
reap_child() is called again later.

-------------------------------------------------------------
# diff -urN -U 9 pool.c~ pool.c
--- pool.c~     2006-06-09 11:31:11.000000000 +0900
+++ pool.c      2006-06-13 13:47:44.939044486 +0900
@@ -461,19 +461,19 @@
 static pid_t reap_child()
 {
        pid_t chpid=0;

        if ((chpid = get_idle_spare()) < 0)
                return chpid;

        kill(chpid, SIGTERM);

-       if (waitpid(chpid, NULL, 0) == chpid)
+       if (waitpid(chpid, NULL, WNOHANG|WUNTRACED) == chpid)
                scoreboard_release(chpid);

        return chpid;

 }
 void manage_spare_children()
 {
        /*
         *
--------------------------------------------------------------- 

---------------------------------------------------------------------- 
 aaron - 13-Jun-06 08:50  
---------------------------------------------------------------------- 
This code example is clipped from man 2 waitpid on Linux:

               do {
                   w = waitpid(cpid, &status, WUNTRACED | WCONTINUED);
                   if (w == -1) { perror("waitpid"); exit(EXIT_FAILURE);
}

                   if (WIFEXITED(status)) {
                       printf("exited, status=%d\n",
WEXITSTATUS(status));
                   } else if (WIFSIGNALED(status)) {
                       printf("killed by signal %d\n", WTERMSIG(status));
                   } else if (WIFSTOPPED(status)) {
                       printf("stopped by signal %d\n",
WSTOPSIG(status));
                   } else if (WIFCONTINUED(status)) {
                       printf("continued\n");
                   }
               } while (!WIFEXITED(status) && !WIFSIGNALED(status));
               exit(EXIT_SUCCESS);

It would at least be interesting to log the status of the unreapable
children. Reading through the pool.c code, I would like to make sure that
get_idle_spare does not return the same stopped child process every time.
If we want to scale down, we should loop through the idle children and try
killing each one. If some are stuck, we'll skip them and go on till we hit
the target population. Right? 

---------------------------------------------------------------------- 
 kaname - 13-Jun-06 12:25  
---------------------------------------------------------------------- 
The child process doesn't stop. The parent process stops.
The zombi process is not related.

It is necessary to prevent the parent process from stopping
by setting WNOHANG in the parameter of waitpid().

The parent process must not stop due to the failure of kill() of
the child process. 

---------------------------------------------------------------------- 
 paul - 21-Jun-06 16:36  
---------------------------------------------------------------------- 
fixed in svn 

---------------------------------------------------------------------- 
 paul - 22-Jun-06 12:53  
---------------------------------------------------------------------- 
I'm changing approach because Ryo's patch doesn't work 

Issue History 
Date Modified   Username       Field                    Change               
====================================================================== 
12-Jun-06 09:22 ryo            New Issue                                    
12-Jun-06 18:11 aaron          Relationship added       related to 0000361  
12-Jun-06 18:13 aaron          Note Added: 0001244                          
12-Jun-06 18:19 aaron          Relationship added       child of 0000364    
13-Jun-06 07:05 kaname         Note Added: 0001246                          
13-Jun-06 08:50 aaron          Note Added: 0001247                          
13-Jun-06 12:25 kaname         Note Added: 0001249                          
21-Jun-06 16:36 paul           target                    => 2.1.7           
21-Jun-06 16:36 paul           Note Added: 0001260                          
21-Jun-06 16:36 paul           Status                   new => resolved     
21-Jun-06 16:36 paul           Resolution               open => fixed       
21-Jun-06 16:36 paul           Fixed in Version          => SVN Trunk       
22-Jun-06 12:53 paul           Note Added: 0001267                          
======================================================================

Reply via email to