A NOTE has been added to this issue. ====================================================================== http://www.dbmail.org/mantis/view.php?id=363 ====================================================================== Reported By: ryo Assigned To: ====================================================================== Project: DBMail Issue ID: 363 Category: General Reproducibility: sometimes Severity: minor Priority: normal Status: new target: ====================================================================== Date Submitted: 12-Jun-06 09:22 CEST Last Modified: 13-Jun-06 12:25 CEST ====================================================================== Summary: Somtimes the count of grandchild processes does not decrease. Description: I'm sorry, my English is poor.
After many access to dbmail-imapd, somtimes the count of grandchild processes does not decrease to NCHILDREN all the time. I could know by using strace command that the child process of dbmail-imapd stopped at the waitpid() as follows. [EMAIL PROTECTED] ~]# strace -p 21208 Process 21208 attached - interrupt to quit waitpid(3422, I sent SIGTERM to the grandchild process(in the above example:pid = 3422) with kill command, then the child process resume and the count of grandchild processes decreased. I think this cause is that the waitpid function is called without WNOHANG option in the pool.c:reap_child(). Is this intentional? Any idea? ====================================================================== Relationships ID Summary ---------------------------------------------------------------------- related to 0000361 IMAP zombies after about a day. ====================================================================== ---------------------------------------------------------------------- aaron - 12-Jun-06 18:13 ---------------------------------------------------------------------- For bug http://www.dbmail.org/mantis/view.php?id=361, I removed a trigger of this bug, but it looks like the core issue is reaping the exit status from child processes. ---------------------------------------------------------------------- kaname - 13-Jun-06 07:05 ---------------------------------------------------------------------- I think that I should change the parameter of waitpid() as follows. Note is that processing stops in waitpid() when failing in kill(). Kill() sometimes fails though it succeeds almost. Kill is done as for pid that fails in kill() some time because reap_child() is called again later. ------------------------------------------------------------- # diff -urN -U 9 pool.c~ pool.c --- pool.c~ 2006-06-09 11:31:11.000000000 +0900 +++ pool.c 2006-06-13 13:47:44.939044486 +0900 @@ -461,19 +461,19 @@ static pid_t reap_child() { pid_t chpid=0; if ((chpid = get_idle_spare()) < 0) return chpid; kill(chpid, SIGTERM); - if (waitpid(chpid, NULL, 0) == chpid) + if (waitpid(chpid, NULL, WNOHANG|WUNTRACED) == chpid) scoreboard_release(chpid); return chpid; } void manage_spare_children() { /* * --------------------------------------------------------------- ---------------------------------------------------------------------- aaron - 13-Jun-06 08:50 ---------------------------------------------------------------------- This code example is clipped from man 2 waitpid on Linux: do { w = waitpid(cpid, &status, WUNTRACED | WCONTINUED); if (w == -1) { perror("waitpid"); exit(EXIT_FAILURE); } if (WIFEXITED(status)) { printf("exited, status=%d\n", WEXITSTATUS(status)); } else if (WIFSIGNALED(status)) { printf("killed by signal %d\n", WTERMSIG(status)); } else if (WIFSTOPPED(status)) { printf("stopped by signal %d\n", WSTOPSIG(status)); } else if (WIFCONTINUED(status)) { printf("continued\n"); } } while (!WIFEXITED(status) && !WIFSIGNALED(status)); exit(EXIT_SUCCESS); It would at least be interesting to log the status of the unreapable children. Reading through the pool.c code, I would like to make sure that get_idle_spare does not return the same stopped child process every time. If we want to scale down, we should loop through the idle children and try killing each one. If some are stuck, we'll skip them and go on till we hit the target population. Right? ---------------------------------------------------------------------- kaname - 13-Jun-06 12:25 ---------------------------------------------------------------------- The child process doesn't stop. The parent process stops. The zombi process is not related. It is necessary to prevent the parent process from stopping by setting WNOHANG in the parameter of waitpid(). The parent process must not stop due to the failure of kill() of the child process. Issue History Date Modified Username Field Change ====================================================================== 12-Jun-06 09:22 ryo New Issue 12-Jun-06 18:11 aaron Relationship added related to 0000361 12-Jun-06 18:13 aaron Note Added: 0001244 12-Jun-06 18:19 aaron Relationship added child of 0000364 13-Jun-06 07:05 kaname Note Added: 0001246 13-Jun-06 08:50 aaron Note Added: 0001247 13-Jun-06 12:25 kaname Note Added: 0001249 ======================================================================