[Dbmail-dev] [DBMail 0000256]: Invalid child management after database restart etc.

bugtrack Tue, 23 Aug 2005 22:33:48 +0200 (CEST)

A NOTE has been added to this issue. 
====================================================================== 
http://www.dbmail.org/mantis/view.php?id=256 
====================================================================== 
Reported By:                idk
Assigned To:                paul
====================================================================== 
Project:                    DBMail
Issue ID:                   256
Category:                   General
Reproducibility:            always
Severity:                   major
Priority:                   normal
Status:                     feedback
====================================================================== 
Date Submitted:             20-Aug-05 23:54 CEST
Last Modified:              23-Aug-05 22:33 CEST
====================================================================== 
Summary:                    Invalid child management after database restart etc.
Description: 
After stopping mysql service all children killed by pool manager
(pool.c,manage_stop_children: General stop requested. Killing children..),
after mysql service starting MINSPARECHILDREN only was started and any more
children wasn't started even they was requested.
======================================================================


---------------------------------------------------------------------- 
 idk - 21-Aug-05 00:06  
---------------------------------------------------------------------- 
My suggestions are:

1) call of manage_start_children() instead of manage_spare_children()
after database resuming

2) after resuming db conn call alarm(10) for recovery alarm timer (I'm not
testing if is it adequate)

3) do corrections in LIFO and infinite loop described above 

---------------------------------------------------------------------- 
 paul - 22-Aug-05 10:15  
---------------------------------------------------------------------- 
I've fixed this problem. There was some faulty login in
manage_spare_children, the alarm is reset after the database resumes, and
the missing breaks were added. Thanks a lot for working on this. Please
test the latest svn code. 

---------------------------------------------------------------------- 
 idk - 23-Aug-05 18:24  
---------------------------------------------------------------------- 
I tried to add

trace(1, "spare: %d %d", count_children(), count_spare_children());

into manage_spare_children() just before first loop, I started daemon,
then I made MAX+ connections, this was logged (attached maillog.txt, I
hope)

Aug 23 17:13:21 start
Aug 23 17:13:21 spare: 5 5
Aug 23 17:13:51 spare: 5 5 - 20s, not 10
Aug 23 17:14:00 connect
Aug 23 17:14:01 spare: 5 4
Aug 23 17:14:05 disconnect
Aug 23 17:14:11 spare: 5 5
Aug 23 17:14:21 spare: 5 5
Aug 23 17:14:31 spare: 5 5
Aug 23 17:14:33+ connect 5 times
Aug 23 17:14:41 spare: 5 0 - last trace of this message, so alarm stopped
in 17:14:41-17:14:50
Aug 23 17:14:41 register children 5-19
Aug 23 17:14:41 child_register failed (21th, ok)

no more messages (alarm)

killall

Aug 23 17:23:31 got signal [15]
Aug 23 17:23:31 stop requested
Aug 23 17:23:31 child [19785] unregistered

all three 20 times, but ps ax shows many zombies

19782 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
19783 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
19785 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
19787 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
19789 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
19791 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
19793 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20075 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20077 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20079 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20081 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20083 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20085 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20087 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20089 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20091 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20093 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20095 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20097 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20099 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20101 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
20103 ?        Z      0:00 [lt-dbmail-imapd] <defunct>

killing them step by step by their pid had no effect, I tried to start new
instance, but

Aug 23 17:23:39 File [/var/run/dbmail-imapd.pid] exists

So I deleted them

Aug 23 17:25:35 could not bind address to socket

Sorry, I have this production server only (no test servers), I had to
restart them immediatelly (due zombies), I cannot test this issue now,
maybe later (tonight UTC+0200, or weekend). 

---------------------------------------------------------------------- 
 idk - 23-Aug-05 22:01  
---------------------------------------------------------------------- 
Unfortunatelly, I verified this issue:

- start a deamon (imap)
- register child 5 times
- initializing child_state using slot [0-4]
- spare: 5 5 (my message) many times (I wait for about one minute)
- connect (nc localhost imap)
- incoming connection from []
- spare: 5 4
- disconnect (A001 LOGOUT)
- Closing connection for client from IP []
- spare: 5 5
- connect 20 times (nc localhost imap&)
- incoming connection from [] 5 times
- spare: 5 0
- register child 15 times
- initializing child_state using slot [5-19]
- spare: 20 0 (many times, I wait)
- connect 21th (MAX + 1)
- child_register failed
- no more spare: %d %d message, so no more alarm

# ps ax | grep imapd
23938 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
23939 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
23941 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
23943 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
23945 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
23947 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
23949 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25057 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25059 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25124 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25126 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25128 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25218 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25220 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25222 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25224 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25226 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25228 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25230 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25509 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25547 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
25549 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd

so 22 processes

# killall lt-dbmail-imapd
# ps ax | grep imapd
23938 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
23939 ?        S      0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd
23941 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
23943 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
23945 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
23947 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
23949 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25057 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25059 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25124 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25126 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25128 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25218 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25220 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25222 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25224 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25226 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25228 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25230 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25509 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25547 ?        Z      0:00 [lt-dbmail-imapd] <defunct>
25549 ?        Z      0:00 [lt-dbmail-imapd] <defunct>

so 20 zombies (each child) and two normal processes, but not killable

# kill -9 23938
# kill -9 23939

solves this, there are zombies no more

I will discover this more. 

---------------------------------------------------------------------- 
 idk - 23-Aug-05 22:33  
---------------------------------------------------------------------- 
Oh yes, child_register says when child_register() retunrs -1, this will
happen when i == scoreboard->conf->maxChildren because new loop condition
in manage_spare_children() while ((count_children() <
scoreboard->conf->startChildren) || (count_spare_children() <
scoreboard->conf->minSpareChildren)) is true even count_children() goes
over max (condition before was ANDed and count_children() <
scoreboard->conf->maxChildren, so when this condition could be "scale up
to minimum a startChildren, but no more then a maxChildren, in order to
ensure at least minSpareChildren", it must be like

while (((count_children() < scoreboard->conf->startChildren) ||
(count_spare_children() < scoreboard->conf->minSpareChildren)) &&
(count_children() < scoreboard->conf->maxChildren))

this works excellent, include restart database

I tried to commit pool.c into svn, but I'm not familiar with svn command
line (I use TortoiseSVN Win GUI in other projects). So the condition above
is only patch:

pool.c
493,494c493,495
<       while ((count_children() < scoreboard->conf->startChildren) ||
<                       (count_spare_children() <
scoreboard->conf->minSpareChildren)) {
---
>       while (((count_children() < scoreboard->conf->startChildren) ||
>                       (count_spare_children() <
scoreboard->conf->minSpareChildren))
>                       && (count_children() <
scoreboard->conf->maxChildren)) {

BUT, I don't know, why alarm was stopped (after 21th child) (when this
happens in future) 

Issue History 
Date Modified   Username       Field                    Change               
====================================================================== 
20-Aug-05 23:54 idk            New Issue                                    
21-Aug-05 00:06 idk            Note Added: 0000848                          
22-Aug-05 10:15 paul           Status                   new => resolved     
22-Aug-05 10:15 paul           Resolution               open => fixed       
22-Aug-05 10:15 paul           Assigned To               => paul            
22-Aug-05 10:15 paul           Note Added: 0000849                          
23-Aug-05 18:24 idk            Status                   resolved => feedback
23-Aug-05 18:24 idk            Resolution               fixed => reopened   
23-Aug-05 18:24 idk            Note Added: 0000872                          
23-Aug-05 18:25 idk            File Added: maillog.txt                      
23-Aug-05 22:01 idk            Note Added: 0000873                          
23-Aug-05 22:33 idk            Note Added: 0000874                          
======================================================================

[Dbmail-dev] [DBMail 0000256]: Invalid child management after database restart etc.

Reply via email to