A NOTE has been added to this issue. ====================================================================== http://www.dbmail.org/mantis/view.php?id=256 ====================================================================== Reported By: idk Assigned To: paul ====================================================================== Project: DBMail Issue ID: 256 Category: General Reproducibility: always Severity: major Priority: normal Status: feedback ====================================================================== Date Submitted: 20-Aug-05 23:54 CEST Last Modified: 24-Aug-05 09:45 CEST ====================================================================== Summary: Invalid child management after database restart etc. Description: After stopping mysql service all children killed by pool manager (pool.c,manage_stop_children: General stop requested. Killing children..), after mysql service starting MINSPARECHILDREN only was started and any more children wasn't started even they was requested. ======================================================================
---------------------------------------------------------------------- idk - 21-Aug-05 00:06 ---------------------------------------------------------------------- My suggestions are: 1) call of manage_start_children() instead of manage_spare_children() after database resuming 2) after resuming db conn call alarm(10) for recovery alarm timer (I'm not testing if is it adequate) 3) do corrections in LIFO and infinite loop described above ---------------------------------------------------------------------- paul - 22-Aug-05 10:15 ---------------------------------------------------------------------- I've fixed this problem. There was some faulty login in manage_spare_children, the alarm is reset after the database resumes, and the missing breaks were added. Thanks a lot for working on this. Please test the latest svn code. ---------------------------------------------------------------------- idk - 23-Aug-05 18:24 ---------------------------------------------------------------------- I tried to add trace(1, "spare: %d %d", count_children(), count_spare_children()); into manage_spare_children() just before first loop, I started daemon, then I made MAX+ connections, this was logged (attached maillog.txt, I hope) Aug 23 17:13:21 start Aug 23 17:13:21 spare: 5 5 Aug 23 17:13:51 spare: 5 5 - 20s, not 10 Aug 23 17:14:00 connect Aug 23 17:14:01 spare: 5 4 Aug 23 17:14:05 disconnect Aug 23 17:14:11 spare: 5 5 Aug 23 17:14:21 spare: 5 5 Aug 23 17:14:31 spare: 5 5 Aug 23 17:14:33+ connect 5 times Aug 23 17:14:41 spare: 5 0 - last trace of this message, so alarm stopped in 17:14:41-17:14:50 Aug 23 17:14:41 register children 5-19 Aug 23 17:14:41 child_register failed (21th, ok) no more messages (alarm) killall Aug 23 17:23:31 got signal [15] Aug 23 17:23:31 stop requested Aug 23 17:23:31 child [19785] unregistered all three 20 times, but ps ax shows many zombies 19782 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 19783 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 19785 ? Z 0:00 [lt-dbmail-imapd] <defunct> 19787 ? Z 0:00 [lt-dbmail-imapd] <defunct> 19789 ? Z 0:00 [lt-dbmail-imapd] <defunct> 19791 ? Z 0:00 [lt-dbmail-imapd] <defunct> 19793 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20075 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20077 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20079 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20081 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20083 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20085 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20087 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20089 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20091 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20093 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20095 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20097 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20099 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20101 ? Z 0:00 [lt-dbmail-imapd] <defunct> 20103 ? Z 0:00 [lt-dbmail-imapd] <defunct> killing them step by step by their pid had no effect, I tried to start new instance, but Aug 23 17:23:39 File [/var/run/dbmail-imapd.pid] exists So I deleted them Aug 23 17:25:35 could not bind address to socket Sorry, I have this production server only (no test servers), I had to restart them immediatelly (due zombies), I cannot test this issue now, maybe later (tonight UTC+0200, or weekend). ---------------------------------------------------------------------- idk - 23-Aug-05 22:01 ---------------------------------------------------------------------- Unfortunatelly, I verified this issue: - start a deamon (imap) - register child 5 times - initializing child_state using slot [0-4] - spare: 5 5 (my message) many times (I wait for about one minute) - connect (nc localhost imap) - incoming connection from [] - spare: 5 4 - disconnect (A001 LOGOUT) - Closing connection for client from IP [] - spare: 5 5 - connect 20 times (nc localhost imap&) - incoming connection from [] 5 times - spare: 5 0 - register child 15 times - initializing child_state using slot [5-19] - spare: 20 0 (many times, I wait) - connect 21th (MAX + 1) - child_register failed - no more spare: %d %d message, so no more alarm # ps ax | grep imapd 23938 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 23939 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 23941 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 23943 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 23945 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 23947 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 23949 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25057 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25059 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25124 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25126 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25128 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25218 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25220 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25222 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25224 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25226 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25228 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25230 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25509 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25547 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 25549 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd so 22 processes # killall lt-dbmail-imapd # ps ax | grep imapd 23938 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 23939 ? S 0:00 /_/dbmail/dbmail-2.0/.libs/lt-dbmail-imapd 23941 ? Z 0:00 [lt-dbmail-imapd] <defunct> 23943 ? Z 0:00 [lt-dbmail-imapd] <defunct> 23945 ? Z 0:00 [lt-dbmail-imapd] <defunct> 23947 ? Z 0:00 [lt-dbmail-imapd] <defunct> 23949 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25057 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25059 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25124 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25126 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25128 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25218 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25220 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25222 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25224 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25226 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25228 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25230 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25509 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25547 ? Z 0:00 [lt-dbmail-imapd] <defunct> 25549 ? Z 0:00 [lt-dbmail-imapd] <defunct> so 20 zombies (each child) and two normal processes, but not killable # kill -9 23938 # kill -9 23939 solves this, there are zombies no more I will discover this more. ---------------------------------------------------------------------- idk - 23-Aug-05 22:33 ---------------------------------------------------------------------- Oh yes, child_register says when child_register() retunrs -1, this will happen when i == scoreboard->conf->maxChildren because new loop condition in manage_spare_children() while ((count_children() < scoreboard->conf->startChildren) || (count_spare_children() < scoreboard->conf->minSpareChildren)) is true even count_children() goes over max (condition before was ANDed and count_children() < scoreboard->conf->maxChildren, so when this condition could be "scale up to minimum a startChildren, but no more then a maxChildren, in order to ensure at least minSpareChildren", it must be like while (((count_children() < scoreboard->conf->startChildren) || (count_spare_children() < scoreboard->conf->minSpareChildren)) && (count_children() < scoreboard->conf->maxChildren)) this works excellent, include restart database I tried to commit pool.c into svn, but I'm not familiar with svn command line (I use TortoiseSVN Win GUI in other projects). So the condition above is only patch: pool.c 493,494c493,495 < while ((count_children() < scoreboard->conf->startChildren) || < (count_spare_children() < scoreboard->conf->minSpareChildren)) { --- > while (((count_children() < scoreboard->conf->startChildren) || > (count_spare_children() < scoreboard->conf->minSpareChildren)) > && (count_children() < scoreboard->conf->maxChildren)) { BUT, I don't know, why alarm was stopped (after 21th child) (when this happens in future) ---------------------------------------------------------------------- paul - 24-Aug-05 09:45 ---------------------------------------------------------------------- idk, Please try the latest svn updates. It looks much better now in my tests. Issue History Date Modified Username Field Change ====================================================================== 20-Aug-05 23:54 idk New Issue 21-Aug-05 00:06 idk Note Added: 0000848 22-Aug-05 10:15 paul Status new => resolved 22-Aug-05 10:15 paul Resolution open => fixed 22-Aug-05 10:15 paul Assigned To => paul 22-Aug-05 10:15 paul Note Added: 0000849 23-Aug-05 18:24 idk Status resolved => feedback 23-Aug-05 18:24 idk Resolution fixed => reopened 23-Aug-05 18:24 idk Note Added: 0000872 23-Aug-05 18:25 idk File Added: maillog.txt 23-Aug-05 22:01 idk Note Added: 0000873 23-Aug-05 22:33 idk Note Added: 0000874 24-Aug-05 09:45 paul Note Added: 0000875 ======================================================================