The following issue has been RESOLVED. ====================================================================== http://www.dbmail.org/mantis/view.php?id=361 ====================================================================== Reported By: blake Assigned To: paul ====================================================================== Project: DBMail Issue ID: 361 Category: IMAP daemon Reproducibility: always Severity: minor Priority: normal Status: resolved target: 2.1.7 Resolution: fixed Fixed in Version: SVN Trunk ====================================================================== Date Submitted: 09-Jun-06 01:36 CEST Last Modified: 20-Jul-06 16:00 CEST ====================================================================== Summary: IMAP zombies after about a day. Description: After starting up dbmail-imapd, I start to see zombies after about a day's use. I am the only regular user of the server, with the current Thunderbird client. Usually when it happens, I'll notice some weird laggy behavior on the client, check the server, and sure enough there will be one or more zombies. Here is a level 5 trace from the last command a child handled before becomign a zombie:
Jun 8 19:04:32 colo dbmail/imap4d[7313]: COMMAND: [6 UID fetch 1332440 (UID RFC822.SIZE BODY.PEEK[])] Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[0]: 'fetch' Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[1]: '1332440' Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[2]: '(' Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[3]: 'UID' Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[4]: 'RFC822.SIZE' Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[5]: 'BODY.PEEK' Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[6]: '[' Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[7]: ']' Jun 8 19:04:32 colo dbmail/imap4d[7313]: arg[8]: ')' Jun 8 19:04:32 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler: Executing command uid... Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c,db_acl_has_right: checking ACL [read_flag] for user [1] on mailbox [5] Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c, db_acl_has_right: mailbox [5] is owned by user [1], is that also [1]? Jun 8 19:04:32 colo dbmail/imap4d[7313]: db.c, db_acl_has_right: mailbox [5] is owned by user [1], giving all rights Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[2] = [UID] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: args[idx = 2] = UID (returning 3) Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[3] = [RFC822.SIZE] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: args[idx = 3] = RFC822.SIZE (returning 4) Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[4] = [BODY.PEEK] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,_imap_session_fetch_parse_octet_range: parse token [)] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[7] = [)] Jun 8 19:04:32 colo dbmail/imap4d[7313]: imapcommands.c,_ic_fetch: fetch_start [1332440] fetch_end [1332440] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query [SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag, recent_flag, DATE_FORMAT(internal_date, '%Y-%m-%d %T'), rfcsize, message_idnr FROM dbmail_messages msg, dbmail_physmessage pm WHERE pm.id = msg.physmessage_id AND message_idnr BETWEEN '1332440' AND '1332440' AND mailbox_idnr = '5' AND status IN ('0','1') ORDER BY message_idnr ASC] Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [* 1981 FETCH (] Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [RFC822.SIZE 5541] Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [ ] Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [UID 1332440] Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [ ] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,_imap_cache_update: cache message [1332440] filter [1] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query [SELECT physmessage_id FROM dbmail_messages WHERE message_idnr = '1332440'] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query [SELECT messageblk FROM dbmail_messageblks WHERE physmessage_id = '810148' ORDER BY messageblk_idnr] Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-message.c,_set_content_from_stream: parse message Jun 8 19:04:32 colo dbmail/imap4d[7313]: dbmail-imapsession.c,_imap_cache_update: cache size [1088] Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [BODY[] {1088} ] Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [) ] Jun 8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [6 OK UID FETCH completed ] Jun 8 19:04:32 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler: Finished command uid [0] Jun 8 19:09:33 colo dbmail/imap4d[7313]: dbmail-imapsession.c,dbmail_imap_session_readln: error reading from client Jun 8 19:09:33 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler: error reading command -- bailing out Jun 8 19:09:33 colo dbmail/imap4d[7313]: dbmail-imapsession.c,dbmail_imap_session_printf: write to client socket failed FWIW: I also noticed this behavior with 2.0.10, but only used that for a short time before upgrading to trunk. ====================================================================== Relationships ID Summary ---------------------------------------------------------------------- related to 0000370 waitpid() in ParentSigHander() function... related to 0000363 Somtimes the count of grandchild proces... child of 0000303 Check for signal-safe calls in the pool... ====================================================================== ---------------------------------------------------------------------- aaron - 09-Jun-06 17:57 ---------------------------------------------------------------------- Please tell a little more about what's going on. Do all of the processes zombie? Only the parent? Only the middle? Only the children? Like this output from ps (trimmed a little) parent: root 9809 1 /usr/local/sbin/dbmail-lmtpd middle: nobody 9811 9809 /usr/local/sbin/dbmail-lmtpd children: nobody 9816 9811 /usr/local/sbin/dbmail-lmtpd nobody 9818 9811 /usr/local/sbin/dbmail-lmtpd ---------------------------------------------------------------------- aaron - 09-Jun-06 18:09 ---------------------------------------------------------------------- Ok, I think I found it. dbmail_imap_session_printf has a TRACE_FATAL when "write to client socket failed". I changed them to TRACE_ERROR with return code of -2 (unrecoverable errors) -- although callers clean up and die on return code -1, too, I figure it might be useful down the road to know what happened. Please test current SVN. Now that should send a SIGCHLD to the parent process letting it know that the child died. For some reason that SIGCHLD is probably not getting through, and the child process remains a zombie until its status is reported to the parent. Search for 'zombie' on this page for some good reading: http://www.win.tue.nl/~aeb/linux/lk/lk-5.html ---------------------------------------------------------------------- aaron - 11-Jun-06 02:34 ---------------------------------------------------------------------- I think I got this. ---------------------------------------------------------------------- aaron - 13-Jun-06 08:55 ---------------------------------------------------------------------- Wait just a minute. child_unregister is called with a child process exits, and it sets the STATE_WAIT flag so that the parent can reap the return process. Except I can't find the code that looks for STATE_WAIT children! In fact, reap_child should probably be recoded to use the same method as child_unregister -- a variant that takes a pid as an argument and gives it STATE_WAIT status. reap_child would kill the process, then mark the scoreboard entry as STATE_WAIT. The next call to manage_spare_children (or a new function, like manage_zombie_children ;-) would reap all STATE_WAIT statuses. ---------------------------------------------------------------------- paul - 03-Jul-06 12:55 ---------------------------------------------------------------------- I'm reopening this bug because it's still unresolved. I'm working on cleaning up the pooling code. ---------------------------------------------------------------------- paul - 03-Jul-06 12:59 ---------------------------------------------------------------------- I've committed a cleanup of the preforking code that seems to address this issue. I've been doing some heavy load-tests on this code, and it now appears to hold up to it's task. Please test svn-trunk ---------------------------------------------------------------------- ryo - 04-Jul-06 11:14 ---------------------------------------------------------------------- I have tested svn 2192. But I had any zombies of dbmail-pop3d. I could know by using strace command that the middle process of dbmail-pop3d stopped at the futex() as follows. [EMAIL PROTECTED] ~]# strace -p 13841 Process 13841 attached - interrupt to quit futex(0xc252ac, FUTEX_WAIT, 2, NULL And gdb trace is as follows. [EMAIL PROTECTED] ~]# gdb GNU gdb Red Hat Linux (6.1post-1.20040607.43rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu". (gdb) attach 13841 Attaching to process 13841 : (snip) 0x00bd5bc9 in __lll_mutex_lock_wait () from /lib/tls/i686/libc.so.6 (gdb) where http://www.dbmail.org/mantis/view.php?id=0 0x00bd5bc9 in __lll_mutex_lock_wait () from /lib/tls/i686/libc.so.6 http://www.dbmail.org/mantis/view.php?id=1 0x00bc722b in _L_mutex_lock_541 () from /lib/tls/i686/libc.so.6 http://www.dbmail.org/mantis/view.php?id=2 0x00000000 in ?? () I think maybe that the cause of this problem is that the signal-unsafe (printf(), malloc()...)function is called in the signal handler. For example, the trace() function in dbmail calls g_strdup_vprintf() and fprintf(). The signal-safe functions are defined in SUSv3. http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03 I am going to test the dbmail without signal-unsafe function. ---------------------------------------------------------------------- ryo - 18-Jul-06 07:02 ---------------------------------------------------------------------- I continued testing the svn 2192 including signal-unsafe coding. Then the futex deadlock (like 0001292) occurred in not only middle process but also child process. I think that the cause of this deadlock is to call client_close() and disconnect_all()... in serverchild.c:active_child_sig_handle. (Becase the free() and fclose(), are not signal-safe functions.) "Secure Programming for Linux and Unix HOWTO" describes as follows. - Where possible, have your signal handlers unconditionally set a specific flag and do nothing else. Secure Programming for Linux and Unix HOWTO: http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/signals.html I created the patch in consideration of this document. Please see the attached file(dbmail-signalsafe.patch). I patched this patch to svn 2192 and I am continuing a stress test for four days. So, I have no zombie and no deadlock yet. The digest of this patch is as follows. - only set flags in signal handlers. - The signal processings are performed at the outside of signal handlers. - The data type of flags is volatile sig_atomic_t. - nodaemonize mode have no signal processing yet. (any idea?) Please check my patch(dbmail-signalsafe.patch). ---------------------------------------------------------------------- paul - 18-Jul-06 13:07 ---------------------------------------------------------------------- patch accepted. Nice work Ryo! Issue History Date Modified Username Field Change ====================================================================== 09-Jun-06 01:36 blake New Issue 09-Jun-06 17:57 aaron Note Added: 0001236 09-Jun-06 18:09 aaron Note Added: 0001237 11-Jun-06 02:34 aaron Status new => resolved 11-Jun-06 02:34 aaron Fixed in Version => SVN Trunk 11-Jun-06 02:34 aaron Resolution open => fixed 11-Jun-06 02:34 aaron Assigned To => aaron 11-Jun-06 02:34 aaron Note Added: 0001240 12-Jun-06 18:11 aaron Relationship added related to 0000363 13-Jun-06 08:55 aaron Note Added: 0001248 03-Jul-06 12:55 paul Note Added: 0001290 03-Jul-06 12:55 paul Resolution fixed => reopened 03-Jul-06 12:56 paul Assigned To aaron => paul 03-Jul-06 12:56 paul Status resolved => assigned 03-Jul-06 12:56 paul Relationship added related to 0000370 03-Jul-06 12:59 paul Note Added: 0001291 04-Jul-06 11:14 ryo Note Added: 0001292 04-Jul-06 15:46 paul Relationship added child of 0000303 18-Jul-06 07:02 ryo Note Added: 0001304 18-Jul-06 07:02 ryo File Added: dbmail-signalsafe.patch 18-Jul-06 13:07 paul Note Added: 0001305 20-Jul-06 16:00 paul target => 2.1.7 20-Jul-06 16:00 paul Status assigned => resolved 20-Jul-06 16:00 paul Resolution reopened => fixed ======================================================================