The following issue has been RESOLVED. 
====================================================================== 
http://www.dbmail.org/mantis/view.php?id=361 
====================================================================== 
Reported By:                blake
Assigned To:                paul
====================================================================== 
Project:                    DBMail
Issue ID:                   361
Category:                   IMAP daemon
Reproducibility:            always
Severity:                   minor
Priority:                   normal
Status:                     resolved
target:                     2.1.7 
Resolution:                 fixed
Fixed in Version:           SVN Trunk
====================================================================== 
Date Submitted:             09-Jun-06 01:36 CEST
Last Modified:              20-Jul-06 16:00 CEST
====================================================================== 
Summary:                    IMAP zombies after about a day.
Description: 
After starting up dbmail-imapd, I start to see zombies after about a day's
use. I am the only regular user of the server, with the current
Thunderbird client. Usually when it happens, I'll notice some weird laggy
behavior on the client, check the server, and sure enough there will be
one or more zombies. Here is a level 5 trace from the last command a child
handled before becomign a zombie:

Jun  8 19:04:32 colo dbmail/imap4d[7313]: COMMAND: [6 UID fetch 1332440
(UID RFC822.SIZE BODY.PEEK[])]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[0]: 'fetch'
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[1]: '1332440'
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[2]: '('
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[3]: 'UID'
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[4]: 'RFC822.SIZE'
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[5]: 'BODY.PEEK'
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[6]: '['
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[7]: ']'
Jun  8 19:04:32 colo dbmail/imap4d[7313]: arg[8]: ')'
Jun  8 19:04:32 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler:
Executing command uid...
Jun  8 19:04:32 colo dbmail/imap4d[7313]: db.c,db_acl_has_right: checking
ACL [read_flag] for user [1] on mailbox [5]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: db.c, db_acl_has_right: mailbox
[5] is owned by user [1], is that also [1]?
Jun  8 19:04:32 colo dbmail/imap4d[7313]: db.c, db_acl_has_right: mailbox
[5] is owned by user [1], giving all rights
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[2] =
[UID]
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: args[idx = 2] =
UID (returning 3)
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[3] =
[RFC822.SIZE]
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: args[idx = 3] =
RFC822.SIZE (returning 4)
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[4] =
[BODY.PEEK]
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_session_fetch_parse_octet_range: parse token
[)]
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_fetch_parse_args: parse args[7] =
[)]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: imapcommands.c,_ic_fetch:
fetch_start [1332440] fetch_end [1332440]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT seen_flag, answered_flag, deleted_flag, flagged_flag, draft_flag,
recent_flag, DATE_FORMAT(internal_date, '%Y-%m-%d %T'), rfcsize,
message_idnr FROM dbmail_messages msg, dbmail_physmessage pm WHERE pm.id =
msg.physmessage_id AND message_idnr BETWEEN '1332440' AND '1332440' AND
mailbox_idnr = '5' AND status IN ('0','1') ORDER BY message_idnr ASC]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [* 1981 FETCH (]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [RFC822.SIZE 5541]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [ ]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [UID 1332440]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [ ]
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_cache_update: cache message [1332440] filter
[1]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT physmessage_id FROM dbmail_messages WHERE message_idnr =
'1332440']
Jun  8 19:04:32 colo dbmail/imap4d[7313]: dbmysql.c,db_query: query
[SELECT messageblk FROM dbmail_messageblks WHERE physmessage_id = '810148'
ORDER BY messageblk_idnr]
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-message.c,_set_content_from_stream: parse message
Jun  8 19:04:32 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,_imap_cache_update: cache size [1088]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [BODY[] {1088}  ]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [)  ]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: RESPONSE: [6 OK UID FETCH
completed  ]
Jun  8 19:04:32 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler:
Finished command uid [0]
Jun  8 19:09:33 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_readln: error reading from
client
Jun  8 19:09:33 colo dbmail/imap4d[7313]: imap4.c,IMAPClientHandler: error
reading command -- bailing out
Jun  8 19:09:33 colo dbmail/imap4d[7313]:
dbmail-imapsession.c,dbmail_imap_session_printf: write to client socket
failed


FWIW: I also noticed this behavior with 2.0.10, but only used that for a
short time before upgrading to trunk.
======================================================================
Relationships       ID      Summary
----------------------------------------------------------------------
related to          0000370 waitpid() in ParentSigHander() function...
related to          0000363 Somtimes the count of grandchild proces...
child of            0000303 Check for signal-safe calls in the pool...
====================================================================== 

---------------------------------------------------------------------- 
 aaron - 09-Jun-06 17:57  
---------------------------------------------------------------------- 
Please tell a little more about what's going on. Do all of the processes
zombie? Only the parent? Only the middle? Only the children?

Like this output from ps (trimmed a little)

parent:
    root      9809     1    /usr/local/sbin/dbmail-lmtpd
middle:
    nobody    9811  9809    /usr/local/sbin/dbmail-lmtpd
children:
    nobody    9816  9811    /usr/local/sbin/dbmail-lmtpd
    nobody    9818  9811    /usr/local/sbin/dbmail-lmtpd 

---------------------------------------------------------------------- 
 aaron - 09-Jun-06 18:09  
---------------------------------------------------------------------- 
Ok, I think I found it. dbmail_imap_session_printf has a TRACE_FATAL when
"write to client socket failed". I changed them to TRACE_ERROR with return
code of -2 (unrecoverable errors) -- although callers clean up and die on
return code -1, too, I figure it might be useful down the road to know
what happened.

Please test current SVN.

Now that should send a SIGCHLD to the parent process letting it know that
the child died. For some reason that SIGCHLD is probably not getting
through, and the child process remains a zombie until its status is
reported to the parent. Search for 'zombie' on this page for some good
reading:

http://www.win.tue.nl/~aeb/linux/lk/lk-5.html 

---------------------------------------------------------------------- 
 aaron - 11-Jun-06 02:34  
---------------------------------------------------------------------- 
I think I got this. 

---------------------------------------------------------------------- 
 aaron - 13-Jun-06 08:55  
---------------------------------------------------------------------- 
Wait just a minute. child_unregister is called with a child process exits,
and it sets the STATE_WAIT flag so that the parent can reap the return
process. Except I can't find the code that looks for STATE_WAIT children!

In fact, reap_child should probably be recoded to use the same method as
child_unregister -- a variant that takes a pid as an argument and gives it
STATE_WAIT status. reap_child would kill the process, then mark the
scoreboard entry as STATE_WAIT. The next call to manage_spare_children (or
a new function, like manage_zombie_children ;-) would reap all STATE_WAIT
statuses. 

---------------------------------------------------------------------- 
 paul - 03-Jul-06 12:55  
---------------------------------------------------------------------- 
I'm reopening this bug because it's still unresolved. I'm working on
cleaning up the pooling code. 

---------------------------------------------------------------------- 
 paul - 03-Jul-06 12:59  
---------------------------------------------------------------------- 
I've committed a cleanup of the preforking code that seems to address this
issue. I've been doing some heavy load-tests on this code, and it now
appears to hold up to it's task. 

Please test svn-trunk 

---------------------------------------------------------------------- 
 ryo - 04-Jul-06 11:14  
---------------------------------------------------------------------- 
I have tested svn 2192. But I had any zombies of dbmail-pop3d.
I could know by using strace command that the middle process of
dbmail-pop3d stopped at the futex() as follows.

  [EMAIL PROTECTED] ~]# strace -p 13841
  Process 13841 attached - interrupt to quit
  futex(0xc252ac, FUTEX_WAIT, 2, NULL

And gdb trace is as follows.

  [EMAIL PROTECTED] ~]# gdb
  GNU gdb Red Hat Linux (6.1post-1.20040607.43rh)
  Copyright 2004 Free Software Foundation, Inc.
  GDB is free software, covered by the GNU General Public License, and you
are
  welcome to change it and/or distribute copies of it under certain
conditions.
  Type "show copying" to see the conditions.
  There is absolutely no warranty for GDB.  Type "show warranty" for
details.
  This GDB was configured as "i386-redhat-linux-gnu".
  (gdb) attach 13841
  Attaching to process 13841
    : (snip)
  0x00bd5bc9 in __lll_mutex_lock_wait () from /lib/tls/i686/libc.so.6
  (gdb) where
  http://www.dbmail.org/mantis/view.php?id=0  0x00bd5bc9 in
__lll_mutex_lock_wait () from /lib/tls/i686/libc.so.6
  http://www.dbmail.org/mantis/view.php?id=1  0x00bc722b in _L_mutex_lock_541 ()
from /lib/tls/i686/libc.so.6
  http://www.dbmail.org/mantis/view.php?id=2  0x00000000 in ?? ()

I think maybe that the cause of this problem is that the signal-unsafe
(printf(), malloc()...)function is called in the 
signal handler.

For example, the trace() function in dbmail calls g_strdup_vprintf() and
fprintf().

The signal-safe functions are defined in SUSv3.
http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03

I am going to test the dbmail without signal-unsafe function. 

---------------------------------------------------------------------- 
 ryo - 18-Jul-06 07:02  
---------------------------------------------------------------------- 
I continued testing the svn 2192 including signal-unsafe coding.
Then the futex deadlock (like 0001292) occurred in not only middle 
process but also child process.

I think that the cause of this deadlock is to call client_close() 
and disconnect_all()... in serverchild.c:active_child_sig_handle.
(Becase the free() and fclose(), are not signal-safe functions.)

"Secure Programming for Linux and Unix HOWTO" describes as follows.

  - Where possible, have your signal handlers unconditionally set 
    a specific flag and do nothing else.

 Secure Programming for Linux and Unix HOWTO:

http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/signals.html

I created the patch in consideration of this document. Please see 
the attached file(dbmail-signalsafe.patch).

I patched this patch to svn 2192 and I am continuing a stress test 
for four days. So, I have no zombie and no deadlock yet.

The digest of this patch is as follows.

  - only set flags in signal handlers.
  - The signal processings are performed at the outside of signal
handlers.
  - The data type of flags is volatile sig_atomic_t.
  - nodaemonize mode have no signal processing yet. (any idea?)

Please check my patch(dbmail-signalsafe.patch). 

---------------------------------------------------------------------- 
 paul - 18-Jul-06 13:07  
---------------------------------------------------------------------- 
patch accepted. Nice work Ryo! 

Issue History 
Date Modified   Username       Field                    Change               
====================================================================== 
09-Jun-06 01:36 blake          New Issue                                    
09-Jun-06 17:57 aaron          Note Added: 0001236                          
09-Jun-06 18:09 aaron          Note Added: 0001237                          
11-Jun-06 02:34 aaron          Status                   new => resolved     
11-Jun-06 02:34 aaron          Fixed in Version          => SVN Trunk       
11-Jun-06 02:34 aaron          Resolution               open => fixed       
11-Jun-06 02:34 aaron          Assigned To               => aaron           
11-Jun-06 02:34 aaron          Note Added: 0001240                          
12-Jun-06 18:11 aaron          Relationship added       related to 0000363  
13-Jun-06 08:55 aaron          Note Added: 0001248                          
03-Jul-06 12:55 paul           Note Added: 0001290                          
03-Jul-06 12:55 paul           Resolution               fixed => reopened   
03-Jul-06 12:56 paul           Assigned To              aaron => paul       
03-Jul-06 12:56 paul           Status                   resolved => assigned
03-Jul-06 12:56 paul           Relationship added       related to 0000370  
03-Jul-06 12:59 paul           Note Added: 0001291                          
04-Jul-06 11:14 ryo            Note Added: 0001292                          
04-Jul-06 15:46 paul           Relationship added       child of 0000303    
18-Jul-06 07:02 ryo            Note Added: 0001304                          
18-Jul-06 07:02 ryo            File Added: dbmail-signalsafe.patch              
     
18-Jul-06 13:07 paul           Note Added: 0001305                          
20-Jul-06 16:00 paul           target                    => 2.1.7           
20-Jul-06 16:00 paul           Status                   assigned => resolved
20-Jul-06 16:00 paul           Resolution               reopened => fixed   
======================================================================

Reply via email to