Re: locking problems with 2.1.9
--On Friday, November 8, 2002 7:49 PM -0500 Peter Krotkov <[EMAIL PROTECTED]> wrote: | Prior to a code fix to address the problems you observed, do you think it | would be unreasonable to configure master so that imaps is not offered? | We could revert to running stunnel for ssl support and then take our Could this also be an entropy issue? On this Solaris 8 box, what are you using for /dev/random, anyway? That Solaris patch? Amos
Re: locking problems with 2.1.9
On Fri, 8 Nov 2002, Lawrence Greenfield wrote: >Date: Fri, 8 Nov 2002 11:04:32 -0500 (EST) >From: Peter Krotkov <[EMAIL PROTECTED]> > [...] >22335: imapd -s > ff09b3bc read (0, 1c4bc8, 6f5) > 0008e8c8 sock_read (0, 1c4bc8, 6f5, 8e8a0, 18edf8, 1) + 28 > 0008d670 BIO_read (1bd090, 1c4bc8, 6f5, 1c32a8, 1bccf0, 0) + d0 > 0007dec8 ssl3_read_n (5, 2010, 2010, 191b, 0, 0) + 148 > 0007e140 ssl3_get_record (1bb7e0, 1bccf0, 0, 0, 23138, ff0941d8) + 1e0 > 0007e8d4 ssl3_read_bytes (1bb7e0, 17, 1aa610, 1000, 0, 1bccf0) + 1d4 > 0007c6e8 ssl3_read (1bb7e0, 1aa610, 1000, 7c6a0, 19a1ac, 0) + 48 > 0006e730 SSL_read (1bb7e0, 1aa610, 1000, 1, ff0bd194, ffbec4b8) + 70 > 00060524 prot_fill (, 0, 1000, 19cfb0, ffbec7b8, 1) + 340 > 00060d8c prot_read (0, ffbec7b8, 1000, 19cfb0, 1, ffbec7b8) + 6c > 00050894 message_copy_strict (0, 19cfb0, 8008c, eff8, 1a17a8, ff09c648) + 64 > 00044584 append_fromstream (ffbed830, 1a17a8, 9408c, 3cbce214, 1d0520, 1) + 14c > > This one looks like the one that's actually having the problem. If you > kill this process, everything will return to normal. > > What caused this? Well, prot_fill() isn't suppose to call SSL_read if > SSL_read is going to block. Unfortunately, it doesn't succeed in this > case. > > Really, we should put the SSL socket into non-blocking mode and have > some additional logic to make sure this doesn't happen. Since the prot > layer itself is (generally) blocking, it's not totally trivial and we > haven't done the work. > > Finally, there's the larger issue that we lock the mailbox during an > APPEND which is a Bad Idea, since a client can be arbitrarily slow > uploading data and thus creates a DoS for other clients. Avoiding this > isn't probably that hard (the staging code used by lmtpd can probably > be adapted by imapd) but we haven't done it, either. > > At the very least, I'd appreciate it if you open a bug on the SSL > issue and include the backtrace on bugzilla.andrew.cmu.edu. > > Larry > Larry, Thank your your time and energies for investigating the problem. I will open a bug for the SSL issue along with a backtrace. Prior to a code fix to address the problems you observed, do you think it would be unreasonable to configure master so that imaps is not offered? We could revert to running stunnel for ssl support and then take our chances with clients that initiate starttls. Our client base has become quite accustomed to the overall reliability of cyrus and would go ballistic with even an occasional imapd/lmtpd going bonkers :-}. Many thanks, Pete
Re: locking problems with 2.1.9
Date: Fri, 8 Nov 2002 11:04:32 -0500 (EST) From: Peter Krotkov <[EMAIL PROTECTED]> [...] 22335: imapd -s ff09b3bc read (0, 1c4bc8, 6f5) 0008e8c8 sock_read (0, 1c4bc8, 6f5, 8e8a0, 18edf8, 1) + 28 0008d670 BIO_read (1bd090, 1c4bc8, 6f5, 1c32a8, 1bccf0, 0) + d0 0007dec8 ssl3_read_n (5, 2010, 2010, 191b, 0, 0) + 148 0007e140 ssl3_get_record (1bb7e0, 1bccf0, 0, 0, 23138, ff0941d8) + 1e0 0007e8d4 ssl3_read_bytes (1bb7e0, 17, 1aa610, 1000, 0, 1bccf0) + 1d4 0007c6e8 ssl3_read (1bb7e0, 1aa610, 1000, 7c6a0, 19a1ac, 0) + 48 0006e730 SSL_read (1bb7e0, 1aa610, 1000, 1, ff0bd194, ffbec4b8) + 70 00060524 prot_fill (, 0, 1000, 19cfb0, ffbec7b8, 1) + 340 00060d8c prot_read (0, ffbec7b8, 1000, 19cfb0, 1, ffbec7b8) + 6c 00050894 message_copy_strict (0, 19cfb0, 8008c, eff8, 1a17a8, ff09c648) + 64 00044584 append_fromstream (ffbed830, 1a17a8, 9408c, 3cbce214, 1d0520, 1) + 14c This one looks like the one that's actually having the problem. If you kill this process, everything will return to normal. What caused this? Well, prot_fill() isn't suppose to call SSL_read if SSL_read is going to block. Unfortunately, it doesn't succeed in this case. Really, we should put the SSL socket into non-blocking mode and have some additional logic to make sure this doesn't happen. Since the prot layer itself is (generally) blocking, it's not totally trivial and we haven't done the work. Finally, there's the larger issue that we lock the mailbox during an APPEND which is a Bad Idea, since a client can be arbitrarily slow uploading data and thus creates a DoS for other clients. Avoiding this isn't probably that hard (the staging code used by lmtpd can probably be adapted by imapd) but we haven't done it, either. At the very least, I'd appreciate it if you open a bug on the SSL issue and include the backtrace on bugzilla.andrew.cmu.edu. Larry
Re: locking problems with 2.1.9
Date: Wed, 6 Nov 2002 14:07:11 -0500 (EST) From: Peter Krotkov <[EMAIL PROTECTED]> > Do the lmtpd acquire or are they _attempting_ to acquire the lock on > the cyrus.seen file? > > Are you using the seen_local backend instead of seen_db? This hasn't > been tested by us in a long time; we've been assuming everyone is > using seen_db. Weird. You aren't using seen_local, which means that lmtpd should never even be trying to acquire a lock on cyrus.seen (this file is used read only exclusively). --with-duplicate-db=skiplist just a warning, you'll probably experience performance problems using the skiplist backend for duplicate delivery suppression. (If you have duplicate delivery suppression turned off it probably doesn't matter.) We noticed this adventure happening yesterday and, in the end, 'master' was stopped and then started. Not a single hiccup since then (about 110,000 imap logins and 50,000 messages handed to lmtpd since midnight). Should this happen again I'll be sure a record the details concerning what process has/wants which locks. Ok, that would be helpful. Larry
Re: locking problems with 2.1.9
On Wed, 6 Nov 2002, Lawrence Greenfield wrote: >Date: Wed, 6 Nov 2002 09:04:56 -0500 (EST) >From: [EMAIL PROTECTED] > >We are experiencing locking problems with cyrus 2.1.9 on a Solaris 8 >system using fcntl and skiplist (except flat for subscriptions). >We've seen the following issues: > > * Lmtpd's acquire a lock on a cyrus.seen file and never get it; >they stack up as mail comes in. > > Do the lmtpd acquire or are they _attempting_ to acquire the lock on > the cyrus.seen file? > > Are you using the seen_local backend instead of seen_db? This hasn't > been tested by us in a long time; we've been assuming everyone is > using seen_db. ./configure --with-com_err --prefix=/var/cyrus/local --with-cyrus-prefix=/var/cyrus/local/cyrus --with-cyrus-group=mail --with-sasl=/var/cyrus/local --with-openssl=/usr/local/ssl2 --without-ucdsnmp --with-dbdir=/usr/local/BerkeleyDB.3.3 --with-libwrap=/usr/local --with-duplicate-db=skiplist --with-mboxlist-db=skiplist --with-seen-db=skiplist --with-subs-db=flat --with-tls-db=skiplist > Who is holding the lock? lsof can tell you. > > Larry We noticed this adventure happening yesterday and, in the end, 'master' was stopped and then started. Not a single hiccup since then (about 110,000 imap logins and 50,000 messages handed to lmtpd since midnight). Should this happen again I'll be sure a record the details concerning what process has/wants which locks. Thank you for your time, Pete
Re: locking problems with 2.1.9
Date: Wed, 6 Nov 2002 09:04:56 -0500 (EST) From: [EMAIL PROTECTED] We are experiencing locking problems with cyrus 2.1.9 on a Solaris 8 system using fcntl and skiplist (except flat for subscriptions). We've seen the following issues: * Lmtpd's acquire a lock on a cyrus.seen file and never get it; they stack up as mail comes in. Do the lmtpd acquire or are they _attempting_ to acquire the lock on the cyrus.seen file? Are you using the seen_local backend instead of seen_db? This hasn't been tested by us in a long time; we've been assuming everyone is using seen_db. Who is holding the lock? lsof can tell you. Larry
Re: locking problems with 2.1.9
Date: Wed, 6 Nov 2002 14:02:52 -0200 From: Henrique de Moraes Holschuh <[EMAIL PROTECTED]> On Wed, 06 Nov 2002, John Wade wrote: > I assume you are using flat seen files. If so, I ran into this > problem on 2.0.16 and came up with a workaround which others > ported to 2.1.3. This was based on flock, but you might be able > to use the same basic technique. see > http://servercc.oakton.edu/~jwade/cyrus/ Yeah, Debian has that patch applied (and forward ported to 2.1.9, both fcntl and flock), files lib/lock*... It works wonderfully. I have never received any reports of seen file lock troubles in the Debian packages. I believe it is also in the CMU Bugzilla. I haven't seen any evidence of this happening on non-Linux platforms, nor have I heard that there is ever a time when the file gets replaced even though it is locked. If it is merely a process that already has a lock blocks trying to get the same lock, then it is a kernel problem. Could we have more detail on any other failures? Larry
Re: locking problems with 2.1.9
On Wed, 06 Nov 2002, John Wade wrote: > I assume you are using flat seen files. If so, I ran into this problem on > 2.0.16 and came up with a workaround which others ported to 2.1.3. This > was based on flock, but you might be able to use the same basic > technique. see http://servercc.oakton.edu/~jwade/cyrus/ Yeah, Debian has that patch applied (and forward ported to 2.1.9, both fcntl and flock), files lib/lock*... It works wonderfully. I have never received any reports of seen file lock troubles in the Debian packages. I believe it is also in the CMU Bugzilla. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh
Re: locking problems with 2.1.9
Hi Pete, I assume you are using flat seen files. If so, I ran into this problem on 2.0.16 and came up with a workaround which others ported to 2.1.3. This was based on flock, but you might be able to use the same basic technique. see http://servercc.oakton.edu/~jwade/cyrus/ The flat file locking code is very strangely broken, I attributed it to linux kernel problems since I could never reproduce the exact scenario that I saw in my gdb stack traces.Others however reported this problem on enough other platforms (including solaris) that I think the bug is in the cyrus code. It will take a far better C programmer than I to track it down. What I saw was that the initial process that held the lock that everyone else was waiting on was invariably a imapd process and it was trying to lock a file that it already had a lock on.Meanwhile, even though the file was locked, other processes had managed to replace it. The workaround I came up with is to have all attempts at file locks time out rather than wait indefinitely.This kills the initial imapd process that has the problem and the lmtpd's etc, are no longer blocked. For us, this happens between one and three times a day. (the patch I created logs it to syslog) Hope this helps, John [EMAIL PROTECTED] wrote: > We are experiencing locking problems with cyrus 2.1.9 on a Solaris 8 > system using fcntl and skiplist (except flat for subscriptions). > We've seen the following issues: > > * Lmtpd's acquire a lock on a cyrus.seen file and never get it; > they stack up as mail comes in. > * In syslog we see 'IOERROR: reading message: unexpected end of file' > * In various partition's 'stage.' directory we see hundreds of > messages stacked up waiting for - surprise - users who seem to > be having the locking issues. > * Some users have cyrus.seen.NEW lying around in their folders. > > The above problems exist for only a handful of users; the other 12k > users seem to be user'ing along without difficulty. But when the > other 18k users move to this box it might get worse... > > All users were transferred from a different Solaris system (cyrus > 1.5.27) to this new one using rsync (mail/folder dirs, quota, > subscriptions). > > Any pointers or suggestions would be helpful!