Two bugs in 2.3.10-rc2 that cause segfaults
Hi Ken After some testing on our test server, we rolled out 2.3.10-rc2 onto one of our production IMAP stores and almost immediately starting noticing segfaults for some users. Bron and I spent quite a bit of time this afternoon debugging, and we're pretty sure we found both the causes. 1. In index_parse_sequence(), your realloc() doesn't multiply by sizeof(struct seq_range), so if you have a sequence list with more than about 8 items, it will start corrupting memory 2. The signed -> unsigned changes in the prot stream stuff weren't as innocent as they looked and created a subtle and nasty bug. You might want to audit the other code where you changed signed -> unsigned as well... Here's an example of the problem if you use IDLE and close the connection on it. In that case, what happens is: In cmd_idle(), we wait for data from the client by calling: c = getword(imapd_in, &arg); getword() calls prot_getc(), which looks like this: #define prot_getc(s) ((s)->cnt-- > 0 ? (int)*(s)->ptr++ : prot_fill(s)) Say there's no data in the prot buffer, then cnt == 0. However when we call the above macro, it does cnt--, which makes cnt wrap to 4294967295, and then calls prot_fill. prot_fill() calls read(), and if read() returns <= 0 (eg other end closed the connection) it does. if (n <= 0) { if (n) s->error = xstrdup(strerror(errno)); else s->eof = 1; return EOF; } So it sets the eof flag and returns EOF. Back in cmd_idle, we get the EOF, and exit out of cmd_idle and back into the main loop, which does. /* Parse tag */ c = getword(imapd_in, &tag); Now getword calls the prot_getc() macro again... but uh, oh. cnt is definitely > 0 now (it's 4294967295) and so we reading through unitialised memory trying to interpret it as IMAP commands until we hit a segfault somewhere. Our fix is to change prot_getc() to only decrement cnt if it's non-zero. Because it's an expression macro, it uses the C comma operator which I've found often even experienced C people don't know about (http://www.eskimo.com/~scs/cclass/int/sx4db.html). #define prot_getc(s) ((s)->cnt > 0 ? (--(s)->cnt, (int)*(s)->ptr++) : prot_fill(s)) Both patches are here: http://cyrus.brong.fastmail.fm/patches/cyrus-fix-segfaults-2.3.10.diff Rob
Re: delayed delete bug?
Bron Gondwana wrote: On Mon, Oct 15, 2007 at 03:25:28PM +0200, Rudy Gevaert wrote: Hi, I'm busy looking at delayed delete. I'm using unix hierarchy seperator. I deleted a mailbox and see it like this: kavula.ugent.be> lm DELETED/user/rudy.gevaert/Foo/[EMAIL PROTECTED] (\HasNoChildren) user/rudy.gevaert/[EMAIL PROTECTED] (\HasNoChildren) user/rudy.gevaert/[EMAIL PROTECTED] (\HasNoChildren) user/rudy.gevaert/[EMAIL PROTECTED] (\HasNoChildren) user/[EMAIL PROTECTED] (\HasChildren) However the file system location is: /var/cyrus/imap/domain/u/ugent.be/u/DELETED/user/rudy^gevaert/Foo/471364C1/ Yeah, that's a "feature" - you see, hashing is done by whatever's after the first dot. In theory you don't have to care. We wrote our userhash patch so the locations would be: /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/DELETED.user.rudy^gevaert.Foo.471364C1/ /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/user.rudy^gevaert /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/user.rudy^gevaert.Inbox2 /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/user.rudy^gevaert.Trash /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/user.rudy^gevaert.Trash2 Which makes much more sense to me and has the nice property that a directory is either a "trunk" or a "leaf" but not both - it contains either purely sub directories or purely mailbox contents. I would have thought it would be under domain/u/ugent.be/DELETED/user/... Is something going wrong here? No. I've run cyr_expire and it cleans it up nicely! Also, is it correct to undelete a deleted folder I need to rename it? (This works! And gets replicated). Yeah, it all works fine. It's just an odd location. I got over that pretty quickly given that my Perl libraries know where it is and Cyrus knows where it is so I just let them do the legwork. So, is it your opinion that this doesn't need to be fixed, or shouldn't be fixed? -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University
Re: Two bugs in 2.3.10-rc2 that cause segfaults
Applied to CVS. Thanks for hunting these down. Rob Mueller wrote: Hi Ken After some testing on our test server, we rolled out 2.3.10-rc2 onto one of our production IMAP stores and almost immediately starting noticing segfaults for some users. Bron and I spent quite a bit of time this afternoon debugging, and we're pretty sure we found both the causes. 1. In index_parse_sequence(), your realloc() doesn't multiply by sizeof(struct seq_range), so if you have a sequence list with more than about 8 items, it will start corrupting memory 2. The signed -> unsigned changes in the prot stream stuff weren't as innocent as they looked and created a subtle and nasty bug. You might want to audit the other code where you changed signed -> unsigned as well... Here's an example of the problem if you use IDLE and close the connection on it. In that case, what happens is: In cmd_idle(), we wait for data from the client by calling: c = getword(imapd_in, &arg); getword() calls prot_getc(), which looks like this: #define prot_getc(s) ((s)->cnt-- > 0 ? (int)*(s)->ptr++ : prot_fill(s)) Say there's no data in the prot buffer, then cnt == 0. However when we call the above macro, it does cnt--, which makes cnt wrap to 4294967295, and then calls prot_fill. prot_fill() calls read(), and if read() returns <= 0 (eg other end closed the connection) it does. if (n <= 0) { if (n) s->error = xstrdup(strerror(errno)); else s->eof = 1; return EOF; } So it sets the eof flag and returns EOF. Back in cmd_idle, we get the EOF, and exit out of cmd_idle and back into the main loop, which does. /* Parse tag */ c = getword(imapd_in, &tag); Now getword calls the prot_getc() macro again... but uh, oh. cnt is definitely > 0 now (it's 4294967295) and so we reading through unitialised memory trying to interpret it as IMAP commands until we hit a segfault somewhere. Our fix is to change prot_getc() to only decrement cnt if it's non-zero. Because it's an expression macro, it uses the C comma operator which I've found often even experienced C people don't know about (http://www.eskimo.com/~scs/cclass/int/sx4db.html). #define prot_getc(s) ((s)->cnt > 0 ? (--(s)->cnt, (int)*(s)->ptr++) : prot_fill(s)) Both patches are here: http://cyrus.brong.fastmail.fm/patches/cyrus-fix-segfaults-2.3.10.diff Rob -- Kenneth Murchison Systems Programmer Project Cyrus Developer/Maintainer Carnegie Mellon University
Re: delayed delete bug?
Ken Murchison wrote: So, is it your opinion that this doesn't need to be fixed, or shouldn't be fixed? Te be honest, I would like to see it fixed. Thanks in advance, Rudy
Re: delayed delete bug?
On Tue, 16 Oct 2007 18:55:55 +0200, "Rudy Gevaert" <[EMAIL PROTECTED]> said: > Ken Murchison wrote: > > > So, is it your opinion that this doesn't need to be fixed, or shouldn't > > be fixed? > > > > Te be honest, I would like to see it fixed. > > Thanks in advance, > > Rudy Bringing back some context: >> I'm busy looking at delayed delete. I'm using unix hierarchy seperator. I >> deleted a mailbox and see it like this: >> >> kavula.ugent.be> lm >> DELETED/user/rudy.gevaert/Foo/[EMAIL PROTECTED] (\HasNoChildren) >> user/rudy.gevaert/[EMAIL PROTECTED] (\HasNoChildren) >> user/rudy.gevaert/[EMAIL PROTECTED] (\HasNoChildren) >> user/rudy.gevaert/[EMAIL PROTECTED] (\HasNoChildren) >> user/[EMAIL PROTECTED] (\HasChildren) >> >> However the file system location is: >> /var/cyrus/imap/domain/u/ugent.be/u/DELETED/user/rudy^gevaert/Foo/471364C1/ >> >> I would have thought it would be under domain/u/ugent.be/DELETED/user/... Just to clarify exactly what you want, do you: a) think that the DELETED folders should not be hashed? (your example) b) think they should be hashed based on the username rather than into 'u'? c) think something entirely different should be done with them? My (c) is as I posted in my followup: > We wrote our userhash patch so the locations would be: > > /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/DELETED.user.rudy^gevaert.Foo.471364C1/ > /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/user.rudy^gevaert > /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/user.rudy^gevaert.Inbox2 > /var/cyrus/imap/domain/u/ugent.be/user/rudy^gevaert/user.rudy^gevaert.Trash So that all the folders related to the same user are in the same directory and have dots in their names as separators rather than being subdirectories of each other. This is obviously only one of many options, and has the downside that it's a completely new filesystem layout that requires a rehash run for every folder rather than just affecting a few deleted folders. Bron. -- Bron Gondwana [EMAIL PROTECTED]
Re: Two bugs in 2.3.10-rc2 that cause segfaults
On Tue, 16 Oct 2007 12:22:28 -0400, "Ken Murchison" <[EMAIL PROTECTED]> said: > Applied to CVS. Thanks for hunting these down. I tell you what - we were about 1/4 grumpy at you and 3/4 grumpy at the C language for being so sharp and unfriendly. Watching our least loaded "real store" segfaulting left, right and centre after we rolled this out to real users, and knowing we either had to downgrade the indexes and roll back, or figure this thing out. Nothing like pressure to make you read those gdb traces. (downgrading the indexes isn't actually insanely painful, it would have been 5 minutes work to rewrite my little Perl utility to switch back to version 9 format complete with pattern matching the GUID and deciding if it needed zeroing or not, and another 10 minutes or so to run it over all the mailboxes. It's like admitting defeat though!) Bron. -- Bron Gondwana [EMAIL PROTECTED]