On Fri, Aug 31, 2007 at 05:51:35PM +0100, David Carter wrote: > On Fri, 31 Aug 2007, Ken Murchison wrote: > >> I think your patch makes sense, but I'm not sure when "The cyrus index >> file format has this clever "ignore" junk at the end until the exists >> count changes trick" means. I know we leave junk in the cache file as a >> result of delayed expunge which gets cleaned up later, but I'm pretty sure >> the the index file is always tightly packed. The is extra space in the >> index header, but there shouldn't be any between the index records. > > I think that the danger is that if sync_server gets shut down uncleanly > (which I know was happening to Fastmail a lot at one point) then then you > can end up with a bogus entry at the end of a cyrus.index file which is not > overwritten by the next sync_append_commit() on that mailbox.
"Uncleanly" means any time it gets a sigterm in the middle of a delivery at the wrong point. On a busy server with a lot of IO this tends to be about every shutdown for a couple of mailboxes (as a datapoint, we failed over 5 slots on one server for about 2 hours while we upgraded the kernel and did some tests - we had about 15 of these according to our backup system (which does MD5 UUID tests on every message and tells us if any files don't match any index record) > The race condition is that the exists count in the header can only be > updated after the index record has been written. > > An explicit seek using mailbox->exists is definitely more robust, although > it probably doesn't help if power fails halfway through the fsync() on the > cyrus.index file after both updates have been made (data=journal maybe?) Consider the following mailbox - exists: 3, uids 12, 13, 14 Master Replica E: 3 E: 3 12 12 13 13 14 14 (15) The (15) there is a phantom record as we shut them down for a "controlled" failover. Replica Master E: 3 E: 3 12 12 13 13 14 14 (15) There's another delivery: Replica Master E: 4 E: 4 12 12 13 13 14 14 (15) 15 15 And another couple: Replica Master E: 6 E: 6 12 12 13 13 14 14 (15) 15 15 16 16 17 17 We fail back: Master Replica E: 6 E: 6 12 12 13 13 14 14 (15) 15 15 16 16 17 17 And have a delivery (which uses index offset): Master Replica E: 7 E: 7 12 12 13 13 14 14 (15) 15 15 16 16 17 18 18 17. < NO INDEX ENTRY The user deletes message 15: Master Replica E: 6 E: 6 12 12 13 13 14 14 15 16 16 17 18 18 EXPUNGE: (15) 15 17. < NO INDEX ENTRY cyr_expire runs: Master Replica E: 6 E: 6 12 12 13 13 14 14 15 < NO FILE 16 16 17 18 18 As you can see, this isn't trivial at a site of our size, it's a pretty big mess for the affected mailbox - especially if it's uncaught until cyr_expire runs and nukes the underlying rfc822 file due to duplicate entries in the index and expunge files. I think reconstruct can fix most of it except the "uid in both index and expunge" - so I fix that by removing the record from the expunge file first (using my Cyrus::IndexFile perl module) Presumably, if there was a multi-append you could even get out of order UIDs in a pathological enough series of failovers. Bron.