Re: Making Replication Robust
On Tue, 9 Oct 2007, David Carter wrote: I've never faced a spilt brain situation which involved more than two or three messages (the outstanding log on an old master system). I suppose that it was predicable that a week after writing this I faced my first serious split brain (3000 messages lost after a hardware fault). My solution was to write a little script which given a list of mailboxes (the sync_log file on the old master), scanned over the cyrus.index files looking for messages with an internaldate greater than a given cutoff. These messages were then transferred across to the new master to be reinjected. Replication from the new master to the old master then resolves the split brain situation (master wins in case of ambiguity), which is the way it was designed to work. From memory Fastmail did something similar when they faced a split brain situation. The procedure works well, but I think that it would be useful to have some tools in the Cyrus distribution rather than having to knock up one off tools. I'm happy to work on this if we can beat out some requirements. I'm not keen about trying to fix split brain situations within the replication protocol itself: at the moment sync_client doesn't try to mess with the data on the master, which is a property I like. There are also certain situations that replication just can't fix. Envision a hypothetical replication engine which can cope with GUID mismatches, adding messages to both master and replica. Then imagine: 1) Replication dies because of hardware or software fault. 2) Master continues to limp along for a bit before dying. Split brain. 3) Message delivered to a non user mailbox (Sieve or + addressing) 4) Master dies entirely: failover to replica with missing messages 5) User logs in and deletes the mailbox in question on the new master, unaware that they are actually missing a message from that mailbox. 6) sysadmin starts replication from new master to old master. They hope that this will automatically resolve all conflicts without losing anything because we promise that replication is magic. 6) Replication engine deletes the entire mailbox (including the message that we want to recover), as it doesn't exist on the new master. /* == */ Just for everyone's amusement: what happened to us on Tuesday evening = This isn't good: Oct 16 20:56:21 cyrus-24 kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0. Oct 16 20:56:21 cyrus-24 kernel: Dazed and confused, but trying to continue Oct 16 20:56:21 cyrus-24 kernel: Do you have a strange power saving mode enabled? But it is nowhere near as bad as: Oct 16 20:56:31 cyrus-24 sync_client[11985]: Unknown system flag: \snswered ^ Oops You know that a machine is unhappy when sync_client -u on a given account randomly: 1) Works without problems 2) segfaults 3) Attempts to reserve every message on the account on the server, presumably as a prelude to a mass UPLOAD. I infer that the machine has a motherboard fault which caused kernel memory corruption in some small lump of buffer cache. I am amazed that the filesystems passed fsck when I attached the disks to a new machine. The original machine refused to reboot cleanly because umount segfaulted. It also failed two DIMMs on each POST until the machine ran out of memory. -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH.
Re: Making Replication Robust
On Sat, 13 Oct 2007, Bron Gondwana wrote: Apart from a couple of short-lived command line utilities it looks like the only use of signal() is a bunch of 'signal(SIGPIPE, SIG_IGN);' scattered through just about everything. Most of the interesting signal handling is done with sigaction already. And indeed it uses an explicit SA_RESTART flag, so it looks like I've been worrying about nothing all along. By all means signal away! -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH.
Re: Making Replication Robust
This would seem to be a significant advantage of running sync_client outside master. When I shut down master, sync_client continues to process the outstanding log. I can then use sync_shutdown_file when it has finished and is idle. We do something similar. But it means you have to develop a bunch of your own infrastructure to make cyrus replication robust. It's not currently a start it and it just works until you shut it down solution, which means either people have to replicate the same extra infrastructure work everywhere separately, or people are going to get burnt not realising that what they're doing isn't safe. That's why I'd really like this to be in cyrus itself. I think we should be able to say in the documentation something like: Shuting down a cyrus master with a SIGQUIT ensures that all actions have been replicated to the replica side. It makes writing init scripts and the like a lot easier. There seems to be a spilt of opinion between BSD and SVR4: BSD tries to retry while SVR4 throws EINTR. Linux of course can work either way: http://www.gnu.org/software/libc/manual/html_node/Interrupted-Primitives.html Isn't a lot of writes already wrapped up in some retry_write() function. I admit I haven't looked closely. Anyway, is this really a problem. Basically shouldn't you be able to kill cyrus at any point, and files are left in a consistent restartable state? If so, if something returns EINTR, won't it just move on and eventually exit? Or is the problem that you have something like: write to file 1 write to file 2 And if the first returns EINTR but is ignored, and then it writes the complete data to the second, things are in an inconsistent state? Rob
Re: Making Replication Robust
On Wed, 10 Oct 2007, Rob Mueller wrote: I think the problem at the moment is that the process you really want is: 1. Stop new imap/pop/lmtp/sieve/etc connections 2. Finish and close existing connections cleanly but as quickly as possible 3. Finish running any sync log files 4. Fully shutdown There's currently no clean way to do this. Basically you have to SIGTERM master which hard kills it and all children, then manually run sync_client -f on any remaining log files. This would seem to be a significant advantage of running sync_client outside master. When I shut down master, sync_client continues to process the outstanding log. I can then use sync_shutdown_file when it has finished and is idle. sync_client could catch SIGQUIT to initiate some form of clean shutdown. I'm still a little bothered about signal handling and EINTR. I did some experiments after our last chat about signals. In practice disk IO system calls seem to be reasonably safe against EINTR on both Linux and Solaris, but a trip to Google suggests that there are few guarantees: http://archives.postgresql.org/pgsql-hackers/2005-12/msg01259.php There seems to be a spilt of opinion between BSD and SVR4: BSD tries to retry while SVR4 throws EINTR. Linux of course can work either way: http://www.gnu.org/software/libc/manual/html_node/Interrupted-Primitives.html -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH.
Re: Making Replication Robust
On Fri, Oct 12, 2007 at 10:29:53AM -0400, Carson Gaspar wrote: David Carter wrote: I'm still a little bothered about signal handling and EINTR. I did some experiments after our last chat about signals. In practice disk IO system calls seem to be reasonably safe against EINTR on both Linux and Solaris, but a trip to Google suggests that there are few guarantees: http://archives.postgresql.org/pgsql-hackers/2005-12/msg01259.php There seems to be a spilt of opinion between BSD and SVR4: BSD tries to retry while SVR4 throws EINTR. Linux of course can work either way: http://www.gnu.org/software/libc/manual/html_node/Interrupted-Primitives.html I suggest reading the POSIX specs. Restart vs. EINTR is specified explicitly on any POSIX platform (including all the *BSDs, Solaris, Linux, ...). If cyrus is still using signal() and friends anywhere, they desperately need to be replaced with sigaction(). Apart from a couple of short-lived command line utilities it looks like the only use of signal() is a bunch of 'signal(SIGPIPE, SIG_IGN);' scattered through just about everything. Most of the interesting signal handling is done with sigaction already. Bron.
Re: Making Replication Robust
Or is the problem that you have something like: write to file 1 write to file 2 And if the first returns EINTR but is ignored, and then it writes the complete data to the second, things are in an inconsistent state? This is my concern. Doing an ack 'write\(' reveals a scary mix of write, retry_write and fwrite calls. My initial reaction was that binary files seem to use open/retry_write, and text files use fopen/fwrite, but doesn't quite seem to be the case... mailbox.c 1242:r = write(newheader_fd, MAILBOX_HEADER_MAGIC, 1359:n = retry_write(mailbox-index_fd, buf, header_size); 1428:n = retry_write(mailbox-index_fd, buf, INDEX_RECORD_SIZE); 1477:n = retry_write(mailbox-index_fd, buf, len); 1642:fwrite(buf, 1, INDEX_HEADER_SIZE, newindex); 1659:fwrite(bufp, INDEX_RECORD_SIZE, 1, newindex); 1710:fwrite(buf, INDEX_RECORD_SIZE, 1, newindex); 1721: fwrite(buf+OFFSET_DELETED, 1952: n = retry_write(expunge_fd, buf, mailbox-record_size); 1979: if (newindex) fwrite(buf, 1, mailbox-record_size, newindex); 1999: /* fwrite will automatically call write() in a sane way */ 2000: fwrite(cacheitembegin, 1, cache_record_size, newcache); 2004: fwrite(buf, 1, mailbox-record_size, newindex); 2058:fwrite(buf, 1, mailbox-start_offset, newindex); 2215: fwrite(buf, 1, sizeof(bit32), newcache); 2219:fwrite(buf, 1, mailbox-start_offset, newindex); 2263: n = retry_write(expunge_fd, buf, mailbox-start_offset); 2342: r = quota_write(mailbox-quota, tid); 2363: fwrite(buf, 1, mailbox-start_offset, newexpungeindex); 2424: n = retry_write(expunge_fd, buf, mailbox-start_offset); 2719: n = retry_write(mailbox.cache_fd, (char *)mailbox.generation_no, 4); 2823: r = quota_write(mailbox-quota, tid); 3056: r = quota_write((newmailbox-quota), tid); 3309: r = quota_write(newmailbox.quota, tid); 3319: r2 = quota_write(newmailbox.quota, tid); 3398:n = retry_write(destfd, src_base, src_size); It seems mixing up fd's or FILE * structs all over the place. *sigh* Does fwrite() retry a write on EINTR? It looks like that's the whole point of retry_write() anyway. If fwrite() does retry, then about the only other work would be changing any naked write() calls to retry_write(), which actually doesn't seem that many. Thoughts? Rob
Re: Making Replication Robust
On Mon, 8 Oct 2007, Rudy Gevaert wrote: Note, we are running 2.3.7, I'm going to upgrade when 2.3.10 is out. We have replication in place, but daren't use it. If I have a method to check if the replica is in sync then I'll dare to do a fail over. I do this using -v -v to sync_client, which gives a running commentary about just what is going on: cyrus-28[cyrus:~]$ replicate -s cyrus-27 -v -v -u dpc99 USER dpc99 USER_ALL dpc99 SELECT user.dpc99 UPLOAD [1 msgs] ENDUSER A very high tech grep -v USER /tmp/out picks out actual updates. This is one of the things which got dropped when replication was merged into 2.3 (my original implementation just didn't fit cleanly). I would like to put something similar into 2.3, as this is a quick and easy way to check for consistency while fixing up problems. A dry run mode which supresses updates would also be useful, although probably more work. The kind of random sampling which Fastmail do probably wouldn't hurt as an extra sanity check. -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH.
Re: Making Replication Robust
On Mon, 8 Oct 2007, Bron Gondwana wrote: We already run a sync_server on our masters as well because we use it for user moves: Generally takes about 15 seconds for the critical path bit, and the initial sync doesn't matter how long it takes. As do we. In fact when I first showed the replication system to Rob Siemborski (few a years back now), he was thinking about using replication to replace XFER in a murder environment. I have a special -y flag to sync_client which disables fsync() on the replica for fast seeding of replicas. We also use replication to dump data from the live systems to a tape spooling array each night. Replication is transaction safe rsync for Cyrus, tailored around the cyrus.index files. More later. -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH.
Re: Making Replication Robust
On Thu, 4 Oct 2007, Bron Gondwana wrote: a) MUST never lose a message that's been accepted for delivery except in the case of total drive failure. b) MUST have a standard way to integrity check and repair a replica-pair after a system crash. A replica system is automatically repaired to match its master, but this doesn't help with the split brain scenarios that you are worried about. I've never faced a spilt brain situation which involved more than two or three messages (the outstanding log on an old master system). I suspect that this is simply because I've never had to run an unreliable replication engine which bails out on my production systems. c) MUST have a clean process to soft-failover to the replica machine, making sure that all replication events from the ex-master have been synchronised. Something more than sync_shutdown_file plus automatic retries on recent work files? d) MUST have replication start/restart automatically when the replica is available rather than requiring it be online at master start time. Work in progress from Ken. e) SHOULD be able to copy back messages which only exist on the replica due to a hard-failover, handling UIDs gracefully (more on this later), This is the hard one. I think that assigning a new UIDvalidity and new UIDs for all the messages would be best as messages can then be sorted in the replacement mailbox based on their arrival time. Actually this would look remarkably like the new sync_combine_commit() on the replica side. What I don't know is how we then synchronise back to the master. Up to now the replication engine has been very careful about _not_ making changes on the master, so that it only has the potential to mess up the spare system. alternatively as least MUST (to satisfy point 'a') notify the administrator that the message has different GUIDs on the two copies and something will need to be done about it (to satisfy point 'd' this must be done without bailing out replication for the remaining messages in the folder) At the moment we replace messages (on the master knows best principle). It would be easy enough to leave message in place and generate warnings instead, although this would generate a lot of warnings, one for every bad message every time that a given mailbox is updated. f) SHOULD keep replicating in the face of an error which affects a single mailbox, keeping track of that mailbox so that a sysadmin can fix the issue and then replicate that mailbox hand. You could try disabling the MAILBOX - USER promotion to see what happens: the 3 x MAILBOXES retry will fix most transient problems caused by mailboxes moving around, leaving just the permanent errors. The MAILBOX - USER promotion was originally there on the principle that a mailbox disappearing under our feet was likely to appear somewhere else in the same account (without shared mailboxes to worry about). My nightmare scenario is a replication engine which carries on running in the face of mboxlist corruption on the master: you could lose a lot of mailboxes on the replica that way. g) MAY have a method to replicate to two different replicas concurrently (replay the same sync_log messages twice) allowing one replica to be taken out of service and a new one created while having no gaps in which there is no second copy alive (we use rsync, rsync again, stop replication, rsync a third time, start replication to the new site - but it's messy and gappy) It would be easy enough to generate multiple replication log files. MySQL keeps a single transaction log for multiple replicas, but that file contains quite a lot of information about each transaction. In contrast the Cyrus sync log is just a list of objects we need to pay attention to: the files have much less state, particularly without duplicates. -- David Carter Email: [EMAIL PROTECTED] University Computing Service,Phone: (01223) 334502 New Museums Site, Pembroke Street, Fax: (01223) 334679 Cambridge UK. CB2 3QH.
Re: Making Replication Robust
c) MUST have a clean process to soft-failover to the replica machine, making sure that all replication events from the ex-master have been synchronised. Something more than sync_shutdown_file plus automatic retries on recent work files? I think the problem at the moment is that the process you really want is: 1. Stop new imap/pop/lmtp/sieve/etc connections 2. Finish and close existing connections cleanly but as quickly as possible 3. Finish running any sync log files 4. Fully shutdown There's currently no clean way to do this. Basically you have to SIGTERM master which hard kills it and all children, then manually run sync_client -f on any remaining log files. We've got a patch which makes master handle SIGQUIT much more nicely. Basically it appears there was some existing infrastructure that was designed to handle a cleaner shutdown, look at the code to all the places that call signals_poll(). It looks like the idea was that you could send child processes SIGQUIT and they would continue their current action until their main loop and check if they'd been sent a QUIT, and then exit cleanly. Unfortunately if you sent SIGQUIT to master, it would just SIGTERM all children, not SIGQUIT them. This patch attempts to fix this, so that sending SIGQUIT to master, sends SIGQUIT to all children, and then waits for them to all exit cleanly. http://cyrus.brong.fastmail.fm/#cyrus-clean-shutdown-2.3.8.diff This solves step 1 2 above, though it doesn't deal with the case of a crazy child that doesn't respond to SIGQUIT. Personally our init script sends SIGQUIT, and if the master process is still there after 10 seconds, then it sends SIGTERM to force and exit. In general we find that everything exits after a couple of seconds of SIGQUIT. To do step 3, I think the best might be to have a new cyrus.conf section, a SHUTDOWN section which gives some commands to run on shutdown. Basically after all children have accepted a SIGQUIT and exited, then we run the SHUTDOWN section, which would run a final sync_client -r on the sync dir to finish up any remaining log files. With all of that in place, it means you could send a SIGQUIT to a cyrus master process on a master server, and it would cleanly shutdown all children and ensure that all replication events have been correctly played to the replica. You could then do the same to the replica, then reverse their roles, and bring them both back up and you've got a safe soft failover. At the moment we replace messages (on the master knows best principle). It would be easy enough to leave message in place and generate warnings instead, although this would generate a lot of warnings, one for every bad message every time that a given mailbox is updated. That's what this patch does. http://cyrus.brong.fastmail.fm/#cyrus-warnmismatcheduuids-2.3.8.diff In theory with clean soft failovers, you should NEVER have UIDs with mismatched UUIDs. After a hard failover, you obviously might, but in those cases, just replacing the message means we're almost certainly overwriting a delivered message and loosing it which is bad. At least making it an option to overwrite or log I think is a sane idea. My nightmare scenario is a replication engine which carries on running in the face of mboxlist corruption on the master: you could lose a lot of mailboxes on the replica that way. That would be bad, though hard to detect and stop. I guess that's what backups are for... It would be easy enough to generate multiple replication log files. MySQL keeps a single transaction log for multiple replicas, but that file contains quite a lot of information about each transaction. In contrast the Cyrus sync log is just a list of objects we need to pay attention to: the files have much less state, particularly without duplicates. The other option is rather than using the rotate log, play it, delete it system, you generate one log file but you keep track of offsets within the file to tell you where each replica is up to. That's what mysql does, so you can have multiple replicas because each replica is playing off the same log files, they're just up to different offsets at any point in time. Rob
Re: Making Replication Robust
Hello, I agree with Bron. However I do think some parts are more important than others. I'll try to explain my point of view. Note, we are running 2.3.7, I'm going to upgrade when 2.3.10 is out. We have replication in place, but daren't use it. If I have a method to check if the replica is in sync then I'll dare to do a fail over. For me points a, e and f are most important, but the others are also important. Bron Gondwana wrote: So I'd like to start a dialogue on the topic of making Cyrus replication robust across failures with the following goals: a) MUST never lose a message that's been accepted for delivery except in the case of total drive failure. b) MUST have a standard way to integrity check and repair a replica-pair after a system crash. Do you mean that if the replica crashes it should be able to catch up again? c) MUST have a clean process to soft-failover to the replica machine, making sure that all replication events from the ex-master have been synchronised. In deed this is nice, but it would still need a lot of site specific tools. E.g. I know (I think I do) that Fastmail runs master/replica in the same subnet. We don't. So soft-failover isn't that easy. For us it's more important that all mail that isn't delivered gets queued at the MTA (it's not on the same machine as cyrus). All delivered mails are replicated. We then still need to update the DNS or /etc/hosts file. d) MUST have replication start/restart automatically when the replica is available rather than requiring it be online at master start time. This would be great if there are some tools available for doing automatic failover, recovery, ... e) SHOULD be able to copy back messages which only exist on the replica due to a hard-failover, handling UIDs gracefully (more on this later), alternatively as least MUST (to satisfy point 'a') notify the administrator that the message has different GUIDs on the two copies and something will need to be done about it (to satisfy point 'd' this must be done without bailing out replication for the remaining messages in the folder) f) SHOULD keep replicating in the face of an error which affects a single mailbox, keeping track of that mailbox so that a sysadmin can fix the issue and then replicate that mailbox hand. g) MAY have a method to replicate to two different replicas concurrently (replay the same sync_log messages twice) allowing one replica to be taken out of service and a new one created while having no gaps in which there is no second copy alive (we use rsync, rsync again, stop replication, rsync a third time, start replication to the new site - but it's messy and gappy) Is again a good idea, and would be very usable. But this is depending what you will be doing with the second replica. If it would be possible to take out the second replica, to make it conssistent and back it up, and then make it up to date it would be a neat way have consistent backup. Kind regards, Rudy -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Rudy Gevaert [EMAIL PROTECTED] tel:+32 9 264 4734 Directie ICT, afd. Infrastructuur ICT Department, Infrastructure office Groep SystemenSystems group Universiteit Gent Ghent University Krijgslaan 281, gebouw S9, 9000 Gent, Belgie www.UGent.be -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Re: Making Replication Robust
On Mon, Oct 08, 2007 at 10:03:31AM +0200, Rudy Gevaert wrote: For me points a, e and f are most important, but the others are also important. Bron Gondwana wrote: So I'd like to start a dialogue on the topic of making Cyrus replication robust across failures with the following goals: a) MUST never lose a message that's been accepted fordelivery except in the case of total drive failure. b) MUST have a standard way to integrity check andrepair a replica-pair after a system crash. Do you mean that if the replica crashes it should be able to catch up again? No, when a master fails and replication wasn't 100% up to date and you decided to bring the replica online, then later switched back to the original master, you don't overwrite messages. c) MUST have a clean process to soft-failover to thereplica machine, making sure that all replication events from the ex-master have been synchronised. In deed this is nice, but it would still need a lot of site specific tools. E.g. I know (I think I do) that Fastmail runs master/replica in the same subnet. We don't. So soft-failover isn't that easy. True - it's easy for us because we have different configs that bind to the same IP address and use arp broadcasts so nothing else needs to change. The bit I care more about is that you can shut a master down cleanly and guarantee that all replication events finish sending as part of the shutdown process. We already do this with an external init script (written in Perl) but would prefer that it's a general option available to everyone and supported upstream. For us it's more important that all mail that isn't delivered gets queued at the MTA (it's not on the same machine as cyrus). All delivered mails are replicated. We then still need to update the DNS or /etc/hosts file. We have that too of course, it's more the ones that are delivered but not yet replicated when we call shutdown that matter (see also APPEND) d) MUST have replication start/restart automatically when the replica is available rather than requiring it beonline at master start time. This would be great if there are some tools available for doing automatic failover, recovery, ... Yeah, we get this with the '-o' option to sync_client meaning it just doesn't start replicating, but then monitorsync.pl runs every 10 minutes from cron, and it checks that there are running sync_client processes for each master and attempts to start them if the replica is marked as up in the database. It also deals with old log files left lying around. e) SHOULD be able to copy back messages which only exist on the replica due to a hard-failover, handling UIDsgracefully (more on this later), alternatively as least MUST (to satisfy point 'a') notify the administrator that the message has different GUIDs on the two copies and something will need to be done about it (to satisfy point 'd' this must be done without bailing outreplication for the remaining messages in the folder) f) SHOULD keep replicating in the face of an error which affects a single mailbox, keeping track of that mailbox so that a sysadmin can fix the issue and then replicate that mailbox hand. g) MAY have a method to replicate to two different replicas concurrently (replay the same sync_log messages twice) allowing one replica to be taken out of service and a new one created while having no gaps in which there is no second copy alive (we use rsync, rsync again, stop replication, rsync a third time, start replication to the new site - but it's messy and gappy) Is again a good idea, and would be very usable. But this is depending what you will be doing with the second replica. If it would be possible to take out the second replica, to make it conssistent and back it up, and then make it up to date it would be a neat way have consistent backup. Yeah, that's a point. That would be very nice :) We're generally doing it because we want to take a drive unit out of service, or even a whole machine, and we'd rather not have a gap where there's only one live copy of data. I've been thinking evil thoughts about writing a sync_server protocol compatibility library and poking cyrus through it. We already run a sync_server on our masters as well because we use it for user moves: *) create custom config file and mailboxes.db snippet *) sync user to new store using custom config *) lock the user against lmtp/pop/imap at the proxy level and kill off all current connections (scans $confdir/proc) *) sync user again *) run checkreplication.pl in paranoid mode to make sure everything actually matches *) update database field for store name and broadcast a cache invalidation packet to all the apps that cache user data (again, one subnet makes broadcast cache management reasonable) *) re-enable delivery and logins. Generally takes about 15 seconds for the critical path