c) MUST have a clean process to "soft-failover" to the
  replica machine, making sure that all replication
  events from the ex-master have been synchronised.

Something more than sync_shutdown_file plus automatic retries on
recent work files?

I think the problem at the moment is that the process you really want is:

1. Stop new imap/pop/lmtp/sieve/etc connections
2. Finish and close existing connections cleanly but as quickly as possible
3. Finish running any sync log files
4. Fully shutdown

There's currently no clean way to do this. Basically you have to SIGTERM master which hard kills it and all children, then manually run sync_client -f on any remaining log files.

We've got a patch which makes master handle SIGQUIT much more nicely. Basically it appears there was some existing infrastructure that was designed to handle a cleaner shutdown, look at the code to all the places that call signals_poll(). It looks like the idea was that you could send child processes SIGQUIT and they would continue their current action until their "main loop" and check if they'd been sent a QUIT, and then exit cleanly. Unfortunately if you sent SIGQUIT to master, it would just SIGTERM all children, not SIGQUIT them.

This patch attempts to fix this, so that sending SIGQUIT to master, sends SIGQUIT to all children, and then waits for them to all exit cleanly.

http://cyrus.brong.fastmail.fm/#cyrus-clean-shutdown-2.3.8.diff

This solves step 1 & 2 above, though it doesn't deal with the case of a "crazy child" that doesn't respond to SIGQUIT. Personally our init script sends SIGQUIT, and if the master process is still there after 10 seconds, then it sends SIGTERM to force and exit. In general we find that everything exits after a couple of seconds of SIGQUIT.

To do step 3, I think the best might be to have a new cyrus.conf section, a SHUTDOWN section which gives some commands to run on shutdown. Basically after all children have accepted a SIGQUIT and exited, then we run the SHUTDOWN section, which would run a final sync_client -r on the sync dir to finish up any remaining log files.

With all of that in place, it means you could send a SIGQUIT to a cyrus master process on a master server, and it would cleanly shutdown all children and ensure that all replication events have been correctly played to the replica. You could then do the same to the replica, then reverse their roles, and bring them both back up and you've got a safe soft failover.

At the moment we replace messages (on the "master knows best" principle).

It would be easy enough to leave message in place and generate warnings instead, although this would generate a lot of warnings, one for every bad message every time that a given mailbox is updated.

That's what this patch does.

http://cyrus.brong.fastmail.fm/#cyrus-warnmismatcheduuids-2.3.8.diff

In theory with clean soft failovers, you should NEVER have UIDs with mismatched UUIDs. After a hard failover, you obviously might, but in those cases, just replacing the message means we're almost certainly overwriting a delivered message and loosing it which is bad. At least making it an option to overwrite or log I think is a sane idea.

My nightmare scenario is a replication engine which carries on running in the face of mboxlist corruption on the master: you could lose a lot of mailboxes on the replica that way.

That would be bad, though hard to detect and stop. I guess that's what backups are for...

It would be easy enough to generate multiple replication log files.

MySQL keeps a single transaction log for multiple replicas, but that file contains quite a lot of information about each transaction. In contrast the Cyrus sync log is just a list of objects we need to pay attention to: the files have much less state, particularly without duplicates.

The other option is rather than using the "rotate log, play it, delete it" system, you generate one log file but you keep track of "offsets" within the file to tell you where each replica is up to. That's what mysql does, so you can have multiple replicas because each replica is "playing" off the same log files, they're just up to different offsets at any point in time.

Rob

Reply via email to