Re: Funding Cyrus High Availability

David Carter Fri, 17 Sep 2004 02:42:40 -0700

On Fri, 17 Sep 2004, Paul Dekkers wrote:

Isn't it possible to have equal roles? If all changes are put in some backlog, and a synchroniser process runs on both machines and pushes the backlog (as soon as there is any) to another machine... then you can have the some process on both (equal) servers... Of course there needs to be some more intelligence, but that's basicly what I would expect.

We have 16 servers: half the accounts on each system are master copies and half are replicas. Each machine has a small database (a CDB lookup file) to tell it whether a given account is master or slave. The replication engine (which runs independently from the normal master spawned jobs) bails out rapidly if the replica copy of an account is updated: it would proceed to transform the master into a copy of the replica, but that's probably not what you wanted :). I have a tool which allows me to switch the master and replica copy for any (inactive) account without having to shut anything down. This tool also lets me migrate data off onto a third system and immediately create a replica of that. This makes upgrading operating systems a much less fraught task.

In my sketch above (really not sure if it works of course) where both have something like a backlog you can like "tail" that backlog and push the update as soon as possible to the second machine. You solve the thing you mention with delays while pushing updates to two servers at the same time.

Yes, that's exactly how my code works. Asynchronous replication (which Ken called lazy replication) is fairly easy to do in Cyrus. Synchronous replication, where you only get a response to an IMAP/POP/LMTP command when the data is safely committed to the replica, would involve a much more substantial rewrite of the Cyrus code.

That's where block based replication schemes like DRDB have a big advantage: the state that they have to track is much less involved.

I'm currently running with a replication cycle of one second on my live servers for "rolling" replication (that's just a name I made up, its not an official term), so on average we would lose of half a second of update traffic for 1/16th of our user base if a single system failed. Further safeguards are possible by keeping copies of incoming mail for a short time on the MTA systems, but that's not really a Cyrus concern.

We also replicate to a tape backup spooling engine overnight. The replication engine is rather useful for fast incremental updates.

If one server is down it should mean that all tasks can be performed at the other one. I 'm curious how this would look if both servers are still running but cannot reach eachother. If there is indeeed a UUID: what if there are doubles... but I guess that has been taken into account.

UUIDs are just a convenient representation of message text, so that you can pass messages by reference rather than value. Duplicates don't matter (though I don't believe that they actual occur given my allocation scheme) so long as the message text is the same. I maintain databases of MD5 checksums for messages and cache text just to be on the safe side.

UUIDs were originally just Mailbox UniqueID + Message UID. Unfortunately, UniqueID isn't very Unique: its just a simple hash of the mailbox name. I ended up allocating UUIDs in large chunks from the master process on each machine. If a process runs out of UUIDS (which would take some going as they are allocated in chunks of 2**24), it falls back to call by value.

--
David Carter                             Email: [EMAIL PROTECTED]
University Computing Service,            Phone: (01223) 334502
New Museums Site, Pembroke Street,       Fax:   (01223) 334679
Cambridge UK. CB2 3QH.
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: Funding Cyrus High Availability

Reply via email to