Re: Making Replication Robust

2007-10-19 Thread David Carter
On Tue, 9 Oct 2007, David Carter wrote: I've never faced a spilt brain situation which involved more than two or three messages (the outstanding log on an old master system). I suppose that it was predicable that a week after writing this I faced my first serious split brain (3000 messages

Re: Making Replication Robust

2007-10-13 Thread David Carter
On Sat, 13 Oct 2007, Bron Gondwana wrote: Apart from a couple of short-lived command line utilities it looks like the only use of signal() is a bunch of 'signal(SIGPIPE, SIG_IGN);' scattered through just about everything. Most of the interesting signal handling is done with sigaction already.

Re: Making Replication Robust

2007-10-12 Thread Rob Mueller
This would seem to be a significant advantage of running sync_client outside master. When I shut down master, sync_client continues to process the outstanding log. I can then use sync_shutdown_file when it has finished and is idle. We do something similar. But it means you have to develop

Re: Making Replication Robust

2007-10-12 Thread David Carter
On Wed, 10 Oct 2007, Rob Mueller wrote: I think the problem at the moment is that the process you really want is: 1. Stop new imap/pop/lmtp/sieve/etc connections 2. Finish and close existing connections cleanly but as quickly as possible 3. Finish running any sync log files 4. Fully shutdown

Re: Making Replication Robust

2007-10-12 Thread Bron Gondwana
On Fri, Oct 12, 2007 at 10:29:53AM -0400, Carson Gaspar wrote: David Carter wrote: I'm still a little bothered about signal handling and EINTR. I did some experiments after our last chat about signals. In practice disk IO system calls seem to be reasonably safe against EINTR on both Linux

Re: Making Replication Robust

2007-10-12 Thread Rob Mueller
Or is the problem that you have something like: write to file 1 write to file 2 And if the first returns EINTR but is ignored, and then it writes the complete data to the second, things are in an inconsistent state? This is my concern. Doing an ack 'write\(' reveals a scary mix of write,

Re: Making Replication Robust

2007-10-09 Thread David Carter
On Mon, 8 Oct 2007, Rudy Gevaert wrote: Note, we are running 2.3.7, I'm going to upgrade when 2.3.10 is out. We have replication in place, but daren't use it. If I have a method to check if the replica is in sync then I'll dare to do a fail over. I do this using -v -v to sync_client, which

Re: Making Replication Robust

2007-10-09 Thread David Carter
On Mon, 8 Oct 2007, Bron Gondwana wrote: We already run a sync_server on our masters as well because we use it for user moves: Generally takes about 15 seconds for the critical path bit, and the initial sync doesn't matter how long it takes. As do we. In fact when I first showed the

Re: Making Replication Robust

2007-10-09 Thread David Carter
On Thu, 4 Oct 2007, Bron Gondwana wrote: a) MUST never lose a message that's been accepted for delivery except in the case of total drive failure. b) MUST have a standard way to integrity check and repair a replica-pair after a system crash. A replica system is automatically repaired to

Re: Making Replication Robust

2007-10-09 Thread Rob Mueller
c) MUST have a clean process to soft-failover to the replica machine, making sure that all replication events from the ex-master have been synchronised. Something more than sync_shutdown_file plus automatic retries on recent work files? I think the problem at the moment is that the

Re: Making Replication Robust

2007-10-08 Thread Rudy Gevaert
Hello, I agree with Bron. However I do think some parts are more important than others. I'll try to explain my point of view. Note, we are running 2.3.7, I'm going to upgrade when 2.3.10 is out. We have replication in place, but daren't use it. If I have a method to check if the replica

Re: Making Replication Robust

2007-10-08 Thread Bron Gondwana
On Mon, Oct 08, 2007 at 10:03:31AM +0200, Rudy Gevaert wrote: For me points a, e and f are most important, but the others are also important. Bron Gondwana wrote: So I'd like to start a dialogue on the topic of making Cyrus replication robust across failures with the following goals: a)

Making Replication Robust

2007-10-03 Thread Bron Gondwana
Hi, As I've mentioned on the mailing list, we have had to put quite a lot of infrastructure around Cyrus to make replication robust in all cases. While the core replication protocol seems pretty stable now, and with GUID stuff it will be easier to do integrity checks, it's still very much not a