Re: failover scenario's for replication

2006-08-29 Thread Wesley Craig

On 29 Aug 2006, at 04:11, Paul Dekkers wrote:

I haven't tried this; but does it hurt defining sync_server, imapd and
friends processes in the replicas cyrus.conf and by that have it
identical as the master?


If you tell the replica where mupdate is, sync_server behaves  
incorrectly.  I'd also avoid running imapd on a replica, unless I  
could guarantee that users couldn't get to it.


(I'm still thinking of nice ways to control (/automatically  
restart) the

sync_client; It doesn't write out a pid, daemon (on FreeBSD) doesn't
create the right pidfile for the thing, so things like monit or the
restartwrapper fail to control the thing... It doesn't stay in
foreground while in rolling mode... Maybe just have check_procs from
Nagios look at the process-string (or any other thing that looks in
/proc or ps). What do others do? Wesley's suggestion for having a
seperate init-script in the same runlevel still looks 'manual' to me,
and/or that's not the part that generates an alert.


We run the attached script periodically.


Maybe I'd write a patch for staying in foreground and/or writing out a
pidfile ;-))


We've toyed with the idea of making it stay in the foreground, so we  
could run it from init.  In the current implementation, tho, when it  
exits it needs operator intervention, so automatic restart is no use  
-- it will just exit again.  The real solution is to make the code  
more resilient.


And on a separate track, we need an overarching strategy for high  
availability.


:wes



replnag
Description: Binary data

Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: failover scenario's for replication

2006-08-29 Thread Paul Dekkers
Hi,

Bron Gondwana wrote:
> On Mon, Aug 28, 2006 at 02:23:22PM -0400, Wesley Craig wrote:
>   
>> On 26 Aug 2006, at 16:09, Paul Dekkers wrote:
>> 
>>> Right now, it looks tricky to me to enable replication after failover,
>>> or the replicated machine itself if you're not sure that the  
>>> replica is
>>> identical and the sync-processes finished completely: if a message- 
>>> file
>>> is in place on machine A (say "7.") but it was not replicated to  
>>> machine
>>> B while that one becomes the master, the machine B will create a new
>>> file 7. and both machines consider this file synchronised after that:
>>> also if roles switch back, you have two different (with one isolated)
>>> copies of 7.
>>>   
>> As I understand it, this is what replication uuids are for.  Not that  
>> I've experimented with this particular case.
>> 
>
> All that replication UUIDs do is make sure that the copy of '7' on
> the master overwrites the copy of '7' on the replica.  It doesn't make
> any attempt to retain '7' from the replica.
>   

It doesn't replace it either. As soon as I have a different copy of '7'
on my replica it never gets rewritten, for as far as I can see with this
simple experiment:

'Haver' is master, 'gerst' is replica.

[EMAIL PROTECTED] paul]# md5sum 7.
599307e354e203b706a7ba88d6ad668c  7.
[EMAIL PROTECTED] paul]# md5sum 11.
md5sum: 11.: No such file or directory

[EMAIL PROTECTED] paul]# sudo -u cyrus /usr/lib/cyrus-imapd/sync_client -v -u 
paul
USER paul
[EMAIL PROTECTED] paul]# md5sum 7.
32187646fe6176e989b9b59c59f7af9e  7.
[EMAIL PROTECTED] paul]# md5sum 11.
41d62ed42df1f058a76831061fb0c4ca  11.

[EMAIL PROTECTED] paul]# md5sum 7.
599307e354e203b706a7ba88d6ad668c  7.
[EMAIL PROTECTED] paul]# md5sum 11.
41d62ed42df1f058a76831061fb0c4ca  11.

Still the previous '7', the new '11' gets replicated correctly. I don't
know if this is correct behavior or not, to be honest.

(From one perspective I think this is a good thing: if there is a
different message(-id) on the other host, no matter what number it has,
it should remain there, a new message should be added and maybe even the
extra message on the replica should get replicated back to the master.
On the other hand: this causes inconsistencies and this is no
bidirectional synchronization, the master should be right (unless there
was a failover ;-)) so just replace the thing (hmm, have to think that
over, still sounds a bit scary and I've had inconsistent 'but still
running' filesystems before...))

>>> Or is it only preferred to use a replica if there is a really serious
>>> crash on the (previous) master?
>>>   
>> That's certainly how I view the current system.  Until replication is  
>> more reliable, I'd be quite leery of any sort of automatic failover.
>> 
>
> Ditto.  Our 'init scripts' actually check a database table to see which
> role a particular instance on a machine has and then starts up in that
> mode.  Changing over the database table entry is a manual step.
>
> The master init script also attempts to run the remaining log files with
> sync client if there are any.

Hmm, sounds like a good idea. (Allthough you can't do that indeed with
an unreachable replica ;-) For now I can live with that, maybe I'd just
put some other check in front to see if the replica is available or not,
check_tcp from Nagios or something.)

>   Sadly, sync_client doesn't interact well
> with real-time requriements and the replica being away.  Bah.  I'll
> get back to my "-o" => "only try to connect ONCE" patch one day.
>
>   
>>> It sounds nice to me if I could use heartbeat or (u)carp (/ifstated)
>>> like systems to start and stop a sync_client or sync_server copy of
>>> cyrus (both different cyrus.conf) as soon as the state of the virtual
>>> interface changes, but then it is even more likely that some  
>>> replication
>>> process is not finished without an admin even noticing it.
>>>   
>> I agree, this is a great goal.  I'd be interested in seeing a roadmap  
>> for how to achieve it, including how failback would occur.  There's a  
>> lot of opportunity to share operational experience with Cyrus.  If  
>> only there was a forum to publish such information...
>> 
>
> Yeah, I've had a play with using heartbeat.  The downside is that its
> colocation works, but ordering operations without having dependencies
> take the other side down as well doesn't work properly.  You can't say
> "always start the master in preference" and "start the replica first
> if you can" (makes master startup actually work at the moment!).
>   

I haven't tried this; but does it hurt defining sync_server, imapd and
friends processes in the replicas cyrus.conf and by that have it
identical as the master?
If we're not using it while in replica mode I'm curious if it will hurt
(same for the sync_server on the master). Then the only switch you'd
make between being a replica or master is the sync_client, which is
something we currently take out of control of the c

Re: failover scenario's for replication

2006-08-28 Thread Bron Gondwana
On Mon, Aug 28, 2006 at 02:23:22PM -0400, Wesley Craig wrote:
> On 26 Aug 2006, at 16:09, Paul Dekkers wrote:
> >Right now, it looks tricky to me to enable replication after failover,
> >or the replicated machine itself if you're not sure that the  
> >replica is
> >identical and the sync-processes finished completely: if a message- 
> >file
> >is in place on machine A (say "7.") but it was not replicated to  
> >machine
> >B while that one becomes the master, the machine B will create a new
> >file 7. and both machines consider this file synchronised after that:
> >also if roles switch back, you have two different (with one isolated)
> >copies of 7.
> 
> As I understand it, this is what replication uuids are for.  Not that  
> I've experimented with this particular case.

All that replication UUIDs do is make sure that the copy of '7' on
the master overwrites the copy of '7' on the replica.  It doesn't make
any attempt to retain '7' from the replica.

> >Or is it only preferred to use a replica if there is a really serious
> >crash on the (previous) master?
> 
> That's certainly how I view the current system.  Until replication is  
> more reliable, I'd be quite leery of any sort of automatic failover.

Ditto.  Our 'init scripts' actually check a database table to see which
role a particular instance on a machine has and then starts up in that
mode.  Changing over the database table entry is a manual step.

The master init script also attempts to run the remaining log files with
sync client if there are any.  Sadly, sync_client doesn't interact well
with real-time requriements and the replica being away.  Bah.  I'll
get back to my "-o" => "only try to connect ONCE" patch one day.

> >It sounds nice to me if I could use heartbeat or (u)carp (/ifstated)
> >like systems to start and stop a sync_client or sync_server copy of
> >cyrus (both different cyrus.conf) as soon as the state of the virtual
> >interface changes, but then it is even more likely that some  
> >replication
> >process is not finished without an admin even noticing it.
> 
> I agree, this is a great goal.  I'd be interested in seeing a roadmap  
> for how to achieve it, including how failback would occur.  There's a  
> lot of opportunity to share operational experience with Cyrus.  If  
> only there was a forum to publish such information...

Yeah, I've had a play with using heartbeat.  The downside is that its
colocation works, but ordering operations without having dependencies
take the other side down as well doesn't work properly.  You can't say
"always start the master in preference" and "start the replica first
if you can" (makes master startup actually work at the moment!).

I might look at it again in a bit though, 2.0.7 looks nicer than 2.0.5
was so far in terms of tools working sanely.

Bron.

Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


Re: failover scenario's for replication

2006-08-28 Thread Wesley Craig

On 26 Aug 2006, at 16:09, Paul Dekkers wrote:

Right now, it looks tricky to me to enable replication after failover,
or the replicated machine itself if you're not sure that the  
replica is
identical and the sync-processes finished completely: if a message- 
file
is in place on machine A (say "7.") but it was not replicated to  
machine

B while that one becomes the master, the machine B will create a new
file 7. and both machines consider this file synchronised after that:
also if roles switch back, you have two different (with one isolated)
copies of 7.


As I understand it, this is what replication uuids are for.  Not that  
I've experimented with this particular case.



Or is it only preferred to use a replica if there is a really serious
crash on the (previous) master?


That's certainly how I view the current system.  Until replication is  
more reliable, I'd be quite leery of any sort of automatic failover.



It sounds nice to me if I could use heartbeat or (u)carp (/ifstated)
like systems to start and stop a sync_client or sync_server copy of
cyrus (both different cyrus.conf) as soon as the state of the virtual
interface changes, but then it is even more likely that some  
replication

process is not finished without an admin even noticing it.


I agree, this is a great goal.  I'd be interested in seeing a roadmap  
for how to achieve it, including how failback would occur.  There's a  
lot of opportunity to share operational experience with Cyrus.  If  
only there was a forum to publish such information...


:wes

Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html


failover scenario's for replication

2006-08-26 Thread Paul Dekkers
Hi,

I was wondering what people consider best practice as failover scenario
in a replication enviroment.

Right now, it looks tricky to me to enable replication after failover,
or the replicated machine itself if you're not sure that the replica is
identical and the sync-processes finished completely: if a message-file
is in place on machine A (say "7.") but it was not replicated to machine
B while that one becomes the master, the machine B will create a new
file 7. and both machines consider this file synchronised after that:
also if roles switch back, you have two different (with one isolated)
copies of 7.
(Probably a thorough check is needed to ensure the consistency 'once
again' and copy files from one side to the other... so I hope that's
where best practice comes in ;-))
Or is it only preferred to use a replica if there is a really serious
crash on the (previous) master?

It sounds nice to me if I could use heartbeat or (u)carp (/ifstated)
like systems to start and stop a sync_client or sync_server copy of
cyrus (both different cyrus.conf) as soon as the state of the virtual
interface changes, but then it is even more likely that some replication
process is not finished without an admin even noticing it.

Curious,
Paul


Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html