Slony Cluster?

Dominik Klein Tue, 06 Nov 2007 23:27:08 -0800

Hi Andrew

thanks for your reply.

So I thought I could implement "demote" as "return 0", as "promote" onthe other machine will do the job anyway. Well, not the best idea as a"monitor" action on the apparently demoted machine will still returnMaster Status until "promote" on the second machine finished.
What if the crm delayed the slave's monitor until after the other sidewas promoted... would that help significantly?

That would propably prevent one failed monitor action in this veryspecial case.

Furthermore, the switchover command will fail if the other machine isnot responding. In case the current master really has a problem, allyou can do get a writeable database on the current slave is to use thefailover command. But Linux-HA only knows "promote" and "demote".
So I implemented some promote and demote the following way:

#### promote
if switchover_to_me
then
    return 0
else
    if ! switchover_to_me
    then
        failover_to_me
        return $?
    fi
fi
####

#### demote
switchover_to_other_machine
# dont care if this works as it cannot work if
# the other machine is not healthy
return 0
####
What you also need to know about slony-1 is the fact that you need toresync the COMPLETE data after a failover. In slony-1 it is notpossible to let a failed node rejoin the slony-Cluster (even if it washealthy when the failover command was issued). It has to fetch ALLdata from the new master. So you want to avoid failover if it is notabsolutely necessary.
Up to now I thought my RA could handle a few cases and it turns out:SOME it can handle (like master reboot or slave reboot or controlledswitchover). But a simple thing as killing postgres on the mastermachine causes a failover. Why?:
Say A is master, B is slave at this moment

1. monitor on A fails
2. Linux-HA executes demote on A
-> As you see above, this will work even if it does nothing
3. Linux-HA executes promote on B
-> This, as postgres on A is not running, will end up in a failover(see above)
Notifications might help.
The Filesystem agent (when operating in OCFS2 mode) keeps a list of whoits peers are.If you did the same then I think you'd be able to recognize that you'reall alone and that it was ok to switchover_to_me instead.

Read my first post again. Switchover is not possible if the otherpostgres instance is not available. The only way to make a single slavethe new master is to use the failover command.


What *would* help here is:

1. monitor on A fails -> OCF_NOT_RUNNING
Now, instead of "demote A, promote B":
2. Stop/Start the resource on A

Iirc "start" includes a monitor action (or "probe" called sometimes inthis case). This would report "OCF_RUNNING_MASTER", so the problem wouldbe solved.

On the other hand, this is propably a pretty big change in Linux-HA'smaster/slave handling and this should be discussed.


Regards
Dominik
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Feedback: Master/Slave RA for Postgres / Slony Cluster?

Reply via email to