Hi
a week earlier I asked wether there was a resource agent that implements
Master/Slave for a Postgres Cluster using slony-1 replication.
There was not, so I tried to implement it myself.
I want to report back to give an explanation and reference on why I
think it is not possible (at the moment) to implement this in heartbeat.
Here we go:
Short summary of slony-1 replication:
In a slony-1 replication setup
* Tables are put together to replication "sets"
* Each set has an "origin" (master)
* Only the origin can be written to
* There can be multiple sets with a different origin each
* There can be multiple "subscribers" (slaves) for each set
* Subscribers are read-only
As you have to somewhat connect the master role to the health of
postgres itself, this restricts you to the use of only one set or manage
all sets at once. Well, okay, I think I could live with this.
Slony-1 implements two commands for "switchover" and "failover". I mean
Switchover when I want to do a planned switch of roles when all machines
are healthy. I mean failover when the Master has a problem and the Slave
takes over.
So now comes the tricky part.
In slony-1 you cannot make an origin a subscriber without making another
subscriber the new origin. This happens in ONE command. So there are no
independent "demote" and "promote" commands. In a two machine setup you
cannot have two slaves at a time.
In other words: "Promote" implicitely demotes the other machine,
"Demote" implicitely promotes the other machine.
So I thought I could implement "demote" as "return 0", as "promote" on
the other machine will do the job anyway. Well, not the best idea as a
"monitor" action on the apparently demoted machine will still return
Master Status until "promote" on the second machine finished.
Furthermore, the switchover command will fail if the other machine is
not responding. In case the current master really has a problem, all you
can do get a writeable database on the current slave is to use the
failover command. But Linux-HA only knows "promote" and "demote".
So I implemented some promote and demote the following way:
#### promote
if switchover_to_me
then
return 0
else
if ! switchover_to_me
then
failover_to_me
return $?
fi
fi
####
#### demote
switchover_to_other_machine
# dont care if this works as it cannot work if
# the other machine is not healthy
return 0
####
What you also need to know about slony-1 is the fact that you need to
resync the COMPLETE data after a failover. In slony-1 it is not possible
to let a failed node rejoin the slony-Cluster (even if it was healthy
when the failover command was issued). It has to fetch ALL data from the
new master. So you want to avoid failover if it is not absolutely necessary.
Up to now I thought my RA could handle a few cases and it turns out:
SOME it can handle (like master reboot or slave reboot or controlled
switchover). But a simple thing as killing postgres on the master
machine causes a failover. Why?:
Say A is master, B is slave at this moment
1. monitor on A fails
2. Linux-HA executes demote on A
-> As you see above, this will work even if it does nothing
3. Linux-HA executes promote on B
-> This, as postgres on A is not running, will end up in a failover (see
above)
This is pretty much it. If you have any ideas on how to improve this or
if you also think that this is impossible with the current master/slave
implementation in Linux-HA - please respond.
The whole "separately demote and promote" approach in Linux-HA seems to
just not fit the way slony-1 handles switchover and failover.
If you have any more questions (it can well be I forgot something), just
ask - I'll be happy to help improve Linux-HA.
Best regards
Dominik
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems