On 9/21/2010 10:06 AM, Steve Davies wrote:

> - Kill the master (A).
> - The slave (B) is coming up
> - Some transient issue prevents the RC scripts running on (B).
> - (B) backs down and requests to become slave again
> - (A) is down, so (B) never gets confirmation of its slave request.
>
> Nothing more happens. A is down and B is sulking!
>
> Can a node be persuaded to retry under these circumstances?

Generally, no: there is no way to know how "transient" the "issue" is. 
E.g. if a backhoe ate your uplink fiber and telco techies will fix the 
cut "in a day or two" -- do you want the other node retrying for a day? 
Or two? Or a week?

> Perhaps there is a way to identify this odd intermediate state so we
> can force a heartbeat restart or reinitialise?

What I have is a separate nagios setup that monitors cluster IP and 
services and sends me nastygrams if they disappear.

Dima
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to