On 9/21/2010 10:06 AM, Steve Davies wrote: > - Kill the master (A). > - The slave (B) is coming up > - Some transient issue prevents the RC scripts running on (B). > - (B) backs down and requests to become slave again > - (A) is down, so (B) never gets confirmation of its slave request. > > Nothing more happens. A is down and B is sulking! > > Can a node be persuaded to retry under these circumstances?
Generally, no: there is no way to know how "transient" the "issue" is. E.g. if a backhoe ate your uplink fiber and telco techies will fix the cut "in a day or two" -- do you want the other node retrying for a day? Or two? Or a week? > Perhaps there is a way to identify this odd intermediate state so we > can force a heartbeat restart or reinitialise? What I have is a separate nagios setup that monitors cluster IP and services and sends me nastygrams if they disappear. Dima _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems