Hi Andrew,
> I'd say you removed no-quorum-policy=ignore Actually, the pair of no_quorum_policy and no-quorum-policy are set to "ignore", and expected-quorum-votes is set to "2": <crm_config> <cluster_property_set id="cib-bootstrap-options"> ... <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2"/> <nvpair id="cib-bootstrap-options-no_quorum_policy" name="no_quorum_policy" value="ignore"/> <nvpair id="nvpair-1d2c923d-7619-4b45-989a-698357f9f8cb" name="no-quorum-policy" value="ignore"/> ... </cluster_property_set> </crm_config> Removing the no-quorum-policy=ignore and no_quorum_policy=ignore (as in, deleting the variables) left the cluster unable to failover with either an ifdown iface or with a node reboot. The state displayed by the GUI did not agree with the state displayed by crm_mon (the GUI showed the ifdown or rebooted node as still controlling resources, whereas crm_mon showed the resources unavailable ... both showed the inaccessible node as offline). Setting the no-quorum-policy=stop had the same results, which included the resources not migrating to the working system until returning no-quorum-policy=ignore. One of the tests led to filesystem corruption. Very messy. (this is a test-only setup, so no real data is present) So, no, the change that I made was neither deleting nor setting no-quorum-policy=stop. Setting no-quorum-policy=ignore seems to be required for the cluster to support migrations and failovers. Cheers and thanks, Bob Haxo On Wed, 2009-05-20 at 11:17 +0200, Andrew Beekhof wrote: > On Wed, May 20, 2009 at 1:31 AM, Bob Haxo <bh...@sgi.com> wrote: > > Greetings, > > > > I liked the idea of not starting the cluster at boot, and found that the > > fenced node would reboot and then openais start brought the node onboard > > without triggering a reboot of the already running node. > > > > Then magic happened. I chkconfig'd openais to start with boot, re-ran the > > "ifdown eth0" command that had been triggering STONITH and then the STONITH > > deathmarch, and, well, everything worked. I've done this test many 10s of > > times without a STONITH deathmarch. > > > > Unfortunately, I haven't a clue as to what was changed that cleared the > > issue. > > At a guess, I'd say you removed no-quorum-policy=ignore > OpenAIS based clusters don't pretend they have quorum when only 1 of > the 2 nodes is available (and you cant start shooting until you have > quorum or the above option is set). > > > > > > Thanks for all the suggestions. > > > > Cheers, > > Bob Haxo > > > > > > On Tue, 2009-05-19 at 14:03 +0200, Andrew Beekhof wrote: > > > > On Mon, May 18, 2009 at 8:12 PM, Bob Haxo <bh...@sgi.com> wrote: > >> > >> Any suggestions as to what needs changing so that the stonith deathmarch > >> can > >> be avoided? > > > > If you only have two nodes, the only two ways have already discussed: > > use poweroff, or don't start the cluster at boot. > > If you don't want to do either of those, the only way to terminate the > > stonith loop is to fix the network failure. > > > > If you had 3 or more nodes, the returning node wouldn't have quorum > > and therefore wouldn't be allowed to shoot anyone. > > > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > _______________________________________________ > > Pacemaker mailing list > > Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
_______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker