Hi,
I am still experimenting with node fencing in the case of several
simultaneous node failures and have a question about the stonith
configuration.
I am failing four nodes simultaneously on a 16 nodes cluster (the
four nodes are four HP blades in an enclosure which is switched
off). Altogether, it works surprisingly well, all resources are
back online within 50 seconds :-)
However, looking at the logs, it looks like the four nodes are
shot sequentially, one after the other. The relevant log entries
on the DC (the node failure is detected at 17:08:50):
pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed4 for
STONITH
pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed3 for
STONITH
pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed2 for
STONITH
pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed1 for
STONITH
stonithd[3596]: 2008/04/16_17:09:01 info: client tengine [pid: 29349] want a
STONITH operation RESET to node bladed4.
stonithd[3596]: 2008/04/16_17:09:07 info: Succeeded to STONITH the node
bladed4: optype=RESET. whodoit: bladea4
stonithd[3596]: 2008/04/16_17:09:07 info: client tengine [pid: 29349] want a
STONITH operation RESET to node bladed3.
stonithd[3596]: 2008/04/16_17:09:14 info: Succeeded to STONITH the node
bladed3: optype=RESET. whodoit: bladeb4
stonithd[3596]: 2008/04/16_17:09:14 info: client tengine [pid: 29349] want a
STONITH operation RESET to node bladed2.
stonithd[3596]: 2008/04/16_17:09:20 info: Succeeded to STONITH the node
bladed2: optype=RESET. whodoit: bladea4
stonithd[3596]: 2008/04/16_17:09:20 info: client tengine [pid: 29349] want a
STONITH operation RESET to node bladed1.
stonithd[3596]: 2008/04/16_17:09:27 info: Succeeded to STONITH the node
bladed1: optype=RESET. whodoit: bladea4
Unfortunately, the stonith agents take their time and since they
are called one after the other, it takes some 25 seconds to bring
all hosts down.
Now I wonder if I can somehow configure heartbeat to reset all
nodes immediately, without waiting for the other nodes.
Here are the stonith-relevant parts of the cib:
<crm_config>
<cluster_property_set>
<attributes>
<nvpair name="stonith-enabled" value="true"/>
</attributes>
</cluster_property_set>
</crm_config>
<resources>
<clone id="fencing_enclosure_a">
<instance_attributes>
<attributes>
<nvpair name="clone_max" value="2" />
<nvpair name="clone_node_max" value="1" />
</attributes>
</instance_attributes>
<primitive id="stonith-hpoa-encla" class="stonith" provider="heartbeat"
type="external/hpoa-encla">
<operations>
<op id="stonih-hpoa-encla_on" name="on" timeout="15s"/>
<op id="stonih-hpoa-encla_off" name="off" timeout="15s"/>
<op id="stonih-hpoa-encla_status" name="status" timeout="15s"/>
<op id="stonih-hpoa-encla_reset" name="reset" timeout="15s"/>
</operations>
</primitive>
</clone>
<clone id="fencing_enclosure_b">
... same for enclosure b with Stonith Agent "external/hpoa-enclb"
</clone>
<clone id="fencing_enclosure_c">
...
</clone>
<clone id="fencing_enclosure_d">
...
</clone>
</resources>
Is there any parameter I can tweak so nodes can be shot
without waiting for other stonith actions in progress?
Thanks and best regards,
Martin
--
Dr. Martin Alt
System und Softwarearchitektur
Plath GmbH
Gotenstrasse 18
D - 20097 Hamburg
Tel: +49 40/237 34-361
Fax: +49 40/237 34-173
Email: [EMAIL PROTECTED]
http://www.plath.de
Hamburg HRB7401
Geschäftsführer: Dipl.-Kfm. Nico Scharfe
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems