[Linux-HA] stonith of several failed nodes is slow - how to speed up

Alt, Martin Wed, 16 Apr 2008 08:45:48 -0700

Hi,

I am still experimenting with node fencing in the case of several
simultaneous node failures and have a question about the stonith 
configuration.


I am failing four nodes simultaneously on a 16 nodes cluster (the 
four nodes are four HP blades in an enclosure which is switched 
off). Altogether, it works surprisingly well, all resources are 
back online within 50 seconds :-)

However, looking at the logs, it looks like the four nodes are 
shot sequentially, one after the other. The relevant log entries
on the DC (the node failure is detected at 17:08:50):
pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed4 for 
STONITH
pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed3 for 
STONITH
pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed2 for 
STONITH
pengine[29350]: 2008/04/16_17:09:01 WARN: stage6: Scheduling Node bladed1 for 
STONITH
stonithd[3596]: 2008/04/16_17:09:01 info: client tengine [pid: 29349] want a 
STONITH operation RESET to node bladed4.
stonithd[3596]: 2008/04/16_17:09:07 info: Succeeded to STONITH the node 
bladed4: optype=RESET. whodoit: bladea4
stonithd[3596]: 2008/04/16_17:09:07 info: client tengine [pid: 29349] want a 
STONITH operation RESET to node bladed3.
stonithd[3596]: 2008/04/16_17:09:14 info: Succeeded to STONITH the node 
bladed3: optype=RESET. whodoit: bladeb4
stonithd[3596]: 2008/04/16_17:09:14 info: client tengine [pid: 29349] want a 
STONITH operation RESET to node bladed2.
stonithd[3596]: 2008/04/16_17:09:20 info: Succeeded to STONITH the node 
bladed2: optype=RESET. whodoit: bladea4
stonithd[3596]: 2008/04/16_17:09:20 info: client tengine [pid: 29349] want a 
STONITH operation RESET to node bladed1.
stonithd[3596]: 2008/04/16_17:09:27 info: Succeeded to STONITH the node 
bladed1: optype=RESET. whodoit: bladea4

Unfortunately, the stonith agents take their time and since they
are called one after the other, it takes some 25 seconds to bring
all hosts down.

Now I wonder if I can somehow configure heartbeat to reset all
nodes immediately, without waiting for the other nodes.

Here are the stonith-relevant parts of the cib:
<crm_config>
  <cluster_property_set>
    <attributes>
      <nvpair name="stonith-enabled" value="true"/>
    </attributes>
  </cluster_property_set>
</crm_config>

<resources>
  <clone id="fencing_enclosure_a">
    <instance_attributes>
      <attributes>
        <nvpair name="clone_max" value="2" />
        <nvpair name="clone_node_max" value="1" />
      </attributes>
    </instance_attributes>
    <primitive id="stonith-hpoa-encla" class="stonith" provider="heartbeat" 
type="external/hpoa-encla">
      <operations>
        <op id="stonih-hpoa-encla_on" name="on" timeout="15s"/>
        <op id="stonih-hpoa-encla_off" name="off" timeout="15s"/>
        <op id="stonih-hpoa-encla_status" name="status" timeout="15s"/>
        <op id="stonih-hpoa-encla_reset" name="reset" timeout="15s"/>
      </operations>
    </primitive>
  </clone>
  <clone id="fencing_enclosure_b">
    ... same for enclosure b with Stonith Agent "external/hpoa-enclb"
  </clone>
  <clone id="fencing_enclosure_c">
    ...
  </clone>
  <clone id="fencing_enclosure_d">
    ...
   </clone>
 </resources>

Is there any parameter I can tweak so nodes can be shot
without waiting for other stonith actions in progress?

Thanks and best regards,
Martin


--
Dr. Martin Alt
System und Softwarearchitektur
Plath GmbH
Gotenstrasse   18
D - 20097 Hamburg
Tel: +49 40/237 34-361
Fax: +49 40/237 34-173 
Email: [EMAIL PROTECTED]
http://www.plath.de

Hamburg HRB7401
Geschäftsführer: Dipl.-Kfm. Nico Scharfe
 

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] stonith of several failed nodes is slow - how to speed up

Reply via email to