I am running heartbeat 2.1.4-04, the latest available on SLES10-SP2.  I have 
configured riloe as my stonith plugin on a four-node cluster (we have HP DL-365 
G5's with integrated ilo2), and muddled my way through attempting a clone setup 
and finally settled on a primitive resource setup with one riloe stonith 
resource per node. When i pkill heartbeat on any given node, all works well and 
the node is reset through iloe as expected... WHEN the stonith ilo resource for 
that node is active on the dc.  When it is not active on the DC, the expected 
behaviour occurs in that the DC logs that it "want a STONITH operation RESET to 
node xxx", then "broadcast succeeded require others to stonith the node xxx" 
and the node that IS currently hosting the stonitth resource for that node 
dutifully responds with "want a STONITH operation RESET to node xxx".  The node 
hosting the stonith  resource then successfully stonith's the dead node and 
attempts to notify the dc
 that it was successful (the return code from running iloe from the logs is 0, 
the stonithing node thinks it was successful, and successfully send the notify 
to the dc), but the DC's logs show "received T_STITmsg from myself, ignoring" 
then a message from the stonithing node with something to the effect of stonith 
operation was already complete when this message was received.  This then 
continues indefinitely.

Net result is, if the stonith resource for a given node is NOT running on the 
DC, and that node fails, it winds up in an infinite reboot loop until i kill 
the stonith daemon on the node hosting that stonith resource, which totally 
confuses the cluster and i wind up having to reboot all the nodes.


I will post the logs when I get to work tomorrow... this looks a lot like when 
the stonith daemon on the dc broadcasts for help to stonith a failed node, it 
gives up too quickly or ignores the success response from the "whodoit" node 
that actually (successfully!) peforms the stonith.

Thanks in advance for your help, 

Gary





_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to