I am running heartbeat 2.1.4-04, the latest available on SLES10-SP2. I have configured riloe as my stonith plugin on a four-node cluster (we have HP DL-365 G5's with integrated ilo2), and muddled my way through attempting a clone setup and finally settled on a primitive resource setup with one riloe stonith resource per node. When i pkill heartbeat on any given node, all works well and the node is reset through iloe as expected... WHEN the stonith ilo resource for that node is active on the dc. When it is not active on the DC, the expected behaviour occurs in that the DC logs that it "want a STONITH operation RESET to node xxx", then "broadcast succeeded require others to stonith the node xxx" and the node that IS currently hosting the stonitth resource for that node dutifully responds with "want a STONITH operation RESET to node xxx". The node hosting the stonith resource then successfully stonith's the dead node and attempts to notify the dc that it was successful (the return code from running iloe from the logs is 0, the stonithing node thinks it was successful, and successfully send the notify to the dc), but the DC's logs show "received T_STITmsg from myself, ignoring" then a message from the stonithing node with something to the effect of stonith operation was already complete when this message was received. This then continues indefinitely.
Net result is, if the stonith resource for a given node is NOT running on the DC, and that node fails, it winds up in an infinite reboot loop until i kill the stonith daemon on the node hosting that stonith resource, which totally confuses the cluster and i wind up having to reboot all the nodes. I will post the logs when I get to work tomorrow... this looks a lot like when the stonith daemon on the dc broadcasts for help to stonith a failed node, it gives up too quickly or ignores the success response from the "whodoit" node that actually (successfully!) peforms the stonith. Thanks in advance for your help, Gary _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems