Alexander Hofmann schrieb:

Hi,
Dejan Muhamedagic wrote:
Hi,
On Wed, Aug 27, 2008 at 01:56:25PM +0200, Alexander Hofmann wrote:
Hello list,
after many hours of try and error, I got the iLo STONITH configuration working. During some tests I noticed the following issue: Testcase 1: node1 has all resources and node2 is hard powered off. node1 tries to STONITH node2 but has no success. node1 retries to STONITH node2 every 30sec. If I now boot node2 it is shutdown by node1 because of the retries. How can I configure STONITH, so that the STONITH plugin is only executed once or twice in a very small interval. Testcase 2: node2 has all resources and is hard powered off. node1 tries to STONITH node2 but does not succeed. node1 _doesn't_ start the resources! it retries to STONITH node2 every ~30sec.
Both problems are most probably in the external/riloe stonith
plugin: if a node is powered off, it should report success for the stonith operation. The point of a stonith operation is to ensure that a host is down or rebooted. This seems to be a serious issue with external/riloe.
I've browsed through the sourcecode (python...brrrr :-) of external/riloe but 
could not find
the piece of code where the error occurs. If I send "power off" twice at the iLO-cmdline, I get the following string at the second execution:
Server power already Off
Perhaps the HTTP cmd returns the same string an the iLO plugin does not know 
how to interpret:
# stonith -t external/riloe hostlist=node1 ilo_hostname=10.0.2.1 ilo_user=user ilo_password=**** ilo_protocol=2.0 ilo_powerdown_method=button ilo_can_reset=1 -T off tfdps01 ** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe off node1' returned 256 ** (process:27676): CRITICAL **: external_reset_req: 'riloe off' for host node1 failed with rc 256 # stonith -t external/riloe hostlist=tfdps01 ilo_hostname=10.0.2.1 ilo_user=tfdps ilo_password=startdfs ilo_protocol=2.0 ilo_powerdown_method=button ilo_can_reset=1 -S stonith: external/riloe device OK. Today, another problem crossed my mind: If I detach the power cable of one node, I cannot communicate with his iLO-card. Would it solve the problem if I request an ICMP echo before connecting to the iLO-card, and return 0 (success) if I get no response. Or: Can I be sure that the node is already off when his iLO-card doesn't respond? (point-to-point connection, no routing etc.) Example: I detached the power cable an executed the following commands: # stonith -t external/riloe hostlist=tfdps01 ilo_hostname=10.0.2.1 ilo_user=tfdps ilo_password=startdfs ilo_protocol=2.0 ilo_powerdown_method=button ilo_can_reset=1 -T off tfdps01 ** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe status' returned 256 ** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe off tfdps01' returned 256 ** (process:22349): CRITICAL **: external_reset_req: 'riloe off' for host tfdps01 failed with rc 256 PS: Where can I find a list explaining all possible STONITH plugin return codes?
I made a mistake:
Node: tfdps01 == node1
User: tfdps == user


Thanks,
Dejan
Thanks,
Alex
_______________________________________________
Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to