Alexander Hofmann schrieb:
Hi,
Dejan Muhamedagic wrote:
Hi,
On Wed, Aug 27, 2008 at 01:56:25PM +0200, Alexander Hofmann wrote:
Hello list,
after many hours of try and error, I got the iLo STONITH configuration
working.
During some tests I noticed the following issue:
Testcase 1: node1 has all resources and node2 is hard powered off.
node1 tries to STONITH node2 but has no success.
node1 retries to STONITH node2 every 30sec.
If I now boot node2 it is shutdown by node1 because of the retries.
How can I configure STONITH, so that the STONITH plugin is only executed
once or twice
in a very small interval.
Testcase 2: node2 has all resources and is hard powered off.
node1 tries to STONITH node2 but does not succeed.
node1 _doesn't_ start the resources! it retries to STONITH node2
every ~30sec.
Both problems are most probably in the external/riloe stonith
plugin: if a node is powered off, it should report success for
the stonith operation. The point of a stonith operation is to
ensure that a host is down or rebooted. This seems to be a
serious issue with external/riloe.
I've browsed through the sourcecode (python...brrrr :-) of external/riloe but
could not find
the piece of code where the error occurs.
If I send "power off" twice at the iLO-cmdline, I get the following string at the second execution:
Server power already Off
Perhaps the HTTP cmd returns the same string an the iLO plugin does not know
how to interpret:
# stonith -t external/riloe hostlist=node1 ilo_hostname=10.0.2.1 ilo_user=user ilo_password=**** ilo_protocol=2.0 ilo_powerdown_method=button ilo_can_reset=1 -T off tfdps01
** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe off node1' returned 256
** (process:27676): CRITICAL **: external_reset_req: 'riloe off' for host node1 failed with rc 256
# stonith -t external/riloe hostlist=tfdps01 ilo_hostname=10.0.2.1 ilo_user=tfdps ilo_password=startdfs ilo_protocol=2.0 ilo_powerdown_method=button ilo_can_reset=1 -S
stonith: external/riloe device OK.
Today, another problem crossed my mind:
If I detach the power cable of one node, I cannot communicate with his iLO-card.
Would it solve the problem if I request an ICMP echo before connecting
to the iLO-card, and return 0 (success) if I get no response.
Or: Can I be sure that the node is already off when his iLO-card doesn't
respond? (point-to-point connection, no routing etc.)
Example: I detached the power cable an executed the following commands:
# stonith -t external/riloe hostlist=tfdps01 ilo_hostname=10.0.2.1 ilo_user=tfdps ilo_password=startdfs ilo_protocol=2.0 ilo_powerdown_method=button ilo_can_reset=1 -T off tfdps01
** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe status' returned 256
** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe off tfdps01' returned 256
** (process:22349): CRITICAL **: external_reset_req: 'riloe off' for host tfdps01 failed with rc 256
PS: Where can I find a list explaining all possible STONITH plugin return codes?
I made a mistake:
Node: tfdps01 == node1
User: tfdps == user
Thanks,
Dejan
Thanks,
Alex
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems