Hi,

Dejan Muhamedagic wrote:

Hi,

On Wed, Aug 27, 2008 at 01:56:25PM +0200, Alexander Hofmann wrote:
Hello list,

after many hours of try and error, I got the iLo STONITH configuration working.
During some tests I  noticed the following issue:

Testcase 1: node1 has all resources and node2 is hard powered off.
node1 tries to STONITH node2 but has no success.
node1 retries to STONITH node2 every 30sec.
If I now boot node2 it is shutdown by node1 because of the retries.
How can I configure STONITH, so that the STONITH plugin is only executed once or twice
in a very small interval.

Testcase 2: node2 has all resources and is hard powered off.
node1 tries to STONITH node2 but does not succeed.
node1 _doesn't_ start the resources! it retries to STONITH node2
every ~30sec.

Both problems are most probably in the external/riloe stonith
plugin: if a node is powered off, it should report success for
the stonith operation. The point of a stonith operation is to
ensure that a host is down or rebooted. This seems to be a
serious issue with external/riloe.
I've browsed through the sourcecode (python...brrrr :-) of external/riloe but 
could not find
the piece of code where the error occurs.
If I send "power off" twice at the iLO-cmdline, I get the following string at 
the second execution:
Server power already Off
Perhaps the HTTP cmd returns the same string an the iLO plugin does not know 
how to interpret:

# stonith -t external/riloe hostlist=node1 ilo_hostname=10.0.2.1 ilo_user=user 
ilo_password=**** ilo_protocol=2.0 ilo_powerdown_method=button ilo_can_reset=1 
-T off tfdps01
** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe off 
node1' returned 256

** (process:27676): CRITICAL **: external_reset_req: 'riloe off' for host node1 
failed with rc 256


# stonith -t external/riloe hostlist=tfdps01 ilo_hostname=10.0.2.1 
ilo_user=tfdps ilo_password=startdfs ilo_protocol=2.0 
ilo_powerdown_method=button ilo_can_reset=1 -S
stonith: external/riloe device OK.



Today, another problem crossed my mind:
If I detach the power cable of one node, I cannot communicate with his iLO-card.
Would it solve the problem if I request an ICMP echo before connecting to the iLO-card, and return 0 (success) if I get no response.
Or: Can I be sure that the node is already off when his iLO-card doesn't
respond? (point-to-point connection, no routing etc.)

Example: I detached the power cable an executed the following commands:
# stonith -t external/riloe hostlist=tfdps01 ilo_hostname=10.0.2.1 
ilo_user=tfdps ilo_password=startdfs ilo_protocol=2.0 
ilo_powerdown_method=button ilo_can_reset=1 -T off tfdps01
** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe 
status' returned 256
** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe off 
tfdps01' returned 256

** (process:22349): CRITICAL **: external_reset_req: 'riloe off' for host 
tfdps01 failed with rc 256


PS: Where can I find a list explaining all possible STONITH plugin return codes?


Thanks,

Dejan

Thanks,
        Alex


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to