Hi list. I've been using Heartbeat "heartbeat-2.1.3-22.1.i386" with Pacemaker "pacemaker-heartbeat-0.6.5-8.1.i386" on HP ProLiant DL380 G4 servers with iLO firmware "version 1.88 09/19/2006" with the external/riloe plugin for STONITH.
I noticed a STONITH issue when I originally set them up where if the server you're trying to reset is OFF and you run a stonith command like this one, it returns exit status 0 but the server didn't attempt to start: stonith -t external/riloe hostlist=192.168.33.21 ilo_hostname=192.168.33.20 ilo_user=Administrator ilo_password=xxxxxxxx ilo_can_reset=1 ilo_protocol=2.0 ilo_powerdown_method=button -T reset 192.168.33.20 This breaks rule #4 here: http://www.linux-ha.org/STONITH I tried the same command with "-T off" and that returned status 0, turned the server on for about 1 second then powered it off. I tried the same command with "-T on" and it returned status 0 and turned the server on. Note that when the server was on, everything worked fine (they all returned exit status 0, reset reset it, off turned it off and on kept it on). I thought - okay, reset doesn't work when it's OFF, what if I set the "ilo_can_reset" variable to "0"? I tried it with all three of the above commands and got the exact same results. I looked at the riloe script for a while (I'm not a python guy) and it seemed that maybe line 174 was incorrectly checking the value of the "reset_ok" variable such that the second half of the if statement would always fail. The relevant section in the external/riloe script I'm referring to is: line 174: if cmd == 'reset' and not reset_ok: acmds.append(login + todo['off'] + logout) acmds.append(login + todo['on'] + logout) else: acmds.append(login + todo[cmd] + logout) So I created an "external/my-riloe" script where line 174 looks like this instead: line 174: if cmd == 'reset' and reset_ok == '0': When I used this new "external/my-riloe" script with "ilo_can_reset=0" and "-T reset" when the server was OFF, the server turned on for a second then shut off like the "off" command does, then turned on and stayed on. I used the "external/my-riloe" script with Heartbeat and it worked for my purposes. It was able to bring up the node when it was off. For example if I hard powered down either node the other node powered it back on. So, all was well in the server room. :) Recently we got some HP ProLiant DL380 G5 servers with iLO 2 firmware "version 1.50 03/21/2008" and I installed Heartbeat "heartbeat-2.99.0-3.1.i386" and Pacemaker "pacemaker-heartbeat-0.6.6-17.2.i386" on them. What I found when doing the STONITH command line tests with the "external/riloe" plugin is that when the node is ON everything works as expected (reset resets, off turns it off and on turns in on). However when the server is OFF, on turns it on, off fails loudly, and reset silently returns exit code 0 but the server stays off. The fact that OFF returns an error when it's OFF means that the workaround plugin I was using won't work anymore. Here's the commands outputs starting from both nodes ON. You can see that the first "OFF" works fine, then the second one fails loudly. After that the "reset" command returns exit status 0 but doesn't bring up the node: [EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21 ilo_hostname=192.168.33.20 ilo_user=Administrator ilo_password=xxxxxxxx ilo_can_reset=1 ilo_protocol=2.0 ilo_powerdown_method=button -T off 192.168.33.20 [EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21 ilo_hostname=192.168.33.20 ilo_user=Administrator ilo_password=xxxxxxxx ilo_can_reset=1 ilo_protocol=2.0 ilo_powerdown_method=button -T off 192.168.33.20 ** INFO: external_run_cmd: Calling '/usr/lib/stonith/plugins/external/riloe off 192.168.33.20' returned 256 ** (process:14417): CRITICAL **: external_reset_req: 'riloe off' for host 192.168.33.20 failed with rc 256 [EMAIL PROTECTED] echo $? 5 [EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21 ilo_hostname=192.168.33.20 ilo_user=Administrator ilo_password=xxxxxxxx ilo_can_reset=1 ilo_protocol=2.0 ilo_powerdown_method=button -T reset 192.168.33.20 [EMAIL PROTECTED] echo $? 0 Has anyone else encountered this issue? Will upgrading the iLO 2 firmware version to 1.60 fix it? I didn't see anything in HP's list of fixes that resembles the issue I'm having. Any ideas how I can get "reset" when the server if OFF to turn it on - either directly or by using "off" then "on"? Thanks, -- Tyler Sutherland _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
