Hi, On Sun, Aug 31, 2008 at 11:53:24PM -0500, Tyler Sutherland wrote: > Hi list. > > I've been using Heartbeat "heartbeat-2.1.3-22.1.i386" with Pacemaker > "pacemaker-heartbeat-0.6.5-8.1.i386" on HP ProLiant DL380 G4 servers > with iLO firmware "version 1.88 09/19/2006" with the external/riloe > plugin for STONITH. > > I noticed a STONITH issue when I originally set them up where if the > server you're trying to reset is OFF and you run a stonith command like > this one, it returns exit status 0 but the server didn't attempt to > start: > stonith -t external/riloe hostlist=192.168.33.21 > ilo_hostname=192.168.33.20 ilo_user=Administrator > ilo_password=xxxxxxxx ilo_can_reset=1 ilo_protocol=2.0 > ilo_powerdown_method=button -T reset 192.168.33.20 > > This breaks rule #4 here: http://www.linux-ha.org/STONITH > > I tried the same command with "-T off" and that returned status 0, > turned the server on for about 1 second then powered it off. I tried > the same command with "-T on" and it returned status 0 and turned the > server on. Note that when the server was on, everything worked fine > (they all returned exit status 0, reset reset it, off turned it off and > on kept it on). > > I thought - okay, reset doesn't work when it's OFF, what if I set the > "ilo_can_reset" variable to "0"? I tried it with all three of the above > commands and got the exact same results. I looked at the riloe script > for a while (I'm not a python guy) and it seemed that maybe line 174 was > incorrectly checking the value of the "reset_ok" variable such that the > second half of the if statement would always fail. The relevant section > in the external/riloe script I'm referring to is: > > line 174: if cmd == 'reset' and not reset_ok: > acmds.append(login + todo['off'] + logout) > acmds.append(login + todo['on'] + logout) > else: > acmds.append(login + todo[cmd] + logout) > > So I created an "external/my-riloe" script where line 174 looks like > this instead: > line 174: if cmd == 'reset' and reset_ok == '0':
Thanks for finding this. The author mixed up the meaning of the string '0' and integer. One of the problems with untyped languages. I'll apply this fix. > When I used this new "external/my-riloe" script with "ilo_can_reset=0" > and "-T reset" when the server was OFF, the server turned on for a > second then shut off like the "off" command does, then turned on and > stayed on. I used the "external/my-riloe" script with Heartbeat and it > worked for my purposes. It was able to bring up the node when it was > off. For example if I hard powered down either node the other node > powered it back on. So, all was well in the server room. :) > > Recently we got some HP ProLiant DL380 G5 servers with iLO 2 firmware > "version 1.50 03/21/2008" and I installed Heartbeat > "heartbeat-2.99.0-3.1.i386" and Pacemaker > "pacemaker-heartbeat-0.6.6-17.2.i386" on them. What I found when doing > the STONITH command line tests with the "external/riloe" plugin is that > when the node is ON everything works as expected (reset resets, off > turns it off and on turns in on). However when the server is OFF, on > turns it on, off fails loudly, and reset silently returns exit code 0 > but the server stays off. The fact that OFF returns an error when it's > OFF means that the workaround plugin I was using won't work anymore. IIRC, somebody recently reported the same result. > Here's the commands outputs starting from both nodes ON. You can see > that the first "OFF" works fine, then the second one fails loudly. > After that the "reset" command returns exit status 0 but doesn't bring > up the node: > > [EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21 > ilo_hostname=192.168.33.20 ilo_user=Administrator > ilo_password=xxxxxxxx ilo_can_reset=1 ilo_protocol=2.0 > ilo_powerdown_method=button -T off 192.168.33.20 > [EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21 > ilo_hostname=192.168.33.20 ilo_user=Administrator > ilo_password=xxxxxxxx ilo_can_reset=1 ilo_protocol=2.0 > ilo_powerdown_method=button -T off 192.168.33.20 > ** INFO: external_run_cmd: Calling > '/usr/lib/stonith/plugins/external/riloe off 192.168.33.20' returned 256 > > ** (process:14417): CRITICAL **: external_reset_req: 'riloe off' for > host 192.168.33.20 failed with rc 256 > [EMAIL PROTECTED] echo $? > 5 > [EMAIL PROTECTED] stonith -t external/riloe hostlist=192.168.33.21 > ilo_hostname=192.168.33.20 ilo_user=Administrator > ilo_password=xxxxxxxx ilo_can_reset=1 ilo_protocol=2.0 > ilo_powerdown_method=button -T reset 192.168.33.20 > [EMAIL PROTECTED] echo $? > 0 > > Has anyone else encountered this issue? > > Will upgrading the iLO 2 firmware version to 1.60 fix it? I didn't see > anything in HP's list of fixes that resembles the issue I'm having. > > Any ideas how I can get "reset" when the server if OFF to turn it on - > either directly or by using "off" then "on"? No idea why it doesn't work. Perhaps people at HP should know. The best way would probably be to use a working client and see which commands it sends to the device to reset the host. One of the issues that the riloe plugin has is that it never checks replies from the device. Unfortunately, I don't have such an iLO available to try it out. Thanks, Dejan > Thanks, > > -- > Tyler Sutherland > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
