Thanks for taking a look into this more. I have pulled down the 'tip' version of Linux-HA and copied over the new ./lib/plugins/stonith/external/riloe into the system install path (did a diff and there are significant changes). Rebooted both nodes in this cluster. Started same test again... Node 1 loses primary network connection to LAN, thereby not able to get status or connect to the Stonith device (ILO) for Node 2.
The monitor process for the riloe appears to timeout and it is still downhill from there (here are log entries from Node1 who lost the network connection): Nov 4 13:25:28 wwwlb01 kernel: bnx2: eth0 NIC Copper Link is Down Nov 4 13:25:58 wwwlb01 lrmd: [8224]: WARN: cl_stonith_lb02:0:monitor process (PID 9213) timed out (try 1). Killing with signal SIGTERM (15). Nov 4 13:25:58 wwwlb01 lrmd: [9213]: ERROR: stonithd_receive_ops_result failed. Nov 4 13:25:58 wwwlb01 lrmd: [8224]: WARN: mapped the invalid return code 254. Nov 4 13:25:58 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM operation cl_stonith_lb02:0_monitor_30000 (call=10, rc=1) complete ... Nov 4 13:25:59 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing op=cl_stonith_lb02:0_stop_0 key=5:3:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df) Nov 4 13:25:59 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: stop ... Nov 4 13:25:59 wwwlb01 lrmd: [9898]: info: Try to stop STONITH resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe ... Nov 4 13:26:00 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM operation cl_stonith_lb02:0_monitor_30000 (call=10, rc=-2) Cancelled Nov 4 13:26:00 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM operation cl_stonith_lb02:0_stop_0 (call=12, rc=0) complete Nov 4 13:26:01 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing op=cl_stonith_lb02:0_start_0 key=19:3:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df) Nov 4 13:26:01 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: start Nov 4 13:26:01 wwwlb01 lrmd: [9902]: info: Try to start STONITH resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe Nov 4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter ilo_can_reset from StonithNVpair Nov 4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter ilo_protocol from StonithNVpair Nov 4 13:26:01 wwwlb01 stonithd: [8225]: info: Cannot get parameter ilo_powerdown_method from StonithNVpair ... Nov 4 13:26:13 wwwlb01 stonithd: [9904]: info: external_run_cmd: Calling '/usr/lib64/stonith/plugins/external/riloe status' returned 256 Nov 4 13:26:13 wwwlb01 stonithd: [8225]: WARN: start cl_stonith_lb02:0 failed, because its hostlist is empty Nov 4 13:26:13 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM operation cl_stonith_lb02:0_start_0 (call=13, rc=1) complete Nov 4 13:26:14 wwwlb01 crmd: [8227]: info: do_lrm_rsc_op: Performing op=cl_stonith_lb02:0_stop_0 key=4:4:0:1eb0bdb2-c828-4b6d-b712-cf7049c775df) Nov 4 13:26:14 wwwlb01 lrmd: [8224]: info: rsc:cl_stonith_lb02:0: stop Nov 4 13:26:14 wwwlb01 lrmd: [9917]: info: Try to stop STONITH resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe Nov 4 13:26:14 wwwlb01 stonithd: [8225]: notice: try to stop a resource cl_stonith_lb02:0 who is not in started resource queue. Nov 4 13:26:14 wwwlb01 crmd: [8227]: info: process_lrm_event: LRM operation cl_stonith_lb02:0_stop_0 (call=14, rc=0) complete Nov 4 13:26:19 wwwlb01 cib: [8223]: info: cib_stats: Processed 44 operations (3409.00us average, 0% utilization) in the last 10min Nov 4 13:27:34 wwwlb01 kernel: bnx2: eth0 NIC Copper Link is Up, 100 Mbps full duplex Nov 4 13:27:35 wwwlb01 heartbeat: [5969]: info: Link wwwlb02.microcenter.com:eth0 up. In playing with the riloe python script I assume that the call to HTTPSConnection is hanging and then being later killed by lrmd. It looks like Python 2.6 added a timeout argument to the HTTPSConnection call. The system is running 2.4.3 so I couldn't test it. I do see that the socket timeout can be set like this: socket.setdefaulttimeout(1) I will follow this up by saying that my Python skills are very rusty. I am trying to find out what the expected behavior should be for a timeout on a start or monitor command. Should Stonith agents follow the OCF resource agent specs? Thanks, -ab -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Dejan Muhamedagic Sent: Tuesday, November 04, 2008 11:26 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Question on ILO stonith resource config and restarting On Thu, Oct 30, 2008 at 03:07:24PM -0400, Aaron Bush wrote: > Just realized that I only included the log entries from the node that > was not experiencing a network disconnect. Attached are the log entries > from the node (01) that had the stonith resource running before the > cable disconnect and looks like they provide some more useful > information. Also included up through when the network cable was > reconnected. The monitor operation on riloe failed. You should definitely upgrade. Thanks, Dejan > > -ab > > >> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with > resources > >> as follows: > >> > >> Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord. > >> Clone-pingd: set to monitor a couple of Ips and used to set a weight > for > >> where to run the LVS group. > >> > >> -- This is the area that I have a question on -- > >> Clone-stonith-node1: HP ILO to shoot node1 > >> Clone-stonith-node2: HP ILO to shoot node2 > >> > >> I read on the old linux-ha site that using a clone for ILO/stonith > was > >> the way to go. I'm not sure I see how this would work correctly and > be > >> preferred over a standard resource. What I am confused about is > this: > >> the external/riloe stonith plugin only knows how to shoot one node so > > > >Please make sure that you use the latest edition of > >external/riloe. The previous one didn't work under all > >circumstances. > > I am using the version that came with heartbeat-common-2.99.0-3.1 > (according rpm -qf) > > To clear my current issue where the stonith resource was not started > (and since this is still in the lab) I have rebooted both nodes to start > with a somewhat clean slate. I have attempted to grab some more useful > information from the logs on why the resource is not restarting from. > Again I disconnect the LAN cable connecting a node to the rest of the > network (a private HB channel is still available and the ILO is still > up). I noticed these entries in the log: > > Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing > op=cl_stonith_lb02:0_start_0 > key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a) > Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: start > Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > ilo_can_reset from StonithNVpair > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > ilo_protocol from StonithNVpair > Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter > ilo_powerdown_method from StonithNVpair > Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link > wwwlb01.microcenter.com:eth0 dead. > Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback: > Status update: Ping node wwwlb01.microcenter.com now has status [dead] > Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback: > Status update: Ping node wwwlb01.microcenter.com now has status [dead] > Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for > cl_stonith_lb02:0 is empty, please fix your constraints > Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start cl_stonith_lb02:0 > failed, because its hostlist is empty > Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete > Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: stop > Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a resource > cl_stonith_lb02:0 who is not in started resource queue. > Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing > op=cl_stonith_lb02:0_stop_0 > key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a) > Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH > resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe > Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM > operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete > > > > Looks like I should specify from additional nvpair's for the ilo's. The > WARN host list empty message is what looks bad to me. Here is the cib > section for the clone resource and the cib constraint for this resource. > Please let me know if there is some obvious errors in this > configuration. This is the stonith resource that is to shoot the 02 > node, intended to run on the 01 node (the 01 node was the node who had a > network cable disconnect). > > > <clone id="cl_stonithset_lb02"> > <meta_attributes id="cl_stonithset_lb02_meta_attrs"> > <attributes> > <nvpair id="cl_stonithset_lb02_metaattr_target_role" > name="target_role" value="started"/> > <nvpair id="cl_stonithset_lb02_metaattr_clone_max" > name="clone_max" value="1"/> > <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max" > name="clone_node_max" value="1"/> > </attributes> > </meta_attributes> > <primitive id="cl_stonith_lb02" class="stonith" > type="external/riloe" provider="heartbeat"> > <instance_attributes id="cl_stonith_lb02_instance_attrs"> > <attributes> > <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224" > name="hostlist" value="wwwlb02.microcenter.com"/> > <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80" > name="ilo_hostname" value="10.100.254.162"/> > <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33" > name="ilo_user" value="Administrator"/> > <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a" > name="ilo_password" value="PASSWORD"/> > </attributes> > </instance_attributes> > <operations> > <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb" > name="monitor" interval="30" timeout="20" start_delay="15" > disabled="false" role="Started" on_fail="restart"/> > <op id="4694393c-e89b-4371-af1c-a60d7f305e2f" name="start" > timeout="20" start_delay="0" disabled="false" role="Started" > on_fail="restart"/> > </operations> > <meta_attributes id="cl_stonith_lb02:0_meta_attrs"> > <attributes> > <nvpair id="cl_stonith_lb02:0_metaattr_target_role" > name="target_role" value="started"/> > </attributes> > </meta_attributes> > </primitive> > </clone> > > <constraints> > <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02"> > <rule id="prefered_location_on_lb01" score="INFINITY"> > <expression attribute="#uname" > id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq" > value="wwwlb01.microcenter.com"/> > </rule> > </rsc_location> > </constraints> > > Thanks, > -ab > > _______________________________________________ > Pacemaker mailing list > Pacemaker@clusterlabs.org > http://list.clusterlabs.org/mailman/listinfo/pacemaker > > _______________________________________________ > Pacemaker mailing list > Pacemaker@clusterlabs.org > http://list.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@clusterlabs.org http://list.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@clusterlabs.org http://list.clusterlabs.org/mailman/listinfo/pacemaker