Just realized that I only included the log entries from the node that was not experiencing a network disconnect. Attached are the log entries from the node (01) that had the stonith resource running before the cable disconnect and looks like they provide some more useful information. Also included up through when the network cable was reconnected.
-ab >> I have a 0.6 pacemaker/heartbeat cluster setup in a lab with resources >> as follows: >> >> Group-lvs(ordered): two primitives -> ocf/IPddr2 and ocf/ldirectord. >> Clone-pingd: set to monitor a couple of Ips and used to set a weight for >> where to run the LVS group. >> >> -- This is the area that I have a question on -- >> Clone-stonith-node1: HP ILO to shoot node1 >> Clone-stonith-node2: HP ILO to shoot node2 >> >> I read on the old linux-ha site that using a clone for ILO/stonith was >> the way to go. I'm not sure I see how this would work correctly and be >> preferred over a standard resource. What I am confused about is this: >> the external/riloe stonith plugin only knows how to shoot one node so > >Please make sure that you use the latest edition of >external/riloe. The previous one didn't work under all >circumstances. I am using the version that came with heartbeat-common-2.99.0-3.1 (according rpm -qf) To clear my current issue where the stonith resource was not started (and since this is still in the lab) I have rebooted both nodes to start with a somewhat clean slate. I have attempted to grab some more useful information from the logs on why the resource is not restarting from. Again I disconnect the LAN cable connecting a node to the rest of the network (a private HB channel is still available and the ILO is still up). I noticed these entries in the log: Oct 30 13:33:07 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing op=cl_stonith_lb02:0_start_0 key=18:7:0:efbdb124-d51a-4228-80bc-7a9464d7971a) Oct 30 13:33:07 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: start Oct 30 13:33:07 wwwlb02 lrmd: [30788]: info: Try to start STONITH resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter ilo_can_reset from StonithNVpair Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter ilo_protocol from StonithNVpair Oct 30 13:33:07 wwwlb02 stonithd: [6413]: info: Cannot get parameter ilo_powerdown_method from StonithNVpair Oct 30 13:33:08 wwwlb02 heartbeat: [6202]: info: Link wwwlb01.microcenter.com:eth0 dead. Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_lstatus_callback: Status update: Ping node wwwlb01.microcenter.com now has status [dead] Oct 30 13:33:08 wwwlb02 pingd: [8475]: notice: pingd_nstatus_callback: Status update: Ping node wwwlb01.microcenter.com now has status [dead] Oct 30 13:33:12 wwwlb02 stonithd: [30790]: WARN: host list for cl_stonith_lb02:0 is empty, please fix your constraints Oct 30 13:33:12 wwwlb02 stonithd: [6413]: WARN: start cl_stonith_lb02:0 failed, because its hostlist is empty Oct 30 13:33:12 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM operation cl_stonith_lb02:0_start_0 (call=12, rc=2) complete Oct 30 13:33:13 wwwlb02 lrmd: [6412]: info: rsc:cl_stonith_lb02:0: stop Oct 30 13:33:13 wwwlb02 stonithd: [6413]: notice: try to stop a resource cl_stonith_lb02:0 who is not in started resource queue. Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: do_lrm_rsc_op: Performing op=cl_stonith_lb02:0_stop_0 key=1:8:0:efbdb124-d51a-4228-80bc-7a9464d7971a) Oct 30 13:33:13 wwwlb02 lrmd: [30842]: info: Try to stop STONITH resource <rsc_id=cl_stonith_lb02:0> : Device=external/riloe Oct 30 13:33:13 wwwlb02 crmd: [6415]: info: process_lrm_event: LRM operation cl_stonith_lb02:0_stop_0 (call=13, rc=0) complete Looks like I should specify from additional nvpair's for the ilo's. The WARN host list empty message is what looks bad to me. Here is the cib section for the clone resource and the cib constraint for this resource. Please let me know if there is some obvious errors in this configuration. This is the stonith resource that is to shoot the 02 node, intended to run on the 01 node (the 01 node was the node who had a network cable disconnect). <clone id="cl_stonithset_lb02"> <meta_attributes id="cl_stonithset_lb02_meta_attrs"> <attributes> <nvpair id="cl_stonithset_lb02_metaattr_target_role" name="target_role" value="started"/> <nvpair id="cl_stonithset_lb02_metaattr_clone_max" name="clone_max" value="1"/> <nvpair id="cl_stonithset_lb02_metaattr_clone_node_max" name="clone_node_max" value="1"/> </attributes> </meta_attributes> <primitive id="cl_stonith_lb02" class="stonith" type="external/riloe" provider="heartbeat"> <instance_attributes id="cl_stonith_lb02_instance_attrs"> <attributes> <nvpair id="76163fb5-05ea-4cff-9786-a817774d8224" name="hostlist" value="wwwlb02.microcenter.com"/> <nvpair id="238e0158-81d3-48fd-879a-494c76d96b80" name="ilo_hostname" value="10.100.254.162"/> <nvpair id="82de3d5d-6f96-44f0-b98f-6eea75704b33" name="ilo_user" value="Administrator"/> <nvpair id="0fdef60a-fe62-4a0d-8f8f-d8da1d42082a" name="ilo_password" value="PASSWORD"/> </attributes> </instance_attributes> <operations> <op id="2a33ffe8-371f-4d08-a1ea-373135e85aeb" name="monitor" interval="30" timeout="20" start_delay="15" disabled="false" role="Started" on_fail="restart"/> <op id="4694393c-e89b-4371-af1c-a60d7f305e2f" name="start" timeout="20" start_delay="0" disabled="false" role="Started" on_fail="restart"/> </operations> <meta_attributes id="cl_stonith_lb02:0_meta_attrs"> <attributes> <nvpair id="cl_stonith_lb02:0_metaattr_target_role" name="target_role" value="started"/> </attributes> </meta_attributes> </primitive> </clone> <constraints> <rsc_location id="location_on_lb01" rsc="cl_stonithset_lb02"> <rule id="prefered_location_on_lb01" score="INFINITY"> <expression attribute="#uname" id="c9e30917-97e2-4c35-86e7-9df6c7abc497" operation="eq" value="wwwlb01.microcenter.com"/> </rule> </rsc_location> </constraints> Thanks, -ab _______________________________________________ Pacemaker mailing list Pacemaker@clusterlabs.org http://list.clusterlabs.org/mailman/listinfo/pacemaker
log.gz
Description: log.gz
_______________________________________________ Pacemaker mailing list Pacemaker@clusterlabs.org http://list.clusterlabs.org/mailman/listinfo/pacemaker