On 22/05/2013, at 9:00 PM, John McCabe <j...@johnmccabe.net> wrote: > No joy with ipport sadly > > <nvpair id="st-rhevm-instance_attributes-ipport" name="ipport" value="443"/> > <nvpair id="st-rhevm-instance_attributes-shell_timeout" name="shell_timeout" > value="10"/> > > Can you share the changes you made to fence_rhevm for the API change? I've > got what *should* be the latest packages from the HA channel on both systems. > > > On Wed, May 22, 2013 at 11:34 AM, Andrew Beekhof <and...@beekhof.net> wrote: > > On 22/05/2013, at 7:31 PM, John McCabe <j...@johnmccabe.net> wrote: > > > Hi, > > I've been trying to get fence_rhevm (fence-agents-3.1.5-25.el6_4.2.x86_64) > > working within pacemaker (pacemaker-1.1.8-7.el6.x86_64) but am unable to > > get it to work as intended, using fence_rhevm on the command line works as > > expected, as does stonith_admin but from within pacemaker (triggered by > > deliberately killing corosync on the node to be fenced): > > > > May 21 22:21:32 defiant corosync[1245]: [TOTEM ] A processor failed, > > forming new configuration. > > May 21 22:21:34 defiant corosync[1245]: [QUORUM] Members[1]: 1 > > May 21 22:21:34 defiant corosync[1245]: [TOTEM ] A processor joined or > > left the membership and a new membership was formed. > > May 21 22:21:34 defiant kernel: dlm: closing connection to node 2 > > May 21 22:21:34 defiant corosync[1245]: [CPG ] chosen downlist: sender > > r(0) ip(10.10.25.152) ; members(old:2 left:1) > > May 21 22:21:34 defiant corosync[1245]: [MAIN ] Completed service > > synchronization, ready to provide service. > > May 21 22:21:34 defiant crmd[1749]: notice: crm_update_peer_state: > > cman_event_callback: Node enterprise[2] - state is now lost > > May 21 22:21:34 defiant crmd[1749]: warning: match_down_event: No match > > for shutdown action on enterprise > > May 21 22:21:34 defiant crmd[1749]: notice: peer_update_callback: > > Stonith/shutdown of enterprise not matched > > May 21 22:21:34 defiant crmd[1749]: notice: do_state_transition: State > > transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL > > origin=check_join_state ] > > May 21 22:21:34 defiant fenced[1302]: fencing node enterprise > > May 21 22:21:34 defiant logger: fence_pcmk[2219]: Requesting Pacemaker > > fence enterprise (reset) > > May 21 22:21:34 defiant stonith_admin[2220]: notice: crm_log_args: > > Invoked: stonith_admin --reboot enterprise --tolerance 5s > > May 21 22:21:35 defiant attrd[1747]: notice: attrd_local_callback: > > Sending full refresh (origin=crmd) > > May 21 22:21:35 defiant attrd[1747]: notice: attrd_trigger_update: > > Sending flush op to all hosts for: probe_complete (true) > > May 21 22:21:36 defiant pengine[1748]: notice: unpack_config: On loss of > > CCM Quorum: Ignore > > May 21 22:21:36 defiant pengine[1748]: notice: process_pe_message: > > Calculated Transition 64: /var/lib/pacemaker/pengine/pe-input-60.bz2 > > May 21 22:21:36 defiant crmd[1749]: notice: run_graph: Transition 64 > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > > Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete > > May 21 22:21:36 defiant crmd[1749]: notice: do_state_transition: State > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > > cause=C_FSA_INTERNAL origin=notify_crmd ] > > May 21 22:21:44 defiant logger: fence_pcmk[2219]: Call to fence enterprise > > (reset) failed with rc=255 > > May 21 22:21:45 defiant fenced[1302]: fence enterprise dev 0.0 agent > > fence_pcmk result: error from agent > > May 21 22:21:45 defiant fenced[1302]: fence enterprise failed > > May 21 22:21:48 defiant fenced[1302]: fencing node enterprise > > May 21 22:21:48 defiant logger: fence_pcmk[2239]: Requesting Pacemaker > > fence enterprise (reset) > > May 21 22:21:48 defiant stonith_admin[2240]: notice: crm_log_args: > > Invoked: stonith_admin --reboot enterprise --tolerance 5s > > May 21 22:21:58 defiant logger: fence_pcmk[2239]: Call to fence enterprise > > (reset) failed with rc=255 > > May 21 22:21:58 defiant fenced[1302]: fence enterprise dev 0.0 agent > > fence_pcmk result: error from agent > > May 21 22:21:58 defiant fenced[1302]: fence enterprise failed > > May 21 22:22:01 defiant fenced[1302]: fencing node enterprise > > > > and with corosync.log showing "warning: match_down_event: No match for > > shutdown action on enterprise", "notice: peer_update_callback: > > Stonith/shutdown of enterprise not matched" > > > > May 21 22:21:32 corosync [TOTEM ] A processor failed, forming new > > configuration. > > May 21 22:21:34 corosync [QUORUM] Members[1]: 1 > > May 21 22:21:34 corosync [TOTEM ] A processor joined or left the membership > > and a new membership was formed. > > May 21 22:21:34 [1749] defiant crmd: info: cman_event_callback: > > Membership 296: quorum retained > > May 21 22:21:34 [1744] defiant cib: info: pcmk_cpg_membership: > > Left[5.0] cib.2 > > May 21 22:21:34 [1744] defiant cib: info: crm_update_peer_proc: > > pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline > > May 21 22:21:34 [1744] defiant cib: info: pcmk_cpg_membership: > > Member[5.0] cib.1 > > May 21 22:21:34 [1745] defiant stonith-ng: info: pcmk_cpg_membership: > > Left[5.0] stonith-ng.2 > > May 21 22:21:34 [1745] defiant stonith-ng: info: crm_update_peer_proc: > > pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline > > May 21 22:21:34 corosync [CPG ] chosen downlist: sender r(0) > > ip(10.10.25.152) ; members(old:2 left:1) > > May 21 22:21:34 corosync [MAIN ] Completed service synchronization, ready > > to provide service. > > May 21 22:21:34 [1745] defiant stonith-ng: info: pcmk_cpg_membership: > > Member[5.0] stonith-ng.1 > > May 21 22:21:34 [1749] defiant crmd: notice: crm_update_peer_state: > > cman_event_callback: Node enterprise[2] - state is now lost > > May 21 22:21:34 [1749] defiant crmd: info: peer_update_callback: > > enterprise is now lost (was member) > > May 21 22:21:34 [1744] defiant cib: info: cib_process_request: > > Operation complete: op cib_modify for section nodes > > (origin=local/crmd/150, version=0.22.3): OK (rc=0) > > May 21 22:21:34 [1749] defiant crmd: info: pcmk_cpg_membership: > > Left[5.0] crmd.2 > > May 21 22:21:34 [1749] defiant crmd: info: crm_update_peer_proc: > > pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline > > May 21 22:21:34 [1749] defiant crmd: info: peer_update_callback: > > Client enterprise/peer now has status [offline] (DC=true) > > May 21 22:21:34 [1749] defiant crmd: warning: match_down_event: No > > match for shutdown action on enterprise > > May 21 22:21:34 [1749] defiant crmd: notice: peer_update_callback: > > Stonith/shutdown of enterprise not matched > > May 21 22:21:34 [1749] defiant crmd: info: > > crm_update_peer_expected: peer_update_callback: Node enterprise[2] - > > expected state is now down > > May 21 22:21:34 [1749] defiant crmd: info: > > abort_transition_graph: peer_update_callback:211 - Triggered transition > > abort (complete=1) : Node failure > > May 21 22:21:34 [1749] defiant crmd: info: pcmk_cpg_membership: > > Member[5.0] crmd.1 > > May 21 22:21:34 [1749] defiant crmd: notice: do_state_transition: > > State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN > > cause=C_FSA_INTERNAL origin=check_join_state ] > > May 21 22:21:34 [1749] defiant crmd: info: > > abort_transition_graph: do_te_invoke:163 - Triggered transition abort > > (complete=1) : Peer Halt > > May 21 22:21:34 [1749] defiant crmd: info: join_make_offer: > > Making join offers based on membership 296 > > May 21 22:21:34 [1749] defiant crmd: info: do_dc_join_offer_all: > > join-7: Waiting on 1 outstanding join acks > > May 21 22:21:34 [1749] defiant crmd: info: update_dc: Set > > DC to defiant (3.0.7) > > May 21 22:21:34 [1749] defiant crmd: info: do_state_transition: > > State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED > > cause=C_FSA_INTERNAL origin=check_join_state ] > > May 21 22:21:34 [1749] defiant crmd: info: do_dc_join_finalize: > > join-7: Syncing the CIB from defiant to the rest of the cluster > > May 21 22:21:34 [1744] defiant cib: info: cib_process_request: > > Operation complete: op cib_sync for section 'all' > > (origin=local/crmd/154, version=0.22.5): OK (rc=0) > > May 21 22:21:34 [1744] defiant cib: info: cib_process_request: > > Operation complete: op cib_modify for section nodes > > (origin=local/crmd/155, version=0.22.6): OK (rc=0) > > May 21 22:21:34 [1749] defiant crmd: info: stonith_action_create: > > Initiating action metadata for agent fence_rhevm (target=(null)) > > May 21 22:21:35 [1749] defiant crmd: info: do_dc_join_ack: > > join-7: Updating node state to member for defiant > > May 21 22:21:35 [1749] defiant crmd: info: erase_status_tag: > > Deleting xpath: //node_state[@uname='defiant']/lrm > > May 21 22:21:35 [1744] defiant cib: info: cib_process_request: > > Operation complete: op cib_delete for section > > //node_state[@uname='defiant']/lrm (origin=local/crmd/156, version=0.22.7): > > OK (rc=0) > > May 21 22:21:35 [1749] defiant crmd: info: do_state_transition: > > State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED > > cause=C_FSA_INTERNAL origin=check_join_state ] > > May 21 22:21:35 [1749] defiant crmd: info: > > abort_transition_graph: do_te_invoke:156 - Triggered transition abort > > (complete=1) : Peer Cancelled > > May 21 22:21:35 [1747] defiant attrd: notice: attrd_local_callback: > > Sending full refresh (origin=crmd) > > May 21 22:21:35 [1747] defiant attrd: notice: attrd_trigger_update: > > Sending flush op to all hosts for: probe_complete (true) > > May 21 22:21:35 [1744] defiant cib: info: cib_process_request: > > Operation complete: op cib_modify for section nodes > > (origin=local/crmd/158, version=0.22.9): OK (rc=0) > > May 21 22:21:35 [1744] defiant cib: info: cib_process_request: > > Operation complete: op cib_modify for section cib > > (origin=local/crmd/160, version=0.22.11): OK (rc=0) > > May 21 22:21:36 [1748] defiant pengine: info: unpack_config: > > Startup probes: enabled > > May 21 22:21:36 [1748] defiant pengine: notice: unpack_config: On > > loss of CCM Quorum: Ignore > > May 21 22:21:36 [1748] defiant pengine: info: unpack_config: > > Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 > > May 21 22:21:36 [1748] defiant pengine: info: unpack_domains: > > Unpacking domains > > May 21 22:21:36 [1748] defiant pengine: info: > > determine_online_status_fencing: Node defiant is active > > May 21 22:21:36 [1748] defiant pengine: info: > > determine_online_status: Node defiant is online > > May 21 22:21:36 [1748] defiant pengine: info: native_print: > > st-rhevm (stonith:fence_rhevm): Started defiant > > May 21 22:21:36 [1748] defiant pengine: info: LogActions: > > Leave st-rhevm (Started defiant) > > May 21 22:21:36 [1748] defiant pengine: notice: process_pe_message: > > Calculated Transition 64: /var/lib/pacemaker/pengine/pe-input-60.bz2 > > May 21 22:21:36 [1749] defiant crmd: info: do_state_transition: > > State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ > > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] > > May 21 22:21:36 [1749] defiant crmd: info: do_te_invoke: > > Processing graph 64 (ref=pe_calc-dc-1369171296-118) derived from > > /var/lib/pacemaker/pengine/pe-input-60.bz2 > > May 21 22:21:36 [1749] defiant crmd: notice: run_graph: > > Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > > Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete > > May 21 22:21:36 [1749] defiant crmd: notice: do_state_transition: > > State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > > cause=C_FSA_INTERNAL origin=notify_crmd ] > > > > > > I can get the node enterprise to fence as expected from the command line > > with: > > > > stonith_admin --reboot enterprise --tolerance 5s > > > > fence_rhevm -o reboot -a <hypervisor ip> -l <user>@<domain> -p <password> > > -n enterprise -z
I must have skipped over this last time... The first batch of logs do not show it working though: > May 21 22:21:58 defiant fenced[1302]: fence enterprise dev 0.0 agent > fence_pcmk result: error from agent Is that from a manual invocation or after kill -9? Assuming the latter, it would seem this is a pacemaker issue, not an agent issue. I also just confirmed that I have the same version as you: fence-agents-3.1.5-25.el6.x86_64 Are you logging to a file as well as syslog? If so that file would be very useful to have (see http://blog.clusterlabs.org/blog/2013/pacemaker-logging/ if you're not :-) Also /var/lib/pacemaker/pengine/pe-input-60.bz2 from defiant will be needed. For those playing along at home, this is the "Did the crmd fail to perform recovery?" case in http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/ :) > > > > My config is as follows: > > > > cluster.conf ----------------------------------- > > > > <?xml version="1.0"?> > > <cluster config_version="1" name="cluster"> > > <logging debug="off"/> > > <clusternodes> > > <clusternode name="defiant" nodeid="1"> > > <fence> > > <method name="pcmk-redirect"> > > <device name="pcmk" port="defiant"/> > > </method> > > </fence> > > </clusternode> > > <clusternode name="enterprise" nodeid="2"> > > <fence> > > <method name="pcmk-redirect"> > > <device name="pcmk" port="enterprise"/> > > </method> > > </fence> > > </clusternode> > > </clusternodes> > > <fencedevices> > > <fencedevice name="pcmk" agent="fence_pcmk"/> > > </fencedevices> > > <cman two_node="1" expected_votes="1"> > > </cman> > > </cluster> > > > > pacemaker cib --------------------------------- > > > > Stonith device created with: > > > > pcs stonith create st-rhevm fence_rhevm login="<user>@<domain>" > > passwd="<password>" ssl=1 ipaddr="<hypervisor ip>" verbose=1 > > debug="/tmp/debug.log" > > > > > > <cib epoch="18" num_updates="88" admin_epoch="0" > > validate-with="pacemaker-1.2" update-origin="defiant" > > update-client="cibadmin" cib-last-written="Tue May 21 07:55:31 2013" > > crm_feature_set="3.0.7" have-quorum="1" dc-uuid="defiant"> > > <configuration> > > <crm_config> > > <cluster_property_set id="cib-bootstrap-options"> > > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > > value="1.1.8-7.el6-394e906"/> > > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > > name="cluster-infrastructure" value="cman"/> > > <nvpair id="cib-bootstrap-options-no-quorum-policy" > > name="no-quorum-policy" value="ignore"/> > > <nvpair id="cib-bootstrap-options-stonith-enabled" > > name="stonith-enabled" value="true"/> > > </cluster_property_set> > > </crm_config> > > <nodes> > > <node id="defiant" uname="defiant"/> > > <node id="enterprise" uname="enterprise"/> > > </nodes> > > <resources> > > <primitive class="stonith" id="st-rhevm" type="fence_rhevm"> > > <instance_attributes id="st-rhevm-instance_attributes"> > > <nvpair id="st-rhevm-instance_attributes-login" name="login" > > value="<user>@<domain>"/> > > <nvpair id="st-rhevm-instance_attributes-passwd" name="passwd" > > value="<password>"/> > > <nvpair id="st-rhevm-instance_attributes-debug" name="debug" > > value="/tmp/debug.log"/> > > <nvpair id="st-rhevm-instance_attributes-ssl" name="ssl" > > value="1"/> > > <nvpair id="st-rhevm-instance_attributes-verbose" name="verbose" > > value="1"/> > > <nvpair id="st-rhevm-instance_attributes-ipaddr" name="ipaddr" > > value="<hypervisor ip>"/> > > </instance_attributes> > > </primitive> > > Mine is: > > <primitive id="Fencing" class="stonith" type="fence_rhevm"> > <instance_attributes id="Fencing-params"> > <nvpair id="Fencing-ipport" name="ipport" value="443"/> > <nvpair id="Fencing-shell_timeout" name="shell_timeout" value="10"/> > <nvpair id="Fencing-passwd" name="passwd" value="{pass}"/> > <nvpair id="Fencing-ipaddr" name="ipaddr" value="{ip}"/> > <nvpair id="Fencing-ssl" name="ssl" value="1"/> > <nvpair id="Fencing-login" name="login" value="{user}@{domain}"/> > </instance_attributes> > <operations> > <op id="Fencing-monitor-120s" interval="120s" name="monitor" > timeout="120s"/> > <op id="Fencing-stop-0" interval="0" name="stop" timeout="60s"/> > <op id="Fencing-start-0" interval="0" name="start" timeout="60s"/> > </operations> > </primitive> > > Maybe ipport is important? > Also, there was a RHEVM API change recently, I had to update the fence_rhevm > agent before it would work again. > > > </resources> > > <constraints/> > > </configuration> > > <status> > > <node_state id="defiant" uname="defiant" in_ccm="true" crmd="online" > > crm-debug-origin="do_state_transition" join="member" expected="member"> > > <transient_attributes id="defiant"> > > <instance_attributes id="status-defiant"> > > <nvpair id="status-defiant-probe_complete" name="probe_complete" > > value="true"/> > > </instance_attributes> > > </transient_attributes> > > <lrm id="defiant"> > > <lrm_resources> > > <lrm_resource id="st-rhevm" type="fence_rhevm" class="stonith"> > > <lrm_rsc_op id="st-rhevm_last_0" > > operation_key="st-rhevm_start_0" operation="start" > > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.7" > > transition-key="2:1:0:1e7972e8-6f9a-4325-b9c3-3d7e2950d996" > > transition-magic="0:0;2:1:0:1e7972e8-6f9a-4325-b9c3-3d7e2950d996" > > call-id="14" rc-code="0" op-status="0" interval="0" last-run="1369119332" > > last-rc-change="0" exec-time="232" queue-time="0" > > op-digest="3bc7e1ce413fe37998a289f77f85d159"/> > > </lrm_resource> > > </lrm_resources> > > </lrm> > > </node_state> > > <node_state id="enterprise" uname="enterprise" in_ccm="true" > > crmd="online" crm-debug-origin="do_update_resource" join="member" > > expected="member"> > > <lrm id="enterprise"> > > <lrm_resources> > > <lrm_resource id="st-rhevm" type="fence_rhevm" class="stonith"> > > <lrm_rsc_op id="st-rhevm_last_0" > > operation_key="st-rhevm_monitor_0" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" > > transition-key="5:59:7:8170c498-f66b-4974-b3c0-c17eb45ba5cb" > > transition-magic="0:7;5:59:7:8170c498-f66b-4974-b3c0-c17eb45ba5cb" > > call-id="5" rc-code="7" op-status="0" interval="0" last-run="1369170800" > > last-rc-change="0" exec-time="4" queue-time="0" > > op-digest="3bc7e1ce413fe37998a289f77f85d159"/> > > </lrm_resource> > > </lrm_resources> > > </lrm> > > <transient_attributes id="enterprise"> > > <instance_attributes id="status-enterprise"> > > <nvpair id="status-enterprise-probe_complete" > > name="probe_complete" value="true"/> > > </instance_attributes> > > </transient_attributes> > > </node_state> > > </status> > > </cib> > > > > > > The debug log output from fence_rhevm doesn't appear to show pacemaker > > trying to request the reboot, only a vms command sent to the hypervisor > > which responds with xml listing the VMs. > > > > I can't quite see why its failing? Are you aware of any issues with > > fence_rhevm (fence-agents-3.1.5-25.el6_4.2.x86_64) not working with > > pacemaker (pacemaker-1.1.8-7.el6.x86_64) on RHEL6.4? > > > > All the best, > > /John > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org