Hello, On 10/18/2011 02:59 PM, Brian J. Murrell wrote: > I have a pacemaker 1.0.10 installation on rhel5 but I can't seem to > manage to get a working stonith configuration. I have tested my stonith > device manually using the stonith command and it works fine. What > doesn't seem to be happening is pacemaker/stonithd actually asking for a > stonith. In my log I get: > > Oct 18 08:54:23 mds1 stonithd: [4645]: ERROR: Failed to STONITH the node > oss1: optype=RESET, op_result=TIMEOUT > Oct 18 08:54:23 mds1 crmd: [4650]: info: tengine_stonith_callback: > call=-975, optype=1, node_name=oss1, result=2, node_list=, > action=17:1023:0:4e12e206-e0be-4915-bfb8-b4e052057f01 > Oct 18 08:54:23 mds1 crmd: [4650]: ERROR: tengine_stonith_callback: > Stonith of oss1 failed (2)... aborting transition. > Oct 18 08:54:23 mds1 crmd: [4650]: info: abort_transition_graph: > tengine_stonith_callback:402 - Triggered transition abort (complete=0) : > Stonith failed > Oct 18 08:54:23 mds1 crmd: [4650]: info: update_abort_priority: Abort > priority upgraded from 0 to 1000000 > Oct 18 08:54:23 mds1 crmd: [4650]: info: update_abort_priority: Abort > action done superceeded by restart > Oct 18 08:54:23 mds1 crmd: [4650]: info: run_graph: > ==================================================== > Oct 18 08:54:23 mds1 crmd: [4650]: notice: run_graph: Transition 1023 > (Complete=2, Pending=0, Fired=0, Skipped=7, Incomplete=0, > Source=/var/lib/pengine/pe-warn-5799.bz2): Stopped > Oct 18 08:54:23 mds1 crmd: [4650]: info: te_graph_trigger: Transition > 1023 is now complete > Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=notify_crmd ] > Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: All 1 > cluster nodes are eligible to run resources. > Oct 18 08:54:23 mds1 crmd: [4650]: info: do_pe_invoke: Query 1307: > Requesting the current CIB: S_POLICY_ENGINE > Oct 18 08:54:23 mds1 crmd: [4650]: info: do_pe_invoke_callback: Invoking > the PE: query=1307, ref=pe_calc-dc-1318942463-1164, seq=16860, quorate=0 > Oct 18 08:54:23 mds1 pengine: [4649]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 18 08:54:23 mds1 pengine: [4649]: info: unpack_config: Node scores: > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: pe_fence_node: Node oss1 > will be fenced because it is un-expectedly down > Oct 18 08:54:23 mds1 pengine: [4649]: info: > determine_online_status_fencing: #011ha_state=active, ccm_state=false, > crm_state=online, join_state=pending, expected=member > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: determine_online_status: > Node oss1 is unclean > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: pe_fence_node: Node mds2 > will be fenced because it is un-expectedly down > Oct 18 08:54:23 mds1 pengine: [4649]: info: > determine_online_status_fencing: #011ha_state=active, ccm_state=false, > crm_state=online, join_state=pending, expected=member > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: determine_online_status: > Node mds2 is unclean > Oct 18 08:54:23 mds1 pengine: [4649]: info: > determine_online_status_fencing: Node oss2 is down > Oct 18 08:54:23 mds1 pengine: [4649]: info: determine_online_status: > Node mds1 is online > Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print: > MGS_2#011(ocf::hydra:Target):#011Started mds1 > Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print: > testfs-MDT0000_3#011(ocf::hydra:Target):#011Started mds2 > Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print: > testfs-OST0000_4#011(ocf::hydra:Target):#011Started oss1 > Oct 18 08:54:23 mds1 pengine: [4649]: notice: clone_print: Clone Set: > fencing > Oct 18 08:54:23 mds1 pengine: [4649]: notice: short_print: Stopped: > [ st-pm:0 st-pm:1 st-pm:2 st-pm:3 ] > Oct 18 08:54:23 mds1 pengine: [4649]: info: get_failcount: > testfs-MDT0000_3 has failed 10 times on mds1 > Oct 18 08:54:23 mds1 pengine: [4649]: notice: common_apply_stickiness: > testfs-MDT0000_3 can fail 999990 more times on mds1 before being forced off > Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource > testfs-OST0000_4 cannot run anywhere > Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource > st-pm:0 cannot run anywhere > Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource > st-pm:1 cannot run anywhere > Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource > st-pm:2 cannot run anywhere > Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource > st-pm:3 cannot run anywhere > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Action > testfs-MDT0000_3_stop_0 on mds2 is unrunnable (offline) > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Marking node > mds2 unclean > Oct 18 08:54:23 mds1 pengine: [4649]: notice: RecurringOp: Start > recurring monitor (120s) for testfs-MDT0000_3 on mds1 > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Action > testfs-OST0000_4_stop_0 on oss1 is unrunnable (offline) > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Marking node > oss1 unclean > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: stage6: Scheduling Node oss1 > for STONITH > Oct 18 08:54:23 mds1 pengine: [4649]: info: native_stop_constraints: > testfs-OST0000_4_stop_0 is implicit after oss1 is fenced > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: stage6: Scheduling Node mds2 > for STONITH > Oct 18 08:54:23 mds1 pengine: [4649]: info: native_stop_constraints: > testfs-MDT0000_3_stop_0 is implicit after mds2 is fenced > Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource > MGS_2#011(Started mds1) > Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Move resource > testfs-MDT0000_3#011(Started mds2 -> mds1) > Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Stop resource > testfs-OST0000_4#011(oss1) > Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource > st-pm:0#011(Stopped) > Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource > st-pm:1#011(Stopped) > Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource > st-pm:2#011(Stopped) > Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource > st-pm:3#011(Stopped)
none of your fencing clones is running. > Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > Oct 18 08:54:23 mds1 crmd: [4650]: info: unpack_graph: Unpacked > transition 1024: 9 actions in 9 synapses > Oct 18 08:54:23 mds1 crmd: [4650]: info: do_te_invoke: Processing graph > 1024 (ref=pe_calc-dc-1318942463-1164) derived from > /var/lib/pengine/pe-warn-5800.bz2 > Oct 18 08:54:23 mds1 crmd: [4650]: info: te_pseudo_action: Pseudo action > 15 fired and confirmed > Oct 18 08:54:23 mds1 crmd: [4650]: info: te_fence_node: Executing reboot > fencing operation (17) on oss1 (timeout=60000) > Oct 18 08:54:23 mds1 stonithd: [4645]: info: client tengine [pid: 4650] > requests a STONITH operation RESET on node oss1 > Oct 18 08:54:23 mds1 stonithd: [4645]: info: we can't manage oss1, > broadcast request to other nodes > Oct 18 08:54:23 mds1 stonithd: [4645]: info: Broadcasting the message > succeeded: require others to stonith node oss1. > Oct 18 08:54:23 mds1 pengine: [4649]: WARN: process_pe_message: > Transition 1024: WARNINGs found during PE processing. PEngine Input > stored in: /var/lib/pengine/pe-warn-5800.bz2 > Oct 18 08:54:23 mds1 pengine: [4649]: info: process_pe_message: > Configuration WARNINGs found during PE processing. Please run > "crm_verify -L" to identify issues. > > My configuration is: > > # crm configure show > node mds1 > node mds2 > node oss1 > node oss2 > primitive MGS_2 ocf:hydra:Target \ > meta target-role="Started" \ > operations $id="MGS_2-operations" \ > op monitor interval="120" timeout="60" \ > op start interval="0" timeout="300" \ > op stop interval="0" timeout="300" \ > params target="MGS" > primitive st-pm stonith:external/powerman \ > params serverhost="192.168.122.1:10101" poweroff="0" > primitive testfs-MDT0000_3 ocf:hydra:Target \ > meta target-role="Started" \ > operations $id="testfs-MDT0000_3-operations" \ > op monitor interval="120" timeout="60" \ > op start interval="0" timeout="300" \ > op stop interval="0" timeout="300" \ > params target="testfs-MDT0000" > primitive testfs-OST0000_4 ocf:hydra:Target \ > meta target-role="Started" \ > operations $id="testfs-OST0000_4-operations" \ > op monitor interval="120" timeout="60" \ > op start interval="0" timeout="300" \ > op stop interval="0" timeout="300" \ > params target="testfs-OST0000" > clone fencing st-pm > location MGS_2-primary MGS_2 20: mds1 > location MGS_2-secondary MGS_2 10: mds2 > location testfs-MDT0000_3-primary testfs-MDT0000_3 20: mds2 > location testfs-MDT0000_3-secondary testfs-MDT0000_3 10: mds1 > location testfs-OST0000_4-primary testfs-OST0000_4 20: oss1 > location testfs-OST0000_4-secondary testfs-OST0000_4 10: oss2 > property $id="cib-bootstrap-options" \ > no-quorum-policy="ignore" \ > expected-quorum-votes="4" \ > symmetric-cluster="false" \ I'd expect this to be the problem ... if you insist on using an unsymmetric cluster you must add a location score for each resource you want to be up on a node ... so add a location constraint for the fencing clone for each node ... or use a symmetric cluster. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > cluster-infrastructure="openais" \ > dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \ > stonith-enabled="true" > > Any ideas why stonith is failing? > > Cheers, > b. > > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker