[ClusterLabs] stonith in dual HMC environment

Alexander Markov Thu, 23 Mar 2017 00:12:07 -0700

Please share your config along with the logs from the nodes that were
effected.

I'm starting to think it's not about how to define stonith resources. Ifthe whole box is down with all the logical partitions defined, then HMCcannot define if LPAR (partition) is really dead or just inaccessible.This leads to UNCLEAN OFFLINE node status and pacemaker refusal to doanything until it's resolved. Am I right? Anyway, the simples pacemakerconfig from my partitions is below.


primitive sap_ASCS SAPInstance \
        params InstanceName=CAP_ASCS01_crmapp \
        op monitor timeout=60 interval=120 depth=0
primitive sap_D00 SAPInstance \
        params InstanceName=CAP_D00_crmapp \
        op monitor timeout=60 interval=120 depth=0
primitive sap_ip IPaddr2 \
        params ip=10.1.12.2 nic=eth0 cidr_netmask=24
primitive st_ch_hmc stonith:ibmhmc \
        params ipaddr=10.1.2.9 \
        op start interval=0 timeout=300
primitive st_hq_hmc stonith:ibmhmc \
        params ipaddr=10.1.2.8 \
        op start interval=0 timeout=300
group g_sap sap_ip sap_ASCS sap_D00 \
        meta target-role=Started
location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
location l_st_hq_hmc st_hq_hmc -inf: crmapp02
location prefer_node_1 g_sap 100: crmapp01
property cib-bootstrap-options: \
        stonith-enabled=true \
        no-quorum-policy=ignore \
        placement-strategy=balanced \
        expected-quorum-votes=2 \
        dc-version=1.1.12-f47ea56 \
        cluster-infrastructure="classic openais (with plugin)" \
        last-lrm-refresh=1490009096 \
        maintenance-mode=false
rsc_defaults rsc-options: \
        resource-stickiness=200 \
        migration-threshold=3
op_defaults op-options: \
        timeout=600 \
        record-pending=true

Logs are pretty much going in circle: stonith cannot reset logicalpartition via HMC, node stays unclean offline, resources are shown tostay on node that is down.

stonith-ng: error: log_operation: Operation 'reboot' [6942] (call6 from crmd.4568) for host 'crmapp02' with device 'st_ch_hmc:0'

Trying: st_ch_hmc:0

stonith-ng: warning: log_operation: st_ch_hmc:0:6942 [ Performing:stonith -t ibmhmc -T reset crmapp02 ]stonith-ng: warning: log_operation: st_ch_hmc:0:6942 [ failed:crmapp02 3 ]stonith-ng: info: internal_stonith_action_execute: Attempt 2 toexecute fence_legacy (reboot). remaining timeout is 59stonith-ng: info: update_remaining_timeout: Attempted toexecute agent fence_legacy (reboot) the maximum number of times (2)

stonith-ng: error: log_operation: Operation 'reboot' [6955] (call6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc' re

Trying: st_hq_hmc

stonith-ng: warning: log_operation: st_hq_hmc:6955 [ Performing:stonith -t ibmhmc -T reset crmapp02 ]stonith-ng: warning: log_operation: st_hq_hmc:6955 [ failed:crmapp02 8 ]stonith-ng: info: internal_stonith_action_execute: Attempt 2 toexecute fence_legacy (reboot). remaining timeout is 60stonith-ng: info: update_remaining_timeout: Attempted toexecute agent fence_legacy (reboot) the maximum number of times (2)

stonith-ng: error: log_operation: Operation 'reboot' [6976] (call6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc:0'

stonith-ng: warning: log_operation: st_hq_hmc:0:6976 [ Performing:stonith -t ibmhmc -T reset crmapp02 ]stonith-ng: warning: log_operation: st_hq_hmc:0:6976 [ failed:crmapp02 8 ]stonith-ng: notice: stonith_choose_peer: Couldn't find anyone tofence crmapp02 with <any>stonith-ng: info: call_remote_stonith: None of the 1 peers arecapable of terminating crmapp02 for crmd.4568 (1)stonith-ng: error: remote_op_done: Operation reboot of crmapp02 by<no-one> for crmd.4568@crmapp01.6bf66b9c: No route to hostcrmd: notice: tengine_stonith_callback: Stonith operation6/31:3700:0:b1fed277-9156-48da-8afd-35db672cd1c8: No route to

crmd: notice: tengine_stonith_callback: Stonith operation 6for crmapp02 failed (No route to host): aborting transition.crmd: notice: abort_transition_graph: Transition aborted: Stonithfailed (source=tengine_stonith_callback:699, 0)crmd: notice: tengine_stonith_notify: Peer crmapp02 was notterminated (reboot) by <anyone> for crmapp01: No route to host (re

crmd: notice: run_graph: Transition 3700 (Complete=1,Pending=0, Fired=0, Skipped=18, Incomplete=2, Source=/var/lib/pacem

crmd: info: do_state_transition: State transitionS_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_IN

pengine: info: process_pe_message: Input has not changed sincelast time, not saving to disk

pengine:   notice: unpack_config:    On loss of CCM Quorum: Ignore

pengine: info: determine_online_status_fencing: Node crmapp01 isactive

pengine:     info: determine_online_status:  Node crmapp01 is online

pengine: warning: pe_fence_node: Node crmapp02 will be fencedbecause the node is no longer part of the cluster

pengine:  warning: determine_online_status:  Node crmapp02 is unclean

pengine: info: clone_print: Clone Set: cl_st_ch_hmc[st_ch_hmc]pengine: info: native_print: st_ch_hmc (stonith:ibmhmc):Started crmapp02 (UNCLEAN)

pengine:     info: short_print:           Started: [ crmapp01 ]

pengine: info: clone_print: Clone Set: cl_st_hq_hmc[st_hq_hmc]pengine: info: native_print: st_hq_hmc (stonith:ibmhmc):Started crmapp02 (UNCLEAN)

pengine:     info: short_print:           Started: [ crmapp01 ]
pengine:     info: group_print:       Resource Group: g_sap

pengine: info: native_print: sap_ip(ocf::heartbeat:IPaddr2): Started crmapp02 (UNCLEAN)pengine: info: native_print: sap_ASCS(ocf::heartbeat:SAPInstance): Started crmapp02 (UNCLEAN)pengine: info: native_print: sap_D00(ocf::heartbeat:SAPInstance): Started crmapp02 (UNCLEAN)pengine: info: native_color: Resource st_ch_hmc:1 cannot runanywherepengine: info: native_color: Resource st_hq_hmc:1 cannot runanywherepengine: warning: custom_action: Action st_ch_hmc:1_stop_0 oncrmapp02 is unrunnable (offline)pengine: warning: custom_action: Action st_ch_hmc:1_stop_0 oncrmapp02 is unrunnable (offline)pengine: warning: custom_action: Action st_hq_hmc:1_stop_0 oncrmapp02 is unrunnable (offline)pengine: warning: custom_action: Action st_hq_hmc:1_stop_0 oncrmapp02 is unrunnable (offline)pengine: warning: custom_action: Action sap_ip_stop_0 on crmapp02 isunrunnable (offline)pengine: warning: custom_action: Action sap_ASCS_stop_0 on crmapp02is unrunnable (offline)pengine: info: RecurringOp: Start recurring monitor (120s) forsap_ASCS on crmapp01pengine: warning: custom_action: Action sap_D00_stop_0 on crmapp02is unrunnable (offline)pengine: info: RecurringOp: Start recurring monitor (120s) forsap_D00 on crmapp01

pengine:  warning: stage6:   Scheduling Node crmapp02 for STONITH

pengine: info: native_stop_constraints: st_ch_hmc:1_stop_0 isimplicit after crmapp02 is fencedpengine: info: native_stop_constraints: st_hq_hmc:1_stop_0 isimplicit after crmapp02 is fencedpengine: info: native_stop_constraints: sap_ip_stop_0 is implicitafter crmapp02 is fencedpengine: info: native_stop_constraints: sap_ASCS_stop_0 is implicitafter crmapp02 is fencedpengine: info: native_stop_constraints: sap_D00_stop_0 is implicitafter crmapp02 is fencedpengine: info: LogActions: Leave st_ch_hmc:0 (Startedcrmapp01)

pengine:   notice: LogActions:       Stop    st_ch_hmc:1     (crmapp02)

pengine: info: LogActions: Leave st_hq_hmc:0 (Startedcrmapp01)

pengine:   notice: LogActions:       Stop    st_hq_hmc:1     (crmapp02)

pengine: notice: LogActions: Move sap_ip (Started crmapp02-> crmapp01)pengine: notice: LogActions: Move sap_ASCS (Startedcrmapp02 -> crmapp01)pengine: notice: LogActions: Move sap_D00 (Started crmapp02-> crmapp01)pengine: warning: process_pe_message: Calculated Transition 3701:/var/lib/pacemaker/pengine/pe-warn-5.bz2crmd: info: do_state_transition: State transitionS_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC

crmd: notice: do_te_invoke: Processing graph 3701(ref=pe_calc-dc-1489966722-3790) derived from/var/lib/pacemaker/pengine/p

e-warn-5.bz2

crmd: notice: te_fence_node: Executing reboot fencing operation(31) on crmapp02 (timeout=60000)stonith-ng: notice: handle_request: Client crmd.4568.9cd8bc8b wantsto fence (reboot) 'crmapp02' with device '(any)'stonith-ng: notice: initiate_remote_stonith_op: Initiatingremote operation reboot for crmapp02: ed7f7eae-4836-451d-b146-d6243b5

stonith-ng: notice: get_capable_devices: stonith-timeout duration60 is low for the current configuration. Consider raising it to 80secondsstonith-ng: notice: can_fence_host_with_device: st_hq_hmc canfence (reboot) crmapp02: dynamic-liststonith-ng: notice: can_fence_host_with_device: st_hq_hmc:0 canfence (reboot) crmapp02: dynamic-liststonith-ng: warning: log_action: fence_legacy[6987] stderr: [ssh: connect to host 10.1.2.9 port 22: No route to host^M ]stonith-ng: warning: log_action: fence_legacy[6987] stderr: [Invalid config info for ibmhmc device ]stonith-ng: info: internal_stonith_action_execute: Attempt 2 toexecute fence_legacy (status). remaining timeout is 11stonith-ng: warning: log_action: fence_legacy[6986] stderr: [ssh: connect to host 10.1.2.9 port 22: No route to host^M ]stonith-ng: warning: log_action: fence_legacy[6986] stderr: [Invalid config info for ibmhmc device ]stonith-ng: info: internal_stonith_action_execute: Attempt 2 toexecute fence_legacy (list). remaining timeout is 11stonith-ng: warning: log_action: fence_legacy[6994] stderr: [ssh: connect to host 10.1.2.9 port 22: No route to host^M ]stonith-ng: warning: log_action: fence_legacy[6994] stderr: [Invalid config info for ibmhmc device ]stonith-ng: info: update_remaining_timeout: Attempted toexecute agent fence_legacy (list) the maximum number of times (2) a

llowed

stonith-ng: warning: log_action: fence_legacy[6993] stderr: [ssh: connect to host 10.1.2.9 port 22: No route to host^M ]stonith-ng: warning: log_action: fence_legacy[6993] stderr: [Invalid config info for ibmhmc device ]stonith-ng: info: update_remaining_timeout: Attempted toexecute agent fence_legacy (status) the maximum number of times (2)

stonith-ng: notice: status_search_cb: Unkown result whentesting if st_ch_hmc can fence crmapp02: rc=-201stonith-ng: info: process_remote_stonith_query: Query result 1of 1 from crmapp01 for crmapp02/reboot (3 devices) ed7f7eae-4836-

451d-b146-d6243b5c8bf3

stonith-ng: info: call_remote_stonith: Total remote op timeoutset to 180 for fencing of node crmapp02 for crmd.4568.ed7f7eaestonith-ng: info: call_remote_stonith: Requesting that crmapp01perform op reboot crmapp02 for crmd.4568 (216s, 0s)stonith-ng: notice: get_capable_devices: stonith-timeout duration60 is low for the current configuration. Consider raising it to

stonith-ng: notice: can_fence_host_with_device: st_hq_hmc canfence (reboot) crmapp02: dynamic-liststonith-ng: notice: can_fence_host_with_device: st_hq_hmc:0 canfence (reboot) crmapp02: dynamic-liststonith-ng: warning: log_action: fence_legacy[6999] stderr: [ssh: connect to host 10.1.2.9 port 22: No route to host^M ]stonith-ng: warning: log_action: fence_legacy[6999] stderr: [Invalid config info for ibmhmc device ]stonith-ng: info: internal_stonith_action_execute: Attempt 2 toexecute fence_legacy (list). remaining timeout is 11stonith-ng: warning: log_action: fence_legacy[7000] stderr: [ssh: connect to host 10.1.2.9 port 22: No route to host^M ]stonith-ng: warning: log_action: fence_legacy[7000] stderr: [Invalid config info for ibmhmc device ]stonith-ng: info: internal_stonith_action_execute: Attempt 2 toexecute fence_legacy (status). remaining timeout is 11stonith-ng: warning: log_action: fence_legacy[7007] stderr: [ssh: connect to host 10.1.2.9 port 22: No route to host^M ]stonith-ng: warning: log_action: fence_legacy[7007] stderr: [Invalid config info for ibmhmc device ]stonith-ng: info: update_remaining_timeout: Attempted toexecute agent fence_legacy (list) the maximum number of times (2) a

llowed

stonith-ng: warning: log_action: fence_legacy[7008] stderr: [ssh: connect to host 10.1.2.9 port 22: No route to host^M ]stonith-ng: warning: log_action: fence_legacy[7008] stderr: [Invalid config info for ibmhmc device ]stonith-ng: info: update_remaining_timeout: Attempted toexecute agent fence_legacy (status) the maximum number of times (2)

stonith-ng: notice: status_search_cb: Unkown result whentesting if st_ch_hmc can fence crmapp02: rc=-201



--
Regards,
Alexander

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] stonith in dual HMC environment

Reply via email to