I was doing a nic firmware upgrade and i forgot to stop the cluster on node
where i was working, but something strange happened, both node are fenced
at the same time.
I'm using sbd as stonith device, with the following parameters.
watchdog time = 10 ; msgwait 20 ; stonith-timeout = 40(pacemaker)
May 31 14:41:48 node01 cluster-dlm: stop_kernel: clvmd stop_kernel cg 2
May 31 14:41:48 node01 corosync[76539]: [CPG ] chosen downlist: sender r(0)
ip(191.255.5.201) ; members(old:2 left:1)
May 31 14:41:48 node01 cluster-dlm: do_sysfs: write "0" to
"/sys/kernel/dlm/clvmd/control"
May 31 14:41:48 node01 crmd: [76549]: info: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
origin=check_dead_member ]
May 31 14:41:48 node01 crmd: [76549]: info: update_dc: Unset DC node02
May 31 14:41:48 node01 corosync[76539]: [MAIN ] Completed service
synchronization, ready to provide service.
May 31 14:41:48 node01 crmd: [76549]: info: do_state_transition: State
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
May 31 14:41:48 node01 crmd: [76549]: info: do_te_control: Registering TE UUID:
3f4ffc02-37c8-471d-bb82-43b23b6c96c4
May 31 14:41:48 node01 crmd: [76549]: info: set_graph_functions: Setting custom
graph functions
May 31 14:41:48 node01 crmd: [76549]: info: unpack_graph: Unpacked transition
-1: 0 actions in 0 synapses
May 31 14:41:48 node01 crmd: [76549]: info: do_dc_takeover: Taking over DC
status for this partition
May 31 14:41:48 node01 cib: [76545]: info: cib_process_readwrite: We are now in
R/W mode
May 31 14:41:48 node01 cluster-dlm: fence_node_time: Node 1241907135/node02 has
not been shot yet
May 31 14:41:48 node01 cib: [76545]: info: cib_process_request: Operation
complete: op cib_master for section 'all' (origin=local/crmd/179,
version=0.1600.32): ok (rc=0)
May 31 14:41:48 node01 cib: [76545]: info: cib_process_request: Operation
complete: op cib_modify for section cib (origin=local/crmd/180,
version=0.1600.33): ok (rc=0)
May 31 14:41:48 node01 cib: [76545]: info: cib_process_request: Operation
complete: op cib_modify for section crm_config (origin=local/crmd/182,
version=0.1600.34): ok (rc=0)
May 31 14:41:48 node01 crmd: [76549]: info: join_make_offer: Making join offers
based on membership 1356
May 31 14:41:48 node01 crmd: [76549]: info: do_dc_join_offer_all: join-1:
Waiting on 1 outstanding join acks
May 31 14:41:48 node01 crmd: [76549]: info: ais_dispatch_message: Membership
1356: quorum still lost
May 31 14:41:48 node02 kernel: [905880.644815] qlcnic 0000:08:00.1: phy port: 1
switch_mode: 0,
May 31 14:41:48 node02 kernel: [905880.644818] max_tx_q: 1 max_rx_q: 16
min_tx_bw: 0x0,
May 31 14:41:48 node02 kernel: [905880.644820] max_tx_bw: 0x64
max_mtu:0x2580, capabilities: 0xdeea0fae
May 31 14:41:48 node02 crmd: [16192]: info: crmd_ais_dispatch: Setting expected
votes to 2
May 31 14:41:48 node02 sbd: [36423]: WARN: CIB: We do NOT have quorum!
May 31 14:41:48 node02 sbd: [36420]: WARN: Pacemaker health check: UNHEALTHY
May 31 14:41:48 node02 crmd: [16192]: WARN: match_down_event: No match for
shutdown action on node01
May 31 14:41:48 node02 crmd: [16192]: info: te_update_diff: Stonith/shutdown of
node01 not matched
May 31 14:41:48 node02 crmd: [16192]: info: abort_transition_graph:
te_update_diff:234 - Triggered transition abort (complete=1, tag=node_state,
id=s02srv002ch, magic=NA, cib=0.1600.33) : Node failure
May 31 14:41:48 node02 crmd: [16192]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
May 31 14:41:48 node02 crmd: [16192]: info: do_state_transition: All 1 cluster
nodes are eligible to run resources.
May 31 14:41:48 node02 crmd: [16192]: info: do_pe_invoke: Query 1676:
Requesting the current CIB: S_POLICY_ENGINE
May 31 14:41:48 node02 cib: [16188]: info: cib_process_request: Operation
complete: op cib_modify for section crm_config (origin=local/crmd/1675,
version=0.1600.35): ok (rc=0)
May 31 14:41:48 node02 cluster-dlm: fence_node_time: Node 1225129919/node01 has
not been shot yet
May 31 14:41:48 node02 cluster-dlm: check_fencing_done: clvmd check_fencing
1225129919 wait add 1400654144 fail 1401540108 last 0
May 31 14:41:48 node02 kernel: [905880.676719] qlcnic 0000:08:00.1: Supports FW
dump capability
May 31 14:41:48 node02 kernel: [905880.676728] qlcnic 0000:08:00.1: firmware
v4.14.26
May 31 14:41:48 node02 crmd: [16192]: info: do_pe_invoke_callback: Invoking the
PE: query=1676, ref=pe_calc-dc-1401540108-4630, seq=1356, quorate=0
May 31 14:41:48 node02 pengine: [16191]: notice: unpack_config: On loss of CCM
Quorum: Ignore
May 31 14:41:48 node02 pengine: [16191]: WARN: pe_fence_node: Node node01 will
be fenced because it is un-expectedly down
May 31 14:41:48 node02 pengine: [16191]: WARN: determine_online_status: Node
s02srv002ch is unclean
May 31 14:41:48 node02 pengine: [16191]: WARN: custom_action: Action
dlm:1_stop_0 on node01 is unrunnable (offline)
May 31 14:41:48 node02 pengine: [16191]: WARN: custom_action: Marking node
node01 unclean
May 31 14:41:48 node02 pengine: [16191]: WARN: custom_action: Action
clvm:1_stop_0 on node01 is unrunnable (offline)
May 31 14:41:48 node02 pengine: [16191]: WARN: custom_action: Marking node
node01 unclean
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org