Hello,

In my pacemaker/corosync cluster it looks like I have some issues with
fencing ACK on DLM/cLVM.

When a node is fenced, dlm/cLVM are not aware of the fencing results and
LVM commands hangs unless I run “dlm_tools fence_ack <ID_OF_THE_NODE>”

Here are some log around the fencing of nebula1:

Nov 24 09:51:06 nebula3 crmd[6043]:  warning: update_failcount: Updating 
failcount for clvm on nebula1 after failed stop: rc=1 (update=INFINITY, 
time=1416819066)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: unpack_rsc_op: Processing 
failed op stop for clvm:0 on nebula1: unknown error (1)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: pe_fence_node: Node nebula1 
will be fenced because of resource failure(s)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: stage6: Scheduling Node 
nebula1 for STONITH
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: native_stop_constraints: Stop 
of failed resource clvm:0 is implicit after nebula1 is fenced
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Move    
Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop    
dlm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop    
clvm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: process_pe_message: Calculated 
Transition 4: /var/lib/pacemaker/pengine/pe-warn-1.bz2
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: unpack_rsc_op: Processing 
failed op stop for clvm:0 on nebula1: unknown error (1)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: pe_fence_node: Node nebula1 
will be fenced because of resource failure(s)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: stage6: Scheduling Node 
nebula1 for STONITH
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: native_stop_constraints: Stop 
of failed resource clvm:0 is implicit after nebula1 is fenced
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Move    
Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop    
dlm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]:   notice: LogActions: Stop    
clvm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]:  warning: process_pe_message: Calculated 
Transition 5: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Nov 24 09:51:06 nebula3 crmd[6043]:   notice: te_fence_node: Executing reboot 
fencing operation (79) on nebula1 (timeout=30000)
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: handle_request: Client 
crmd.6043.5ec58277 wants to fence (reboot) 'nebula1' with device '(any)'
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for nebula1: 
50c93bed-e66f-48a5-bd2f-100a9e7ca7a1 (0)
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device: 
Stonith-nebula1-IPMILAN can fence nebula1: static-list
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device: 
Stonith-nebula2-IPMILAN can not fence nebula1: static-list
Nov 24 09:51:06 nebula3 stonith-ng[6039]:   notice: can_fence_host_with_device: 
Stonith-ONE-Frontend can not fence nebula1: static-list
Nov 24 09:51:09 nebula3 corosync[5987]:   [TOTEM ] A processor failed, forming 
new configuration.
Nov 24 09:51:13 nebula3 corosync[5987]:   [TOTEM ] A new membership 
(192.168.231.71:81200) was formed. Members left: 1084811078
Nov 24 09:51:13 nebula3 lvm[6311]: confchg callback. 0 joined, 1 left, 2 members
Nov 24 09:51:13 nebula3 corosync[5987]:   [QUORUM] Members[2]: 1084811079 
1084811080
Nov 24 09:51:13 nebula3 corosync[5987]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Nov 24 09:51:13 nebula3 pacemakerd[6036]:   notice: crm_update_peer_state: 
pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was 
member)
Nov 24 09:51:13 nebula3 crmd[6043]:   notice: crm_update_peer_state: 
pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was 
member)
Nov 24 09:51:13 nebula3 kernel: [  510.140107] dlm: closing connection to node 
1084811078
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 
1 from 1084811079 walltime 1416819073 local 509
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 pid 
7142 nodedown time 1416819073 fence_all dlm_stonith
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence result 1084811078 pid 
7142 result 1 exit status
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 
1 from 1084811080 walltime 1416819073 local 509
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 no 
actor
Nov 24 09:51:13 nebula3 stonith-ng[6039]:   notice: remote_op_done: Operation 
reboot of nebula1 by nebula2 for crmd.6043@nebula3.50c93bed: OK
Nov 24 09:51:13 nebula3 crmd[6043]:   notice: tengine_stonith_callback: Stonith 
operation 4/79:5:0:817919e5-fa6d-4381-b0bd-42141ce0bb41: OK (0)
Nov 24 09:51:13 nebula3 crmd[6043]:   notice: tengine_stonith_notify: Peer 
nebula1 was terminated (reboot) by nebula2 for nebula3: OK 
(ref=50c93bed-e66f-48a5-bd2f-100a9e7ca7a1) by client crmd.6043
Nov 24 09:51:13 nebula3 crmd[6043]:   notice: te_rsc_command: Initiating action 
22: start Stonith-nebula3-IPMILAN_start_0 on nebula2
Nov 24 09:51:14 nebula3 crmd[6043]:   notice: run_graph: Transition 5 
(Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Nov 24 09:51:14 nebula3 pengine[6042]:   notice: process_pe_message: Calculated 
Transition 6: /var/lib/pacemaker/pengine/pe-input-2.bz2
Nov 24 09:51:14 nebula3 crmd[6043]:   notice: te_rsc_command: Initiating action 
21: monitor Stonith-nebula3-IPMILAN_monitor_1800000 on nebula2
Nov 24 09:51:15 nebula3 crmd[6043]:   notice: run_graph: Transition 6 
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-2.bz2): Complete
Nov 24 09:51:15 nebula3 crmd[6043]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 datastores wait for fencing
Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 clvmd wait for fencing
Nov 24 09:55:10 nebula3 dlm_controld[6263]: 747 fence status 1084811078 receive 
-125 from 1084811079 walltime 1416819310 local 747

When the node is fenced I have “clvmd wait for fencing” and “datastores
wait for fencing” (datastores is my GFS2 volume).

Any idea of something I can check when this happens?

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to