Hello, In my pacemaker/corosync cluster it looks like I have some issues with fencing ACK on DLM/cLVM.
When a node is fenced, dlm/cLVM are not aware of the fencing results and LVM commands hangs unless I run “dlm_tools fence_ack <ID_OF_THE_NODE>” Here are some log around the fencing of nebula1: Nov 24 09:51:06 nebula3 crmd[6043]: warning: update_failcount: Updating failcount for clvm on nebula1 after failed stop: rc=1 (update=INFINITY, time=1416819066) Nov 24 09:51:06 nebula3 pengine[6042]: warning: unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown error (1) Nov 24 09:51:06 nebula3 pengine[6042]: warning: pe_fence_node: Node nebula1 will be fenced because of resource failure(s) Nov 24 09:51:06 nebula3 pengine[6042]: warning: stage6: Scheduling Node nebula1 for STONITH Nov 24 09:51:06 nebula3 pengine[6042]: notice: native_stop_constraints: Stop of failed resource clvm:0 is implicit after nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Move Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2) Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop clvm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]: warning: process_pe_message: Calculated Transition 4: /var/lib/pacemaker/pengine/pe-warn-1.bz2 Nov 24 09:51:06 nebula3 pengine[6042]: warning: unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown error (1) Nov 24 09:51:06 nebula3 pengine[6042]: warning: pe_fence_node: Node nebula1 will be fenced because of resource failure(s) Nov 24 09:51:06 nebula3 pengine[6042]: warning: stage6: Scheduling Node nebula1 for STONITH Nov 24 09:51:06 nebula3 pengine[6042]: notice: native_stop_constraints: Stop of failed resource clvm:0 is implicit after nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Move Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2) Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop clvm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]: warning: process_pe_message: Calculated Transition 5: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Nov 24 09:51:06 nebula3 crmd[6043]: notice: te_fence_node: Executing reboot fencing operation (79) on nebula1 (timeout=30000) Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: handle_request: Client crmd.6043.5ec58277 wants to fence (reboot) 'nebula1' with device '(any)' Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for nebula1: 50c93bed-e66f-48a5-bd2f-100a9e7ca7a1 (0) Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: can_fence_host_with_device: Stonith-nebula1-IPMILAN can fence nebula1: static-list Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can not fence nebula1: static-list Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: can_fence_host_with_device: Stonith-ONE-Frontend can not fence nebula1: static-list Nov 24 09:51:09 nebula3 corosync[5987]: [TOTEM ] A processor failed, forming new configuration. Nov 24 09:51:13 nebula3 corosync[5987]: [TOTEM ] A new membership (192.168.231.71:81200) was formed. Members left: 1084811078 Nov 24 09:51:13 nebula3 lvm[6311]: confchg callback. 0 joined, 1 left, 2 members Nov 24 09:51:13 nebula3 corosync[5987]: [QUORUM] Members[2]: 1084811079 1084811080 Nov 24 09:51:13 nebula3 corosync[5987]: [MAIN ] Completed service synchronization, ready to provide service. Nov 24 09:51:13 nebula3 pacemakerd[6036]: notice: crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was member) Nov 24 09:51:13 nebula3 crmd[6043]: notice: crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was member) Nov 24 09:51:13 nebula3 kernel: [ 510.140107] dlm: closing connection to node 1084811078 Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from 1084811079 walltime 1416819073 local 509 Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 pid 7142 nodedown time 1416819073 fence_all dlm_stonith Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence result 1084811078 pid 7142 result 1 exit status Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from 1084811080 walltime 1416819073 local 509 Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 no actor Nov 24 09:51:13 nebula3 stonith-ng[6039]: notice: remote_op_done: Operation reboot of nebula1 by nebula2 for crmd.6043@nebula3.50c93bed: OK Nov 24 09:51:13 nebula3 crmd[6043]: notice: tengine_stonith_callback: Stonith operation 4/79:5:0:817919e5-fa6d-4381-b0bd-42141ce0bb41: OK (0) Nov 24 09:51:13 nebula3 crmd[6043]: notice: tengine_stonith_notify: Peer nebula1 was terminated (reboot) by nebula2 for nebula3: OK (ref=50c93bed-e66f-48a5-bd2f-100a9e7ca7a1) by client crmd.6043 Nov 24 09:51:13 nebula3 crmd[6043]: notice: te_rsc_command: Initiating action 22: start Stonith-nebula3-IPMILAN_start_0 on nebula2 Nov 24 09:51:14 nebula3 crmd[6043]: notice: run_graph: Transition 5 (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Nov 24 09:51:14 nebula3 pengine[6042]: notice: process_pe_message: Calculated Transition 6: /var/lib/pacemaker/pengine/pe-input-2.bz2 Nov 24 09:51:14 nebula3 crmd[6043]: notice: te_rsc_command: Initiating action 21: monitor Stonith-nebula3-IPMILAN_monitor_1800000 on nebula2 Nov 24 09:51:15 nebula3 crmd[6043]: notice: run_graph: Transition 6 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2.bz2): Complete Nov 24 09:51:15 nebula3 crmd[6043]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 datastores wait for fencing Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 clvmd wait for fencing Nov 24 09:55:10 nebula3 dlm_controld[6263]: 747 fence status 1084811078 receive -125 from 1084811079 walltime 1416819310 local 747 When the node is fenced I have “clvmd wait for fencing” and “datastores wait for fencing” (datastores is my GFS2 volume). Any idea of something I can check when this happens? Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
signature.asc
Description: PGP signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org