> > Hi, > > > > On my two-node active/passive setup, I configured fencing via > > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I > expected > > that both nodes will be stonithed simultaenously. > > > > On my test scenario, Node1 has ClusterIP resource. When I > disconnect > > service/corosync link physically, Node1 was fenced and Node2 keeps > alive > > given pcmk_delay=0 on both nodes. > > > > Can you explain the behavior above? > > > > #node1 could not connect to ESX because links were disconnected. As > the > #most obvious explanation. > > #You have logs, you are the only one who can answer this question > with > #some certainty. Others can only guess. > > > Oops, my bad. I forgot to tell. I have two interfaces on each virtual > machine (nodes). second interface was used for ESX links, so fence > can be executed even though corosync links were disconnected. Looking > forward to your response. Thanks
#Having no fence delay means a death match (each node killing the other) #is possible, but it doesn't guarantee that it will happen. Some of the #time, one node will detect the outage and fence the other one before #the other one can react. #It's basically an Old West shoot-out -- they may reach for their guns #at the same time, but one may be quicker. #As Andrei suggested, the logs from both nodes could give you a timeline #of what happened when. Hi andrei, kindly see below logs. Based on time of logs, Node1 should have fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown by Node2. Is it possible to have a 2-Node active/passive setup in pacemaker/corosync that the node that gets disconnected/interface down is the only one that gets fenced? Thanks guys *LOGS from Node2:* Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed, forming new configuration. Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] A new membership ( 172.16.10.242:220) was formed. Members left: 1 Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] Failed to receive the leave message. failed: 1 Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [QUORUM] Members[1]: 2 Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [MAIN ] Completed service synchronization, ready to provide service. Jul 17 13:33:28 ArcosRhel2 attrd[1082]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 attrd[1082]: notice: Removing all ArcosRhel1 attributes for peer loss Jul 17 13:33:28 ArcosRhel2 attrd[1082]: notice: Lost attribute writer ArcosRhel1 Jul 17 13:33:28 ArcosRhel2 attrd[1082]: notice: Purged 1 peers with id=1 and/or uname=ArcosRhel1 from the membership cache Jul 17 13:33:28 ArcosRhel2 cib[1079]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 cib[1079]: notice: Purged 1 peers with id=1 and/or uname=ArcosRhel1 from the membership cache Jul 17 13:33:28 ArcosRhel2 crmd[1084]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Our DC node (ArcosRhel1) left the cluster Jul 17 13:33:28 ArcosRhel2 pacemakerd[1074]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]: notice: Purged 1 peers with id=1 and/or uname=ArcosRhel1 from the membership cache Jul 17 13:33:28 ArcosRhel2 crmd[1084]: notice: State transition S_NOT_DC -> S_ELECTION Jul 17 13:33:28 ArcosRhel2 crmd[1084]: notice: State transition S_ELECTION -> S_INTEGRATION Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be fenced because the node is no longer part of the cluster Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 is unclean Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action fence2_stop_0 on ArcosRhel1 is unrunnable (offline) Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action ClusterIP_stop_0 on ArcosRhel1 is unrunnable (offline) Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Scheduling Node ArcosRhel1 for STONITH Jul 17 13:33:30 ArcosRhel2 pengine[1083]: notice: Move fence2#011(Started ArcosRhel1 -> ArcosRhel2) Jul 17 13:33:30 ArcosRhel2 pengine[1083]: notice: Move ClusterIP#011(Started ArcosRhel1 -> ArcosRhel2) Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-20.bz2 Jul 17 13:33:30 ArcosRhel2 crmd[1084]: notice: Requesting fencing (reboot) of node ArcosRhel1 Jul 17 13:33:30 ArcosRhel2 crmd[1084]: notice: Initiating start operation fence2_start_0 locally on ArcosRhel2 Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: Client crmd.1084.cd70178e wants to fence (reboot) 'ArcosRhel1' with device '(any)' Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: Requesting peer fencing (reboot) of ArcosRhel1 Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: Fence1 can fence (reboot) ArcosRhel1: static-list Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: fence2 can not fence (reboot) ArcosRhel1: static-list Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: Fence1 can fence (reboot) ArcosRhel1: static-list Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: fence2 can not fence (reboot) ArcosRhel1: static-list Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: warning: fence2 has 'action' parameter, which should never be specified in configuration Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: warning: Mapping action='off' to pcmk_reboot_action='off' Jul 17 13:33:49 ArcosRhel2 crmd[1084]: notice: Result of start operation for fence2 on ArcosRhel2: 0 (ok) Jul 17 13:33:49 ArcosRhel2 crmd[1084]: notice: Initiating monitor operation fence2_monitor_60000 locally on ArcosRhel2 Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]: notice: Operation 'reboot' [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1' returned: 0 (OK) Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]: notice: Operation reboot of ArcosRhel1 by ArcosRhel2 for crmd.1084@ArcosRhel2.0426e6e1: OK Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Stonith operation 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0) Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Peer ArcosRhel1 was terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK (ref=0426e6e1-cfda-4475-b32d-8f7bce17027b) by client crmd.1084 Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Initiating start operation ClusterIP_start_0 locally on ArcosRhel2 Jul 17 13:33:50 ArcosRhel2 IPaddr2(ClusterIP)[2342]: INFO: Adding inet address 172.16.10.243/32 with broadcast address 172.16.10.255 to device ens192 Jul 17 13:33:51 ArcosRhel2 IPaddr2(ClusterIP)[2342]: INFO: Bringing device ens192 up Jul 17 13:33:51 ArcosRhel2 IPaddr2(ClusterIP)[2342]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -c 5 -p /var/run/resource-agents/send_arp-172.16.10.243 -I ens192 -m auto 172.16.10.243 Jul 17 13:33:52 ArcosRhel2 ntpd[1821]: Listen normally on 8 ens192 172.16.10.243 UDP 123 Jul 17 13:33:55 ArcosRhel2 crmd[1084]: notice: Result of start operation for ClusterIP on ArcosRhel2: 0 (ok) Jul 17 13:33:58 ArcosRhel2 crmd[1084]: notice: Initiating monitor operation ClusterIP_monitor_30000 locally on ArcosRhel2 Jul 17 13:33:58 ArcosRhel2 crmd[1084]: notice: Transition 0 (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-20.bz2): Complete Jul 17 13:33:58 ArcosRhel2 crmd[1084]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Jul 17 13:34:43 ArcosRhel2 ntpd[1821]: 0.0.0.0 0612 02 freq_set kernel -40.734 PPM Jul 17 13:34:43 ArcosRhel2 ntpd[1821]: 0.0.0.0 0615 05 clock_sync *LOGS from NODE1* Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed, forming new configuration. Jul 17 13:33:28 ArcoSRhel1 corosync[1464]: [TOTEM ] A new membership ( 172.16.10.241:220) was formed. Members left: 2 Jul 17 13:33:28 ArcoSRhel1 corosync[1464]: [TOTEM ] Failed to receive the leave message. failed: 2 Jul 17 13:33:28 ArcoSRhel1 corosync[1464]: [QUORUM] Members[1]: 1 Jul 17 13:33:28 ArcoSRhel1 corosync[1464]: [MAIN ] Completed service synchronization, ready to provide service. Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Node ArcosRhel2 state is now lost Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Purged 1 peers with id=2 and/or uname=ArcosRhel2 from the membership cache Jul 17 13:33:28 ArcoSRhel1 attrd[1475]: notice: Node ArcosRhel2 state is now lost Jul 17 13:33:28 ArcoSRhel1 attrd[1475]: notice: Removing all ArcosRhel2 attributes for peer loss Jul 17 13:33:28 ArcoSRhel1 attrd[1475]: notice: Purged 1 peers with id=2 and/or uname=ArcosRhel2 from the membership cache Jul 17 13:33:28 ArcoSRhel1 cib[1472]: notice: Node ArcosRhel2 state is now lost Jul 17 13:33:28 ArcoSRhel1 cib[1472]: notice: Purged 1 peers with id=2 and/or uname=ArcosRhel2 from the membership cache Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: notice: Node ArcosRhel2 state is now lost Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: warning: No reason to expect node 2 to be down Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: notice: Stonith/shutdown of ArcosRhel2 not matched Jul 17 13:33:28 ArcoSRhel1 pacemakerd[1471]: notice: Node ArcosRhel2 state is now lost Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: warning: No reason to expect node 2 to be down Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: notice: Stonith/shutdown of ArcosRhel2 not matched Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be fenced because the node is no longer part of the cluster Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 is unclean Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Action Fence1_stop_0 on ArcosRhel2 is unrunnable (offline) Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Scheduling Node ArcosRhel2 for STONITH Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: notice: Move Fence1#011(Started ArcosRhel2 -> ArcosRhel1) Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Calculated transition 4 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-8.bz2 Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: notice: Requesting fencing (reboot) of node ArcosRhel2 Jul 17 13:33:28 ArcoSRhel1 crmd[1477]: notice: Initiating start operation Fence1_start_0 locally on ArcosRhel1 Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Client crmd.1477.6d888347 wants to fence (reboot) 'ArcosRhel2' with device '(any)' Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Requesting peer fencing (reboot) of ArcosRhel2 Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Fence1 has 'action' parameter, which should never be specified in configuration Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off' to pcmk_reboot_action='off' Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Fence1 can not fence (reboot) ArcosRhel2: static-list Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: fence2 can fence (reboot) ArcosRhel2: static-list Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Fence1 can not fence (reboot) ArcosRhel2: static-list Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: fence2 can fence (reboot) ArcosRhel2: static-list Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to fencing device Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device ] Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: fence_vmware_soap[7157] stderr: [ ] Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: fence_vmware_soap[7157] stderr: [ ] > > See my config below: > > > > [root@ArcosRhel2 cluster]# pcs config > > Cluster Name: ARCOSCLUSTER > > Corosync Nodes: > >? ArcosRhel1 ArcosRhel2 > > Pacemaker Nodes: > >? ArcosRhel1 ArcosRhel2 > > > > Resources: > >? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) > >? ?Attributes: cidr_netmask=32 ip=172.16.10.243 > >? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s) > >? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start- > interval-0s) > >? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop- > interval-0s) > > > > Stonith Devices: > >? Resource: Fence1 (class=stonith type=fence_vmware_soap) > >? ?Attributes: action=off ipaddr=172.16.10.151 login=admin > passwd=123pass > > pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s > port=ArcosRhel1(Joniel) > > ssl_insecure=1 pcmk_delay_max=0s > >? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s) > >? Resource: fence2 (class=stonith type=fence_vmware_soap) > >? ?Attributes: action=off ipaddr=172.16.10.152 login=admin > passwd=123pass > > pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 > pcmk_monitor_timeout=60s > > port=ArcosRhel2(Ben) ssl_insecure=1 > >? ?Operations: monitor interval=60s (fence2-monitor-interval-60s) > > Fencing Levels: > > > > Location Constraints: > >? ?Resource: Fence1 > >? ? ?Enabled on: ArcosRhel2 (score:INFINITY) > > (id:location-Fence1-ArcosRhel2-INFINITY) > >? ?Resource: fence2 > >? ? ?Enabled on: ArcosRhel1 (score:INFINITY) > > (id:location-fence2-ArcosRhel1-INFINITY) > > Ordering Constraints: > > Colocation Constraints: > > Ticket Constraints: > > > > Alerts: > >? No alerts defined > > > > Resources Defaults: > >? No defaults set > > Operations Defaults: > >? No defaults set > > > > Cluster Properties: > >? cluster-infrastructure: corosync > >? cluster-name: ARCOSCLUSTER > >? dc-version: 1.1.16-12.el7-94ff4df > >? have-watchdog: false > >? last-lrm-refresh: 1531810841 > >? stonith-enabled: true > > > > Quorum: > >? ?Options:
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org