>>>> Hi, >>>> >>>> On my two-node active/passive setup, I configured fencing via >>>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I >>> expected >>>> that both nodes will be stonithed simultaenously. >>>> >>>> On my test scenario, Node1 has ClusterIP resource. When I >>> disconnect >>>> service/corosync link physically, Node1 was fenced and Node2 keeps >>> alive >>>> given pcmk_delay=0 on both nodes. >>>> >>>> Can you explain the behavior above? >>>> >>> #node1 could not connect to ESX because links were disconnected. As >>> the >>> #most obvious explanation. >>> >>> #You have logs, you are the only one who can answer this question >>> with >>> #some certainty. Others can only guess. >>> >>> >>> Oops, my bad. I forgot to tell. I have two interfaces on each virtual >>> machine (nodes). second interface was used for ESX links, so fence >>> can be executed even though corosync links were disconnected. Looking >>> forward to your response. Thanks >> #Having no fence delay means a death match (each node killing the other) >> #is possible, but it doesn't guarantee that it will happen. Some of the >> #time, one node will detect the outage and fence the other one before >> #the other one can react. >> >> #It's basically an Old West shoot-out -- they may reach for their guns >> #at the same time, but one may be quicker. >> >> #As Andrei suggested, the logs from both nodes could give you a timeline >> #of what happened when. >> >> >> Hi andrei, kindly see below logs. Based on time of logs, Node1 should have >> fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown >> by Node2. >> > Node1 tried to fence but failed. It could be connectivity, it could be > credentials. >
Maybe this is the reason but it's still weird, I run so many tests and I conclude all of them have pattern, the Node that was physically disconnected is the one that gets fenced. It's not random. See diagram on this link: https://drive.google.com/open?id=1pbJef_wJdQelJSv1L72c4H6NAvUqV_p- And also based on my test, if Node1 gets fenced, after reboot, it doesn't automatically run the cluster. Different from what happens on Node2, even after reboot, it automatically run/join the cluster. >> Is it possible to have a 2-Node active/passive setup in pacemaker/corosync >> that the node that gets disconnected/interface down is the only one that >> gets fenced? >> > If you could determine which node was disconnected you would not need > any fencing at all. #True but there is still good reason taking connection into account. #Of course the foreseen survivor can't know that his peer got #disconnected directly. #But what you can do is that if you see that you are disconnected #yourself (e.g. ping-connection to routers, test-access to some #web-servers, ...) you can decide to shoot with a delay or not #shoot at all because starting services locally would anyway #be no good. #That is the basic idea behind fence_heuristics_ping fence-agent. #There was some discussion just recently about approaches #like that on the list. #Regards, #Klaus fence_heuristics_ping seems not available on my Rhel7 version. I do wonder if it is deprecated. ? >> Thanks guys >> >> *LOGS from Node2:* >> >> Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed, >> forming new configuration. > ... >> Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be >> fenced because the node is no longer part of the cluster > ... >> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]: notice: Operation 'reboot' >> [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1' >> returned: 0 (OK) >> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]: notice: Operation reboot of >> ArcosRhel1 by ArcosRhel2 for crmd.1084@ArcosRhel2.0426e6e1: OK >> Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Stonith operation >> 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0) >> Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Peer ArcosRhel1 was >> terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK > ... >> >> >> *LOGS from NODE1* >> Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed, >> forming new configuration.... >> Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be >> fenced because the node is no longer part of the cluster > ... >> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off' >> to pcmk_reboot_action='off' >> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Fence1 can not fence >> (reboot) ArcosRhel2: static-list >> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: fence2 can fence >> (reboot) ArcosRhel2: static-list >> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Fence1 can not fence >> (reboot) ArcosRhel2: static-list >> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: fence2 can fence >> (reboot) ArcosRhel2: static-list >> Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to >> fencing device >> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: >> fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device >> ] >> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: >> fence_vmware_soap[7157] stderr: [ ] >> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: >> fence_vmware_soap[7157] stderr: [ ] >> >> >> >> >> >> >>>> See my config below: >>>> >>>> [root@ArcosRhel2 cluster]# pcs config >>>> Cluster Name: ARCOSCLUSTER >>>> Corosync Nodes: >>>> ? ArcosRhel1 ArcosRhel2 >>>> Pacemaker Nodes: >>>> ? ArcosRhel1 ArcosRhel2 >>>> >>>> Resources: >>>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) >>>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243 >>>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s) >>>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start- >>> interval-0s) >>>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop- >>> interval-0s) >>>> Stonith Devices: >>>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap) >>>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin >>> passwd=123pass >>>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s >>> port=ArcosRhel1(Joniel) >>>> ssl_insecure=1 pcmk_delay_max=0s >>>> ? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s) >>>> ? Resource: fence2 (class=stonith type=fence_vmware_soap) >>>> ? ?Attributes: action=off ipaddr=172.16.10.152 login=admin >>> passwd=123pass >>>> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 >>> pcmk_monitor_timeout=60s >>>> port=ArcosRhel2(Ben) ssl_insecure=1 >>>> ? ?Operations: monitor interval=60s (fence2-monitor-interval-60s) >>>> Fencing Levels: >>>> >>>> Location Constraints: >>>> ? ?Resource: Fence1 >>>> ? ? ?Enabled on: ArcosRhel2 (score:INFINITY) >>>> (id:location-Fence1-ArcosRhel2-INFINITY) >>>> ? ?Resource: fence2 >>>> ? ? ?Enabled on: ArcosRhel1 (score:INFINITY) >>>> (id:location-fence2-ArcosRhel1-INFINITY) >>>> Ordering Constraints: >>>> Colocation Constraints: >>>> Ticket Constraints: >>>> >>>> Alerts: >>>> ? No alerts defined >>>> >>>> Resources Defaults: >>>> ? No defaults set >>>> Operations Defaults: >>>> ? No defaults set >>>> >>>> Cluster Properties: >>>> ? cluster-infrastructure: corosync >>>> ? cluster-name: ARCOSCLUSTER >>>> ? dc-version: 1.1.16-12.el7-94ff4df >>>> ? have-watchdog: false >>>> ? last-lrm-refresh: 1531810841 >>>> ? stonith-enabled: true >>>> >>>> Quorum: >>>> ? ?Options: On Wed, Jul 18, 2018 at 8:00 PM, <users-requ...@clusterlabs.org> wrote: > Send Users mailing list submissions to > users@clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@clusterlabs.org > > You can reach the person managing the list at > users-ow...@clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Users digest..." > > > Today's Topics: > > 1. Re: Weird Fencing Behavior (Andrei Borzenkov) > 2. Re: Weird Fencing Behavior (Klaus Wenninger) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 18 Jul 2018 07:22:25 +0300 > From: Andrei Borzenkov <arvidj...@gmail.com> > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Weird Fencing Behavior > Message-ID: <a58c2151-2519-46c0-209c-8f19cd0c7...@gmail.com> > Content-Type: text/plain; charset=utf-8 > > 18.07.2018 04:21, Confidential Company ?????: > >>> Hi, > >>> > >>> On my two-node active/passive setup, I configured fencing via > >>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I > >> expected > >>> that both nodes will be stonithed simultaenously. > >>> > >>> On my test scenario, Node1 has ClusterIP resource. When I > >> disconnect > >>> service/corosync link physically, Node1 was fenced and Node2 keeps > >> alive > >>> given pcmk_delay=0 on both nodes. > >>> > >>> Can you explain the behavior above? > >>> > >> > >> #node1 could not connect to ESX because links were disconnected. As > >> the > >> #most obvious explanation. > >> > >> #You have logs, you are the only one who can answer this question > >> with > >> #some certainty. Others can only guess. > >> > >> > >> Oops, my bad. I forgot to tell. I have two interfaces on each virtual > >> machine (nodes). second interface was used for ESX links, so fence > >> can be executed even though corosync links were disconnected. Looking > >> forward to your response. Thanks > > > > #Having no fence delay means a death match (each node killing the other) > > #is possible, but it doesn't guarantee that it will happen. Some of the > > #time, one node will detect the outage and fence the other one before > > #the other one can react. > > > > #It's basically an Old West shoot-out -- they may reach for their guns > > #at the same time, but one may be quicker. > > > > #As Andrei suggested, the logs from both nodes could give you a timeline > > #of what happened when. > > > > > > Hi andrei, kindly see below logs. Based on time of logs, Node1 should > have > > fenced first Node2, but in actual test/scenario, Node1 was > fenced/shutdown > > by Node2. > > > > Node1 tried to fence but failed. It could be connectivity, it could be > credentials. > > > Is it possible to have a 2-Node active/passive setup in > pacemaker/corosync > > that the node that gets disconnected/interface down is the only one that > > gets fenced? > > > > If you could determine which node was disconnected you would not need > any fencing at all. > > > Thanks guys > > > > *LOGS from Node2:* > > > > Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed, > > forming new configuration. > ... > > Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will > be > > fenced because the node is no longer part of the cluster > ... > > Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]: notice: Operation 'reboot' > > [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1' > > returned: 0 (OK) > > Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]: notice: Operation reboot of > > ArcosRhel1 by ArcosRhel2 for crmd.1084@ArcosRhel2.0426e6e1: OK > > Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Stonith operation > > 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0) > > Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Peer ArcosRhel1 was > > terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK > ... > > > > > > > > *LOGS from NODE1* > > Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed, > > forming new configuration.... > > Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will > be > > fenced because the node is no longer part of the cluster > ... > > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping > action='off' > > to pcmk_reboot_action='off' > > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Fence1 can not > fence > > (reboot) ArcosRhel2: static-list > > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: fence2 can fence > > (reboot) ArcosRhel2: static-list > > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Fence1 can not > fence > > (reboot) ArcosRhel2: static-list > > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: fence2 can fence > > (reboot) ArcosRhel2: static-list > > Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to > > fencing device > > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: > > fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing > device > > ] > > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: > > fence_vmware_soap[7157] stderr: [ ] > > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: > > fence_vmware_soap[7157] stderr: [ ] > > > > > > > > > > > > > >>> See my config below: > >>> > >>> [root@ArcosRhel2 cluster]# pcs config > >>> Cluster Name: ARCOSCLUSTER > >>> Corosync Nodes: > >>> ? ArcosRhel1 ArcosRhel2 > >>> Pacemaker Nodes: > >>> ? ArcosRhel1 ArcosRhel2 > >>> > >>> Resources: > >>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) > >>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243 > >>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s) > >>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start- > >> interval-0s) > >>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop- > >> interval-0s) > >>> > >>> Stonith Devices: > >>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap) > >>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin > >> passwd=123pass > >>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s > >> port=ArcosRhel1(Joniel) > >>> ssl_insecure=1 pcmk_delay_max=0s > >>> ? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s) > >>> ? Resource: fence2 (class=stonith type=fence_vmware_soap) > >>> ? ?Attributes: action=off ipaddr=172.16.10.152 login=admin > >> passwd=123pass > >>> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 > >> pcmk_monitor_timeout=60s > >>> port=ArcosRhel2(Ben) ssl_insecure=1 > >>> ? ?Operations: monitor interval=60s (fence2-monitor-interval-60s) > >>> Fencing Levels: > >>> > >>> Location Constraints: > >>> ? ?Resource: Fence1 > >>> ? ? ?Enabled on: ArcosRhel2 (score:INFINITY) > >>> (id:location-Fence1-ArcosRhel2-INFINITY) > >>> ? ?Resource: fence2 > >>> ? ? ?Enabled on: ArcosRhel1 (score:INFINITY) > >>> (id:location-fence2-ArcosRhel1-INFINITY) > >>> Ordering Constraints: > >>> Colocation Constraints: > >>> Ticket Constraints: > >>> > >>> Alerts: > >>> ? No alerts defined > >>> > >>> Resources Defaults: > >>> ? No defaults set > >>> Operations Defaults: > >>> ? No defaults set > >>> > >>> Cluster Properties: > >>> ? cluster-infrastructure: corosync > >>> ? cluster-name: ARCOSCLUSTER > >>> ? dc-version: 1.1.16-12.el7-94ff4df > >>> ? have-watchdog: false > >>> ? last-lrm-refresh: 1531810841 > >>> ? stonith-enabled: true > >>> > >>> Quorum: > >>> ? ?Options: >
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org