Re: [ClusterLabs] Weird Fencing Behavior

2018-07-18 Thread Confidential Company
t; to pcmk_reboot_action='off'
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not
fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not
fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
>> fencing device
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing
device
>> ]
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [  ]
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [  ]
>>
>>
>>
>>
>>
>>
>>>> See my config below:
>>>>
>>>> [root@ArcosRhel2 cluster]# pcs config
>>>> Cluster Name: ARCOSCLUSTER
>>>> Corosync Nodes:
>>>> ? ArcosRhel1 ArcosRhel2
>>>> Pacemaker Nodes:
>>>> ? ArcosRhel1 ArcosRhel2
>>>>
>>>> Resources:
>>>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243
>>>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start-
>>> interval-0s)
>>>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop-
>>> interval-0s)
>>>> Stonith Devices:
>>>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap)
>>>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin
>>> passwd=123pass
>>>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
>>> port=ArcosRhel1(Joniel)
>>>> ssl_insecure=1 pcmk_delay_max=0s
>>>> ? ?Operations: monitor interval=60s (Fence1-monitor-interval-60s)
>>>> ? Resource: fence2 (class=stonith type=fence_vmware_soap)
>>>> ? ?Attributes: action=off ipaddr=172.16.10.152 login=admin
>>> passwd=123pass
>>>> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2
>>> pcmk_monitor_timeout=60s
>>>> port=ArcosRhel2(Ben) ssl_insecure=1
>>>> ? ?Operations: monitor interval=60s (fence2-monitor-interval-60s)
>>>> Fencing Levels:
>>>>
>>>> Location Constraints:
>>>> ? ?Resource: Fence1
>>>> ? ? ?Enabled on: ArcosRhel2 (score:INFINITY)
>>>> (id:location-Fence1-ArcosRhel2-INFINITY)
>>>> ? ?Resource: fence2
>>>> ? ? ?Enabled on: ArcosRhel1 (score:INFINITY)
>>>> (id:location-fence2-ArcosRhel1-INFINITY)
>>>> Ordering Constraints:
>>>> Colocation Constraints:
>>>> Ticket Constraints:
>>>>
>>>> Alerts:
>>>> ? No alerts defined
>>>>
>>>> Resources Defaults:
>>>> ? No defaults set
>>>> Operations Defaults:
>>>> ? No defaults set
>>>>
>>>> Cluster Properties:
>>>> ? cluster-infrastructure: corosync
>>>> ? cluster-name: ARCOSCLUSTER
>>>> ? dc-version: 1.1.16-12.el7-94ff4df
>>>> ? have-watchdog: false
>>>> ? last-lrm-refresh: 1531810841
>>>> ? stonith-enabled: true
>>>>
>>>> Quorum:
>>>> ? ?Options:

On Wed, Jul 18, 2018 at 8:00 PM,  wrote:

> Send Users mailing list submissions to
> users@clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@clusterlabs.org
>
> You can reach the person managing the list at
> users-ow...@clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Users digest..."
>
>
> Today's Topics:
>
>1. Re: Weird Fencing Behavior (Andrei Borzenkov)
>2. Re: Weird Fencing Behavior (Klaus Wenninger)
>
>
> --
>
> Message: 1
> Date: Wed, 18 Jul 2018 07:22:25 +0300
> From: Andrei Borzenkov 
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Weird Fencing Behavior
> Message-ID: 
> Content-Type: text/plain; charset=utf-8
>
> 18.07.2018 04:21, Confidenti

Re: [ClusterLabs] Weird Fencing Behavior

2018-07-18 Thread Klaus Wenninger
On 07/18/2018 06:22 AM, Andrei Borzenkov wrote:
> 18.07.2018 04:21, Confidential Company пишет:
 Hi,

 On my two-node active/passive setup, I configured fencing via
 fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
>>> expected
 that both nodes will be stonithed simultaenously.

 On my test scenario, Node1 has ClusterIP resource. When I
>>> disconnect
 service/corosync link physically, Node1 was fenced and Node2 keeps
>>> alive
 given pcmk_delay=0 on both nodes.

 Can you explain the behavior above?

>>> #node1 could not connect to ESX because links were disconnected. As
>>> the
>>> #most obvious explanation.
>>>
>>> #You have logs, you are the only one who can answer this question
>>> with
>>> #some certainty. Others can only guess.
>>>
>>>
>>> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
>>> machine (nodes). second interface was used for ESX links, so fence
>>> can be executed even though corosync links were disconnected. Looking
>>> forward to your response. Thanks
>> #Having no fence delay means a death match (each node killing the other)
>> #is possible, but it doesn't guarantee that it will happen. Some of the
>> #time, one node will detect the outage and fence the other one before
>> #the other one can react.
>>
>> #It's basically an Old West shoot-out -- they may reach for their guns
>> #at the same time, but one may be quicker.
>>
>> #As Andrei suggested, the logs from both nodes could give you a timeline
>> #of what happened when.
>>
>>
>> Hi andrei, kindly see below logs. Based on time of logs, Node1 should have
>> fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown
>> by Node2.
>>
> Node1 tried to fence but failed. It could be connectivity, it could be
> credentials.
>
>> Is it possible to have a 2-Node active/passive setup in pacemaker/corosync
>> that the node that gets disconnected/interface down is the only one that
>> gets fenced?
>>
> If you could determine which node was disconnected you would not need
> any fencing at all.

True but there is still good reason taking connection into account.
Of course the foreseen survivor can't know that his peer got
disconnected directly.
But what you can do is that if you see that you are disconnected
yourself (e.g. ping-connection to routers, test-access to some
web-servers, ...) you can decide to shoot with a delay or not
shoot at all because starting services locally would anyway
be no good.
That is the basic idea behind fence_heuristics_ping fence-agent.
There was some discussion just recently about approaches
like that on the list.

Regards,
Klaus
 
>> Thanks guys
>>
>> *LOGS from Node2:*
>>
>> Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
>> forming new configuration.
> ...
>> Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be
>> fenced because the node is no longer part of the cluster
> ...
>> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation 'reboot'
>> [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1'
>> returned: 0 (OK)
>> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation reboot of
>> ArcosRhel1 by ArcosRhel2 for crmd.1084@ArcosRhel2.0426e6e1: OK
>> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Stonith operation
>> 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0)
>> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Peer ArcosRhel1 was
>> terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK
> ...
>>
>>
>> *LOGS from NODE1*
>> Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed,
>> forming new configuration
>> Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be
>> fenced because the node is no longer part of the cluster
> ...
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off'
>> to pcmk_reboot_action='off'
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
>> (reboot) ArcosRhel2: static-list
>> Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
>> fencing device
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device
>> ]
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [  ]
>> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
>> fence_vmware_soap[7157] stderr: [  ]
>>
>>
>>
>>
>>
>>
 See my config below:

 [root@ArcosRhel2 cluster]# pcs config
 Cluster Name: ARCOSCLUSTER
 Corosync Nodes:
 ? ArcosRhel1 ArcosRhel2
 Pacemaker 

Re: [ClusterLabs] Weird Fencing Behavior

2018-07-17 Thread Andrei Borzenkov
18.07.2018 04:21, Confidential Company пишет:
>>> Hi,
>>>
>>> On my two-node active/passive setup, I configured fencing via
>>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
>> expected
>>> that both nodes will be stonithed simultaenously.
>>>
>>> On my test scenario, Node1 has ClusterIP resource. When I
>> disconnect
>>> service/corosync link physically, Node1 was fenced and Node2 keeps
>> alive
>>> given pcmk_delay=0 on both nodes.
>>>
>>> Can you explain the behavior above?
>>>
>>
>> #node1 could not connect to ESX because links were disconnected. As
>> the
>> #most obvious explanation.
>>
>> #You have logs, you are the only one who can answer this question
>> with
>> #some certainty. Others can only guess.
>>
>>
>> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
>> machine (nodes). second interface was used for ESX links, so fence
>> can be executed even though corosync links were disconnected. Looking
>> forward to your response. Thanks
> 
> #Having no fence delay means a death match (each node killing the other)
> #is possible, but it doesn't guarantee that it will happen. Some of the
> #time, one node will detect the outage and fence the other one before
> #the other one can react.
> 
> #It's basically an Old West shoot-out -- they may reach for their guns
> #at the same time, but one may be quicker.
> 
> #As Andrei suggested, the logs from both nodes could give you a timeline
> #of what happened when.
> 
> 
> Hi andrei, kindly see below logs. Based on time of logs, Node1 should have
> fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown
> by Node2.
> 

Node1 tried to fence but failed. It could be connectivity, it could be
credentials.

> Is it possible to have a 2-Node active/passive setup in pacemaker/corosync
> that the node that gets disconnected/interface down is the only one that
> gets fenced?
> 

If you could determine which node was disconnected you would not need
any fencing at all.

> Thanks guys
> 
> *LOGS from Node2:*
> 
> Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
> forming new configuration.
...
> Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be
> fenced because the node is no longer part of the cluster
...
> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation 'reboot'
> [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1'
> returned: 0 (OK)
> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation reboot of
> ArcosRhel1 by ArcosRhel2 for crmd.1084@ArcosRhel2.0426e6e1: OK
> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Stonith operation
> 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0)
> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Peer ArcosRhel1 was
> terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK
...
> 
> 
> 
> *LOGS from NODE1*
> Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed,
> forming new configuration
> Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be
> fenced because the node is no longer part of the cluster
...
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off'
> to pcmk_reboot_action='off'
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
> fencing device
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device
> ]
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [  ]
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [  ]
> 
> 
> 
> 
> 
> 
>>> See my config below:
>>>
>>> [root@ArcosRhel2 cluster]# pcs config
>>> Cluster Name: ARCOSCLUSTER
>>> Corosync Nodes:
>>> ? ArcosRhel1 ArcosRhel2
>>> Pacemaker Nodes:
>>> ? ArcosRhel1 ArcosRhel2
>>>
>>> Resources:
>>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243
>>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start-
>> interval-0s)
>>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop-
>> interval-0s)
>>>
>>> Stonith Devices:
>>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap)
>>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin
>> passwd=123pass
>>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
>> port=ArcosRhel1(Joniel)
>>> ssl_insecure=1 pcmk_delay_max=0s
>>> ? ?Operations: monitor 

Re: [ClusterLabs] Weird Fencing Behavior

2018-07-17 Thread Confidential Company
> > Hi,
> >
> > On my two-node active/passive setup, I configured fencing via
> > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
> expected
> > that both nodes will be stonithed simultaenously.
> >
> > On my test scenario, Node1 has ClusterIP resource. When I
> disconnect
> > service/corosync link physically, Node1 was fenced and Node2 keeps
> alive
> > given pcmk_delay=0 on both nodes.
> >
> > Can you explain the behavior above?
> >
>
> #node1 could not connect to ESX because links were disconnected. As
> the
> #most obvious explanation.
>
> #You have logs, you are the only one who can answer this question
> with
> #some certainty. Others can only guess.
>
>
> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
> machine (nodes). second interface was used for ESX links, so fence
> can be executed even though corosync links were disconnected. Looking
> forward to your response. Thanks

#Having no fence delay means a death match (each node killing the other)
#is possible, but it doesn't guarantee that it will happen. Some of the
#time, one node will detect the outage and fence the other one before
#the other one can react.

#It's basically an Old West shoot-out -- they may reach for their guns
#at the same time, but one may be quicker.

#As Andrei suggested, the logs from both nodes could give you a timeline
#of what happened when.


Hi andrei, kindly see below logs. Based on time of logs, Node1 should have
fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown
by Node2.

Is it possible to have a 2-Node active/passive setup in pacemaker/corosync
that the node that gets disconnected/interface down is the only one that
gets fenced?

Thanks guys

*LOGS from Node2:*

Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
forming new configuration.
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] A new membership (
172.16.10.242:220) was formed. Members left: 1
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] Failed to receive the
leave message. failed: 1
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [QUORUM] Members[1]: 2
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [MAIN  ] Completed service
synchronization, ready to provide service.
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Node ArcosRhel1 state is
now lost
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Removing all ArcosRhel1
attributes for peer loss
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Lost attribute writer
ArcosRhel1
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Purged 1 peers with id=1
and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 cib[1079]:  notice: Node ArcosRhel1 state is now
lost
Jul 17 13:33:28 ArcosRhel2 cib[1079]:  notice: Purged 1 peers with id=1
and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: Node ArcosRhel1 state is
now lost
Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Our DC node (ArcosRhel1)
left the cluster
Jul 17 13:33:28 ArcosRhel2 pacemakerd[1074]:  notice: Node ArcosRhel1 state
is now lost
Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]:  notice: Node ArcosRhel1 state
is now lost
Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]:  notice: Purged 1 peers with
id=1 and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: State transition S_NOT_DC
-> S_ELECTION
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: State transition S_ELECTION
-> S_INTEGRATION
Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Input I_ELECTION_DC
received in state S_INTEGRATION from do_election_check
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be
fenced because the node is no longer part of the cluster
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 is
unclean
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action fence2_stop_0 on
ArcosRhel1 is unrunnable (offline)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action ClusterIP_stop_0
on ArcosRhel1 is unrunnable (offline)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Scheduling Node
ArcosRhel1 for STONITH
Jul 17 13:33:30 ArcosRhel2 pengine[1083]:  notice: Move
 fence2#011(Started ArcosRhel1 -> ArcosRhel2)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]:  notice: Move
 ClusterIP#011(Started ArcosRhel1 -> ArcosRhel2)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Calculated transition 0
(with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-20.bz2
Jul 17 13:33:30 ArcosRhel2 crmd[1084]:  notice: Requesting fencing (reboot)
of node ArcosRhel1
Jul 17 13:33:30 ArcosRhel2 crmd[1084]:  notice: Initiating start operation
fence2_start_0 locally on ArcosRhel2
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Client
crmd.1084.cd70178e wants to fence (reboot) 'ArcosRhel1' with device '(any)'
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Requesting peer
fencing (reboot) of ArcosRhel1
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Fence1 

Re: [ClusterLabs] Weird Fencing Behavior

2018-07-17 Thread Ken Gaillot
On Tue, 2018-07-17 at 21:29 +0800, Confidential Company wrote:
> 
> > Hi,
> >
> > On my two-node active/passive setup, I configured fencing via
> > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
> expected
> > that both nodes will be stonithed simultaenously.
> >
> > On my test scenario, Node1 has ClusterIP resource. When I
> disconnect
> > service/corosync link physically, Node1 was fenced and Node2 keeps
> alive
> > given pcmk_delay=0 on both nodes.
> >
> > Can you explain the behavior above?
> >
> 
> #node1 could not connect to ESX because links were disconnected. As
> the
> #most obvious explanation.
> 
> #You have logs, you are the only one who can answer this question
> with
> #some certainty. Others can only guess.
> 
> 
> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
> machine (nodes). second interface was used for ESX links, so fence
> can be executed even though corosync links were disconnected. Looking
> forward to your response. Thanks

Having no fence delay means a death match (each node killing the other)
is possible, but it doesn't guarantee that it will happen. Some of the
time, one node will detect the outage and fence the other one before
the other one can react.

It's basically an Old West shoot-out -- they may reach for their guns
at the same time, but one may be quicker.

As Andrei suggested, the logs from both nodes could give you a timeline
of what happened when.

> > See my config below:
> >
> > [root@ArcosRhel2 cluster]# pcs config
> > Cluster Name: ARCOSCLUSTER
> > Corosync Nodes:
> >  ArcosRhel1 ArcosRhel2
> > Pacemaker Nodes:
> >  ArcosRhel1 ArcosRhel2
> >
> > Resources:
> >  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> >   Attributes: cidr_netmask=32 ip=172.16.10.243
> >   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
> >               start interval=0s timeout=20s (ClusterIP-start-
> interval-0s)
> >               stop interval=0s timeout=20s (ClusterIP-stop-
> interval-0s)
> >
> > Stonith Devices:
> >  Resource: Fence1 (class=stonith type=fence_vmware_soap)
> >   Attributes: action=off ipaddr=172.16.10.151 login=admin
> passwd=123pass
> > pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
> port=ArcosRhel1(Joniel)
> > ssl_insecure=1 pcmk_delay_max=0s
> >   Operations: monitor interval=60s (Fence1-monitor-interval-60s)
> >  Resource: fence2 (class=stonith type=fence_vmware_soap)
> >   Attributes: action=off ipaddr=172.16.10.152 login=admin
> passwd=123pass
> > pcmk_delay_max=0s pcmk_host_list=ArcosRhel2
> pcmk_monitor_timeout=60s
> > port=ArcosRhel2(Ben) ssl_insecure=1
> >   Operations: monitor interval=60s (fence2-monitor-interval-60s)
> > Fencing Levels:
> >
> > Location Constraints:
> >   Resource: Fence1
> >     Enabled on: ArcosRhel2 (score:INFINITY)
> > (id:location-Fence1-ArcosRhel2-INFINITY)
> >   Resource: fence2
> >     Enabled on: ArcosRhel1 (score:INFINITY)
> > (id:location-fence2-ArcosRhel1-INFINITY)
> > Ordering Constraints:
> > Colocation Constraints:
> > Ticket Constraints:
> >
> > Alerts:
> >  No alerts defined
> >
> > Resources Defaults:
> >  No defaults set
> > Operations Defaults:
> >  No defaults set
> >
> > Cluster Properties:
> >  cluster-infrastructure: corosync
> >  cluster-name: ARCOSCLUSTER
> >  dc-version: 1.1.16-12.el7-94ff4df
> >  have-watchdog: false
> >  last-lrm-refresh: 1531810841
> >  stonith-enabled: true
> >
> > Quorum:
> >   Options:
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> h.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Weird Fencing Behavior?

2018-07-17 Thread Andrei Borzenkov
On Tue, Jul 17, 2018 at 10:58 AM, Confidential Company
 wrote:
> Hi,
>
> On my two-node active/passive setup, I configured fencing via
> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I expected
> that both nodes will be stonithed simultaenously.
>
> On my test scenario, Node1 has ClusterIP resource. When I disconnect
> service/corosync link physically, Node1 was fenced and Node2 keeps alive
> given pcmk_delay=0 on both nodes.
>
> Can you explain the behavior above?
>

node1 could not connect to ESX because links were disconnected. As the
most obvious explanation.

You have logs, you are the only one who can answer this question with
some certainty. Others can only guess.

>
>
> See my config below:
>
> [root@ArcosRhel2 cluster]# pcs config
> Cluster Name: ARCOSCLUSTER
> Corosync Nodes:
>  ArcosRhel1 ArcosRhel2
> Pacemaker Nodes:
>  ArcosRhel1 ArcosRhel2
>
> Resources:
>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: cidr_netmask=32 ip=172.16.10.243
>   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>   start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>   stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>
> Stonith Devices:
>  Resource: Fence1 (class=stonith type=fence_vmware_soap)
>   Attributes: action=off ipaddr=172.16.10.151 login=admin passwd=123pass
> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s port=ArcosRhel1(Joniel)
> ssl_insecure=1 pcmk_delay_max=0s
>   Operations: monitor interval=60s (Fence1-monitor-interval-60s)
>  Resource: fence2 (class=stonith type=fence_vmware_soap)
>   Attributes: action=off ipaddr=172.16.10.152 login=admin passwd=123pass
> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 pcmk_monitor_timeout=60s
> port=ArcosRhel2(Ben) ssl_insecure=1
>   Operations: monitor interval=60s (fence2-monitor-interval-60s)
> Fencing Levels:
>
> Location Constraints:
>   Resource: Fence1
> Enabled on: ArcosRhel2 (score:INFINITY)
> (id:location-Fence1-ArcosRhel2-INFINITY)
>   Resource: fence2
> Enabled on: ArcosRhel1 (score:INFINITY)
> (id:location-fence2-ArcosRhel1-INFINITY)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:
>
> Alerts:
>  No alerts defined
>
> Resources Defaults:
>  No defaults set
> Operations Defaults:
>  No defaults set
>
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: ARCOSCLUSTER
>  dc-version: 1.1.16-12.el7-94ff4df
>  have-watchdog: false
>  last-lrm-refresh: 1531810841
>  stonith-enabled: true
>
> Quorum:
>   Options:
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org