Re: [ClusterLabs] Linux 8.2 - high totem token requires manual setting of ping_interval and ping_timeout

2020-06-25 Thread Jan Friesse

Robert,
thank you for the info/report. More comments inside.


All,
Hello.  Hope all is well.   I have been researching Oracle Linux 8.2 and ran 
across a situation that is not well documented.   I decided to provide some 
details to the community in case I am missing something.

Basically, if you increase the totem token above approximately 33000 with the 
knet transport, then a two node cluster will not properly form.   The exact 
threshold value will slightly fluctuate, depending on hardware type and 
debugging, but will consistently fail above 4.


At least corosync with 40sec timeout works just fine for me.

# corosync-cmapctl  | grep token
runtime.config.totem.token (u32) = 40650

# corosync-quorumtool
Quorum information
--
Date: Fri Jun 26 08:45:12 2020
Quorum provider:  corosync_votequorum
Nodes:2
Node ID:  1
Ring ID:  1.11be1
Quorate:  Yes

Votequorum information
--
Expected votes:   3
Highest expected: 3
Total votes:  2
Quorum:   2
Flags:Quorate

Membership information
--
Nodeid  Votes Name
 1  1 vmvlan-vmcos8-n05 (local)
 6  1 vmvlan-vmcos8-n06


It is indeed true that forming took a bit more time (30 sec to be more 
precise)




The failure to form a cluster would occur when running the "pcs cluster start 
--all" command or if I would start one cluster, let it stabilize, then start the 
second.  When it fails to form a cluster, each side would say they are ONLINE, but the 
other side is UNCLEAN(offline) (cluster state: partition WITHOUT quorum).   If I define 
proper stonith resources, then they will not fence since the cluster never makes it to an 
initial quorum state.  So, the cluster will stay in this split state indefinitely.


Maybe some timeout in pcs?



Changing the transport back to udpu or udp, the higher totem tokens worked as 
expected.


Yup. You've correctly find out that knet_* timeouts helps. Basically 
knet let link not working till it gets enough pongs. UDP/UDPU doesn't 
have this concept so it will create cluster faster.




 From the debug logging, I suspect that the Election Trigger (20 seconds) fires before all nodes are properly identified by the knet transport.  I noticed that with a totem token passing 32 seconds, the knet_ping* defaults were pushing up against that 20 second mark.  The output of "corosync-cfgtool -s" will show each node's link as enabled, but each side will state the other side's link is not connected.   Since each side thinks the other node is not active, they fail to properly send a join message to the other node during the election.   They will essentially form a singleton cluster(??).  


Till now your analysis is correct. Corosync is really unable to send 
join message and forms single node cluster.



It is more puzzling when you start one node at a time, waiting for the node to 
stabilize before starting the other.   It is like the first node will never see 
the remote knet interfaces become active, regardless of how long you wait.


This shouldn't happen. Knet will eventually receive enough pongs so 
corosync broadcast message to other nodes, which founds out that new 
membership should be formed.




The solution is to manually set the knet ping_timeout and ping_interval to 
lower values than the default values derived from the totem token.  This seems 
to allow for the knet transport to determine link status of all nodes before 
the election timer pops.


These timeouts are indeed not the best one. I had few ideas how to 
improve them, because currently they are in favor of multiple links 
clusters. Single links cluster may work better with slightly different 
defaults.




I tested this on both physical hardware and with VMs.  Both react similarly.

Bare bones test case to reproduce:
yum install pcs pacemaker fence-agents-all
firewall-cmd --permanent --add-service=high-availability
firewall-cmd --add-service=high-availability
systemctl start pcsd.service
systemctl enable pcsd.service
systemctl disable corosync
systemctl disable pacemaker
passwd hacluster
pcs host auth node1 node2
pcs cluster setup rhcs_test node1 node2 totem token=41000
pcs cluster start --all

Example command to create cluster that will properly form and get quorum:
pcs cluster setup rhcs_test node1 node2 totem token=61000 transport knet link 
ping_interval=1250 ping_timeout=2500

Hope this helps someone in the future.


Yup. It is interesting finding and thanks for that.

Regards,
  Honza



Thanks
Robert


Robert Hayden | Lead Technology Architect | Cerner Corporation


CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing

[ClusterLabs] Linux 8.2 - high totem token requires manual setting of ping_interval and ping_timeout

2020-06-25 Thread Hayden,Robert
All,
Hello.  Hope all is well.   I have been researching Oracle Linux 8.2 and ran 
across a situation that is not well documented.   I decided to provide some 
details to the community in case I am missing something.

Basically, if you increase the totem token above approximately 33000 with the 
knet transport, then a two node cluster will not properly form.   The exact 
threshold value will slightly fluctuate, depending on hardware type and 
debugging, but will consistently fail above 4.

The failure to form a cluster would occur when running the "pcs cluster start 
--all" command or if I would start one cluster, let it stabilize, then start 
the second.  When it fails to form a cluster, each side would say they are 
ONLINE, but the other side is UNCLEAN(offline) (cluster state: partition 
WITHOUT quorum).   If I define proper stonith resources, then they will not 
fence since the cluster never makes it to an initial quorum state.  So, the 
cluster will stay in this split state indefinitely.

Changing the transport back to udpu or udp, the higher totem tokens worked as 
expected.

>From the debug logging, I suspect that the Election Trigger (20 seconds) fires 
>before all nodes are properly identified by the knet transport.  I noticed 
>that with a totem token passing 32 seconds, the knet_ping* defaults were 
>pushing up against that 20 second mark.  The output of "corosync-cfgtool -s" 
>will show each node's link as enabled, but each side will state the other 
>side's link is not connected.   Since each side thinks the other node is not 
>active, they fail to properly send a join message to the other node during the 
>election.   They will essentially form a singleton cluster(??).  It is more 
>puzzling when you start one node at a time, waiting for the node to stabilize 
>before starting the other.   It is like the first node will never see the 
>remote knet interfaces become active, regardless of how long you wait.

The solution is to manually set the knet ping_timeout and ping_interval to 
lower values than the default values derived from the totem token.  This seems 
to allow for the knet transport to determine link status of all nodes before 
the election timer pops.

I tested this on both physical hardware and with VMs.  Both react similarly.

Bare bones test case to reproduce:
yum install pcs pacemaker fence-agents-all
firewall-cmd --permanent --add-service=high-availability
firewall-cmd --add-service=high-availability
systemctl start pcsd.service
systemctl enable pcsd.service
systemctl disable corosync
systemctl disable pacemaker
passwd hacluster
pcs host auth node1 node2
pcs cluster setup rhcs_test node1 node2 totem token=41000
pcs cluster start --all

Example command to create cluster that will properly form and get quorum:
pcs cluster setup rhcs_test node1 node2 totem token=61000 transport knet link 
ping_interval=1250 ping_timeout=2500

Hope this helps someone in the future.

Thanks
Robert


Robert Hayden | Lead Technology Architect | Cerner Corporation


CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Beginner with STONITH Problem

2020-06-25 Thread Strahil Nikolov
Hi Stefan,

this  sounds  like  firewall issue.

Check that the port udp/1229 is opened  for  the Hypervisours and tcp/1229 for 
the VMs.

P.S.: The  protocols are based on my fading memory, so double check the .

Best Regards,
Strahil Nikolov

На 25 юни 2020 г. 18:18:46 GMT+03:00, "stefan.schm...@farmpartner-tec.com" 
 написа:
>Hello,
>
>I have now tried to use that "how to" to make things work. Sadly I have
>
>run into a couple of Problems.
>
>I have installed and configured fence_xvm like it was told in the 
>walk-through but as expected the fence_virtd does not find all VMs,
>only 
>the one installed on itself.
>In the configuration I have chosen "bond0" as the listeners interface 
>since the hosts have bonding configured. I have appenmded the complete 
>fence_virt.conf at the end of the mail.
>All 4 servers, CentOS-Hosts and Ubuntu-VMs are in the same Network.
>Also 
>the generated key is present on all 4 Servers.
>
>Still the "fence_xvm -o list" command olny results in showing the local
>VM
>
># fence_xvm -o list
>kvm101   beee402d-c6ac-4df4-9b97-bd84e637f2e7
>on
>
>I hav tried the "Alternative configuration for guests running on 
>multiple hosts" but this fails right from the start, because the 
>packages libvirt-qpid are not available
>
># yum install -y libvirt-qpid qpidd
>[...]
>No package libvirt-qpid available.
>No package qpidd available.
>
>Could anyone please advise on how to proceed to get both nodes 
>recognized by the CentOS-Hosts? As a side note, all 4 Servers can ping 
>each other, so they are present and available in the same network.
>
>In addition, I cant seem to find the correct packages for Ubuntu 18.04 
>to install on the VMs. Trying to install fence_virt and/or fence_xvm 
>just results in "E: Unable to locate package fence_xvm/fence_virt".
>Are those packages available at all forUbubtu 18.04? I could only find 
>them for 20.04 or are they just called completely different so that I
>am 
>not able to find them?
>
>Thank you in advance for your help!
>
>Kind regards
>Stefan Schmitz
>
>
>The current /etc/fence_virt.conf:
>
>fence_virtd {
> listener = "multicast";
> backend = "libvirt";
> module_path = "/usr/lib64/fence-virt";
>}
>
>listeners {
> multicast {
> key_file = "/etc/cluster/fence_xvm.key";
> address = "225.0.0.12";
> interface = "bond0";
> family = "ipv4";
> port = "1229";
> }
>
>}
>
>backends {
> libvirt {
> uri = "qemu:///system";
> }
>
>}
>
>
>
>
>
>
>
>Am 25.06.2020 um 10:28 schrieb stefan.schm...@farmpartner-tec.com:
>> Hello and thank you both for the help,
>> 
>>  >> Are the VMs in the same VLAN like the hosts?
>> Yes the VMs and Hosts are all in the same VLan. So I will try the 
>> fence_xvm solution.
>> 
>>  > https://wiki.clusterlabs.org/wiki/Guest_Fencing
>> Thank you for the pointer to that walk-through. Sadly every VM is on
>its 
>> own host which is marked as "Not yet supported" but still this how to
>is 
>> a good starting point and I will try to work and tweak my way through
>it 
>> for out setup.
>> 
>> Thanks again!
>> 
>> Kind regards
>> Stefan Schmitz
>> 
>> Am 24.06.2020 um 15:51 schrieb Ken Gaillot:
>>> On Wed, 2020-06-24 at 15:47 +0300, Strahil Nikolov wrote:
 Hello Stefan,

 There are multiple options for stonith, but it depends on the
 environment.
 Are the VMs in the same VLAN like the hosts? I am asking this , as
 the most popular candidate is 'fence_xvm' but it requires the VM to
 send fencing request to the KVM host (multicast) where the partner
>VM
 is hosted .
>>>
>>> FYI a fence_xvm walk-through for the simple case is available on the
>>> ClusterLabs wiki:
>>>
>>> https://wiki.clusterlabs.org/wiki/Guest_Fencing
>>>
 Another approach is to use a shared disk (either over iSCSI or
 SAN)  and use sbd for power-based fencing,  or  use SCSI3
>Persistent
 Reservations (which can also be converted into a power-based
 fencing).


 Best Regards,
 Strahil Nikolov


 На 24 юни 2020 г. 13:44:27 GMT+03:00, "
 stefan.schm...@farmpartner-tec.com" <
 stefan.schm...@farmpartner-tec.com> написа:
> Hello,
>
> I am an absolute beginner trying to setup our first HA Cluster.
> So far I have been working with the "Pacemaker 1.1 Clusters from
> Scratch" Guide wich worked for me perfectly up to the Point where
>I
> need
> to install and configure STONITH.
>
> Curerent Situation is:2 Ubuntu Server as the cluster. Both of
> those
> Servers are virtual machines running on 2 Centos KVM Hosts.
> Those are the devices or ressources we can use for a STONITH
> implementation. In this and other guides I do read a lot about
> external
>
> devices and in the "pcs stonith list" there are some XEN but sadly
> I
> cannot find anything about KVM. At this po

Re: [ClusterLabs] Beginner with STONITH Problem

2020-06-25 Thread stefan.schm...@farmpartner-tec.com

Hello and thank you both for the help,

>> Are the VMs in the same VLAN like the hosts?
Yes the VMs and Hosts are all in the same VLan. So I will try the 
fence_xvm solution.


> https://wiki.clusterlabs.org/wiki/Guest_Fencing
Thank you for the pointer to that walk-through. Sadly every VM is on its 
own host which is marked as "Not yet supported" but still this how to is 
a good starting point and I will try to work and tweak my way through it 
for out setup.


Thanks again!

Kind regards
Stefan Schmitz

Am 24.06.2020 um 15:51 schrieb Ken Gaillot:

On Wed, 2020-06-24 at 15:47 +0300, Strahil Nikolov wrote:

Hello Stefan,

There are multiple options for stonith, but it depends on the
environment.
Are the VMs in the same VLAN like the hosts? I am asking this , as
the most popular candidate is 'fence_xvm' but it requires the VM to
send fencing request to the KVM host (multicast) where the partner VM
is hosted .


FYI a fence_xvm walk-through for the simple case is available on the
ClusterLabs wiki:

https://wiki.clusterlabs.org/wiki/Guest_Fencing


Another approach is to use a shared disk (either over iSCSI or
SAN)  and use sbd for power-based fencing,  or  use SCSI3 Persistent
Reservations (which can also be converted into a power-based
fencing).


Best Regards,
Strahil Nikolov


На 24 юни 2020 г. 13:44:27 GMT+03:00, "
stefan.schm...@farmpartner-tec.com" <
stefan.schm...@farmpartner-tec.com> написа:

Hello,

I am an absolute beginner trying to setup our first HA Cluster.
So far I have been working with the "Pacemaker 1.1 Clusters from
Scratch" Guide wich worked for me perfectly up to the Point where I
need
to install and configure STONITH.

Curerent Situation is:2 Ubuntu Server as the cluster. Both of
those
Servers are virtual machines running on 2 Centos KVM Hosts.
Those are the devices or ressources we can use for a STONITH
implementation. In this and other guides I do read a lot about
external

devices and in the "pcs stonith list" there are some XEN but sadly
I
cannot find anything about KVM. At this point I am stumped and have
no
clue in how to proceed, I am not even sure what further inforamtion
I
shopuld provide that would be useful for giving advise?

The current pcs status is:

# pcs status
Cluster name: pacemaker_cluster
WARNING: corosync and pacemaker node names do not match (IPs used
in
setup?)
Stack: corosync
Current DC: server2ubuntu1 (version 1.1.18-2b07d5c5a9) - partition
with

quorum
Last updated: Wed Jun 24 12:43:24 2020
Last change: Wed Jun 24 12:35:17 2020 by root via cibadmin on
server4ubuntu1

2 nodes configured
12 resources configured

Online: [ server2ubuntu1 server4ubuntu1 ]

Full list of resources:

  Master/Slave Set: r0_pacemaker_Clone [r0_pacemaker]
  Masters: [ server4ubuntu1 ]
  Slaves: [ server2ubuntu1 ]
  Clone Set: dlm-clone [dlm]
  Stopped: [ server2ubuntu1 server4ubuntu1 ]
  Clone Set: ClusterIP-clone [ClusterIP] (unique)
  ClusterIP:0(ocf::heartbeat:IPaddr2):   Started
server4ubuntu1
  ClusterIP:1(ocf::heartbeat:IPaddr2):   Started
server4ubuntu1
  Master/Slave Set: WebDataClone [WebData]
  Masters: [ server2ubuntu1 server4ubuntu1 ]
  Clone Set: WebFS-clone [WebFS]
  Stopped: [ server2ubuntu1 server4ubuntu1 ]
  Clone Set: WebSite-clone [WebSite]
  Stopped: [ server2ubuntu1 server4ubuntu1 ]

Failed Actions:
* dlm_start_0 on server2ubuntu1 'not configured' (6): call=437,
status=complete, exitreason='',
 last-rc-change='Wed Jun 24 12:35:30 2020', queued=0ms,
exec=86ms
* r0_pacemaker_monitor_6 on server2ubuntu1 'master' (8):
call=438,
status=complete, exitreason='',
 last-rc-change='Wed Jun 24 12:36:30 2020', queued=0ms, exec=0ms
* dlm_start_0 on server4ubuntu1 'not configured' (6): call=441,
status=complete, exitreason='',
 last-rc-change='Wed Jun 24 12:35:30 2020', queued=0ms,
exec=74ms


Daemon Status:
   corosync: active/disabled
   pacemaker: active/disabled
   pcsd: active/enabled



I have researched the shown dlm Problem but everything I have found
says
that configuring STONITH would solve that issue.
Could please someone advise on how to proceed?

Thank you in advance!

Kind regards
Stefan Schmitz


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/