Re: [PVE-User] PVE 6.2 Strange cluster node fence

Eneko Lacunza via pve-user Thu, 15 Apr 2021 00:56:33 -0700

--- Begin Message ---
Hi Stefan,

El 14/4/21 a las 19:28, Stefan M. Radman escribió:
The redundant corosync rings would definitely have prevented thefencing even in your scenario.
Yes that's for sure ;)
As a final note you should also consider replacing that 1GbE linkbetween the switches by an Nx1GbE bundle (LACP) for redundancy andbandwidth reasons or at least by 2 x 1GbE secured by spanning tree (RSTP).
I think we should interlink the switches with SFP+. Backups don't needthat bandwith but the final say is not mine :(
Thanks a lot
Eneko
Stefan
On Apr 14, 2021, at 18:26, Eneko Lacunza via pve-user<[email protected] <mailto:[email protected]>> wrote:
*From: *Eneko Lacunza <[email protected] <mailto:[email protected]>>
*Subject: **Re: [PVE-User] PVE 6.2 Strange cluster node fence*
*Date: *April 14, 2021 at 18:26:08 GMT+2
*To: *[email protected] <mailto:[email protected]>


Hi,

So I have figured out what likely happened.
Indeed it was very likely a network congestion because proxmox1 andproxmox2 where using a switch and proxmox3 the other, due to proxmox1and proxmox2 not having properly loaded the bond-primary directive(primary slave not shown on /proc/net/bonding/bond0 although it waspresent in /etc/network/interfaces).
Additionally, just checked out that both switches are linked by a 1Gport due to the 4th SFP+ port being used for the backup server...(against my recommendation during the cluster setup I must add...)
So very likely it was network congestion that kicked proxmox1 out ofthe cluster.
If seems that bond directives should be present in slaves too, like:

auto lo
iface lo inet loopback

iface ens2f0np0 inet manual
    bond-master bond0
    bond-primary ens2f0np1
# Switch2

iface ens2f1np1 inet manual
    bond-master bond0
    bond-primary ens2f0np1
# Switch1

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves ens2f0np0 ens2f1np1
    bond-miimon 100
    bond-mode active-backup
    bond-primary ens2f0np1

auto bond0.91
iface bond0.91 inet static
    address 192.168.91.11
#Ceph

auto vmbr0
iface vmbr0 inet static
    address 192.168.90.11
    gateway 192.168.90.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0

Otherwise, it seems sometimes primary doesn't get configured properly...

Thanks again Michael and Stefan!
Eneko


El 14/4/21 a las 12:12, Eneko Lacunza via pve-user escribió:
Hi Michael,

El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
On Wed, 14 Apr 2021 11:04:10 +0200
Eneko Lacunza via pve-user<[email protected]<mailto:[email protected]>> wrote:
Hi all,

Yesterday we had a strange fence happen in a PVE 6.2 cluster.

Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
operating normally for a year. Last update was on January 21st 2021.
Storage is Ceph and nodes are connected to the same network switch
with active-pasive bonds.

proxmox1 was fenced and automatically rebooted, then everything
recovered. HA restarted VMs in other nodes too.

proxmox1 syslog: (no network link issues reported at device level)
I have seen this occasionally and every time the cause was high network
load/network congestion which caused token timeout. The default token
timeout in corosync IMHO is very optimistically configured to 1000 ms
so I have changed this setting to 5000 ms and after I have done this I
have never seen fencing happening caused by network load/network
congestion again. You could try this and see if that helps you.

PS. my cluster communication is on a dedicated gb bonded vlan.
Thanks for the info. In this case network is 10Gbit (I see I didn'tinclude this info) but only for proxmox nodes:
- We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
- Both switches are interconnected with a SFP+ DAC
- Active-passive Bonds in each proxmox node go one SFP+ interface oneach switch. Primary interfaces are configured to be on the same switch.
- Connectivity to the LAN is done with 1 Gbit link
- Proxmox 2x10G Bond is used for VM networking and Cephpublic/private networks.
I wouldn't expect high network load/congestion because it's on aninternal LAN, with 1Gbit clients. No Ceph issues/backfilling wereocurring during the fence.
Network cards are Broadcom.

Thanks
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es <https://www.binovo.es>
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO<https://www.youtube.com/user/CANALBINOVO>
https://www.linkedin.com/company/37269706/



_______________________________________________
pve-user mailing list
[email protected]
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.proxmox.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpve-user&amp;data=04%7C01%7Csmr%40kmi.com%7C6173285a195944ab306e08d8ff620c61%7Cc2283768b8d34e008f3d85b1b4f03b33%7C0%7C0%7C637540143873213806%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=k%2FL7WhTr4ybZ%2FsKsx%2F49L3k7sjc2VA71xKwI8iH8buw%3D&amp;reserved=0
CONFIDENTIALITY NOTICE: /This communication may contain privileged andconfidential information, or may otherwise be protected fromdisclosure, and is intended solely for use of the intendedrecipient(s). If you are not the intended recipient of thiscommunication, please notify the sender that you have received thiscommunication in error and delete and destroy all copies in yourpossession. /
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
--- End Message ---

_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] PVE 6.2 Strange cluster node fence

Reply via email to