--- Begin Message ---
Hi,

So I have figured out what likely happened.

Indeed it was very likely a network congestion because proxmox1 and proxmox2 where using a switch and proxmox3 the other, due to proxmox1 and proxmox2 not having properly loaded the bond-primary directive (primary slave not shown on /proc/net/bonding/bond0 although it was present in /etc/network/interfaces).

Additionally, just checked out that both switches are linked by a 1G port due to the 4th SFP+ port being used for the backup server... (against my recommendation during the cluster setup I must add...)

So very likely it was network congestion that kicked proxmox1 out of the cluster.

If seems that bond directives should be present in slaves too, like:

auto lo
iface lo inet loopback

iface ens2f0np0 inet manual
    bond-master bond0
    bond-primary ens2f0np1
# Switch2

iface ens2f1np1 inet manual
    bond-master bond0
    bond-primary ens2f0np1
# Switch1

iface eno1 inet manual

iface eno2 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves ens2f0np0 ens2f1np1
    bond-miimon 100
    bond-mode active-backup
    bond-primary ens2f0np1

auto bond0.91
iface bond0.91 inet static
    address 192.168.91.11
#Ceph

auto vmbr0
iface vmbr0 inet static
    address 192.168.90.11
    gateway 192.168.90.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0

Otherwise, it seems sometimes primary doesn't get configured properly...

Thanks again Michael and Stefan!
Eneko


El 14/4/21 a las 12:12, Eneko Lacunza via pve-user escribió:
Hi Michael,

El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
On Wed, 14 Apr 2021 11:04:10 +0200
Eneko Lacunza via pve-user<[email protected]> wrote:

Hi all,

Yesterday we had a strange fence happen in a PVE 6.2 cluster.

Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
operating normally for a year. Last update was on January 21st 2021.
Storage is Ceph and nodes are connected to the same network switch
with active-pasive bonds.

proxmox1 was fenced and automatically rebooted, then everything
recovered. HA restarted VMs in other nodes too.

proxmox1 syslog: (no network link issues reported at device level)
I have seen this occasionally and every time the cause was high network
load/network congestion which caused token timeout. The default token
timeout in corosync IMHO is very optimistically configured to 1000 ms
so I have changed this setting to 5000 ms and after I have done this I
have never seen fencing happening caused by network load/network
congestion again. You could try this and see if that helps you.

PS. my cluster communication is on a dedicated gb bonded vlan.
Thanks for the info. In this case network is 10Gbit (I see I didn't include this info) but only for proxmox nodes:

- We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
- Both switches are interconnected with a SFP+ DAC
- Active-passive Bonds in each proxmox node go one SFP+ interface on each switch. Primary interfaces are configured to be on the same switch.
- Connectivity to the LAN is done with 1 Gbit link
- Proxmox 2x10G Bond is used for VM networking and Ceph public/private networks.

I wouldn't expect high network load/congestion because it's on an internal LAN, with 1Gbit clients. No Ceph issues/backfilling were ocurring during the fence.

Network cards are Broadcom.

Thanks

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/


--- End Message ---
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to