--- Begin Message ---
Hi,
So I have figured out what likely happened.
Indeed it was very likely a network congestion because proxmox1 and
proxmox2 where using a switch and proxmox3 the other, due to proxmox1
and proxmox2 not having properly loaded the bond-primary directive
(primary slave not shown on /proc/net/bonding/bond0 although it was
present in /etc/network/interfaces).
Additionally, just checked out that both switches are linked by a 1G
port due to the 4th SFP+ port being used for the backup server...
(against my recommendation during the cluster setup I must add...)
So very likely it was network congestion that kicked proxmox1 out of the
cluster.
If seems that bond directives should be present in slaves too, like:
auto lo
iface lo inet loopback
iface ens2f0np0 inet manual
bond-master bond0
bond-primary ens2f0np1
# Switch2
iface ens2f1np1 inet manual
bond-master bond0
bond-primary ens2f0np1
# Switch1
iface eno1 inet manual
iface eno2 inet manual
auto bond0
iface bond0 inet manual
bond-slaves ens2f0np0 ens2f1np1
bond-miimon 100
bond-mode active-backup
bond-primary ens2f0np1
auto bond0.91
iface bond0.91 inet static
address 192.168.91.11
#Ceph
auto vmbr0
iface vmbr0 inet static
address 192.168.90.11
gateway 192.168.90.1
bridge-ports bond0
bridge-stp off
bridge-fd 0
Otherwise, it seems sometimes primary doesn't get configured properly...
Thanks again Michael and Stefan!
Eneko
El 14/4/21 a las 12:12, Eneko Lacunza via pve-user escribió:
Hi Michael,
El 14/4/21 a las 11:21, Michael Rasmussen via pve-user escribió:
On Wed, 14 Apr 2021 11:04:10 +0200
Eneko Lacunza via pve-user<[email protected]> wrote:
Hi all,
Yesterday we had a strange fence happen in a PVE 6.2 cluster.
Cluster has 3 nodes (proxmox1, proxmox2, proxmox3) and has been
operating normally for a year. Last update was on January 21st 2021.
Storage is Ceph and nodes are connected to the same network switch
with active-pasive bonds.
proxmox1 was fenced and automatically rebooted, then everything
recovered. HA restarted VMs in other nodes too.
proxmox1 syslog: (no network link issues reported at device level)
I have seen this occasionally and every time the cause was high network
load/network congestion which caused token timeout. The default token
timeout in corosync IMHO is very optimistically configured to 1000 ms
so I have changed this setting to 5000 ms and after I have done this I
have never seen fencing happening caused by network load/network
congestion again. You could try this and see if that helps you.
PS. my cluster communication is on a dedicated gb bonded vlan.
Thanks for the info. In this case network is 10Gbit (I see I didn't
include this info) but only for proxmox nodes:
- We have 2 Dell N1124T 24x1Gbit 4xSFP+ switches
- Both switches are interconnected with a SFP+ DAC
- Active-passive Bonds in each proxmox node go one SFP+ interface on
each switch. Primary interfaces are configured to be on the same switch.
- Connectivity to the LAN is done with 1 Gbit link
- Proxmox 2x10G Bond is used for VM networking and Ceph public/private
networks.
I wouldn't expect high network load/congestion because it's on an
internal LAN, with 1Gbit clients. No Ceph issues/backfilling were
ocurring during the fence.
Network cards are Broadcom.
Thanks
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
--- End Message ---
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user