[PVE-User] Strange problem on bridge after upgrade to proxmox 7

Simone Piccardi via pve-user Thu, 19 May 2022 11:08:17 -0700

--- Begin Message --- Hi, I have a very strange networking problem on a Proxmox server,emerged after upgrading from 6.4 to 7.
These the results of pveversion on the server:
root@lama10:~# pveversion -V
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.35-1-pve: 5.15.35-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
The server has 4 network interfaces, bound in pairs in active-passivemode, then bridged. This is its /etc/network/interfaces:
auto eth0
iface eth0 inet manual
auto eth1
iface eth1 inet manual
auto eth2
iface eth2 inet manual
auto eth3
iface eth3 inet manual
auto bond0
iface bond0 inet manual
        bond-slaves eth0 eth1
        bond-miimon 100
        bond-mode active-backup
        bond-primary eth0
auto bond1
iface bond1 inet manual
        bond-slaves eth2 eth3
        bond-miimon 100
        bond-mode active-backup
        bond-primary eth2
auto vmbr0
iface vmbr0 inet static
        address 192.168.250.110/23
        gateway 192.168.250.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
auto vmbr1
iface vmbr1 inet static
        address 192.168.223.110/24
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0
The network problems comes only for connectiong to the virtual machineshosted by the server (no container are used), there is no problem at allfor connecting to the server. The only anomaly I could find is that itseems that the bridge makes mac-address of some of the VM coming from awrong internal port, so they become unreachable.
To explain what this means, I put 3 test VM on the server (two debian 11and a windows one, just to exclude problem at operating system level)using vmbr1 bridge; their tap interfaces are:
root@lama10:~# brctl show vmbr1
bridge name     bridge id               STP enabled     interfaces
vmbr1           8000.7a576e974a37       no              bond1
                                                        tap403i0
                                                        tap404i0
                                                        tap603i0
Sometime some of them are working and some are not. When I was writingthis email the VM 404 was not working. Looking at tap404i0 mac address Igot:
root@lama10:~# ip -br link show dev tap404i0
tap404i0 UNKNOWN 26:6f:0c:19:95:58<BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP>
while the 404 VM own mac address is:

root@lama10:~# grep vmbr1 /etc/pve/qemu-server/404.conf
net0: virtio=BE:47:4C:D5:5D:A9,bridge=vmbr1

and when I look at these mac address seen inside vmbr1 I got:
root@lama10:~# brctl showmacs vmbr1 | egrep -i'(26:6f:0c:19:95:58|BE:47:4C:D5:5D:A9)'
  4     26:6f:0c:19:95:58       yes                0.00
  4     26:6f:0c:19:95:58       yes                0.00
  1     be:47:4c:d5:5d:a9       no                 0.65
doing the same for another VM that was working (mac address are found asabove) I found instead:
root@lama10:~# brctl showmacs vmbr1 | egrep -i'(92:4f:ec:7e:8a:e1|DE:A3:E6:96:0C:6E)'
  3     92:4f:ec:7e:8a:e1       yes                0.00
  3     92:4f:ec:7e:8a:e1       yes                0.00
  3     de:a3:e6:96:0c:6e       no                 2.32
Note: with "working" I mean that a VM is normally reachable by networkwithout packet loss. I checked in multiple times and in other serversand in all working cases the the ports inside the vmbrX switch are thesame for the TAP mac and the VM mac, as expected. When not working theVM own mac seems always to be associated to port 1 (the one of thebonding interface).
What I find in a "not working" VM is that ARP reply is never received(looking with tcpdump run using the console). The arp request are sent,and seen in other VM or on the host, but no reply are seen.
Having a working VM is almost casual (or at least I could not find apattern up to now). After stopping and restarting the above working VM Igot it not working anymore and the port on the bridge changed:
root@lama10:~# brctl showmacs vmbr1 | egrep -i'(92:4f:ec:7e:8a:e1|DE:A3:E6:96:0C:6E)'
  3     92:4f:ec:7e:8a:e1       yes                0.00
  3     92:4f:ec:7e:8a:e1       yes                0.00
  1     de:a3:e6:96:0c:6e       no                 0.86
What make this behaviour "strange" is that other two identical machineswith same Proxmox version (they are in cluster with this one, and insidea blades rack) are just working fine. And no problem on the cluster(like I said, no network problems at all for the server itself).
The only difference on the other two fully working nodes is that theirbonding is configures as lacp. That was not possible for this one; itgot loop error messages when configured, so I had to remove thatconfiguration to avoid disturbance on the other two nodes, were allproduction VM were migrated and are running whitout problems.
But another standalone server (with the same Proxmox version of allother ones) that's outside the blade rack and it's also configured withactive-passive bonding, is working fine.
So despite the difference in network configuration between all theseservers I still cannot imagine how the different kind of bonding or theuse of a different switch can have an impact on this problem. In theprevious example I cannot ping 404 VM nor from the server itself norfrom the the other working VM hosted inside the server itself, and thiskind of traffic is completely internal traffic, done inside vmbr1.
So I'm asking directions about what to search, and where to look to findhow the ports inside the bridge are allocated, or any other suggestionuseful to have some light on this issue.
Simone
--
Simone Piccardi                                 Truelite Srl
[email protected] (email/jabber)             Via Monferrato, 6
Tel. +39-347-1032433                            50142 Firenze
http://www.truelite.it                          Tel. +39-055-7879597
--- End Message ---

_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

[PVE-User] Strange problem on bridge after upgrade to proxmox 7

Reply via email to