--- Begin Message --- Hi, I have a very strange networking problem on a Proxmox server, emerged after upgrading from 6.4 to 7.

These the results of pveversion on the server:

root@lama10:~# pveversion -V
proxmox-ve: 7.2-1 (running kernel: 5.15.35-1-pve)
pve-manager: 7.2-3 (running version: 7.2-3/c743d6c1)
pve-kernel-5.15: 7.2-3
pve-kernel-helper: 7.2-3
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.35-1-pve: 5.15.35-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-8
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-6
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.2-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.1.8-1
proxmox-backup-file-restore: 2.1.8-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-10
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-1
pve-qemu-kvm: 6.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1

The server has 4 network interfaces, bound in pairs in active-passive mode, then bridged. This is its /etc/network/interfaces:

auto eth0
iface eth0 inet manual
auto eth1
iface eth1 inet manual
auto eth2
iface eth2 inet manual
auto eth3
iface eth3 inet manual
auto bond0
iface bond0 inet manual
        bond-slaves eth0 eth1
        bond-miimon 100
        bond-mode active-backup
        bond-primary eth0
auto bond1
iface bond1 inet manual
        bond-slaves eth2 eth3
        bond-miimon 100
        bond-mode active-backup
        bond-primary eth2
auto vmbr0
iface vmbr0 inet static
        address 192.168.250.110/23
        gateway 192.168.250.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
auto vmbr1
iface vmbr1 inet static
        address 192.168.223.110/24
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0


The network problems comes only for connectiong to the virtual machines hosted by the server (no container are used), there is no problem at all for connecting to the server. The only anomaly I could find is that it seems that the bridge makes mac-address of some of the VM coming from a wrong internal port, so they become unreachable.

To explain what this means, I put 3 test VM on the server (two debian 11 and a windows one, just to exclude problem at operating system level) using vmbr1 bridge; their tap interfaces are:

root@lama10:~# brctl show vmbr1
bridge name     bridge id               STP enabled     interfaces
vmbr1           8000.7a576e974a37       no              bond1
                                                        tap403i0
                                                        tap404i0
                                                        tap603i0

Sometime some of them are working and some are not. When I was writing this email the VM 404 was not working. Looking at tap404i0 mac address I got:

root@lama10:~# ip -br link show dev tap404i0
tap404i0 UNKNOWN 26:6f:0c:19:95:58 <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP>

while the 404 VM own mac address is:

root@lama10:~# grep vmbr1 /etc/pve/qemu-server/404.conf
net0: virtio=BE:47:4C:D5:5D:A9,bridge=vmbr1

and when I look at these mac address seen inside vmbr1 I got:

root@lama10:~# brctl showmacs vmbr1 | egrep -i '(26:6f:0c:19:95:58|BE:47:4C:D5:5D:A9)'
  4     26:6f:0c:19:95:58       yes                0.00
  4     26:6f:0c:19:95:58       yes                0.00
  1     be:47:4c:d5:5d:a9       no                 0.65

doing the same for another VM that was working (mac address are found as above) I found instead:

root@lama10:~# brctl showmacs vmbr1 | egrep -i '(92:4f:ec:7e:8a:e1|DE:A3:E6:96:0C:6E)'
  3     92:4f:ec:7e:8a:e1       yes                0.00
  3     92:4f:ec:7e:8a:e1       yes                0.00
  3     de:a3:e6:96:0c:6e       no                 2.32

Note: with "working" I mean that a VM is normally reachable by network without packet loss. I checked in multiple times and in other servers and in all working cases the the ports inside the vmbrX switch are the same for the TAP mac and the VM mac, as expected. When not working the VM own mac seems always to be associated to port 1 (the one of the bonding interface).

What I find in a "not working" VM is that ARP reply is never received (looking with tcpdump run using the console). The arp request are sent, and seen in other VM or on the host, but no reply are seen.

Having a working VM is almost casual (or at least I could not find a pattern up to now). After stopping and restarting the above working VM I got it not working anymore and the port on the bridge changed:

root@lama10:~# brctl showmacs vmbr1 | egrep -i '(92:4f:ec:7e:8a:e1|DE:A3:E6:96:0C:6E)'
  3     92:4f:ec:7e:8a:e1       yes                0.00
  3     92:4f:ec:7e:8a:e1       yes                0.00
  1     de:a3:e6:96:0c:6e       no                 0.86

What make this behaviour "strange" is that other two identical machines with same Proxmox version (they are in cluster with this one, and inside a blades rack) are just working fine. And no problem on the cluster (like I said, no network problems at all for the server itself).

The only difference on the other two fully working nodes is that their bonding is configures as lacp. That was not possible for this one; it got loop error messages when configured, so I had to remove that configuration to avoid disturbance on the other two nodes, were all production VM were migrated and are running whitout problems.

But another standalone server (with the same Proxmox version of all other ones) that's outside the blade rack and it's also configured with active-passive bonding, is working fine.

So despite the difference in network configuration between all these servers I still cannot imagine how the different kind of bonding or the use of a different switch can have an impact on this problem. In the previous example I cannot ping 404 VM nor from the server itself nor from the the other working VM hosted inside the server itself, and this kind of traffic is completely internal traffic, done inside vmbr1.

So I'm asking directions about what to search, and where to look to find how the ports inside the bridge are allocated, or any other suggestion useful to have some light on this issue.

Simone
--
Simone Piccardi                                 Truelite Srl
[email protected] (email/jabber)             Via Monferrato, 6
Tel. +39-347-1032433                            50142 Firenze
http://www.truelite.it                          Tel. +39-055-7879597


--- End Message ---
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to