On 6/8/26 10:06 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 07:12:14PM +0200, Paolo Abeni wrote:
>> On 6/8/26 12:41 PM, Fiona Ebner wrote:
>>> Am 05.06.26 um 4:54 PM schrieb Paolo Abeni:
>>>> On 6/5/26 4:02 PM, Fiona Ebner wrote:
>>>>> Am 09.11.25 um 4:10 PM schrieb Michael S. Tsirkin:
>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>> index 17ed0ef919..3b85560f6f 100644
>>>>>> --- a/hw/net/virtio-net.c
>>>>>> +++ b/hw/net/virtio-net.c
>>>>>> @@ -4299,19 +4299,19 @@ static const Property virtio_net_properties[] = {
>>>>>> VIRTIO_DEFINE_PROP_FEATURE("host_tunnel", VirtIONet,
>>>>>> host_features_ex,
>>>>>> VIRTIO_NET_F_HOST_UDP_TUNNEL_GSO,
>>>>>> - false),
>>>>>> + true),
>>>>> it seems that the host_tunnel setting can cause issues when VXLAN
>>>>> traffic originating in a guest goes over a physical NIC which does not
>>>>> support the feature. We received several reports about the issue
>>>>> [0][1][2][3] and were able to reproduce it. Turning off the
>>>>> 'host_tunnel' property in the commandline for the VirtIO net device
>>>>> makes TCP traffic work. The network configuration from our reproducer
>>>>> setup is as follows:
>>>>>
>>>>> guest A (iperf3 -c) guest B (iperf3 -s)
>>>>> vxlan using vNIC as underlay vxlan using vNIC as underlay
>>>>> virtualized NIC exposed to guest virtualized NIC exposed to guest
>>>>> ---guest boundary--- ---guest boundary---
>>>>> tap device connected to bridge tap device connected to bridge
>>>>> bridge with physical NIC as port bridge with physical NIC as port
>>>>> physical NIC <---host boundary---> physical NIC
>>>>>
>>>>> Bridge configuration:
>>>>> iface vmbr0 inet static
>>>>> address 10.48.0.109/20
>>>>> gateway 10.48.0.1
>>>>> bridge-ports nic3
>>>>> bridge-stp off
>>>>> bridge-fd 0
>>>>> bridge-vlan-aware yes
>>>>> bridge-vids 2-4094
>>>>>
>>>>> VXLAN created with:
>>>>> ip link add vxlan0 type vxlan id 100 remote X dstport 4789 dev eth1
>>>>> where eth1 is the virtualized NIC exposed to the guest
>>>>>
>>>>> The physical NIC does not have the feature:
>>>>> Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme
>>>>> BCM5719 Gigabit Ethernet PCIe [14e4:1657] (rev 01)
>>>>> tx-udp_tnl-segmentation: off [fixed]
>>>>> tx-udp_tnl-csum-segmentation: off [fixed]
>>>>>
>>>>> Using a physical NIC which does have the feature works:
>>>>> Ethernet controller [0200]: Broadcom Inc. and subsidiaries BCM57504
>>>>> NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet [14e4:1751] (rev 11)
>>>>> tx-udp_tnl-segmentation: on
>>>>> tx-udp_tnl-csum-segmentation: on
>>>>>
>>>>> Host kernel:
>>>>> Proxmox VE with 7.0.2-6-pve
>>>>>
>>>>> Guest kernel:
>>>>> Apline with 6.18.34-0-lts
>>>>>
>>>>> QEMU commandline for the vNIC:
>>>>>> -netdev
>>>>>> 'type=tap,id=net2,ifname=tap103i2,script=/usr/libexec/qemu-server/pve-bridge,downscript=/usr/libexec/qemu-server/pve-bridgedown,vhost=on'
>>>>>> \
>>>>>> -device
>>>>>> 'virtio-net-pci,mac=BC:24:11:78:C3:3B,netdev=net2,bus=pci.0,addr=0x14,id=net2,rx_queue_size=1024,tx_queue_size=256,host_mtu=1500'
>>>>>> \
>>>>>
>>>>> We can see that QEMU sets the features for the tap interface via ioctl()
>>>>> and the host kernel allows it:
>>>>> tx-udp_tnl-segmentation: on
>>>>> tx-udp_tnl-csum-segmentation: on
>>>>>
>>>>> As far as we understand, in the problematic scenario, nothing is ever
>>>>> filling in the checksums for the inner TCP packets, meaning the outer
>>>>> UDP checksum ends up being wrong on the target side. Is the host kernel
>>>>> responsible for doing that before passing the packet to the physical NIC
>>>>> (without the feature)? Or who would be?
>>>>>
>>>>> Turning off host_tunnel_csum without turning off host_tunnel does not
>>>>> help.
>>>>>
>>>>> Interestingly, turning off the features for the working physical NIC
>>>>> does not make it break:
>>>>> tx-udp_tnl-segmentation: off
>>>>> tx-udp_tnl-csum-segmentation: off
>>>>> Could it be that the NIC just always fills in the inner TCP checksums
>>>>> regardless of that setting?
>>>>>
>>>>> On the other hand, running
>>>>> localhost:~# ethtool -K eth2 tx-checksum-ip-generic off
>>>>> Actual changes:
>>>>> tx-checksum-ip-generic: off
>>>>> tx-tcp-segmentation: off [not requested]
>>>>> tx-tcp-ecn-segmentation: off [not requested]
>>>>> tx-tcp6-segmentation: off [not requested]
>>>>> tx-udp-segmentation: off [not requested]
>>>>> inside the guests makes it work for the physical NIC without the
>>>>> tx-udp_tnl* features.
>>>>>
>>>>> I wanted to ask if this configuration is expected to be unsupported and
>>>>> if the management is expected to turn off the feature on the commandline
>>>>> if the traffic might go over a physical NIC without the feature. Or if
>>>>> this could be a kernel or NIC bug that should be investigated further?
>>>>> In the former case, should the option really be turned on by default
>>>>> with new machine versions?
>>>>
>>>> Thank you for the detailed report. The configuration you describe is
>>>> supported and expected to work. The fact that different results are
>>>> obtained on top of a NIC with:
>>>>
>>>> [1] tx-udp_tnl-segmentation: off [fixed]
>>>>
>>>> WRT to similar setup on top of NIC with:
>>>>
>>>> [2] tx-udp_tnl-segmentation: off
>>>>
>>>> is indeed strange/unexpected, as the two scenarios are indistinguishable
>>>> from the stack perspective. I suspect the issue is NIC driver dependent.
>>>>
>>>> I understand [1] is using a tg3 driver, and [2] bnxt, both running Linux
>>>> 7.0.2, am I correct?
>>>
>>> Yes.
>>>
>>>> If you disable csum offloading on the tg3 NIC, does that impact the
>>>> results?
>>>
>>> Yes, doing
>>>
>>> root@tamy3:~# ethtool -K nic3 tx-checksum-ipv4 off
>>> Actual changes:
>>> tx-checksum-ipv4: off
>>> tx-tcp-segmentation: off [not requested]
>>> tx-tcp-ecn-segmentation: off [not requested]
>>>
>>> on both hosts makes it work.
>>>
>>>> If you have such data handy, could you please share pcap captures on
>>>> both ends? links to some accessible URL would be better than sending a
>>>> lot of data to the ML, I think.
>>>
>>> I captured the following while the problem is present with
>>> tcpdump -i foo udp port 4789 -w bar.pcap
>>> on the host interfaces (tap, bridge and physical NIC) just to be sure.
>>> Looking at it with tcpdump -envvvr, within the same host, only the
>>> timestamps change. Between the hosts, the UDP checksums do change, but
>>> the inner TCP checksums do not. So I suppose the NIC fills in the UDP
>>> checksum based on the still wrong data? Since the UDP checksum would
>>> already be correct if the TCP checksums would be fixed up?
>>>
>>> For the NIC with the tx-udp_tnl features, the inner TCP checksums do get
>>> corrected and the UDP checksum stays the same. I did not include
>>> captures for this.
>>>
>>> IPs for the guest running iperf -s (on host tamy2)
>>> 10.48.6.81 for the virtualized NIC
>>> 10.0.123.102 for the VXLAN
>>>
>>> IPs for the guest running iperf -c (on host tamy3):
>>> 10.48.6.101 for the virtualized NIC
>>> 10.0.123.103 for the VXLAN
>>>
>>> The captures are short, so I take the liberty to just provide them directly:
>>>
>>> [I] febner@enia ~> tar cf pcap.tar tamy*.pcap
>>> [I] febner@enia ~> xz pcap.tar
>>> [I] febner@enia ~> base64 -w 70 pcap.tar.xz
>>> /Td6WFoAAATm1rRGBMDXBYCgASEBFgAAAAAAAKQwOkPgT/8Cz10AOhhJ/551cIJN23SQMX
>>> Q2Us4cGiof2bxOS4FK4DxejNh+76NiWIpdIfOxrB5urac3FT0mPKMbUreSY+04/NhofcgS
>>> Zz41D6t/Xp+VkPxNYx7Xsp3xz4xUCsVuK205jz6G/NAY0bJ0+UrJuCkP0G5VBtn88hJstD
>>> 7qlaT7qcBLECseOO1OfqsLezxasbm5p614IL18cqAVMCMWucr/Kh2Oqth26v7zI4SVEJC/
>>> YSEgaOhfjCbQZSi85BEw9/NSZO6IqoyNLrEiPUPgXTWH63NssG+4RMuBswrkgN5Wld70B1
>>> mROOCwKbo9b9oXI4DumGHqgCV5jdAxzITpEjMQpvDKh6NvM5L/8v1cPiGjLFSL2JesZ0F5
>>> dTbstymv1q4eN+9f3ng+4AXCvDzaziYMwtGwxYyptK5qDI2oGsCIGwFDpP/ZEw7NYI9EMM
>>> G2+SDG6D8bKgKWl9Mi6EJcqSMVKFR1P1Z/P3XJ/9sWOMJug1IVYZGIJmtXXM3+roqOEGMF
>>> tco/LMUJHgdmfkitfuZ5tN1+0EVE0/f4GQiUpdidjqfZ2m9jL0svcGXUd5D3LN0tbh5vmP
>>> KzXQNtMQiMY6Fj7gbzDbOQGGW/L3/34B5YV+pWEpzhAbeTI9KL0ZF3vJ0OESlL9OMhrqgl
>>> WX23bxek2h9eG15eO9cderaoCOFb8NEKIjC+UTh2Ir7/ZFfDvlXeGB/3jXM8OTmWmJSr5b
>>> CrAvBQ4xvow3hwKq2Fbyu7aU6KycVpo03a+59LqxPyRfc3qRXcoUnp8MTi2YUk+kfYR6mI
>>> S9AE/5xYFzb7I40RUBPUm0OCzguzk9qlIcab3lnTFnrMWa+Cj9AMIkWEEf0tMzw0v9+17u
>>> VJg/8tWMad3d9Jc5Z6B9kOukzGvgVEWoq4z9snb/k6u2sBVY36q2iI1cmSPrI+UcF2GtSA
>>> Qs6bt/T/c1Xi2r0Up+tRDrIE9O2aNAAAAKAWpkIVfsvrAAHzBYCgAQAA6om1scRn+wIAAA
>>> AABFla
>>
>> Thanks for the data. The bug is in the virtio_net driver; non GSO
>> packets requiring inner header csum are handled unmodified from the
>> guest to the host, and the H/W NICs has to compute the inner transport
>> header csum for the encap packet.
>>
>> Could you please try the attached kernel patch? You will need to update
>> the kernel inside the guest.
>
> Paolo was there supposed to be a patch here?
I'm sorry, ENOCOFFEE here. Thanks Michael for the head-up!
Attaching the patch now.
/P
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f4adcfee7a80..725e6315036a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -6222,6 +6222,19 @@ static void virtnet_free_irq_moder(struct virtnet_info *vi)
rtnl_unlock();
}
+static netdev_features_t virtnet_features_check(struct sk_buff *skb,
+ struct net_device *dev,
+ netdev_features_t features)
+{
+ /* Inner csum offload is only available for GSO packets. */
+ if (skb->encapsulation && !skb_is_gso(skb))
+ return features & ~NETIF_F_CSUM_MASK;
+
+ /* Passthru. */
+ return features;
+}
+
+
static const struct net_device_ops virtnet_netdev = {
.ndo_open = virtnet_open,
.ndo_stop = virtnet_close,
@@ -6235,7 +6248,7 @@ static const struct net_device_ops virtnet_netdev = {
.ndo_bpf = virtnet_xdp,
.ndo_xdp_xmit = virtnet_xdp_xmit,
.ndo_xsk_wakeup = virtnet_xsk_wakeup,
- .ndo_features_check = passthru_features_check,
+ .ndo_features_check = virtnet_features_check,
.ndo_get_phys_port_name = virtnet_get_phys_port_name,
.ndo_set_features = virtnet_set_features,
.ndo_tx_timeout = virtnet_tx_timeout,