On 6/26/23 08:32, Jason Wang wrote: > On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasow...@redhat.com> wrote: >> >> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote: >>> >>> AF_XDP is a network socket family that allows communication directly >>> with the network device driver in the kernel, bypassing most or all >>> of the kernel networking stack. In the essence, the technology is >>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >>> and works with any network interfaces without driver modifications. >>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >>> require access to character devices or unix sockets. Only access to >>> the network interface itself is necessary. >>> >>> This patch implements a network backend that communicates with the >>> kernel by creating an AF_XDP socket. A chunk of userspace memory >>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >>> Fill and Completion) are placed in that memory along with a pool of >>> memory buffers for the packet data. Data transmission is done by >>> allocating one of the buffers, copying packet data into it and >>> placing the pointer into Tx ring. After transmission, device will >>> return the buffer via Completion ring. On Rx, device will take >>> a buffer form a pre-populated Fill ring, write the packet data into >>> it and place the buffer into Rx ring. >>> >>> AF_XDP network backend takes on the communication with the host >>> kernel and the network interface and forwards packets to/from the >>> peer device in QEMU. >>> >>> Usage example: >>> >>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >>> >>> XDP program bridges the socket with a network interface. It can be >>> attached to the interface in 2 different modes: >>> >>> 1. skb - this mode should work for any interface and doesn't require >>> driver support. With a caveat of lower performance. >>> >>> 2. native - this does require support from the driver and allows to >>> bypass skb allocation in the kernel and potentially use >>> zero-copy while getting packets in/out userspace. >>> >>> By default, QEMU will try to use native mode and fall back to skb. >>> Mode can be forced via 'mode' option. To force 'copy' even in native >>> mode, use 'force-copy=on' option. This might be useful if there is >>> some issue with the driver. >>> >>> Option 'queues=N' allows to specify how many device queues should >>> be open. Note that all the queues that are not open are still >>> functional and can receive traffic, but it will not be delivered to >>> QEMU. So, the number of device queues should generally match the >>> QEMU configuration, unless the device is shared with something >>> else and the traffic re-direction to appropriate queues is correctly >>> configured on a device level (e.g. with ethtool -N). >>> 'start-queue=M' option can be used to specify from which queue id >>> QEMU should start configuring 'N' queues. It might also be necessary >>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >>> for examples. >>> >>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >>> capabilities in order to load default XSK/XDP programs to the >>> network interface and configure BTF maps. >> >> I think you mean "BPF" actually?
"BPF Type Format maps" kind of makes some sense, but yes. :) >> >>> It is possible, however, >>> to run only with CAP_NET_RAW. >> >> Qemu often runs without any privileges, so we need to fix it. >> >> I think adding support for SCM_RIGHTS via monitor would be a way to go. I looked through the code and it seems like we can run completely non-privileged as far as kernel concerned. We'll need an API modification in libxdp though. The thing is, IIUC, the only syscall that requires CAP_NET_RAW is a base socket creation. Binding and other configuration doesn't require any privileges. So, we could create a socket externally and pass it to QEMU. Should work, unless it's an oversight from the kernel side that needs to be patched. :) libxdp doesn't have a way to specify externally created socket today, so we'll need to change that. Should be easy to do though. I can explore. In case the bind syscall will actually need CAP_NET_RAW for some reason, we could change the kernel and allow non-privileged bind by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged process bind the socket to a particular device, so QEMU can't bind it to a random one. Might be a good use case to allow even if not strictly necessary. >> >> >>> For that to work, an external process >>> with admin capabilities will need to pre-load default XSK program >>> and pass an open file descriptor for this program's 'xsks_map' to >>> QEMU process on startup. Network backend will need to be configured >>> with 'inhibit=on' to avoid loading of the programs. The file >>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. >>> >>> There are few performance challenges with the current network backends. >>> >>> First is that they do not support IO threads. >> >> The current networking codes needs some major recatoring to support IO >> threads which I'm not sure is worthwhile. >> >>> This means that data >>> path is handled by the main thread in QEMU and may slow down other >>> work or may be slowed down by some other work. This also means that >>> taking advantage of multi-queue is generally not possible today. >>> >>> Another thing is that data path is going through the device emulation >>> code, which is not really optimized for performance. The fastest >>> "frontend" device is virtio-net. But it's not optimized for heavy >>> traffic either, because it expects such use-cases to be handled via >>> some implementation of vhost (user, kernel, vdpa). In practice, we >>> have virtio notifications and rcu lock/unlock on a per-packet basis >>> and not very efficient accesses to the guest memory. Communication >>> channels between backend and frontend devices do not allow passing >>> more than one packet at a time as well. >>> >>> Some of these challenges can be avoided in the future by adding better >>> batching into device emulation or by implementing vhost-af-xdp variant. >> >> It might require you to register(pin) the whole guest memory to XSK or >> there could be a copy. Both of them are sub-optimal. A single copy by itself shouldn't be a huge problem, right? vhost-user and -kernel do copy packets. >> >> A really interesting project is to do AF_XDP passthrough, then we >> don't need to care about pin and copy and we will get ultra speed in >> the guest. (But again, it might needs BPF support in virtio-net). I suppose, if we're doing pass-through we need a new device type and a driver in the kernel/dpdk. There is no point pretending it's a virtio-net and translating between different ring layouts. Or is there? >> >>> >>> There are also a few kernel limitations. AF_XDP sockets do not >>> support any kinds of checksum or segmentation offloading. Buffers >>> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer >>> support is not implemented for AF_XDP today. Also, transmission in >>> all non-zero-copy modes is synchronous, i.e. done in a syscall. >>> That doesn't allow high packet rates on virtual interfaces. >>> >>> However, keeping in mind all of these challenges, current implementation >>> of the AF_XDP backend shows a decent performance while running on top >>> of a physical NIC with zero-copy support. >>> >>> Test setup: >>> >>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. >>> Network backend is configured to open the NIC directly in native mode. >>> The driver supports zero-copy. NIC is configured to use 1 queue. >>> >>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd >>> for PPS testing. >>> >>> iperf3 result: >>> TCP stream : 19.1 Gbps >>> >>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: >>> Tx only : 3.4 Mpps >>> Rx only : 2.0 Mpps >>> L2 FWD Loopback : 1.5 Mpps >> >> I don't object to merging this backend (considering we've already >> merged netmap) once the code is fine, but the number is not amazing so >> I wonder what is the use case for this backend? I don't think there is a use case right now that would significantly benefit from the current implementation, so I'm fine if the merge is postponed. It is noticeably more performant than a tap with vhost=on in terms of PPS. So, that might be one case. Taking into account that just rcu lock and unlock in virtio-net code takes more time than a packet copy, some batching on QEMU side should improve performance significantly. And it shouldn't be too hard to implement. Performance over virtual interfaces may potentially be improved by creating a kernel thread for async Tx. Similarly to what io_uring allows. Currently Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to scale well. So, I do think that there is a potential in this backend. The main benefit, assuming we can reach performance comparable with other high-performance backends (vhost-user), I think, is the fact that it's Linux-native and doesn't require talking with any other devices (like chardevs/sockets), except for a network interface itself. i.e. it could be easier to manage in complex environments. > A more ambitious method is to reuse DPDK via dedicated threads, then > we can reuse any of its PMD like AF_XDP. Linking with DPDK will make configuration much more complex. I don't think it makes sense to bring it in for AF_XDP specifically. Might be a separate project though, sure. Best regards, Ilya Maximets.