On Sat, Aug 5, 2023 at 2:20 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > > AF_XDP is a network socket family that allows communication directly > with the network device driver in the kernel, bypassing most or all > of the kernel networking stack. In the essence, the technology is > pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native > and works with any network interfaces without driver modifications. > Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't > require access to character devices or unix sockets. Only access to > the network interface itself is necessary. > > This patch implements a network backend that communicates with the > kernel by creating an AF_XDP socket. A chunk of userspace memory > is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, > Fill and Completion) are placed in that memory along with a pool of > memory buffers for the packet data. Data transmission is done by > allocating one of the buffers, copying packet data into it and > placing the pointer into Tx ring. After transmission, device will > return the buffer via Completion ring. On Rx, device will take > a buffer form a pre-populated Fill ring, write the packet data into > it and place the buffer into Rx ring. > > AF_XDP network backend takes on the communication with the host > kernel and the network interface and forwards packets to/from the > peer device in QEMU. > > Usage example: > > -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C > -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 > > XDP program bridges the socket with a network interface. It can be > attached to the interface in 2 different modes: > > 1. skb - this mode should work for any interface and doesn't require > driver support. With a caveat of lower performance. > > 2. native - this does require support from the driver and allows to > bypass skb allocation in the kernel and potentially use > zero-copy while getting packets in/out userspace. > > By default, QEMU will try to use native mode and fall back to skb. > Mode can be forced via 'mode' option. To force 'copy' even in native > mode, use 'force-copy=on' option. This might be useful if there is > some issue with the driver. > > Option 'queues=N' allows to specify how many device queues should > be open. Note that all the queues that are not open are still > functional and can receive traffic, but it will not be delivered to > QEMU. So, the number of device queues should generally match the > QEMU configuration, unless the device is shared with something > else and the traffic re-direction to appropriate queues is correctly > configured on a device level (e.g. with ethtool -N). > 'start-queue=M' option can be used to specify from which queue id > QEMU should start configuring 'N' queues. It might also be necessary > to use this option with certain NICs, e.g. MLX5 NICs. See the docs > for examples. > > In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN > or CAP_BPF capabilities in order to load default XSK/XDP programs to > the network interface and configure BPF maps. It is possible, however, > to run with no capabilities. For that to work, an external process > with enough capabilities will need to pre-load default XSK program, > create AF_XDP sockets and pass their file descriptors to QEMU process > on startup via 'sock-fds' option. Network backend will need to be > configured with 'inhibit=on' to avoid loading of the program. > QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue > or CAP_IPC_LOCK. > > There are few performance challenges with the current network backends. > > First is that they do not support IO threads. This means that data > path is handled by the main thread in QEMU and may slow down other > work or may be slowed down by some other work. This also means that > taking advantage of multi-queue is generally not possible today. > > Another thing is that data path is going through the device emulation > code, which is not really optimized for performance. The fastest > "frontend" device is virtio-net. But it's not optimized for heavy > traffic either, because it expects such use-cases to be handled via > some implementation of vhost (user, kernel, vdpa). In practice, we > have virtio notifications and rcu lock/unlock on a per-packet basis > and not very efficient accesses to the guest memory. Communication > channels between backend and frontend devices do not allow passing > more than one packet at a time as well. > > Some of these challenges can be avoided in the future by adding better > batching into device emulation or by implementing vhost-af-xdp variant. > > There are also a few kernel limitations. AF_XDP sockets do not > support any kinds of checksum or segmentation offloading. Buffers > are limited to a page size (4K), i.e. MTU is limited. Multi-buffer > support implementation for AF_XDP is in progress, but not ready yet. > Also, transmission in all non-zero-copy modes is synchronous, i.e. > done in a syscall. That doesn't allow high packet rates on virtual > interfaces. > > However, keeping in mind all of these challenges, current implementation > of the AF_XDP backend shows a decent performance while running on top > of a physical NIC with zero-copy support. > > Test setup: > > 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. > Network backend is configured to open the NIC directly in native mode. > The driver supports zero-copy. NIC is configured to use 1 queue. > > Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd > for PPS testing. > > iperf3 result: > TCP stream : 19.1 Gbps > > dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > Tx only : 3.4 Mpps > Rx only : 2.0 Mpps > L2 FWD Loopback : 1.5 Mpps > > In skb mode the same setup shows much lower performance, similar to > the setup where pair of physical NICs is replaced with veth pair: > > iperf3 result: > TCP stream : 9 Gbps > > dpdk-testpmd (single queue, single CPU core, 64 B packets) results: > Tx only : 1.2 Mpps > Rx only : 1.0 Mpps > L2 FWD Loopback : 0.7 Mpps > > Results in skb mode or over the veth are close to results of a tap > backend with vhost=on and disabled segmentation offloading bridged > with a NIC. > > Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> > --- > > Version 3: > > - Bump requirements to libxdp 1.4.0+. Having that, removed all > the conditional compilation parts, since all the needed APIs > are available in this version of libxdp. > > - Also removed the ability to pass xsks map fd, since ability > to just pass socket fds is now always available and it doesn't > require any capabilities untile manipulations with BPF maps. > > - Updated documentation to not call out specific vendors, memory > numbers or specific required capabilities. > > - Changed logic of returning peeked at but unused descriptors. > > - Minor cleanups. >
Queued this for 8.2. Thanks