On 6/27/23 04:54, Jason Wang wrote: > On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maxim...@ovn.org> wrote: >> >> On 6/26/23 08:32, Jason Wang wrote: >>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasow...@redhat.com> wrote: >>>> >>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote: >>>>> >>>>> AF_XDP is a network socket family that allows communication directly >>>>> with the network device driver in the kernel, bypassing most or all >>>>> of the kernel networking stack. In the essence, the technology is >>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native >>>>> and works with any network interfaces without driver modifications. >>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't >>>>> require access to character devices or unix sockets. Only access to >>>>> the network interface itself is necessary. >>>>> >>>>> This patch implements a network backend that communicates with the >>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory >>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, >>>>> Fill and Completion) are placed in that memory along with a pool of >>>>> memory buffers for the packet data. Data transmission is done by >>>>> allocating one of the buffers, copying packet data into it and >>>>> placing the pointer into Tx ring. After transmission, device will >>>>> return the buffer via Completion ring. On Rx, device will take >>>>> a buffer form a pre-populated Fill ring, write the packet data into >>>>> it and place the buffer into Rx ring. >>>>> >>>>> AF_XDP network backend takes on the communication with the host >>>>> kernel and the network interface and forwards packets to/from the >>>>> peer device in QEMU. >>>>> >>>>> Usage example: >>>>> >>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C >>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1 >>>>> >>>>> XDP program bridges the socket with a network interface. It can be >>>>> attached to the interface in 2 different modes: >>>>> >>>>> 1. skb - this mode should work for any interface and doesn't require >>>>> driver support. With a caveat of lower performance. >>>>> >>>>> 2. native - this does require support from the driver and allows to >>>>> bypass skb allocation in the kernel and potentially use >>>>> zero-copy while getting packets in/out userspace. >>>>> >>>>> By default, QEMU will try to use native mode and fall back to skb. >>>>> Mode can be forced via 'mode' option. To force 'copy' even in native >>>>> mode, use 'force-copy=on' option. This might be useful if there is >>>>> some issue with the driver. >>>>> >>>>> Option 'queues=N' allows to specify how many device queues should >>>>> be open. Note that all the queues that are not open are still >>>>> functional and can receive traffic, but it will not be delivered to >>>>> QEMU. So, the number of device queues should generally match the >>>>> QEMU configuration, unless the device is shared with something >>>>> else and the traffic re-direction to appropriate queues is correctly >>>>> configured on a device level (e.g. with ethtool -N). >>>>> 'start-queue=M' option can be used to specify from which queue id >>>>> QEMU should start configuring 'N' queues. It might also be necessary >>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs >>>>> for examples. >>>>> >>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN >>>>> capabilities in order to load default XSK/XDP programs to the >>>>> network interface and configure BTF maps. >>>> >>>> I think you mean "BPF" actually? >> >> "BPF Type Format maps" kind of makes some sense, but yes. :) >> >>>> >>>>> It is possible, however, >>>>> to run only with CAP_NET_RAW. >>>> >>>> Qemu often runs without any privileges, so we need to fix it. >>>> >>>> I think adding support for SCM_RIGHTS via monitor would be a way to go. >> >> I looked through the code and it seems like we can run completely >> non-privileged as far as kernel concerned. We'll need an API >> modification in libxdp though. >> >> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is >> a base socket creation. Binding and other configuration doesn't >> require any privileges. So, we could create a socket externally >> and pass it to QEMU. > > That's the way TAP works for example. > >> Should work, unless it's an oversight from >> the kernel side that needs to be patched. :) libxdp doesn't have >> a way to specify externally created socket today, so we'll need >> to change that. Should be easy to do though. I can explore. > > Please do that.
I have a prototype: https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3 Need to test it out and then submit PR to xdp-tools project. > >> >> In case the bind syscall will actually need CAP_NET_RAW for some >> reason, we could change the kernel and allow non-privileged bind >> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged >> process bind the socket to a particular device, so QEMU can't >> bind it to a random one. Might be a good use case to allow even >> if not strictly necessary. > > Yes. Will propose something for a kernel as well. We might want something more granular though, e.g. bind to a queue instead of a device. In case we want better control in the device sharing scenario. > >> >>>> >>>> >>>>> For that to work, an external process >>>>> with admin capabilities will need to pre-load default XSK program >>>>> and pass an open file descriptor for this program's 'xsks_map' to >>>>> QEMU process on startup. Network backend will need to be configured >>>>> with 'inhibit=on' to avoid loading of the programs. The file >>>>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option. >>>>> >>>>> There are few performance challenges with the current network backends. >>>>> >>>>> First is that they do not support IO threads. >>>> >>>> The current networking codes needs some major recatoring to support IO >>>> threads which I'm not sure is worthwhile. >>>> >>>>> This means that data >>>>> path is handled by the main thread in QEMU and may slow down other >>>>> work or may be slowed down by some other work. This also means that >>>>> taking advantage of multi-queue is generally not possible today. >>>>> >>>>> Another thing is that data path is going through the device emulation >>>>> code, which is not really optimized for performance. The fastest >>>>> "frontend" device is virtio-net. But it's not optimized for heavy >>>>> traffic either, because it expects such use-cases to be handled via >>>>> some implementation of vhost (user, kernel, vdpa). In practice, we >>>>> have virtio notifications and rcu lock/unlock on a per-packet basis >>>>> and not very efficient accesses to the guest memory. Communication >>>>> channels between backend and frontend devices do not allow passing >>>>> more than one packet at a time as well. >>>>> >>>>> Some of these challenges can be avoided in the future by adding better >>>>> batching into device emulation or by implementing vhost-af-xdp variant. >>>> >>>> It might require you to register(pin) the whole guest memory to XSK or >>>> there could be a copy. Both of them are sub-optimal. >> >> A single copy by itself shouldn't be a huge problem, right? > > Probably. > >> vhost-user and -kernel do copy packets. >> >>>> >>>> A really interesting project is to do AF_XDP passthrough, then we >>>> don't need to care about pin and copy and we will get ultra speed in >>>> the guest. (But again, it might needs BPF support in virtio-net). >> >> I suppose, if we're doing pass-through we need a new device type and a >> driver in the kernel/dpdk. There is no point pretending it's a >> virtio-net and translating between different ring layouts. > > Yes. > >> Or is there? >> >>>> >>>>> >>>>> There are also a few kernel limitations. AF_XDP sockets do not >>>>> support any kinds of checksum or segmentation offloading. Buffers >>>>> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer >>>>> support is not implemented for AF_XDP today. Also, transmission in >>>>> all non-zero-copy modes is synchronous, i.e. done in a syscall. >>>>> That doesn't allow high packet rates on virtual interfaces. >>>>> >>>>> However, keeping in mind all of these challenges, current implementation >>>>> of the AF_XDP backend shows a decent performance while running on top >>>>> of a physical NIC with zero-copy support. >>>>> >>>>> Test setup: >>>>> >>>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. >>>>> Network backend is configured to open the NIC directly in native mode. >>>>> The driver supports zero-copy. NIC is configured to use 1 queue. >>>>> >>>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd >>>>> for PPS testing. >>>>> >>>>> iperf3 result: >>>>> TCP stream : 19.1 Gbps >>>>> >>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results: >>>>> Tx only : 3.4 Mpps >>>>> Rx only : 2.0 Mpps >>>>> L2 FWD Loopback : 1.5 Mpps >>>> >>>> I don't object to merging this backend (considering we've already >>>> merged netmap) once the code is fine, but the number is not amazing so >>>> I wonder what is the use case for this backend? >> >> I don't think there is a use case right now that would significantly benefit >> from the current implementation, so I'm fine if the merge is postponed. > > Just to be clear, I don't want to postpone this if we decide to > invest/enhance it. I will go through the codes and get back. Ack. Thanks. > >> It is noticeably more performant than a tap with vhost=on in terms of PPS. >> So, that might be one case. Taking into account that just rcu lock and >> unlock in virtio-net code takes more time than a packet copy, some batching >> on QEMU side should improve performance significantly. And it shouldn't be >> too hard to implement. >> >> Performance over virtual interfaces may potentially be improved by creating >> a kernel thread for async Tx. Similarly to what io_uring allows. Currently >> Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to >> scale well. > > Interestingly, actually, there are a lot of "duplication" between > io_uring and AF_XDP: > > 1) both have similar memory model (user register) > 2) both use ring for communication > > I wonder if we can let io_uring talks directly to AF_XDP. Well, if we submit poll() in QEMU main loop via io_uring, then we can avoid cost of the synchronous Tx for non-zero-copy modes, i.e. for virtual interfaces. io_uring thread in the kernel will be able to perform transmission for us. But yeah, there are way too many way too similar ring buffer interfaces in the kernel. > >> >> So, I do think that there is a potential in this backend. >> >> The main benefit, assuming we can reach performance comparable with other >> high-performance backends (vhost-user), I think, is the fact that it's >> Linux-native and doesn't require talking with any other devices >> (like chardevs/sockets), except for a network interface itself. i.e. it >> could be easier to manage in complex environments. > > Yes. > >> >>> A more ambitious method is to reuse DPDK via dedicated threads, then >>> we can reuse any of its PMD like AF_XDP. >> >> Linking with DPDK will make configuration much more complex. I don't >> think it makes sense to bring it in for AF_XDP specifically. Might be >> a separate project though, sure. > > Right. > > Thanks > >> >> Best regards, Ilya Maximets. >> >