On 6/26/23 08:32, Jason Wang wrote:
> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasow...@redhat.com> wrote:
>>
>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>
>>> AF_XDP is a network socket family that allows communication directly
>>> with the network device driver in the kernel, bypassing most or all
>>> of the kernel networking stack.  In the essence, the technology is
>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>>> and works with any network interfaces without driver modifications.
>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>> require access to character devices or unix sockets.  Only access to
>>> the network interface itself is necessary.
>>>
>>> This patch implements a network backend that communicates with the
>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>>> Fill and Completion) are placed in that memory along with a pool of
>>> memory buffers for the packet data.  Data transmission is done by
>>> allocating one of the buffers, copying packet data into it and
>>> placing the pointer into Tx ring.  After transmission, device will
>>> return the buffer via Completion ring.  On Rx, device will take
>>> a buffer form a pre-populated Fill ring, write the packet data into
>>> it and place the buffer into Rx ring.
>>>
>>> AF_XDP network backend takes on the communication with the host
>>> kernel and the network interface and forwards packets to/from the
>>> peer device in QEMU.
>>>
>>> Usage example:
>>>
>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>
>>> XDP program bridges the socket with a network interface.  It can be
>>> attached to the interface in 2 different modes:
>>>
>>> 1. skb - this mode should work for any interface and doesn't require
>>>          driver support.  With a caveat of lower performance.
>>>
>>> 2. native - this does require support from the driver and allows to
>>>             bypass skb allocation in the kernel and potentially use
>>>             zero-copy while getting packets in/out userspace.
>>>
>>> By default, QEMU will try to use native mode and fall back to skb.
>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>>> mode, use 'force-copy=on' option.  This might be useful if there is
>>> some issue with the driver.
>>>
>>> Option 'queues=N' allows to specify how many device queues should
>>> be open.  Note that all the queues that are not open are still
>>> functional and can receive traffic, but it will not be delivered to
>>> QEMU.  So, the number of device queues should generally match the
>>> QEMU configuration, unless the device is shared with something
>>> else and the traffic re-direction to appropriate queues is correctly
>>> configured on a device level (e.g. with ethtool -N).
>>> 'start-queue=M' option can be used to specify from which queue id
>>> QEMU should start configuring 'N' queues.  It might also be necessary
>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>>> for examples.
>>>
>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>> capabilities in order to load default XSK/XDP programs to the
>>> network interface and configure BTF maps.
>>
>> I think you mean "BPF" actually?

"BPF Type Format maps" kind of makes some sense, but yes. :)

>>
>>>  It is possible, however,
>>> to run only with CAP_NET_RAW.
>>
>> Qemu often runs without any privileges, so we need to fix it.
>>
>> I think adding support for SCM_RIGHTS via monitor would be a way to go.

I looked through the code and it seems like we can run completely
non-privileged as far as kernel concerned.  We'll need an API
modification in libxdp though.

The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
a base socket creation.  Binding and other configuration doesn't
require any privileges.  So, we could create a socket externally
and pass it to QEMU.  Should work, unless it's an oversight from
the kernel side that needs to be patched. :)  libxdp doesn't have
a way to specify externally created socket today, so we'll need
to change that.  Should be easy to do though.  I can explore.

In case the bind syscall will actually need CAP_NET_RAW for some
reason, we could change the kernel and allow non-privileged bind
by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
process bind the socket to a particular device, so QEMU can't
bind it to a random one.  Might be a good use case to allow even
if not strictly necessary.

>>
>>
>>> For that to work, an external process
>>> with admin capabilities will need to pre-load default XSK program
>>> and pass an open file descriptor for this program's 'xsks_map' to
>>> QEMU process on startup.  Network backend will need to be configured
>>> with 'inhibit=on' to avoid loading of the programs.  The file
>>> descriptor for 'xsks_map' can be passed via 'xsks-map-fd=N' option.
>>>
>>> There are few performance challenges with the current network backends.
>>>
>>> First is that they do not support IO threads.
>>
>> The current networking codes needs some major recatoring to support IO
>> threads which I'm not sure is worthwhile.
>>
>>> This means that data
>>> path is handled by the main thread in QEMU and may slow down other
>>> work or may be slowed down by some other work.  This also means that
>>> taking advantage of multi-queue is generally not possible today.
>>>
>>> Another thing is that data path is going through the device emulation
>>> code, which is not really optimized for performance.  The fastest
>>> "frontend" device is virtio-net.  But it's not optimized for heavy
>>> traffic either, because it expects such use-cases to be handled via
>>> some implementation of vhost (user, kernel, vdpa).  In practice, we
>>> have virtio notifications and rcu lock/unlock on a per-packet basis
>>> and not very efficient accesses to the guest memory.  Communication
>>> channels between backend and frontend devices do not allow passing
>>> more than one packet at a time as well.
>>>
>>> Some of these challenges can be avoided in the future by adding better
>>> batching into device emulation or by implementing vhost-af-xdp variant.
>>
>> It might require you to register(pin) the whole guest memory to XSK or
>> there could be a copy. Both of them are sub-optimal.

A single copy by itself shouldn't be a huge problem, right?
vhost-user and -kernel do copy packets.

>>
>> A really interesting project is to do AF_XDP passthrough, then we
>> don't need to care about pin and copy and we will get ultra speed in
>> the guest. (But again, it might needs BPF support in virtio-net).

I suppose, if we're doing pass-through we need a new device type and a
driver in the kernel/dpdk.  There is no point pretending it's a
virtio-net and translating between different ring layouts.  Or is there?

>>
>>>
>>> There are also a few kernel limitations.  AF_XDP sockets do not
>>> support any kinds of checksum or segmentation offloading.  Buffers
>>> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
>>> support is not implemented for AF_XDP today.  Also, transmission in
>>> all non-zero-copy modes is synchronous, i.e. done in a syscall.
>>> That doesn't allow high packet rates on virtual interfaces.
>>>
>>> However, keeping in mind all of these challenges, current implementation
>>> of the AF_XDP backend shows a decent performance while running on top
>>> of a physical NIC with zero-copy support.
>>>
>>> Test setup:
>>>
>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
>>> Network backend is configured to open the NIC directly in native mode.
>>> The driver supports zero-copy.  NIC is configured to use 1 queue.
>>>
>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
>>> for PPS testing.
>>>
>>> iperf3 result:
>>>  TCP stream      : 19.1 Gbps
>>>
>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>>  Tx only         : 3.4 Mpps
>>>  Rx only         : 2.0 Mpps
>>>  L2 FWD Loopback : 1.5 Mpps
>>
>> I don't object to merging this backend (considering we've already
>> merged netmap) once the code is fine, but the number is not amazing so
>> I wonder what is the use case for this backend?

I don't think there is a use case right now that would significantly benefit
from the current implementation, so I'm fine if the merge is postponed.
It is noticeably more performant than a tap with vhost=on in terms of PPS.
So, that might be one case.  Taking into account that just rcu lock and
unlock in virtio-net code takes more time than a packet copy, some batching
on QEMU side should improve performance significantly.  And it shouldn't be
too hard to implement.

Performance over virtual interfaces may potentially be improved by creating
a kernel thread for async Tx.  Similarly to what io_uring allows.  Currently
Tx on non-zero-copy interfaces is synchronous, and that doesn't allow to
scale well.

So, I do think that there is a potential in this backend.

The main benefit, assuming we can reach performance comparable with other
high-performance backends (vhost-user), I think, is the fact that it's
Linux-native and doesn't require talking with any other devices
(like chardevs/sockets), except for a network interface itself. i.e. it
could be easier to manage in complex environments.

> A more ambitious method is to reuse DPDK via dedicated threads, then
> we can reuse any of its PMD like AF_XDP.

Linking with DPDK will make configuration much more complex.  I don't
think it makes sense to bring it in for AF_XDP specifically.  Might be
a separate project though, sure.  

Best regards, Ilya Maximets.

Reply via email to