On 6/28/23 05:27, Jason Wang wrote:
> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>
>> On 6/27/23 04:54, Jason Wang wrote:
>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>>
>>>> On 6/26/23 08:32, Jason Wang wrote:
>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasow...@redhat.com> wrote:
>>>>>>
>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>>>>>
>>>>>>> AF_XDP is a network socket family that allows communication directly
>>>>>>> with the network device driver in the kernel, bypassing most or all
>>>>>>> of the kernel networking stack.  In the essence, the technology is
>>>>>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>>>>>>> and works with any network interfaces without driver modifications.
>>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>>>>>> require access to character devices or unix sockets.  Only access to
>>>>>>> the network interface itself is necessary.
>>>>>>>
>>>>>>> This patch implements a network backend that communicates with the
>>>>>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>>>>>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>>>>>>> Fill and Completion) are placed in that memory along with a pool of
>>>>>>> memory buffers for the packet data.  Data transmission is done by
>>>>>>> allocating one of the buffers, copying packet data into it and
>>>>>>> placing the pointer into Tx ring.  After transmission, device will
>>>>>>> return the buffer via Completion ring.  On Rx, device will take
>>>>>>> a buffer form a pre-populated Fill ring, write the packet data into
>>>>>>> it and place the buffer into Rx ring.
>>>>>>>
>>>>>>> AF_XDP network backend takes on the communication with the host
>>>>>>> kernel and the network interface and forwards packets to/from the
>>>>>>> peer device in QEMU.
>>>>>>>
>>>>>>> Usage example:
>>>>>>>
>>>>>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>>>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>>>>>
>>>>>>> XDP program bridges the socket with a network interface.  It can be
>>>>>>> attached to the interface in 2 different modes:
>>>>>>>
>>>>>>> 1. skb - this mode should work for any interface and doesn't require
>>>>>>>          driver support.  With a caveat of lower performance.
>>>>>>>
>>>>>>> 2. native - this does require support from the driver and allows to
>>>>>>>             bypass skb allocation in the kernel and potentially use
>>>>>>>             zero-copy while getting packets in/out userspace.
>>>>>>>
>>>>>>> By default, QEMU will try to use native mode and fall back to skb.
>>>>>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>>>>>>> mode, use 'force-copy=on' option.  This might be useful if there is
>>>>>>> some issue with the driver.
>>>>>>>
>>>>>>> Option 'queues=N' allows to specify how many device queues should
>>>>>>> be open.  Note that all the queues that are not open are still
>>>>>>> functional and can receive traffic, but it will not be delivered to
>>>>>>> QEMU.  So, the number of device queues should generally match the
>>>>>>> QEMU configuration, unless the device is shared with something
>>>>>>> else and the traffic re-direction to appropriate queues is correctly
>>>>>>> configured on a device level (e.g. with ethtool -N).
>>>>>>> 'start-queue=M' option can be used to specify from which queue id
>>>>>>> QEMU should start configuring 'N' queues.  It might also be necessary
>>>>>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>>>>>>> for examples.
>>>>>>>
>>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>>>>>> capabilities in order to load default XSK/XDP programs to the
>>>>>>> network interface and configure BTF maps.
>>>>>>
>>>>>> I think you mean "BPF" actually?
>>>>
>>>> "BPF Type Format maps" kind of makes some sense, but yes. :)
>>>>
>>>>>>
>>>>>>>  It is possible, however,
>>>>>>> to run only with CAP_NET_RAW.
>>>>>>
>>>>>> Qemu often runs without any privileges, so we need to fix it.
>>>>>>
>>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go.
>>>>
>>>> I looked through the code and it seems like we can run completely
>>>> non-privileged as far as kernel concerned.  We'll need an API
>>>> modification in libxdp though.
>>>>
>>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
>>>> a base socket creation.  Binding and other configuration doesn't
>>>> require any privileges.  So, we could create a socket externally
>>>> and pass it to QEMU.
>>>
>>> That's the way TAP works for example.
>>>
>>>>  Should work, unless it's an oversight from
>>>> the kernel side that needs to be patched. :)  libxdp doesn't have
>>>> a way to specify externally created socket today, so we'll need
>>>> to change that.  Should be easy to do though.  I can explore.
>>>
>>> Please do that.
>>
>> I have a prototype:
>>   
>> https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3
>>
>> Need to test it out and then submit PR to xdp-tools project.
>>
>>>
>>>>
>>>> In case the bind syscall will actually need CAP_NET_RAW for some
>>>> reason, we could change the kernel and allow non-privileged bind
>>>> by utilizing, e.g. SO_BINDTODEVICE.  i.e., let the privileged
>>>> process bind the socket to a particular device, so QEMU can't
>>>> bind it to a random one.  Might be a good use case to allow even
>>>> if not strictly necessary.
>>>
>>> Yes.
>>
>> Will propose something for a kernel as well.  We might want something
>> more granular though, e.g. bind to a queue instead of a device.  In
>> case we want better control in the device sharing scenario.
> 
> I may miss something but the bind is already done at dev plus queue
> right now, isn't it?


Yes, the bind() syscall will bind socket to the dev+queue.  I was talking
about SO_BINDTODEVICE that only ties the socket to a particular device,
but not a queue.

Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and
assuming a privileged process does:

  fd = socket(AF_XDP, ...);
  setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>);

And sends fd to a non-privileged process.  That non-privileged process
will be able to call:

  bind(fd, <device>, <random queue>);

It will have to use the same device, but can choose any queue, if that
queue is not already busy with another socket.

So, I was thinking maybe implementing something like XDP_BINDTOQID option.
This way the privileged process may call:

  setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>);

And later kernel will be able to refuse bind() for any other queue for
this particular socket.

Not sure if that is necessary though.
Since we're allocating the socket in the privileged process, that process
may add the socket to the BPF map on the correct queue id.  This way the
non-privileged process will not be able to receive any packets from any
other queue on this socket, even if bound to it.  And no other AF_XDP
socket will be able to be bound to that other queue as well.  So, the
rogue QEMU will be able to hog one extra queue, but it will not be able
to intercept traffic any from it, AFAICT.  May not be a huge problem
after all.

SO_BINDTODEVICE would still be nice to have.  Especially for cases where
we give the whole device to one VM.

Best regards, Ilya Maximets.

Reply via email to