On Mon, 10 Jul 2023 at 06:55, Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 7/10/23 05:51, Jason Wang wrote:
> > On Fri, Jul 7, 2023 at 7:21 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
> >>
> >> On 7/7/23 03:43, Jason Wang wrote:
> >>> On Fri, Jul 7, 2023 at 3:08 AM Stefan Hajnoczi <stefa...@gmail.com> wrote:
> >>>>
> >>>> On Wed, 5 Jul 2023 at 02:02, Jason Wang <jasow...@redhat.com> wrote:
> >>>>>
> >>>>> On Mon, Jul 3, 2023 at 5:03 PM Stefan Hajnoczi <stefa...@gmail.com> 
> >>>>> wrote:
> >>>>>>
> >>>>>> On Fri, 30 Jun 2023 at 09:41, Jason Wang <jasow...@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On Thu, Jun 29, 2023 at 8:36 PM Stefan Hajnoczi <stefa...@gmail.com> 
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, 29 Jun 2023 at 07:26, Jason Wang <jasow...@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jun 28, 2023 at 4:25 PM Stefan Hajnoczi 
> >>>>>>>>> <stefa...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 28 Jun 2023 at 10:19, Jason Wang <jasow...@redhat.com> 
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jun 28, 2023 at 4:15 PM Stefan Hajnoczi 
> >>>>>>>>>>> <stefa...@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 28 Jun 2023 at 09:59, Jason Wang <jasow...@redhat.com> 
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Jun 28, 2023 at 3:46 PM Stefan Hajnoczi 
> >>>>>>>>>>>>> <stefa...@gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, 28 Jun 2023 at 05:28, Jason Wang <jasow...@redhat.com> 
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets 
> >>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 6/27/23 04:54, Jason Wang wrote:
> >>>>>>>>>>>>>>>>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets 
> >>>>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 6/26/23 08:32, Jason Wang wrote:
> >>>>>>>>>>>>>>>>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang 
> >>>>>>>>>>>>>>>>>>> <jasow...@redhat.com> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets 
> >>>>>>>>>>>>>>>>>>>> <i.maxim...@ovn.org> wrote:
> >>>>>>>>>>>>>>>>>> It is noticeably more performant than a tap with vhost=on 
> >>>>>>>>>>>>>>>>>> in terms of PPS.
> >>>>>>>>>>>>>>>>>> So, that might be one case.  Taking into account that just 
> >>>>>>>>>>>>>>>>>> rcu lock and
> >>>>>>>>>>>>>>>>>> unlock in virtio-net code takes more time than a packet 
> >>>>>>>>>>>>>>>>>> copy, some batching
> >>>>>>>>>>>>>>>>>> on QEMU side should improve performance significantly.  
> >>>>>>>>>>>>>>>>>> And it shouldn't be
> >>>>>>>>>>>>>>>>>> too hard to implement.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Performance over virtual interfaces may potentially be 
> >>>>>>>>>>>>>>>>>> improved by creating
> >>>>>>>>>>>>>>>>>> a kernel thread for async Tx.  Similarly to what io_uring 
> >>>>>>>>>>>>>>>>>> allows.  Currently
> >>>>>>>>>>>>>>>>>> Tx on non-zero-copy interfaces is synchronous, and that 
> >>>>>>>>>>>>>>>>>> doesn't allow to
> >>>>>>>>>>>>>>>>>> scale well.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Interestingly, actually, there are a lot of "duplication" 
> >>>>>>>>>>>>>>>>> between
> >>>>>>>>>>>>>>>>> io_uring and AF_XDP:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1) both have similar memory model (user register)
> >>>>>>>>>>>>>>>>> 2) both use ring for communication
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I wonder if we can let io_uring talks directly to AF_XDP.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Well, if we submit poll() in QEMU main loop via io_uring, 
> >>>>>>>>>>>>>>>> then we can
> >>>>>>>>>>>>>>>> avoid cost of the synchronous Tx for non-zero-copy modes, 
> >>>>>>>>>>>>>>>> i.e. for
> >>>>>>>>>>>>>>>> virtual interfaces.  io_uring thread in the kernel will be 
> >>>>>>>>>>>>>>>> able to
> >>>>>>>>>>>>>>>> perform transmission for us.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> It would be nice if we can use iothread/vhost other than the 
> >>>>>>>>>>>>>>> main loop
> >>>>>>>>>>>>>>> even if io_uring can use kthreads. We can avoid the memory 
> >>>>>>>>>>>>>>> translation
> >>>>>>>>>>>>>>> cost.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The QEMU event loop (AioContext) has io_uring code
> >>>>>>>>>>>>>> (utils/fdmon-io_uring.c) but it's disabled at the moment. I'm 
> >>>>>>>>>>>>>> working
> >>>>>>>>>>>>>> on patches to re-enable it and will probably send them in 
> >>>>>>>>>>>>>> July. The
> >>>>>>>>>>>>>> patches also add an API to submit arbitrary io_uring 
> >>>>>>>>>>>>>> operations so
> >>>>>>>>>>>>>> that you can do stuff besides file descriptor monitoring. Both 
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> main loop and IOThreads will be able to use io_uring on Linux 
> >>>>>>>>>>>>>> hosts.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Just to make sure I understand. If we still need a copy from 
> >>>>>>>>>>>>> guest to
> >>>>>>>>>>>>> io_uring buffer, we still need to go via memory API for GPA 
> >>>>>>>>>>>>> which
> >>>>>>>>>>>>> seems expensive.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Vhost seems to be a shortcut for this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not sure how exactly you're thinking of using io_uring.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Simply using io_uring for the event loop (file descriptor 
> >>>>>>>>>>>> monitoring)
> >>>>>>>>>>>> doesn't involve an extra buffer, but the packet payload still 
> >>>>>>>>>>>> needs to
> >>>>>>>>>>>> reside in AF_XDP umem, so there is a copy between guest memory 
> >>>>>>>>>>>> and
> >>>>>>>>>>>> umem.
> >>>>>>>>>>>
> >>>>>>>>>>> So there would be a translation from GPA to HVA (unless io_uring
> >>>>>>>>>>> support 2 stages) which needs to go via qemu memory core. And this
> >>>>>>>>>>> part seems to be very expensive according to my test in the past.
> >>>>>>>>>>
> >>>>>>>>>> Yes, but in the current approach where AF_XDP is implemented as a 
> >>>>>>>>>> QEMU
> >>>>>>>>>> netdev, there is already QEMU device emulation (e.g. virtio-net)
> >>>>>>>>>> happening. So the GPA to HVA translation will happen anyway in 
> >>>>>>>>>> device
> >>>>>>>>>> emulation.
> >>>>>>>>>
> >>>>>>>>> Just to make sure we're on the same page.
> >>>>>>>>>
> >>>>>>>>> I meant, AF_XDP can do more than e.g 10Mpps. So if we still use the
> >>>>>>>>> QEMU netdev, it would be very hard to achieve that if we stick to
> >>>>>>>>> using the Qemu memory core translations which need to take care 
> >>>>>>>>> about
> >>>>>>>>> too much extra stuff. That's why I suggest using vhost in io threads
> >>>>>>>>> which only cares about ram so the translation could be very fast.
> >>>>>>>>
> >>>>>>>> What does using "vhost in io threads" mean?
> >>>>>>>
> >>>>>>> It means a vhost userspace dataplane that is implemented via io 
> >>>>>>> threads.
> >>>>>>
> >>>>>> AFAIK this does not exist today. QEMU's built-in devices that use
> >>>>>> IOThreads don't use vhost code. QEMU vhost code is for vhost kernel,
> >>>>>> vhost-user, or vDPA but not built-in devices that use IOThreads. The
> >>>>>> built-in devices implement VirtioDeviceClass callbacks directly and
> >>>>>> use AioContext APIs to run in IOThreads.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>>
> >>>>>> Do you have an idea for using vhost code for built-in devices? Maybe
> >>>>>> it's fastest if you explain your idea and its advantages instead of me
> >>>>>> guessing.
> >>>>>
> >>>>> It's something like I'd proposed in [1]:
> >>>>>
> >>>>> 1) a vhost that is implemented via IOThreads
> >>>>> 2) memory translation is done via vhost memory table/IOTLB
> >>>>>
> >>>>> The advantages are:
> >>>>>
> >>>>> 1) No 3rd application like DPDK application
> >>>>> 2) Attack surface were reduced
> >>>>> 3) Better understanding/interactions with device model for things like
> >>>>> RSS and IOMMU
> >>>>>
> >>>>> There could be some dis-advantages but it's not obvious to me :)
> >>>>
> >>>> Why is QEMU's native device emulation API not the natural choice for
> >>>> writing built-in devices? I don't understand why the vhost interface
> >>>> is desirable for built-in devices.
> >>>
> >>> Unless the memory helpers (like address translations) were optimized
> >>> fully to satisfy this 10M+ PPS.
> >>>
> >>> Not sure if this is too hard, but last time I benchmark, perf told me
> >>> most of the time spent in the translation.
> >>>
> >>> Using a vhost is a workaround since its memory model is much more
> >>> simpler so it can skip lots of memory sections like I/O and ROM etc.
> >>
> >> So, we can have a thread running as part of QEMU process that implements
> >> vhost functionality for a virtio-net device.  And this thread has an
> >> optimized way to access memory.  What prevents current virtio-net emulation
> >> code accessing memory in the same optimized way?
> >
> > Current emulation using memory core accessors which needs to take care
> > of a lot of stuff like MMIO or even P2P. Such kind of stuff is not
> > considered since day0 of vhost. You can do some experiment on this e.g
> > just dropping packets after fetching it from the TX ring.
>
> If I'm reading that right, virtio implementation is using address space
> caching by utilizing a memory listener and pre-translated addresses of
> interesting memory regions.  Then it's performing address_space_read_cached,
> which is bypassing all the memory address translation logic on a cache hit.
> That sounds pretty similar to how memory table is prepared for vhost.

Exactly, but only for the vring memory structures (avail, used, and
descriptor rings in the Split Virtqueue Layout).

The packet headers and payloads are still translated using the
uncached virtqueue_pop() -> dma_memory_map() -> address_space_map()
API.

Running a tx packet drop benchmark as Jason suggested and checking if
memory translation is a bottleneck seems worthwhile. Improving
dma_memory_map() performance would speed up all built-in QEMU devices.

Jason: When you noticed this bottleneck, were you using a normal
virtio-net-pci device without vIOMMU?

Stefan

Reply via email to