On Wed, Aug 16, 2023 at 08:30:58PM +0200, Ilya Maximets wrote:
> On 8/16/23 17:30, Stefan Hajnoczi wrote:
> > On Wed, Aug 16, 2023 at 03:36:32PM +0200, Ilya Maximets wrote:
> >> On 8/15/23 14:08, Stefan Hajnoczi wrote:
> >>> virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
> >>> Buffer Notifications from an IOThread. This involves an eventfd
> >>> write(2) syscall. Calling this repeatedly when completing multiple I/O
> >>> requests in a row is wasteful.
> >>
> >> Hi, Stefan.  This is an interesting change!
> >>
> >> There is more or less exactly the same problem with fast network backends
> >> and I was playing around with similar ideas in this area while working on
> >> af-xdp network backend recently.  Primarily, implementation of the Rx BH
> >> for virtio-net device and locking the RCU before passing packets from the
> >> backend to the device one by one.
> >>
> >>>
> >>> Use the blk_io_plug_call() API to batch together virtio_irqfd_notify()
> >>> calls made during Linux AIO (aio=native) or io_uring (aio=io_uring)
> >>> completion processing. Do not modify the thread pool (aio=threads) to
> >>> avoid introducing a dependency from util/ onto the block layer.
> >>
> >> But you're introducing a dependency from generic virtio code onto the
> >> block layer in this patch.  This seem to break the module abstraction.
> >>
> >> It looks like there are 2 options to resolve the semantics issue here:
> > 
> > Yes, it's a layering violation.
> > 
> >>
> >> 1. Move virtio_notify_irqfd() from virtio.c down to the block layer.
> >>    Block layer is the only user, so that may be justified, but it
> >>    doesn't seem like a particularly good solution.  (I'm actually not
> >>    sure why block devices are the only ones using this function...)
> > 
> > Yes, this is the easiest way to avoid the layering violation for now.
> > 
> > The virtio_notify_irqfd() API is necessary when running in an IOThread
> > because the normal QEMU irq API must run under the Big QEMU Lock. Block
> > devices are the only ones that raise interrupts from IOThreads at the
> > moment.
> 
> Ack.  Thanks for explanation!
> 
> > 
> >>
> >> 2. Move and rename the block/plug library somewhere generic.  The plug
> >>    library seems to not have any dependencies on a block layer, other
> >>    than a name, so it should not be hard to generalize it (well, the
> >>    naming might be hard).
> > 
> > Yes, it should be possible to make it generic quite easily. I will give
> > this a try in the next version of the patch.
> 
> OK.  Sounds good to me.
> 
> > 
> >> In general, while looking at the plug library, it seems to do what is
> >> typically implemented via RCU frameworks - the delayed function call.
> >> The only difference is that RCU doesn't check for duplicates and the
> >> callbacks are global.  Should not be hard to add some new functionality
> >> to RCU framework in order to address these, e.g. rcu_call_local() for
> >> calls that should be executed once the current thread exits its own
> >> critical section.
> > 
> > This rcu_call_local() idea is unrelated to Read Copy Update, so I don't
> > think it should be part of the RCU API.
> 
> Agreed.
> 
> > Another deferred function call mechanism is QEMUBH. It already supports
> > coalescing. However, BHs are invoked once per AioContext event loop
> > iteration and there is no way invoke the BH earlier. Also the BH pointer
> > needs to be passed to every function that wishes to schedule a deferred
> > call, which can be tedious (e.g. block/linux-aio.c should defer the
> > io_submit(2) syscall until the end of virtio-blk request processing -
> > there are a lot of layers of code between those two components).
> > 
> >>
> >> Using RCU for non-RCU-protected things might be considered as an abuse.
> >> However, we might solve two issues in one shot if instead of entering
> >> blk_io_plug/unplug section we will enter an RCU critical section and
> >> call callbacks at the exit.  The first issue is the notification batching
> >> that this patch is trying to fix, the second is an excessive number of
> >> thread fences on RCU exits every time virtio_notify_irqfd() and other
> >> virtio functions are invoked.  The second issue can be avoided by using
> >> RCU_READ_LOCK_GUARD() in completion functions.  Not sure if that will
> >> improve performance, but it definitely removes a lot of noise from the
> >> perf top for network backends.  This makes the code a bit less explicit
> >> though, the lock guard will definitely need a comment.  Though, the reason
> >> for blk_io_plug() calls is not fully clear for a module code alone either.
> > 
> > util/aio-posix.c:run_poll_handlers() has a top-level
> > RCU_READ_LOCK_GUARD() for this reason.
> 
> Nice, didn't know that.
> 
> > Maybe we should do the same
> > around aio_bh_poll() + aio_dispatch_ready_handlers() in
> > util/aio-posix.c:aio_poll()? The downside is that latency-sensitive
> > call_rcu() callbacks perform worse.
> 
> "latency-sensitive call_rcu() callback" is a bit of an oxymoron.
> There is no real way to tell when the other thread will exit the
> critical section.  But I'm not familiar with the code enough to
> make a decision here.

I see your point :).

Another issue occurred to me: drain_call_rcu() is invoked by some QEMU
monitor commands and the monitor runs in QEMU's event loop.
drain_call_rcu() waits for the call_rcu thread, which needs
synchronize_rcu() to finish, so I guess it will hang if called with the
RCU read lock held. There is the potential for a deadlock here. It might
be solvable.

> 
> >>
> >> I'm not sure what is the best way forward.  I'm trying to figure out
> >> that myself, since a similar solution will in the end be needed for
> >> network backends and so it might be better to have something more generic.
> >> What do you think?
> > 
> > I suggest moving blk_io_plug() to util/ and renaming it to
> > defer_begin/defer_end()/defer_call(). The interface and implementation
> > can also be improved as needed for other users.
> 
> Sounds good to me.
> 
> > 
> >>
> >>>
> >>> Behavior is unchanged for emulated devices that do not use blk_io_plug()
> >>> since blk_io_plug_call() immediately invokes the callback when called
> >>> outside a blk_io_plug()/blk_io_unplug() region.
> >>>
> >>> fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
> >>> single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
> >>> be noise. Detailed performance data and configuration specifics are
> >>> available here:
> >>> https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd
> >>>
> >>> This duplicates the BH that virtio-blk uses for batching. The next
> >>> commit will remove it.
> >>
> >> I'm likely missing something, but could you explain why it is safe to
> >> batch unconditionally here?  The current BH code, as you mentioned in
> >> the second patch, is only batching if EVENT_IDX is not set.
> >> Maybe worth adding a few words in the commit message for people like
> >> me, who are a bit rusty on QEMU/virtio internals. :)
> > 
> > The BH is condition on EVENT_IDX not for correctness/safety, but for
> > performance.
> 
> I see.  Thanks!
> Do you think using deferred call in the non-irqfd virtio_notify()
> will make sense?  Not in this patch set, but in general, e.g. for
> a future use from a network backend.

Yes, a deferred call may improve performance in the non-irqfd case too.

> > Commit 12c1c7d7cefb4dbffeb5712e75a33e4692f0a76b
> > ("virtio-blk: dataplane: Don't batch notifications if EVENT_IDX is
> > present") argued that VIRTIO's EVENT_IDX feature already provides
> > batching so the BH is not needed in that case.
> > 
> > Actually I'm not sure exactly why deferring virtio_notify_irqfd() helps,
> > but IOPS improves on my machine. One theory:
> > 
> > Linux drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_split() (or the
> > end of the irq handler function) updates EVENT_IDX so that QEMU will
> > inject an interrupt again when the next request is completed. If the
> > vCPU and QEMU are running in parallel, then N requests could result in N
> > interrupts when the vCPU updates EVENT_IDX before QEMU completes the
> > next request. Maybe that's why I see better performance when deferring
> > virtio_notify_irqfd().
> 
> Sounds very plausible.  I actually see only latency and interrupt
> count measurements in either 12c1c7d7cefb ("virtio-blk: dataplane:
> Don't batch notifications if EVENT_IDX is present") or 5b2ffbe4d998
> ("virtio-blk: dataplane: notify guest as a batch") as well as in
> mailing list discussions about these changes.  So, it's possible that
> impact of batching on throughput in terms of IOPS was not actually
> measured before (it porobably was, but not published).
> 
> > 
> > (I'm thinking of experimenting with rate limiting irq injection so that
> > it's possible to batch even more efficienctly but I worry that it will
> > be hard to avoid latency vs throughput regressions.)
> > 
> > Stefan
> > 
> >>
> >> Best regards, Ilya Maximets.
> >>
> >>>
> >>> Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
> >>> ---
> >>>  block/io_uring.c   |  6 ++++++
> >>>  block/linux-aio.c  |  4 ++++
> >>>  hw/virtio/virtio.c | 10 +++++++++-
> >>>  3 files changed, 19 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/block/io_uring.c b/block/io_uring.c
> >>> index 69d9820928..749cf83934 100644
> >>> --- a/block/io_uring.c
> >>> +++ b/block/io_uring.c
> >>> @@ -124,6 +124,9 @@ static void luring_process_completions(LuringState *s)
> >>>  {
> >>>      struct io_uring_cqe *cqes;
> >>>      int total_bytes;
> >>> +
> >>> +    blk_io_plug();
> >>> +
> >>>      /*
> >>>       * Request completion callbacks can run the nested event loop.
> >>>       * Schedule ourselves so the nested event loop will "see" remaining
> >>> @@ -216,7 +219,10 @@ end:
> >>>              aio_co_wake(luringcb->co);
> >>>          }
> >>>      }
> >>> +
> >>>      qemu_bh_cancel(s->completion_bh);
> >>> +
> >>> +    blk_io_unplug();
> >>>  }
> >>>  
> >>>  static int ioq_submit(LuringState *s)
> >>> diff --git a/block/linux-aio.c b/block/linux-aio.c
> >>> index 561c71a9ae..cef3d6b1c7 100644
> >>> --- a/block/linux-aio.c
> >>> +++ b/block/linux-aio.c
> >>> @@ -204,6 +204,8 @@ static void 
> >>> qemu_laio_process_completions(LinuxAioState *s)
> >>>  {
> >>>      struct io_event *events;
> >>>  
> >>> +    blk_io_plug();
> >>> +
> >>>      /* Reschedule so nested event loops see currently pending 
> >>> completions */
> >>>      qemu_bh_schedule(s->completion_bh);
> >>>  
> >>> @@ -230,6 +232,8 @@ static void 
> >>> qemu_laio_process_completions(LinuxAioState *s)
> >>>       * own `for` loop.  If we are the last all counters droped to zero. 
> >>> */
> >>>      s->event_max = 0;
> >>>      s->event_idx = 0;
> >>> +
> >>> +    blk_io_unplug();
> >>>  }
> >>>  
> >>>  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
> >>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >>> index 309038fd46..a691e8526b 100644
> >>> --- a/hw/virtio/virtio.c
> >>> +++ b/hw/virtio/virtio.c
> >>> @@ -28,6 +28,7 @@
> >>>  #include "hw/virtio/virtio-bus.h"
> >>>  #include "hw/qdev-properties.h"
> >>>  #include "hw/virtio/virtio-access.h"
> >>> +#include "sysemu/block-backend.h"
> >>>  #include "sysemu/dma.h"
> >>>  #include "sysemu/runstate.h"
> >>>  #include "virtio-qmp.h"
> >>> @@ -2426,6 +2427,13 @@ static bool virtio_should_notify(VirtIODevice 
> >>> *vdev, VirtQueue *vq)
> >>>      }
> >>>  }
> >>>  
> >>> +/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */
> >>> +static void virtio_notify_irqfd_unplug_fn(void *opaque)
> >>> +{
> >>> +    EventNotifier *notifier = opaque;
> >>> +    event_notifier_set(notifier);
> >>> +}
> >>> +
> >>>  void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
> >>>  {
> >>>      WITH_RCU_READ_LOCK_GUARD() {
> >>> @@ -2452,7 +2460,7 @@ void virtio_notify_irqfd(VirtIODevice *vdev, 
> >>> VirtQueue *vq)
> >>>       * to an atomic operation.
> >>>       */
> >>>      virtio_set_isr(vq->vdev, 0x1);
> >>> -    event_notifier_set(&vq->guest_notifier);
> >>> +    blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier);
> 
> There is a trace call in this function.  Should we move it or add a new
> trace call to the callback?  The trace was printed right before the
> actual notification before, and now the actual notification is delayed
> potentially reducing the usefulness of the trace.
> 
> >>>  }
> >>>  
> >>>  static void virtio_irq(VirtQueue *vq)
> >>
> 

Attachment: signature.asc
Description: PGP signature

Reply via email to