On Wed, Aug 16, 2023 at 08:30:58PM +0200, Ilya Maximets wrote: > On 8/16/23 17:30, Stefan Hajnoczi wrote: > > On Wed, Aug 16, 2023 at 03:36:32PM +0200, Ilya Maximets wrote: > >> On 8/15/23 14:08, Stefan Hajnoczi wrote: > >>> virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used > >>> Buffer Notifications from an IOThread. This involves an eventfd > >>> write(2) syscall. Calling this repeatedly when completing multiple I/O > >>> requests in a row is wasteful. > >> > >> Hi, Stefan. This is an interesting change! > >> > >> There is more or less exactly the same problem with fast network backends > >> and I was playing around with similar ideas in this area while working on > >> af-xdp network backend recently. Primarily, implementation of the Rx BH > >> for virtio-net device and locking the RCU before passing packets from the > >> backend to the device one by one. > >> > >>> > >>> Use the blk_io_plug_call() API to batch together virtio_irqfd_notify() > >>> calls made during Linux AIO (aio=native) or io_uring (aio=io_uring) > >>> completion processing. Do not modify the thread pool (aio=threads) to > >>> avoid introducing a dependency from util/ onto the block layer. > >> > >> But you're introducing a dependency from generic virtio code onto the > >> block layer in this patch. This seem to break the module abstraction. > >> > >> It looks like there are 2 options to resolve the semantics issue here: > > > > Yes, it's a layering violation. > > > >> > >> 1. Move virtio_notify_irqfd() from virtio.c down to the block layer. > >> Block layer is the only user, so that may be justified, but it > >> doesn't seem like a particularly good solution. (I'm actually not > >> sure why block devices are the only ones using this function...) > > > > Yes, this is the easiest way to avoid the layering violation for now. > > > > The virtio_notify_irqfd() API is necessary when running in an IOThread > > because the normal QEMU irq API must run under the Big QEMU Lock. Block > > devices are the only ones that raise interrupts from IOThreads at the > > moment. > > Ack. Thanks for explanation! > > > > >> > >> 2. Move and rename the block/plug library somewhere generic. The plug > >> library seems to not have any dependencies on a block layer, other > >> than a name, so it should not be hard to generalize it (well, the > >> naming might be hard). > > > > Yes, it should be possible to make it generic quite easily. I will give > > this a try in the next version of the patch. > > OK. Sounds good to me. > > > > >> In general, while looking at the plug library, it seems to do what is > >> typically implemented via RCU frameworks - the delayed function call. > >> The only difference is that RCU doesn't check for duplicates and the > >> callbacks are global. Should not be hard to add some new functionality > >> to RCU framework in order to address these, e.g. rcu_call_local() for > >> calls that should be executed once the current thread exits its own > >> critical section. > > > > This rcu_call_local() idea is unrelated to Read Copy Update, so I don't > > think it should be part of the RCU API. > > Agreed. > > > Another deferred function call mechanism is QEMUBH. It already supports > > coalescing. However, BHs are invoked once per AioContext event loop > > iteration and there is no way invoke the BH earlier. Also the BH pointer > > needs to be passed to every function that wishes to schedule a deferred > > call, which can be tedious (e.g. block/linux-aio.c should defer the > > io_submit(2) syscall until the end of virtio-blk request processing - > > there are a lot of layers of code between those two components). > > > >> > >> Using RCU for non-RCU-protected things might be considered as an abuse. > >> However, we might solve two issues in one shot if instead of entering > >> blk_io_plug/unplug section we will enter an RCU critical section and > >> call callbacks at the exit. The first issue is the notification batching > >> that this patch is trying to fix, the second is an excessive number of > >> thread fences on RCU exits every time virtio_notify_irqfd() and other > >> virtio functions are invoked. The second issue can be avoided by using > >> RCU_READ_LOCK_GUARD() in completion functions. Not sure if that will > >> improve performance, but it definitely removes a lot of noise from the > >> perf top for network backends. This makes the code a bit less explicit > >> though, the lock guard will definitely need a comment. Though, the reason > >> for blk_io_plug() calls is not fully clear for a module code alone either. > > > > util/aio-posix.c:run_poll_handlers() has a top-level > > RCU_READ_LOCK_GUARD() for this reason. > > Nice, didn't know that. > > > Maybe we should do the same > > around aio_bh_poll() + aio_dispatch_ready_handlers() in > > util/aio-posix.c:aio_poll()? The downside is that latency-sensitive > > call_rcu() callbacks perform worse. > > "latency-sensitive call_rcu() callback" is a bit of an oxymoron. > There is no real way to tell when the other thread will exit the > critical section. But I'm not familiar with the code enough to > make a decision here.
I see your point :). Another issue occurred to me: drain_call_rcu() is invoked by some QEMU monitor commands and the monitor runs in QEMU's event loop. drain_call_rcu() waits for the call_rcu thread, which needs synchronize_rcu() to finish, so I guess it will hang if called with the RCU read lock held. There is the potential for a deadlock here. It might be solvable. > > >> > >> I'm not sure what is the best way forward. I'm trying to figure out > >> that myself, since a similar solution will in the end be needed for > >> network backends and so it might be better to have something more generic. > >> What do you think? > > > > I suggest moving blk_io_plug() to util/ and renaming it to > > defer_begin/defer_end()/defer_call(). The interface and implementation > > can also be improved as needed for other users. > > Sounds good to me. > > > > >> > >>> > >>> Behavior is unchanged for emulated devices that do not use blk_io_plug() > >>> since blk_io_plug_call() immediately invokes the callback when called > >>> outside a blk_io_plug()/blk_io_unplug() region. > >>> > >>> fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a > >>> single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could > >>> be noise. Detailed performance data and configuration specifics are > >>> available here: > >>> https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd > >>> > >>> This duplicates the BH that virtio-blk uses for batching. The next > >>> commit will remove it. > >> > >> I'm likely missing something, but could you explain why it is safe to > >> batch unconditionally here? The current BH code, as you mentioned in > >> the second patch, is only batching if EVENT_IDX is not set. > >> Maybe worth adding a few words in the commit message for people like > >> me, who are a bit rusty on QEMU/virtio internals. :) > > > > The BH is condition on EVENT_IDX not for correctness/safety, but for > > performance. > > I see. Thanks! > Do you think using deferred call in the non-irqfd virtio_notify() > will make sense? Not in this patch set, but in general, e.g. for > a future use from a network backend. Yes, a deferred call may improve performance in the non-irqfd case too. > > Commit 12c1c7d7cefb4dbffeb5712e75a33e4692f0a76b > > ("virtio-blk: dataplane: Don't batch notifications if EVENT_IDX is > > present") argued that VIRTIO's EVENT_IDX feature already provides > > batching so the BH is not needed in that case. > > > > Actually I'm not sure exactly why deferring virtio_notify_irqfd() helps, > > but IOPS improves on my machine. One theory: > > > > Linux drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_split() (or the > > end of the irq handler function) updates EVENT_IDX so that QEMU will > > inject an interrupt again when the next request is completed. If the > > vCPU and QEMU are running in parallel, then N requests could result in N > > interrupts when the vCPU updates EVENT_IDX before QEMU completes the > > next request. Maybe that's why I see better performance when deferring > > virtio_notify_irqfd(). > > Sounds very plausible. I actually see only latency and interrupt > count measurements in either 12c1c7d7cefb ("virtio-blk: dataplane: > Don't batch notifications if EVENT_IDX is present") or 5b2ffbe4d998 > ("virtio-blk: dataplane: notify guest as a batch") as well as in > mailing list discussions about these changes. So, it's possible that > impact of batching on throughput in terms of IOPS was not actually > measured before (it porobably was, but not published). > > > > > (I'm thinking of experimenting with rate limiting irq injection so that > > it's possible to batch even more efficienctly but I worry that it will > > be hard to avoid latency vs throughput regressions.) > > > > Stefan > > > >> > >> Best regards, Ilya Maximets. > >> > >>> > >>> Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com> > >>> --- > >>> block/io_uring.c | 6 ++++++ > >>> block/linux-aio.c | 4 ++++ > >>> hw/virtio/virtio.c | 10 +++++++++- > >>> 3 files changed, 19 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/block/io_uring.c b/block/io_uring.c > >>> index 69d9820928..749cf83934 100644 > >>> --- a/block/io_uring.c > >>> +++ b/block/io_uring.c > >>> @@ -124,6 +124,9 @@ static void luring_process_completions(LuringState *s) > >>> { > >>> struct io_uring_cqe *cqes; > >>> int total_bytes; > >>> + > >>> + blk_io_plug(); > >>> + > >>> /* > >>> * Request completion callbacks can run the nested event loop. > >>> * Schedule ourselves so the nested event loop will "see" remaining > >>> @@ -216,7 +219,10 @@ end: > >>> aio_co_wake(luringcb->co); > >>> } > >>> } > >>> + > >>> qemu_bh_cancel(s->completion_bh); > >>> + > >>> + blk_io_unplug(); > >>> } > >>> > >>> static int ioq_submit(LuringState *s) > >>> diff --git a/block/linux-aio.c b/block/linux-aio.c > >>> index 561c71a9ae..cef3d6b1c7 100644 > >>> --- a/block/linux-aio.c > >>> +++ b/block/linux-aio.c > >>> @@ -204,6 +204,8 @@ static void > >>> qemu_laio_process_completions(LinuxAioState *s) > >>> { > >>> struct io_event *events; > >>> > >>> + blk_io_plug(); > >>> + > >>> /* Reschedule so nested event loops see currently pending > >>> completions */ > >>> qemu_bh_schedule(s->completion_bh); > >>> > >>> @@ -230,6 +232,8 @@ static void > >>> qemu_laio_process_completions(LinuxAioState *s) > >>> * own `for` loop. If we are the last all counters droped to zero. > >>> */ > >>> s->event_max = 0; > >>> s->event_idx = 0; > >>> + > >>> + blk_io_unplug(); > >>> } > >>> > >>> static void qemu_laio_process_completions_and_submit(LinuxAioState *s) > >>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c > >>> index 309038fd46..a691e8526b 100644 > >>> --- a/hw/virtio/virtio.c > >>> +++ b/hw/virtio/virtio.c > >>> @@ -28,6 +28,7 @@ > >>> #include "hw/virtio/virtio-bus.h" > >>> #include "hw/qdev-properties.h" > >>> #include "hw/virtio/virtio-access.h" > >>> +#include "sysemu/block-backend.h" > >>> #include "sysemu/dma.h" > >>> #include "sysemu/runstate.h" > >>> #include "virtio-qmp.h" > >>> @@ -2426,6 +2427,13 @@ static bool virtio_should_notify(VirtIODevice > >>> *vdev, VirtQueue *vq) > >>> } > >>> } > >>> > >>> +/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */ > >>> +static void virtio_notify_irqfd_unplug_fn(void *opaque) > >>> +{ > >>> + EventNotifier *notifier = opaque; > >>> + event_notifier_set(notifier); > >>> +} > >>> + > >>> void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq) > >>> { > >>> WITH_RCU_READ_LOCK_GUARD() { > >>> @@ -2452,7 +2460,7 @@ void virtio_notify_irqfd(VirtIODevice *vdev, > >>> VirtQueue *vq) > >>> * to an atomic operation. > >>> */ > >>> virtio_set_isr(vq->vdev, 0x1); > >>> - event_notifier_set(&vq->guest_notifier); > >>> + blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier); > > There is a trace call in this function. Should we move it or add a new > trace call to the callback? The trace was printed right before the > actual notification before, and now the actual notification is delayed > potentially reducing the usefulness of the trace. > > >>> } > >>> > >>> static void virtio_irq(VirtQueue *vq) > >> >
signature.asc
Description: PGP signature