Re: [PATCH for-9.0?] usb-storage: Fix BlockConf defaults
On 12.04.24 16:42, Kevin Wolf wrote: Commit 30896374 started to pass the full BlockConf from usb-storage to scsi-disk, while previously only a few select properties would be forwarded. This enables the user to set more properties, e.g. the block size, that are actually taking effect. However, now the calls to blkconf_apply_backend_options() and blkconf_blocksizes() in usb_msd_storage_realize() that modify some of these properties take effect, too, instead of being silently ignored. This means at least that the block sizes get an unconditional default of 512 bytes before the configuration is passed to scsi-disk. Before commit 30896374, the property wouldn't be set for scsi-disk and therefore the device dependent defaults would apply - 512 for scsi-hd, but 2048 for scsi-cd. The latter default has now become 512, too, which makes at least Windows 11 installation fail when installing from usb-storage. Fix this by simply not calling these functions any more in usb-storage and passing BlockConf on unmodified (except for the BlockBackend). The same functions are called by the SCSI code anyway and it sets the right defaults for the actual media type. Fixes: 308963746169 ('scsi: Don't ignore most usb-storage properties') Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2260 Reported-by: Jonas Svensson Signed-off-by: Kevin Wolf --- Considering this a candidate for 9.0 given that we're already having an rc4, it's a regression from 8.2 and breaks installing Windows from USB hw/usb/dev-storage-classic.c | 9 - 1 file changed, 9 deletions(-) Reviewed-by: Hanna Czenczek
Re: [PATCH 0/2] block: Allow concurrent BB context changes
On 10.02.24 09:46, Michael Tokarev wrote: 09.02.2024 19:51, Hanna Czenczek : On 09.02.24 15:08, Michael Tokarev wrote: 02.02.2024 17:47, Hanna Czenczek : Hi, Without the AioContext lock, a BB's context may kind of change at any time (unless it has a root node, and I/O requests are pending). That also means that its own context (BlockBackend.ctx) and that of its root node can differ sometimes (while the context is being changed). How relevant this is for -stable (8.2 at least) which does not have "scsi: eliminate AioContext lock" patchset, and in particular,: v8.2.0-124-geaad0fe260 "scsi: only access SCSIDevice->requests from one thread"? The issue first patch "block-backend: Allow concurrent context changes" fixes (RHEL-19381) seems to be for 8.1.something, so it exists in 8.2 too, and this particular fix applies to 8.2. But with other changes around all this, I'm a bit lost as of what should be done on stable. Not even thinking about 7.2 here :) Ah, sorry, yes. Since we do still have the AioContext lock, this series won’t be necessary in -stable. Sorry for the noise! Hm. Now I'm confused even more.. :) ad89367202 "block-backend: Allow concurrent context changes" - the first one in this series - apparently is needed, as it fixes an issue reported for qemu 8.1 (https://issues.redhat.com/browse/RHEL-19381). Or is it not the case? Ah, yes, I got confused there. There are two (unfortunately? fortunately? Red-Hat-internal) comments, one of which describes the crash that’s fixed here, so I thought that bug described this crash. But the actual description in the report describes something different (more like what’s fixed by https://lists.nongnu.org/archive/html/qemu-devel/2024-01/msg03649.html, but I’m not entirely sure yet). So basically I got the bug link wrong. We now have https://issues.redhat.com/browse/RHEL-24593, which has been reported only against 8.2. Hanna FWIW, truth is born in the noise, not in silence ;) Thanks, /mjt
Re: [PATCH v2 3/3] virtio-blk: Use ioeventfd_attach in start_ioeventfd
On 09.02.24 15:38, Michael Tokarev wrote: 02.02.2024 18:31, Hanna Czenczek : Commit d3f6f294aeadd5f88caf0155e4360808c95b3146 ("virtio-blk: always set ioeventfd during startup") has made virtio_blk_start_ioeventfd() always kick the virtqueue (set the ioeventfd), regardless of whether the BB is drained. That is no longer necessary, because attaching the host notifier will now set the ioeventfd, too; this happens either immediately right here in virtio_blk_start_ioeventfd(), or later when the drain ends, in virtio_blk_ioeventfd_attach(). With event_notifier_set() removed, the code becomes the same as the one in virtio_blk_ioeventfd_attach(), so we can reuse that function. The mentioned comit is v8.2.0-812-gd3f6f294ae, - ie, past 8.2. Is this new change still relevant for stable? Sorry again. :/ This patch is a clean-up patch that won’t apply to 8.2. Now, 8.2 does have basically the same logic as described in the patch message (d3f6f294aea restored it after it was broken), so a similar patch could be made for it (removing the event_notifier_set() from virtio_blk_data_plane_start()), but whether we kick the virtqueues once or twice on start-up probably won’t make a difference, certainly not in terms of correctness. Hanna
Re: [PATCH 0/2] block: Allow concurrent BB context changes
On 09.02.24 15:08, Michael Tokarev wrote: 02.02.2024 17:47, Hanna Czenczek : Hi, Without the AioContext lock, a BB's context may kind of change at any time (unless it has a root node, and I/O requests are pending). That also means that its own context (BlockBackend.ctx) and that of its root node can differ sometimes (while the context is being changed). How relevant this is for -stable (8.2 at least) which does not have "scsi: eliminate AioContext lock" patchset, and in particular,: v8.2.0-124-geaad0fe260 "scsi: only access SCSIDevice->requests from one thread"? The issue first patch "block-backend: Allow concurrent context changes" fixes (RHEL-19381) seems to be for 8.1.something, so it exists in 8.2 too, and this particular fix applies to 8.2. But with other changes around all this, I'm a bit lost as of what should be done on stable. Not even thinking about 7.2 here :) Ah, sorry, yes. Since we do still have the AioContext lock, this series won’t be necessary in -stable. Sorry for the noise! Hanna
Re: [PATCH 0/2] block: Allow concurrent BB context changes
On 06.02.24 17:53, Stefan Hajnoczi wrote: On Fri, Feb 02, 2024 at 03:47:53PM +0100, Hanna Czenczek wrote: Hi, Without the AioContext lock, a BB's context may kind of change at any time (unless it has a root node, and I/O requests are pending). That also means that its own context (BlockBackend.ctx) and that of its root node can differ sometimes (while the context is being changed). blk_get_aio_context() doesn't know this yet and asserts that both are always equal (if there is a root node). Because it's no longer true, and because callers don't seem to really care about the root node's context, we can and should remove the assertion and just return the BB's context. Beyond that, the question is whether the callers of blk_get_aio_context() are OK with the context potentially changing concurrently. Honestly, it isn't entirely clear to me; most look OK, except for the virtio-scsi code, which operates under the general assumption that the BB's context is always equal to that of the virtio-scsi device. I doubt that this assumption always holds (it is definitely not obvious to me that it would), but then again, this series will not make matters worse in that regard, and that is what counts for me now. One clear point of contention is scsi_device_for_each_req_async(), which is addressed by patch 2. Right now, it schedules a BH in the BB context, then the BH double-checks whether the context still fits, and if not, re-schedules itself. Because virtio-scsi's context is fixed, this seems to indicate to me that it wants to be able to deal with a case where BB and virtio-scsi context differ, which seems to break that aforementioned general virtio-scsi assumption. I don't agree with the last sentence: virtio-scsi's context isn't fixed. The AioContext changes when dataplane is started/stopped. virtio-scsi switches AioContext between the IOThread's AioContext and the main loop's qemu_aio_context. However, virtio-scsi virtqueue processing only happens in the IOThread's AioContext. Maybe this is what you meant when you said the AioContext is fixed? Specifically, I meant VirtIOSCSI.ctx, which is set only once in virtio_scsi_dataplane_setup(). That’s at least where the virtqueue notifiers are registered, so yes, virtqueue processing should at least be fixed to that context. It seems like it’s always possible some things are processed in the main thread (not just setup/teardown, but also e.g. TMF_LOGICAL_UNIT_RESET), so to me it seems like virtio-scsi kind of runs in two contexts simultaneously. Yes, when virtqueue processing is paused, all processing VirtIOSCSI.ctx is stopped, but I wouldn’t say it switches contexts there. It just stops processing some requests. Either way, virtio-scsi request processing doesn’t stop just because a scsi-hd device is hot-plugged or -unplugged. If the BB changes contexts in the hot-unplug path (while vq request processing is continuing in the I/O thread), its context will differ from that of virtio-scsi. So should I just replace the “the context is fixed” and say that in this specific instance, virtio-scsi vq processing continues in the I/O thread? The BH function is aware that the current AioContext might not be the same as the AioContext at the time the BH was scheduled. That doesn't break assumptions in the code. (It may be possible to rewrite virtio-blk, virtio-scsi, and core VirtIODevice ioeventfd code to use the simpler model where the AioContext really is fixed because things have changed significantly over the years, but I looked a few weeks ago and it's difficult work.) I'm just pointing out that I think this description is incomplete. I *do* agree with what this patch series is doing :). Well, this description won’t land in any commit log, so from my side, I’m not too worried about its correctness. O:) Hanna Unfortunately, I definitely have to touch that code, because accepting concurrent changes of AioContexts breaks the double-check (just because the BB has the right context in that place does not mean it stays in that context); instead, we must prevent any concurrent change until the BH is done. Because changing contexts generally involves a drained section, we can prevent it by keeping the BB in-flight counter elevated. Question is, how to reason for that. I’d really rather not include the need to follow the BB context in my argument, because I find that part a bit fishy. Luckily, there’s a second, completely different reason for having scsi_device_for_each_req_async() increment the in-flight counter: Specifically, scsi_device_purge_requests() probably wants to await full completion of scsi_device_for_each_req_async(), and we can do that most easily in the very same way by incrementing the in-flight counter. This way, the blk_drain() in scsi_device_purge_requests() will not only await all (cancelled) I/O requests, but also the non-I/O requests. The fact that this prevents the BB AioContext from changing while the BH
Re: [PATCH] virtio-blk: do not use C99 mixed declarations
On 06.02.24 15:04, Stefan Hajnoczi wrote: QEMU's coding style generally forbids C99 mixed declarations. Signed-off-by: Stefan Hajnoczi --- hw/block/virtio-blk.c | 25 ++--- 1 file changed, 14 insertions(+), 11 deletions(-) Reviewed-by: Hanna Czenczek
Re: [PATCH 5/5] monitor: use aio_co_reschedule_self()
On 05.02.24 18:26, Stefan Hajnoczi wrote: The aio_co_reschedule_self() API is designed to avoid the race condition between scheduling the coroutine in another AioContext and yielding. The QMP dispatch code uses the open-coded version that appears susceptible to the race condition at first glance: aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self()); qemu_coroutine_yield(); The code is actually safe because the iohandler and qemu_aio_context AioContext run under the Big QEMU Lock. Nevertheless, set a good example and use aio_co_reschedule_self() so it's obvious that there is no race. Suggested-by: Hanna Reitz Signed-off-by: Stefan Hajnoczi --- qapi/qmp-dispatch.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) Reviewed-by: Hanna Czenczek
Re: [PATCH 4/5] virtio-blk: declare VirtIOBlock::rq with a type
On 05.02.24 18:26, Stefan Hajnoczi wrote: The VirtIOBlock::rq field has had the type void * since its introduction in commit 869a5c6df19a ("Stop VM on error in virtio-blk. (Gleb Natapov)"). Perhaps this was done to avoid the forward declaration of VirtIOBlockReq. Hanna Czenczek pointed out the missing type. Specify the actual type because there is no need to use void * here. Suggested-by: Hanna Czenczek Signed-off-by: Stefan Hajnoczi --- include/hw/virtio/virtio-blk.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Hanna Czenczek
Re: [PATCH 3/5] virtio-blk: add vq_rq[] bounds check in virtio_blk_dma_restart_cb()
On 05.02.24 18:26, Stefan Hajnoczi wrote: Hanna Czenczek noted that the array index in virtio_blk_dma_restart_cb() is not bounds-checked: g_autofree VirtIOBlockReq **vq_rq = g_new0(VirtIOBlockReq *, num_queues); ... while (rq) { VirtIOBlockReq *next = rq->next; uint16_t idx = virtio_get_queue_index(rq->vq); rq->next = vq_rq[idx]; ^^ The code is correct because both rq->vq and vq_rq[] depend on num_queues, but this is indirect and not 100% obvious. Add an assertion. Suggested-by: Hanna Czenczek Signed-off-by: Stefan Hajnoczi --- hw/block/virtio-blk.c | 1 + 1 file changed, 1 insertion(+) Reviewed-by: Hanna Czenczek
Re: [PATCH 2/5] virtio-blk: clarify that there is at least 1 virtqueue
On 05.02.24 18:26, Stefan Hajnoczi wrote: It is not possible to instantiate a virtio-blk device with 0 virtqueues. The following check is located in ->realize(): if (!conf->num_queues) { error_setg(errp, "num-queues property must be larger than 0"); return; } Later on we access s->vq_aio_context[0] under the assumption that there is as least one virtqueue. Hanna Czenczek noted that it would help to show that the array index is already valid. Add an assertion to document that s->vq_aio_context[0] is always safe...and catch future code changes that break this assumption. Suggested-by: Hanna Czenczek Signed-off-by: Stefan Hajnoczi --- hw/block/virtio-blk.c | 1 + 1 file changed, 1 insertion(+) Reviewed-by: Hanna Czenczek
Re: [PATCH 1/5] virtio-blk: enforce iothread-vq-mapping validation
On 05.02.24 18:26, Stefan Hajnoczi wrote: Hanna Czenczek noticed that the safety of `vq_aio_context[vq->value] = ctx;` with user-defined vq->value inputs is not obvious. The code is structured in validate() + apply() steps so input validation is there, but it happens way earlier and there is nothing that guarantees apply() can only be called with validated inputs. This patch moves the validate() call inside the apply() function so validation is guaranteed. I also added the bounds checking assertion that Hanna suggested. Signed-off-by: Stefan Hajnoczi --- hw/block/virtio-blk.c | 192 +++--- 1 file changed, 107 insertions(+), 85 deletions(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 227d83569f..e8b37fd5f4 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c [...] @@ -1660,6 +1681,14 @@ static bool virtio_blk_vq_aio_context_init(VirtIOBlock *s, Error **errp) BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev))); VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus); +if (conf->iothread && conf->iothread_vq_mapping_list) { +if (conf->iothread) { This inner condition should probably be dropped. With that done: Reviewed-by: Hanna Czenczek +error_setg(errp, "iothread and iothread-vq-mapping properties " + "cannot be set at the same time"); +return false; +} +} + if (conf->iothread || conf->iothread_vq_mapping_list) { if (!k->set_guest_notifiers || !k->ioeventfd_assign) { error_setg(errp,
[PATCH v2 2/3] virtio: Re-enable notifications after drain
During drain, we do not care about virtqueue notifications, which is why we remove the handlers on it. When removing those handlers, whether vq notifications are enabled or not depends on whether we were in polling mode or not; if not, they are enabled (by default); if so, they have been disabled by the io_poll_start callback. Because we do not care about those notifications after removing the handlers, this is fine. However, we have to explicitly ensure they are enabled when re-attaching the handlers, so we will resume receiving notifications. We do this in virtio_queue_aio_attach_host_notifier*(). If such a function is called while we are in a polling section, attaching the notifiers will then invoke the io_poll_start callback, re-disabling notifications. Because we will always miss virtqueue updates in the drained section, we also need to poll the virtqueue once after attaching the notifiers. Buglink: https://issues.redhat.com/browse/RHEL-3934 Signed-off-by: Hanna Czenczek --- include/block/aio.h | 7 ++- hw/virtio/virtio.c | 42 ++ 2 files changed, 48 insertions(+), 1 deletion(-) diff --git a/include/block/aio.h b/include/block/aio.h index 5d0a114988..8378553eb9 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx, AioPollFn *io_poll, EventNotifierHandler *io_poll_ready); -/* Set polling begin/end callbacks for an event notifier that has already been +/* + * Set polling begin/end callbacks for an event notifier that has already been * registered with aio_set_event_notifier. Do nothing if the event notifier is * not registered. + * + * Note that if the io_poll_end() callback (or the entire notifier) is removed + * during polling, it will not be called, so an io_poll_begin() is not + * necessarily always followed by an io_poll_end(). */ void aio_set_event_notifier_poll(AioContext *ctx, EventNotifier *notifier, diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 7549094154..d229755eae 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -3556,6 +3556,17 @@ static void virtio_queue_host_notifier_aio_poll_end(EventNotifier *n) void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) { +/* + * virtio_queue_aio_detach_host_notifier() can leave notifications disabled. + * Re-enable them. (And if detach has not been used before, notifications + * being enabled is still the default state while a notifier is attached; + * see virtio_queue_host_notifier_aio_poll_end(), which will always leave + * notifications enabled once the polling section is left.) + */ +if (!virtio_queue_get_notification(vq)) { +virtio_queue_set_notification(vq, 1); +} + aio_set_event_notifier(ctx, >host_notifier, virtio_queue_host_notifier_read, virtio_queue_host_notifier_aio_poll, @@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) aio_set_event_notifier_poll(ctx, >host_notifier, virtio_queue_host_notifier_aio_poll_begin, virtio_queue_host_notifier_aio_poll_end); + +/* + * We will have ignored notifications about new requests from the guest + * while no notifiers were attached, so "kick" the virt queue to process + * those requests now. + */ +event_notifier_set(>host_notifier); } /* @@ -3573,14 +3591,38 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) */ void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext *ctx) { +/* See virtio_queue_aio_attach_host_notifier() */ +if (!virtio_queue_get_notification(vq)) { +virtio_queue_set_notification(vq, 1); +} + aio_set_event_notifier(ctx, >host_notifier, virtio_queue_host_notifier_read, NULL, NULL); + +/* + * See virtio_queue_aio_attach_host_notifier(). + * Note that this may be unnecessary for the type of virtqueues this + * function is used for. Still, it will not hurt to have a quick look into + * whether we can/should process any of the virtqueue elements. + */ +event_notifier_set(>host_notifier); } void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx) { aio_set_event_notifier(ctx, >host_notifier, NULL, NULL, NULL); + +/* + * aio_set_event_notifier_poll() does not guarantee whether io_poll_end() + * will run after io_poll_begin(), so by removing the notifier, we do not + * know whether virtio_queue_host_notifier_aio_poll_end() has run after a + * previous virtio_queue_host_notifier_aio_poll_begin(), i.e. whether + * notifications are e
[PATCH v2 1/3] virtio-scsi: Attach event vq notifier with no_poll
As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU polling the event virtqueue"), we only attach an io_read notifier for the virtio-scsi event virtqueue instead, and no polling notifiers. During operation, the event virtqueue is typically non-empty, but none of the buffers are intended to be used immediately. Instead, they only get used when certain events occur. Therefore, it makes no sense to continuously poll it when non-empty, because it is supposed to be and stay non-empty. We do this by using virtio_queue_aio_attach_host_notifier_no_poll() instead of virtio_queue_aio_attach_host_notifier() for the event virtqueue. Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use virtio_queue_aio_attach_host_notifier() for all virtqueues, including the event virtqueue. This can lead to it being polled again, undoing the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552. Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the event virtqueue. Reported-by: Fiona Ebner Fixes: 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement BlockDevOps->drained_begin()") Reviewed-by: Stefan Hajnoczi Tested-by: Fiona Ebner Reviewed-by: Fiona Ebner Signed-off-by: Hanna Czenczek --- hw/scsi/virtio-scsi.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index 690aceec45..9f02ceea09 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -1149,6 +1149,7 @@ static void virtio_scsi_drained_begin(SCSIBus *bus) static void virtio_scsi_drained_end(SCSIBus *bus) { VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus); +VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s); VirtIODevice *vdev = VIRTIO_DEVICE(s); uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED + s->parent_obj.conf.num_queues; @@ -1166,7 +1167,11 @@ static void virtio_scsi_drained_end(SCSIBus *bus) for (uint32_t i = 0; i < total_queues; i++) { VirtQueue *vq = virtio_get_queue(vdev, i); -virtio_queue_aio_attach_host_notifier(vq, s->ctx); +if (vq == vs->event_vq) { +virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx); +} else { +virtio_queue_aio_attach_host_notifier(vq, s->ctx); +} } } -- 2.43.0
[PATCH v2 3/3] virtio-blk: Use ioeventfd_attach in start_ioeventfd
Commit d3f6f294aeadd5f88caf0155e4360808c95b3146 ("virtio-blk: always set ioeventfd during startup") has made virtio_blk_start_ioeventfd() always kick the virtqueue (set the ioeventfd), regardless of whether the BB is drained. That is no longer necessary, because attaching the host notifier will now set the ioeventfd, too; this happens either immediately right here in virtio_blk_start_ioeventfd(), or later when the drain ends, in virtio_blk_ioeventfd_attach(). With event_notifier_set() removed, the code becomes the same as the one in virtio_blk_ioeventfd_attach(), so we can reuse that function. Signed-off-by: Hanna Czenczek --- hw/block/virtio-blk.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 227d83569f..22b8eef69b 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -37,6 +37,8 @@ #include "hw/virtio/virtio-blk-common.h" #include "qemu/coroutine.h" +static void virtio_blk_ioeventfd_attach(VirtIOBlock *s); + static void virtio_blk_init_request(VirtIOBlock *s, VirtQueue *vq, VirtIOBlockReq *req) { @@ -1808,17 +1810,14 @@ static int virtio_blk_start_ioeventfd(VirtIODevice *vdev) s->ioeventfd_started = true; smp_wmb(); /* paired with aio_notify_accept() on the read side */ -/* Get this show started by hooking up our callbacks */ -for (i = 0; i < nvqs; i++) { -VirtQueue *vq = virtio_get_queue(vdev, i); -AioContext *ctx = s->vq_aio_context[i]; - -/* Kick right away to begin processing requests already in vring */ -event_notifier_set(virtio_queue_get_host_notifier(vq)); - -if (!blk_in_drain(s->conf.conf.blk)) { -virtio_queue_aio_attach_host_notifier(vq, ctx); -} +/* + * Get this show started by hooking up our callbacks. If drained now, + * virtio_blk_drained_end() will do this later. + * Attaching the notifier also kicks the virtqueues, processing any requests + * they may already have. + */ +if (!blk_in_drain(s->conf.conf.blk)) { +virtio_blk_ioeventfd_attach(s); } return 0; -- 2.43.0
[PATCH v2 0/3] virtio: Re-enable notifications after drain
v1: https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00336.html Hi, This is basically the same series as v1: When using aio_set_event_notifier_poll(), the io_poll_end() callback is only invoked when polling ends, not when the notifier is being removed while in a polling section. This can leave the virtqueue notifier disabled during drained sections, which however is not a bad thing. We just need to ensure they are re-enabled after the drain, and kick the virtqueue once to pick up all the requests that came in during the drained section. Patch 1 is a technically unrelated fix, but addresses a problem that became visible with patch 2 applied. Patch 3 is a small (optional) clean-up patch. v2: - Changed the title of this series and patch 2 (was: "Keep notifications disabled durin drain"): Keeping the notifier disabled was something the initial RFC did, this version (v1 too) just ensures the notifier is enabled after the drain, regardless of its state before. - Use event_notifier_set() instead of virtio_queue_notify() in patch 2 - Added patch 3 Hanna Czenczek (3): virtio-scsi: Attach event vq notifier with no_poll virtio: Re-enable notifications after drain virtio-blk: Use ioeventfd_attach in start_ioeventfd include/block/aio.h | 7 ++- hw/block/virtio-blk.c | 21 ++--- hw/scsi/virtio-scsi.c | 7 ++- hw/virtio/virtio.c| 42 ++ 4 files changed, 64 insertions(+), 13 deletions(-) -- 2.43.0
[PATCH 2/2] scsi: Await request purging
scsi_device_for_each_req_async() currently does not provide any way to be awaited. One of its callers is scsi_device_purge_requests(), which therefore currently does not guarantee that all requests are fully settled when it returns. We want all requests to be settled, because scsi_device_purge_requests() is called through the unrealize path, including the one invoked by virtio_scsi_hotunplug() through qdev_simple_device_unplug_cb(), which most likely assumes that all SCSI requests are done then. In fact, scsi_device_purge_requests() already contains a blk_drain(), but this will not fully await scsi_device_for_each_req_async(), only the I/O requests it potentially cancels (not the non-I/O requests). However, we can have scsi_device_for_each_req_async() increment the BB in-flight counter, and have scsi_device_for_each_req_async_bh() decrement it when it is done. This way, the blk_drain() will fully await all SCSI requests to be purged. This also removes the need for scsi_device_for_each_req_async_bh() to double-check the current context and potentially re-schedule itself, should it now differ from the BB's context: Changing a BB's AioContext with a root node is done through bdrv_try_change_aio_context(), which creates a drained section. With this patch, we keep the BB in-flight counter elevated throughout, so we know the BB's context cannot change. Signed-off-by: Hanna Czenczek --- hw/scsi/scsi-bus.c | 30 +- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c index 0a2eb11c56..230313022c 100644 --- a/hw/scsi/scsi-bus.c +++ b/hw/scsi/scsi-bus.c @@ -120,17 +120,13 @@ static void scsi_device_for_each_req_async_bh(void *opaque) SCSIRequest *next; /* - * If the AioContext changed before this BH was called then reschedule into - * the new AioContext before accessing ->requests. This can happen when - * scsi_device_for_each_req_async() is called and then the AioContext is - * changed before BHs are run. + * The BB cannot have changed contexts between this BH being scheduled and + * now: BBs' AioContexts, when they have a node attached, can only be + * changed via bdrv_try_change_aio_context(), in a drained section. While + * we have the in-flight counter incremented, that drain must block. */ ctx = blk_get_aio_context(s->conf.blk); -if (ctx != qemu_get_current_aio_context()) { -aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh, -g_steal_pointer()); -return; -} +assert(ctx == qemu_get_current_aio_context()); QTAILQ_FOREACH_SAFE(req, >requests, next, next) { data->fn(req, data->fn_opaque); @@ -138,11 +134,16 @@ static void scsi_device_for_each_req_async_bh(void *opaque) /* Drop the reference taken by scsi_device_for_each_req_async() */ object_unref(OBJECT(s)); + +/* Paired with blk_inc_in_flight() in scsi_device_for_each_req_async() */ +blk_dec_in_flight(s->conf.blk); } /* * Schedule @fn() to be invoked for each enqueued request in device @s. @fn() * runs in the AioContext that is executing the request. + * Keeps the BlockBackend's in-flight counter incremented until everything is + * done, so draining it will settle all scheduled @fn() calls. */ static void scsi_device_for_each_req_async(SCSIDevice *s, void (*fn)(SCSIRequest *, void *), @@ -163,6 +164,8 @@ static void scsi_device_for_each_req_async(SCSIDevice *s, */ object_ref(OBJECT(s)); +/* Paired with blk_dec_in_flight() in scsi_device_for_each_req_async_bh() */ +blk_inc_in_flight(s->conf.blk); aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk), scsi_device_for_each_req_async_bh, data); @@ -1728,11 +1731,20 @@ static void scsi_device_purge_one_req(SCSIRequest *req, void *opaque) scsi_req_cancel_async(req, NULL); } +/** + * Cancel all requests, and block until they are deleted. + */ void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense) { scsi_device_for_each_req_async(sdev, scsi_device_purge_one_req, NULL); +/* + * Await all the scsi_device_purge_one_req() calls scheduled by + * scsi_device_for_each_req_async(), and all I/O requests that were + * cancelled this way, but may still take a bit of time to settle. + */ blk_drain(sdev->conf.blk); + scsi_device_set_ua(sdev, sense); } -- 2.43.0
[PATCH 1/2] block-backend: Allow concurrent context changes
Since AioContext locks have been removed, a BlockBackend's AioContext may really change at any time (only exception is that it is often confined to a drained section, as noted in this patch). Therefore, blk_get_aio_context() cannot rely on its root node's context always matching that of the BlockBackend. In practice, whether they match does not matter anymore anyway: Requests can be sent to BDSs from any context, so anyone who requests the BB's context should have no reason to require the root node to have the same context. Therefore, we can and should remove the assertion to that effect. In addition, because the context can be set and queried from different threads concurrently, it has to be accessed with atomic operations. Buglink: https://issues.redhat.com/browse/RHEL-19381 Suggested-by: Kevin Wolf Signed-off-by: Hanna Czenczek --- block/block-backend.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 209eb07528..9c4de79e6b 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -44,7 +44,7 @@ struct BlockBackend { char *name; int refcnt; BdrvChild *root; -AioContext *ctx; +AioContext *ctx; /* access with atomic operations only */ DriveInfo *legacy_dinfo;/* null unless created by drive_new() */ QTAILQ_ENTRY(BlockBackend) link; /* for block_backends */ QTAILQ_ENTRY(BlockBackend) monitor_link; /* for monitor_block_backends */ @@ -2414,22 +2414,22 @@ void blk_op_unblock_all(BlockBackend *blk, Error *reason) } } +/** + * Return BB's current AioContext. Note that this context may change + * concurrently at any time, with one exception: If the BB has a root node + * attached, its context will only change through bdrv_try_change_aio_context(), + * which creates a drained section. Therefore, incrementing such a BB's + * in-flight counter will prevent its context from changing. + */ AioContext *blk_get_aio_context(BlockBackend *blk) { -BlockDriverState *bs; IO_CODE(); if (!blk) { return qemu_get_aio_context(); } -bs = blk_bs(blk); -if (bs) { -AioContext *ctx = bdrv_get_aio_context(blk_bs(blk)); -assert(ctx == blk->ctx); -} - -return blk->ctx; +return qatomic_read(>ctx); } int blk_set_aio_context(BlockBackend *blk, AioContext *new_context, @@ -2442,7 +2442,7 @@ int blk_set_aio_context(BlockBackend *blk, AioContext *new_context, GLOBAL_STATE_CODE(); if (!bs) { -blk->ctx = new_context; +qatomic_set(>ctx, new_context); return 0; } @@ -2471,7 +2471,7 @@ static void blk_root_set_aio_ctx_commit(void *opaque) AioContext *new_context = s->new_ctx; ThrottleGroupMember *tgm = >public.throttle_group_member; -blk->ctx = new_context; +qatomic_set(>ctx, new_context); if (tgm->throttle_state) { throttle_group_detach_aio_context(tgm); throttle_group_attach_aio_context(tgm, new_context); -- 2.43.0
[PATCH 0/2] block: Allow concurrent BB context changes
Hi, Without the AioContext lock, a BB's context may kind of change at any time (unless it has a root node, and I/O requests are pending). That also means that its own context (BlockBackend.ctx) and that of its root node can differ sometimes (while the context is being changed). blk_get_aio_context() doesn't know this yet and asserts that both are always equal (if there is a root node). Because it's no longer true, and because callers don't seem to really care about the root node's context, we can and should remove the assertion and just return the BB's context. Beyond that, the question is whether the callers of blk_get_aio_context() are OK with the context potentially changing concurrently. Honestly, it isn't entirely clear to me; most look OK, except for the virtio-scsi code, which operates under the general assumption that the BB's context is always equal to that of the virtio-scsi device. I doubt that this assumption always holds (it is definitely not obvious to me that it would), but then again, this series will not make matters worse in that regard, and that is what counts for me now. One clear point of contention is scsi_device_for_each_req_async(), which is addressed by patch 2. Right now, it schedules a BH in the BB context, then the BH double-checks whether the context still fits, and if not, re-schedules itself. Because virtio-scsi's context is fixed, this seems to indicate to me that it wants to be able to deal with a case where BB and virtio-scsi context differ, which seems to break that aforementioned general virtio-scsi assumption. Unfortunately, I definitely have to touch that code, because accepting concurrent changes of AioContexts breaks the double-check (just because the BB has the right context in that place does not mean it stays in that context); instead, we must prevent any concurrent change until the BH is done. Because changing contexts generally involves a drained section, we can prevent it by keeping the BB in-flight counter elevated. Question is, how to reason for that. I’d really rather not include the need to follow the BB context in my argument, because I find that part a bit fishy. Luckily, there’s a second, completely different reason for having scsi_device_for_each_req_async() increment the in-flight counter: Specifically, scsi_device_purge_requests() probably wants to await full completion of scsi_device_for_each_req_async(), and we can do that most easily in the very same way by incrementing the in-flight counter. This way, the blk_drain() in scsi_device_purge_requests() will not only await all (cancelled) I/O requests, but also the non-I/O requests. The fact that this prevents the BB AioContext from changing while the BH is scheduled/running then is just a nice side effect. Hanna Czenczek (2): block-backend: Allow concurrent context changes scsi: Await request purging block/block-backend.c | 22 +++--- hw/scsi/scsi-bus.c| 30 +- 2 files changed, 32 insertions(+), 20 deletions(-) -- 2.43.0
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 01.02.24 16:25, Hanna Czenczek wrote: On 01.02.24 15:28, Stefan Hajnoczi wrote: [...] Did you find a scenario where the virtio-scsi AioContext is different from the scsi-hd BB's Aiocontext? Technically, that’s the reason for this thread, specifically that virtio_scsi_hotunplug() switches the BB back to the main context while scsi_device_for_each_req_async_bh() is running. Yes, we can fix that specific case via the in-flight counter, but I’m wondering whether there’s really any merit in requiring the BB to always be in virtio-scsi’s context, or whether it would make more sense to schedule everything in virtio-scsi’s context. Now that BBs/BDSs can receive requests from any context, that is. Now that I know that wouldn’t be easy, let me turn this around: As far as I understand, scsi_device_for_each_req_async_bh() should still run in virtio-scsi’s context, but that’s hard, so we take the BB’s context, which we therefore require to be the same one. Further, (again AFAIU,) virtio-scsi’s context cannot change (only set in virtio_scsi_dataplane_setup(), which is run in virtio_scsi_device_realize()). Therefore, why does the scsi_device_for_each_req_async() code accommodate for BB context changes? Hanna
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 01.02.24 16:25, Hanna Czenczek wrote: [...] It just seems simpler to me to not rely on the BB's context at all. Hm, I now see the problem is that the processing (and scheduling) is largely done in generic SCSI code, which doesn’t have access to virtio-scsi’s context, only to that of the BB. That makes my idea quite impossible. :/
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 01.02.24 15:28, Stefan Hajnoczi wrote: On Thu, Feb 01, 2024 at 03:10:12PM +0100, Hanna Czenczek wrote: On 31.01.24 21:35, Stefan Hajnoczi wrote: On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote: On 26.01.24 14:18, Kevin Wolf wrote: Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben: On 23.01.24 18:10, Kevin Wolf wrote: Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer forhttps://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: Note: We (on issues.redhat.com) have a separate report that seems to be concerning this very problem: https://issues.redhat.com/browse/RHEL-19381 {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. [...] I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context). I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic. Actually, now that completely unproblematic part is giving me trouble. I wanted to just put a graph lock into blk_get_aio_context() (making it a coroutine with a wrapper) Which is the first thing I neglected and already not great. We have calls of blk_get_aio_context() in the SCSI I/O path, and creating a coroutine and doing at least two context switches simply for this call is a lot of overhead... but callers of blk_get_aio_context() generally assume the context is going to stay the BB’s context for as long as their AioContext * variable is in scope. I'm not so sure about that. And taking another step back, I'm actually also not sure how much it still matters now that they can submit I/O from any thread. That’s my impression, too, but “not sure” doesn’t feel great. :) scsi_device_for_each_req_async_bh() specifically double-checks whether it’s still in the right context before invoking the specified function, so it seems there was some intention to continue to run in the context associated with the BB. (Not judging whether that intent makes sense or not, yet.) Maybe the correct solution is to remove the assertion from blk_get_aio_context() and just always return blk->ctx. If it's in the middle of a change, you'll either get the old one or the new one. Either one is fine to submit I/O from, and if you care about changes for other reasons (like SCSI does), then you need explicit code to protect it anyway (which SCSI apparently has, but it doesn't work). I think most callers do just assume the BB stays in the context they got (without any proof, admittedly), but I agree that under re-evaluation, it probably doesn’t actually matter to them, really. And yes, basically, if the caller doesn’t need to take a lock because it doesn’t really matter whether blk->ctx changes while its still using the old value, blk_get_aio_context() in turn doesn’t need to double-check blk->ctx against the root node’s context either, and nobody needs a lock. So I agree, it’s on the caller to protect against a potentially changing context,
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 31.01.24 21:35, Stefan Hajnoczi wrote: On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote: On 26.01.24 14:18, Kevin Wolf wrote: Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben: On 23.01.24 18:10, Kevin Wolf wrote: Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer forhttps://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: Note: We (on issues.redhat.com) have a separate report that seems to be concerning this very problem: https://issues.redhat.com/browse/RHEL-19381 {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. [...] I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context). I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic. Actually, now that completely unproblematic part is giving me trouble. I wanted to just put a graph lock into blk_get_aio_context() (making it a coroutine with a wrapper) Which is the first thing I neglected and already not great. We have calls of blk_get_aio_context() in the SCSI I/O path, and creating a coroutine and doing at least two context switches simply for this call is a lot of overhead... but callers of blk_get_aio_context() generally assume the context is going to stay the BB’s context for as long as their AioContext * variable is in scope. I'm not so sure about that. And taking another step back, I'm actually also not sure how much it still matters now that they can submit I/O from any thread. That’s my impression, too, but “not sure” doesn’t feel great. :) scsi_device_for_each_req_async_bh() specifically double-checks whether it’s still in the right context before invoking the specified function, so it seems there was some intention to continue to run in the context associated with the BB. (Not judging whether that intent makes sense or not, yet.) Maybe the correct solution is to remove the assertion from blk_get_aio_context() and just always return blk->ctx. If it's in the middle of a change, you'll either get the old one or the new one. Either one is fine to submit I/O from, and if you care about changes for other reasons (like SCSI does), then you need explicit code to protect it anyway (which SCSI apparently has, but it doesn't work). I think most callers do just assume the BB stays in the context they got (without any proof, admittedly), but I agree that under re-evaluation, it probably doesn’t actually matter to them, really. And yes, basically, if the caller doesn’t need to take a lock because it doesn’t really matter whether blk->ctx changes while its still using the old value, blk_get_aio_context() in turn doesn’t need to double-check blk->ctx against the root node’s context either, and nobody needs a lock. So I agree, it’s on the caller to protect against a potentially changing context, blk_get_aio_context() should just return blk->ctx and not check against the root node. (On a tangent: blk_drain(
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 01.02.24 11:21, Kevin Wolf wrote: Am 01.02.2024 um 10:43 hat Hanna Czenczek geschrieben: On 31.01.24 11:17, Kevin Wolf wrote: Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben: I don’t like using drain as a form of lock specifically against AioContext changes, but maybe Stefan is right, and we should use it in this specific case to get just the single problem fixed. (Though it’s not quite trivial either. We’d probably still want to remove the assertion from blk_get_aio_context(), so we don’t have to require all of its callers to hold a count in the in-flight counter.) Okay, fair, maybe fixing the specific problem is more important that solving the more generic blk_get_aio_context() race. In this case, wouldn't it be enough to increase the in-flight counter so that the drain before switching AioContexts would run the BH before anything bad can happen? Does the following work? Yes, that’s what I had in mind (Stefan, too, I think), and in testing, it looks good. Oh, sorry, I completely misunderstood then. I thought you were talking about adding a new drained section somewhere and that sounded a bit more complicated. :-) If it works, let's do this. Would you like to pick this up and send it as a formal patch (possibly in a more polished form), or should I do that? Sure, I can do it. Hanna
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 31.01.24 11:17, Kevin Wolf wrote: Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben: I don’t like using drain as a form of lock specifically against AioContext changes, but maybe Stefan is right, and we should use it in this specific case to get just the single problem fixed. (Though it’s not quite trivial either. We’d probably still want to remove the assertion from blk_get_aio_context(), so we don’t have to require all of its callers to hold a count in the in-flight counter.) Okay, fair, maybe fixing the specific problem is more important that solving the more generic blk_get_aio_context() race. In this case, wouldn't it be enough to increase the in-flight counter so that the drain before switching AioContexts would run the BH before anything bad can happen? Does the following work? Yes, that’s what I had in mind (Stefan, too, I think), and in testing, it looks good. Hanna Kevin diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c index 0a2eb11c56..dc09eb8024 100644 --- a/hw/scsi/scsi-bus.c +++ b/hw/scsi/scsi-bus.c @@ -120,17 +120,11 @@ static void scsi_device_for_each_req_async_bh(void *opaque) SCSIRequest *next; /* - * If the AioContext changed before this BH was called then reschedule into - * the new AioContext before accessing ->requests. This can happen when - * scsi_device_for_each_req_async() is called and then the AioContext is - * changed before BHs are run. + * The AioContext can't have changed because we increased the in-flight + * counter for s->conf.blk. */ ctx = blk_get_aio_context(s->conf.blk); -if (ctx != qemu_get_current_aio_context()) { -aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh, -g_steal_pointer()); -return; -} +assert(ctx == qemu_get_current_aio_context()); QTAILQ_FOREACH_SAFE(req, >requests, next, next) { data->fn(req, data->fn_opaque); @@ -138,6 +132,7 @@ static void scsi_device_for_each_req_async_bh(void *opaque) /* Drop the reference taken by scsi_device_for_each_req_async() */ object_unref(OBJECT(s)); +blk_dec_in_flight(s->conf.blk); } /* @@ -163,6 +158,7 @@ static void scsi_device_for_each_req_async(SCSIDevice *s, */ object_ref(OBJECT(s)); +blk_inc_in_flight(s->conf.blk); aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk), scsi_device_for_each_req_async_bh, data);
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 23.01.24 18:10, Kevin Wolf wrote: Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. (gdb) bt #0 0x7f32a668d83c in () at /usr/lib/libc.so.6 #1 0x7f32a663d668 in raise () at /usr/lib/libc.so.6 #2 0x7f32a66254b8 in abort () at /usr/lib/libc.so.6 #3 0x7f32a66253dc in () at /usr/lib/libc.so.6 #4 0x7f32a6635d26 in () at /usr/lib/libc.so.6 #5 0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at ../block/block-backend.c:2429 #6 blk_get_aio_context (blk=0x556e6e89ccf0) at ../block/block-backend.c:2417 #7 0x556e6b112d87 in scsi_device_for_each_req_async_bh (opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128 #8 0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at ../util/async.c:218 #9 0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, blocking=blocking@entry=true) at ../util/aio-posix.c:722 #10 0x556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63 #11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at ../util/qemu-thread-posix.c:541 #12 0x7f32a668b9eb in () at /usr/lib/libc.so.6 #13 0x7f32a670f7cc in () at /usr/lib/libc.so.6 I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context). I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? Removing the drain to allow for all of bdrv_try_change_aio_context() to require GRAPH_WRLOCK, but that broke tests/unit/test-block-iothread, because without draining, block jobs would need to switch AioContexts while running, and job_set_aio_context() doesn’t like that. Similarly to blk_get_aio_context(), I assume we can in theory just drop the assertion there and change the context while the job is running, because then the job can just change AioContexts on the next pause point (and in the meantime send requests from the old context, which is fine), but this does get quite murky. (One rather virtual (but present) problem is that test-block-iothread itself contains some assertions in the job that its AioContext is actually the on its running in, but this assertion would no longer necessarily hold true.) I don’t like using drain as a form of lock specifically against AioContext changes, but maybe Stefan is right, and we should use it in this specific case to get just the single problem fixed. (Though it’s not quite trivial either. We’d probably still want to remove the assertion from blk_get_aio_context(), so we don’t have to require all of its callers to hold a count in the in-flight counter.) Hanna
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 26.01.24 14:18, Kevin Wolf wrote: Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben: On 23.01.24 18:10, Kevin Wolf wrote: Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer forhttps://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: Note: We (on issues.redhat.com) have a separate report that seems to be concerning this very problem: https://issues.redhat.com/browse/RHEL-19381 {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. [...] I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context). I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic. Actually, now that completely unproblematic part is giving me trouble. I wanted to just put a graph lock into blk_get_aio_context() (making it a coroutine with a wrapper) Which is the first thing I neglected and already not great. We have calls of blk_get_aio_context() in the SCSI I/O path, and creating a coroutine and doing at least two context switches simply for this call is a lot of overhead... but callers of blk_get_aio_context() generally assume the context is going to stay the BB’s context for as long as their AioContext * variable is in scope. I'm not so sure about that. And taking another step back, I'm actually also not sure how much it still matters now that they can submit I/O from any thread. That’s my impression, too, but “not sure” doesn’t feel great. :) scsi_device_for_each_req_async_bh() specifically double-checks whether it’s still in the right context before invoking the specified function, so it seems there was some intention to continue to run in the context associated with the BB. (Not judging whether that intent makes sense or not, yet.) Maybe the correct solution is to remove the assertion from blk_get_aio_context() and just always return blk->ctx. If it's in the middle of a change, you'll either get the old one or the new one. Either one is fine to submit I/O from, and if you care about changes for other reasons (like SCSI does), then you need explicit code to protect it anyway (which SCSI apparently has, but it doesn't work). I think most callers do just assume the BB stays in the context they got (without any proof, admittedly), but I agree that under re-evaluation, it probably doesn’t actually matter to them, really. And yes, basically, if the caller doesn’t need to take a lock because it doesn’t really matter whether blk->ctx changes while its still using the old value, blk_get_aio_context() in turn doesn’t need to double-check blk->ctx against the root node’s context either, and nobody needs a lock. So I agree, it’s on the caller to protect against a potentially changing context, blk_get_aio_context() should just return blk->ctx and not check against the root node. (On a tangent: blk_drain() is a caller of blk_get_aio_context(), and it polls that context. Why does it need to
Re: [PATCH 2/2] virtio: Keep notifications disabled during drain
On 25.01.24 19:18, Hanna Czenczek wrote: On 25.01.24 19:03, Stefan Hajnoczi wrote: On Wed, Jan 24, 2024 at 06:38:30PM +0100, Hanna Czenczek wrote: [...] @@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) aio_set_event_notifier_poll(ctx, >host_notifier, virtio_queue_host_notifier_aio_poll_begin, virtio_queue_host_notifier_aio_poll_end); + + /* + * We will have ignored notifications about new requests from the guest + * during the drain, so "kick" the virt queue to process those requests + * now. + */ + virtio_queue_notify(vq->vdev, vq->queue_index); event_notifier_set(>host_notifier) is easier to understand because it doesn't contain a non-host_notifier code path that we must not take. Is there a reason why you used virtio_queue_notify() instead? Not a good one anyway! virtio_queue_notify() is just what seemed obvious to me (i.e. to notify the virtqueue). Before removal of the AioContext lock, calling handle_output seemed safe. But, yes, there was the discussion on the RFC that it really isn’t. I didn’t consider that means we must rely on virtio_queue_notify() calling event_notifier_set(), so we may as well call it explicitly here. I’ll fix it, thanks for pointing it out! (I think together with this change, I’ll also remove the event_notifier_set() call from virtio_blk_data_plane_start(). It’d obviously be a duplicate, and removing it shows why virtio_queue_aio_attach_host_notifier() should always kick the queue.)
Re: [PATCH 2/2] virtio: Keep notifications disabled during drain
On 25.01.24 19:03, Stefan Hajnoczi wrote: On Wed, Jan 24, 2024 at 06:38:30PM +0100, Hanna Czenczek wrote: During drain, we do not care about virtqueue notifications, which is why we remove the handlers on it. When removing those handlers, whether vq notifications are enabled or not depends on whether we were in polling mode or not; if not, they are enabled (by default); if so, they have been disabled by the io_poll_start callback. Because we do not care about those notifications after removing the handlers, this is fine. However, we have to explicitly ensure they are enabled when re-attaching the handlers, so we will resume receiving notifications. We do this in virtio_queue_aio_attach_host_notifier*(). If such a function is called while we are in a polling section, attaching the notifiers will then invoke the io_poll_start callback, re-disabling notifications. Because we will always miss virtqueue updates in the drained section, we also need to poll the virtqueue once after attaching the notifiers. Buglink: https://issues.redhat.com/browse/RHEL-3934 Signed-off-by: Hanna Czenczek --- include/block/aio.h | 7 ++- hw/virtio/virtio.c | 42 ++ 2 files changed, 48 insertions(+), 1 deletion(-) diff --git a/include/block/aio.h b/include/block/aio.h index 5d0a114988..8378553eb9 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx, AioPollFn *io_poll, EventNotifierHandler *io_poll_ready); -/* Set polling begin/end callbacks for an event notifier that has already been +/* + * Set polling begin/end callbacks for an event notifier that has already been * registered with aio_set_event_notifier. Do nothing if the event notifier is * not registered. + * + * Note that if the io_poll_end() callback (or the entire notifier) is removed + * during polling, it will not be called, so an io_poll_begin() is not + * necessarily always followed by an io_poll_end(). */ void aio_set_event_notifier_poll(AioContext *ctx, EventNotifier *notifier, diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 7549094154..4166da9e97 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -3556,6 +3556,17 @@ static void virtio_queue_host_notifier_aio_poll_end(EventNotifier *n) void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) { +/* + * virtio_queue_aio_detach_host_notifier() can leave notifications disabled. + * Re-enable them. (And if detach has not been used before, notifications + * being enabled is still the default state while a notifier is attached; + * see virtio_queue_host_notifier_aio_poll_end(), which will always leave + * notifications enabled once the polling section is left.) + */ +if (!virtio_queue_get_notification(vq)) { +virtio_queue_set_notification(vq, 1); +} + aio_set_event_notifier(ctx, >host_notifier, virtio_queue_host_notifier_read, virtio_queue_host_notifier_aio_poll, @@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) aio_set_event_notifier_poll(ctx, >host_notifier, virtio_queue_host_notifier_aio_poll_begin, virtio_queue_host_notifier_aio_poll_end); + +/* + * We will have ignored notifications about new requests from the guest + * during the drain, so "kick" the virt queue to process those requests + * now. + */ +virtio_queue_notify(vq->vdev, vq->queue_index); event_notifier_set(>host_notifier) is easier to understand because it doesn't contain a non-host_notifier code path that we must not take. Is there a reason why you used virtio_queue_notify() instead? Not a good one anyway! virtio_queue_notify() is just what seemed obvious to me (i.e. to notify the virtqueue). Before removal of the AioContext lock, calling handle_output seemed safe. But, yes, there was the discussion on the RFC that it really isn’t. I didn’t consider that means we must rely on virtio_queue_notify() calling event_notifier_set(), so we may as well call it explicitly here. I’ll fix it, thanks for pointing it out! Otherwise: Reviewed-by: Stefan Hajnoczi Thanks! Hanna
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 23.01.24 18:10, Kevin Wolf wrote: Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. [...] I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context). I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic. Actually, now that completely unproblematic part is giving me trouble. I wanted to just put a graph lock into blk_get_aio_context() (making it a coroutine with a wrapper), but callers of blk_get_aio_context() generally assume the context is going to stay the BB’s context for as long as their AioContext * variable is in scope. I was tempted to think callers know what happens to the BB they pass to blk_get_aio_context(), and it won’t change contexts so easily, but then I remembered this is exactly what happens in this case; we run scsi_device_for_each_req_async_bh() in one thread (which calls blk_get_aio_context()), and in the other, we change the BB’s context. It seems like there are very few blk_* functions right now that require taking a graph lock around it, so I’m hesitant to go that route. But if we’re protecting a BB’s context via the graph write lock, I can’t think of a way around having to take a read lock whenever reading a BB’s context, and holding it for as long as we assume that context to remain the BB’s context. It’s also hard to figure out how long that is, case by case; for example, dma_blk_read() schedules an AIO function in the BB context; but we probably don’t care that this context remains the BB’s context until the request is done. In the case of scsi_device_for_each_req_async_bh(), we already take care to re-schedule it when it turns out the context is outdated, so it does seem quite important here, and we probably want to keep the lock until after the QTAILQ_FOREACH_SAFE() loop. On a tangent, this TOCTTOU problem makes me wary of other blk_* functions that query information. For example, fuse_read() (in block/export/fuse.c) truncates requests to the BB length. But what if the BB length changes concurrently between blk_getlength() and blk_pread()? While we can justify using the graph lock for a BB’s AioContext, we can’t use it for other metadata like its length. Hanna
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 24.01.24 22:53, Stefan Hajnoczi wrote: On Wed, Jan 24, 2024 at 01:12:47PM +0100, Hanna Czenczek wrote: On 23.01.24 18:10, Kevin Wolf wrote: Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer forhttps://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. [...] I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic. I tried this, and it’s not easy taking the lock just for tran_commit(), because some callers of bdrv_try_change_aio_context() already hold the write lock (specifically bdrv_attach_child_common(), bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and qmp_x_blockdev_set_iothread() holds the read lock. Other callers don’t hold any lock[2]. So I’m not sure whether we should mark all of bdrv_try_change_aio_context() as GRAPH_WRLOCK and then make all callers take the lock, or really only take it for tran_commit(), and have callers release the lock around bdrv_try_change_aio_context(). Former sounds better to naïve me. (In any case, FWIW, having blk_set_aio_context() take the write lock, and scsi_device_for_each_req_async_bh() take the read lock[3], does fix the assertion failure.) I wonder if a simpler solution is blk_inc_in_flight() in scsi_device_for_each_req_async() and blk_dec_in_flight() in scsi_device_for_each_req_async_bh() so that drain waits for the BH. There is a drain around the AioContext change, so as long as scsi_device_for_each_req_async() was called before blk_set_aio_context() and not _during_ aio_poll(), we would prevent inconsistent BB vs BDS aio_contexts. Actually, Kevin has suggested on IRC that we drop the whole drain. :) Dropping the write lock in our outside of bdrv_try_change_aio_context() for callers that have already taken it seems unsafe, so the only option would be to make the whole function write-lock-able. The drained section can cause problems with that if it ends up wanting to reorganize the graph, so AFAIU drain should never be done while under a write lock. This is already a problem now because there are three callers that do call bdrv_try_change_aio_context() while under a write lock, so it seems like we shouldn’t keep the drain as-is. So, Kevin suggested just dropping that drain, because I/O requests are no longer supposed to care about a BDS’s native AioContext anymore anyway, so it seems like the need for the drain has gone away with multiqueue. Then we could make the whole function GRAPH_WRLOCK. Hanna
[PATCH 2/2] virtio: Keep notifications disabled during drain
During drain, we do not care about virtqueue notifications, which is why we remove the handlers on it. When removing those handlers, whether vq notifications are enabled or not depends on whether we were in polling mode or not; if not, they are enabled (by default); if so, they have been disabled by the io_poll_start callback. Because we do not care about those notifications after removing the handlers, this is fine. However, we have to explicitly ensure they are enabled when re-attaching the handlers, so we will resume receiving notifications. We do this in virtio_queue_aio_attach_host_notifier*(). If such a function is called while we are in a polling section, attaching the notifiers will then invoke the io_poll_start callback, re-disabling notifications. Because we will always miss virtqueue updates in the drained section, we also need to poll the virtqueue once after attaching the notifiers. Buglink: https://issues.redhat.com/browse/RHEL-3934 Signed-off-by: Hanna Czenczek --- include/block/aio.h | 7 ++- hw/virtio/virtio.c | 42 ++ 2 files changed, 48 insertions(+), 1 deletion(-) diff --git a/include/block/aio.h b/include/block/aio.h index 5d0a114988..8378553eb9 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx, AioPollFn *io_poll, EventNotifierHandler *io_poll_ready); -/* Set polling begin/end callbacks for an event notifier that has already been +/* + * Set polling begin/end callbacks for an event notifier that has already been * registered with aio_set_event_notifier. Do nothing if the event notifier is * not registered. + * + * Note that if the io_poll_end() callback (or the entire notifier) is removed + * during polling, it will not be called, so an io_poll_begin() is not + * necessarily always followed by an io_poll_end(). */ void aio_set_event_notifier_poll(AioContext *ctx, EventNotifier *notifier, diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 7549094154..4166da9e97 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -3556,6 +3556,17 @@ static void virtio_queue_host_notifier_aio_poll_end(EventNotifier *n) void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) { +/* + * virtio_queue_aio_detach_host_notifier() can leave notifications disabled. + * Re-enable them. (And if detach has not been used before, notifications + * being enabled is still the default state while a notifier is attached; + * see virtio_queue_host_notifier_aio_poll_end(), which will always leave + * notifications enabled once the polling section is left.) + */ +if (!virtio_queue_get_notification(vq)) { +virtio_queue_set_notification(vq, 1); +} + aio_set_event_notifier(ctx, >host_notifier, virtio_queue_host_notifier_read, virtio_queue_host_notifier_aio_poll, @@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) aio_set_event_notifier_poll(ctx, >host_notifier, virtio_queue_host_notifier_aio_poll_begin, virtio_queue_host_notifier_aio_poll_end); + +/* + * We will have ignored notifications about new requests from the guest + * during the drain, so "kick" the virt queue to process those requests + * now. + */ +virtio_queue_notify(vq->vdev, vq->queue_index); } /* @@ -3573,14 +3591,38 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx) */ void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext *ctx) { +/* See virtio_queue_aio_attach_host_notifier() */ +if (!virtio_queue_get_notification(vq)) { +virtio_queue_set_notification(vq, 1); +} + aio_set_event_notifier(ctx, >host_notifier, virtio_queue_host_notifier_read, NULL, NULL); + +/* + * See virtio_queue_aio_attach_host_notifier(). + * Note that this may be unnecessary for the type of virtqueues this + * function is used for. Still, it will not hurt to have a quick look into + * whether we can/should process any of the virtqueue elements. + */ +virtio_queue_notify(vq->vdev, vq->queue_index); } void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx) { aio_set_event_notifier(ctx, >host_notifier, NULL, NULL, NULL); + +/* + * aio_set_event_notifier_poll() does not guarantee whether io_poll_end() + * will run after io_poll_begin(), so by removing the notifier, we do not + * know whether virtio_queue_host_notifier_aio_poll_end() has run after a + * previous virtio_queue_host_notifier_aio_poll_begin(), i.e. whether + * not
[PATCH 1/2] virtio-scsi: Attach event vq notifier with no_poll
As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU polling the event virtqueue"), we only attach an io_read notifier for the virtio-scsi event virtqueue instead, and no polling notifiers. During operation, the event virtqueue is typically non-empty, but none of the buffers are intended to be used immediately. Instead, they only get used when certain events occur. Therefore, it makes no sense to continuously poll it when non-empty, because it is supposed to be and stay non-empty. We do this by using virtio_queue_aio_attach_host_notifier_no_poll() instead of virtio_queue_aio_attach_host_notifier() for the event virtqueue. Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use virtio_queue_aio_attach_host_notifier() for all virtqueues, including the event virtqueue. This can lead to it being polled again, undoing the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552. Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the event virtqueue. Reported-by: Fiona Ebner Fixes: 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement BlockDevOps->drained_begin()") Signed-off-by: Hanna Czenczek --- hw/scsi/virtio-scsi.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index 690aceec45..9f02ceea09 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -1149,6 +1149,7 @@ static void virtio_scsi_drained_begin(SCSIBus *bus) static void virtio_scsi_drained_end(SCSIBus *bus) { VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus); +VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s); VirtIODevice *vdev = VIRTIO_DEVICE(s); uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED + s->parent_obj.conf.num_queues; @@ -1166,7 +1167,11 @@ static void virtio_scsi_drained_end(SCSIBus *bus) for (uint32_t i = 0; i < total_queues; i++) { VirtQueue *vq = virtio_get_queue(vdev, i); -virtio_queue_aio_attach_host_notifier(vq, s->ctx); +if (vq == vs->event_vq) { +virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx); +} else { +virtio_queue_aio_attach_host_notifier(vq, s->ctx); +} } } -- 2.43.0
[PATCH 0/2] virtio: Keep notifications disabled during drain
Hi, When registering callbacks via aio_set_event_notifier_poll(), the io_poll_end() callback is only invoked when polling actually ends. If the notifiers are removed while in a polling section, it is not called. Therefore, io_poll_start() is not necessarily followed up by io_poll_end(). It is not entirely clear whether this is good or bad behavior. On one hand, it may be unexpected to callers. On the other, it may be counterproductive to call io_poll_end() when the polling section has not ended yet. Right now, there is only one user of aio_set_event_notifier(), which is virtio_queue_aio_attach_host_notifier(). It does not expect this behavior, which leads to virtqueue notifiers remaining disabled if virtio_queue_aio_detach_host_notifier() is called while polling. That can happen e.g. through virtio_scsi_drained_begin() or virtio_blk_drained_begin() (through virtio_blk_data_plane_detach()). In such a case, the virtqueue may not be processed for a while, letting the guest driver hang. This can be reproduced by repeatedly hot-plugging and -unplugging a virtio-scsi device with a scsi-hd disk, because the guest will try to enumerate the virtio-scsi device while we’re attaching the scsi-hd disk, which causes a drain, which can cause the virtio-scsi virtqueue to stall as described. Stefan has suggested ensuring we always follow up io_poll_start() by io_poll_end(): https://lists.nongnu.org/archive/html/qemu-block/2023-12/msg00163.html I prefer changing the caller instead, because I don’t think we actually want the virtqueue notifier to be force-enabled when removing our AIO notifiers. So I believe we actually only want to take care to force-enable it when we re-attach the AIO notifiers, and to kick virtqueue processing once, in case we missed any events while the AIO notifiers were not attached. That is done by patch 2. We have already discussed a prior version of it here: https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg1.html And compared to that, based on the discussion, there are some changes: 1. Used virtio_queue_notify() instead of virtio_queue_notify_vq(), as suggested by Paolo, because it’s thread-safe 2. Moved virtio_queue_notify() into virtio_queue_aio_attach_host_notifier*(), because we always want it 3. Dropped virtio_queue_set_notification(vq, 0) from virtio_queue_aio_detach_host_notifier(): Paolo wasn’t sure whether that was safe to do from any context. We don’t really need to call it anyway, so I just dropped it. 4. Added patch 1: Patch 1 fixes virtio_scsi_drained_end() so it won’t attach polling notifiers for the event virtqueue. That didn’t turn out to be an issue so far, but with patch 2, Fiona saw the virtqueue processing queue spinning in a loop as described in 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU polling the event virtqueue"). Note that as of eaad0fe26050c227dc5dad63205835bac4912a51 ("scsi: only access SCSIDevice->requests from one thread") there’s a different problem when trying to reproduce the bug via hot-plugging and -unplugging a virtio-scsi device, specifically, when unplugging, qemu may crash with an assertion failure[1]. I don’t have a full fix for that yet, but in case you need a work-around for the specific case of virtio-scsi hot-plugging and -unplugging, you can use this patch: https://czenczek.de/0001-DONTMERGE-Fix-crash-on-scsi-unplug.patch [1] https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00317.html Hanna Czenczek (2): virtio-scsi: Attach event vq notifier with no_poll virtio: Keep notifications disabled during drain include/block/aio.h | 7 ++- hw/scsi/virtio-scsi.c | 7 ++- hw/virtio/virtio.c| 42 ++ 3 files changed, 54 insertions(+), 2 deletions(-) -- 2.43.0
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 23.01.24 18:10, Kevin Wolf wrote: Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer forhttps://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. [...] I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic. I tried this, and it’s not easy taking the lock just for tran_commit(), because some callers of bdrv_try_change_aio_context() already hold the write lock (specifically bdrv_attach_child_common(), bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and qmp_x_blockdev_set_iothread() holds the read lock. Other callers don’t hold any lock[2]. So I’m not sure whether we should mark all of bdrv_try_change_aio_context() as GRAPH_WRLOCK and then make all callers take the lock, or really only take it for tran_commit(), and have callers release the lock around bdrv_try_change_aio_context(). Former sounds better to naïve me. (In any case, FWIW, having blk_set_aio_context() take the write lock, and scsi_device_for_each_req_async_bh() take the read lock[3], does fix the assertion failure.) Hanna [1] bdrv_root_unref_child() is not marked as GRAPH_WRLOCK, but it’s callers generally seem to ensure that the lock is taken when calling it. [2] blk_set_aio_context() (evidently), blk_exp_add(), external_snapshot_abort(), {blockdev,drive}_backup_action(), qmp_{blockdev,drive}_mirror() [3] I’ve made the _bh a coroutine (for bdrv_graph_co_rdlock()) and replaced the aio_bh_schedule_oneshot() by aio_co_enter() – hope that’s right.
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 23.01.24 18:10, Kevin Wolf wrote: Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer forhttps://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. (gdb) bt #0 0x7f32a668d83c in () at /usr/lib/libc.so.6 #1 0x7f32a663d668 in raise () at /usr/lib/libc.so.6 #2 0x7f32a66254b8 in abort () at /usr/lib/libc.so.6 #3 0x7f32a66253dc in () at /usr/lib/libc.so.6 #4 0x7f32a6635d26 in () at /usr/lib/libc.so.6 #5 0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at ../block/block-backend.c:2429 #6 blk_get_aio_context (blk=0x556e6e89ccf0) at ../block/block-backend.c:2417 #7 0x556e6b112d87 in scsi_device_for_each_req_async_bh (opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128 #8 0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at ../util/async.c:218 #9 0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, blocking=blocking@entry=true) at ../util/aio-posix.c:722 #10 0x556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63 #11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at ../util/qemu-thread-posix.c:541 #12 0x7f32a668b9eb in () at /usr/lib/libc.so.6 #13 0x7f32a670f7cc in () at /usr/lib/libc.so.6 I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context). I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic. At the first sight I don't see a reason why this would break something, but I've learnt not to trust my first impression with the graph locking work... Of course, I also didn't check if there are more things inside of the device emulation that need additional locking in this case, too. But even if so, blk_get_aio_context() should never see an inconsistent state. Ah, sorry, saw your reply only now that I hit “send”. I forgot that we do have that big lock that I thought rather to avoid :) Sounds good and very reasonable to me. Changing the contexts in the graph sounds like a graph change operation, and reading and comparing contexts in the graph sounds like reading the graph. Hanna
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 23.01.24 17:40, Hanna Czenczek wrote: On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) [...] I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context). I’ll look more into it. What I think happens is that scsi_device_purge_requests() runs (perhaps through virtio_scsi_hotunplug() -> qdev_simple_device_unplug_cb() -> scsi_qdev_unrealize()?), which schedules scsi_device_for_each_req_async_bh() to run, but doesn’t await it. We go on, begin to move the BB and its BDS back to the main context (via blk_set_aio_context() in virtio_scsi_hotunplug()), but scsi_device_for_each_req_async_bh() still runs in the I/O thread, it calls blk_get_aio_context() while the contexts are inconsistent, and we get the crash. There is a comment above blk_get_aio_context() in scsi_device_for_each_req_async_bh() about the BB potentially being moved to a different context prior to the BH running, but it doesn’t consider the possibility that that move may occur *concurrently*. I don’t know how to fix this, though. The whole idea of anything happening to a BB while it is being moved to a different context seems so wrong to me that I’d want to slap a big lock on it, but I have the feeling that that isn’t what we want. Hanna
Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread
On 21.12.23 22:23, Kevin Wolf wrote: From: Stefan Hajnoczi Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi Reviewed-by: Eric Blake Message-ID:<20231204164259.1515217-2-stefa...@redhat.com> Signed-off-by: Kevin Wolf --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 - 2 files changed, 131 insertions(+), 57 deletions(-) My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. (gdb) bt #0 0x7f32a668d83c in () at /usr/lib/libc.so.6 #1 0x7f32a663d668 in raise () at /usr/lib/libc.so.6 #2 0x7f32a66254b8 in abort () at /usr/lib/libc.so.6 #3 0x7f32a66253dc in () at /usr/lib/libc.so.6 #4 0x7f32a6635d26 in () at /usr/lib/libc.so.6 #5 0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at ../block/block-backend.c:2429 #6 blk_get_aio_context (blk=0x556e6e89ccf0) at ../block/block-backend.c:2417 #7 0x556e6b112d87 in scsi_device_for_each_req_async_bh (opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128 #8 0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at ../util/async.c:218 #9 0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, blocking=blocking@entry=true) at ../util/aio-posix.c:722 #10 0x556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63 #11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at ../util/qemu-thread-posix.c:541 #12 0x7f32a668b9eb in () at /usr/lib/libc.so.6 #13 0x7f32a670f7cc in () at /usr/lib/libc.so.6 I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context). I’ll look more into it. Hanna
Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler
On 02.01.24 16:24, Hanna Czenczek wrote: [...] I’ve attached the preliminary patch that I didn’t get to send (or test much) last year. Not sure if it has the same CPU-usage-spike issue Fiona was seeing, the only functional difference is that I notify the vq after attaching the notifiers instead of before. It’ll take me some more time to send a patch mail to that effect, because now there’s an assertion failure on hotunplug, which bisects back to eaad0fe26050c227dc5dad63205835bac4912a51 (“scsi: only access SCSIDevice->requests from one thread”): {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed. (gdb) bt #0 0x7f32a668d83c in () at /usr/lib/libc.so.6 #1 0x7f32a663d668 in raise () at /usr/lib/libc.so.6 #2 0x7f32a66254b8 in abort () at /usr/lib/libc.so.6 #3 0x7f32a66253dc in () at /usr/lib/libc.so.6 #4 0x7f32a6635d26 in () at /usr/lib/libc.so.6 #5 0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at ../block/block-backend.c:2429 #6 blk_get_aio_context (blk=0x556e6e89ccf0) at ../block/block-backend.c:2417 #7 0x556e6b112d87 in scsi_device_for_each_req_async_bh (opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128 #8 0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at ../util/async.c:218 #9 0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, blocking=blocking@entry=true) at ../util/aio-posix.c:722 #10 0x556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63 #11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at ../util/qemu-thread-posix.c:541 #12 0x7f32a668b9eb in () at /usr/lib/libc.so.6 #13 0x7f32a670f7cc in () at /usr/lib/libc.so.6
Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler
On 23.01.24 12:12, Fiona Ebner wrote: [...] I noticed poll_set_started() is not called, because ctx->fdmon_ops->need_wait(ctx) was true, i.e. ctx->poll_disable_cnt was positive (I'm using fdmon_poll). I then found this is because of the notifier for the event vq, being attached with virtio_queue_aio_attach_host_notifier_no_poll(vs->event_vq, s->ctx); in virtio_scsi_dataplane_start(). But in virtio_scsi_drained_end() it is attached with virtio_queue_aio_attach_host_notifier() instead of the _no_poll() variant. So that might be the actual issue here? From a quick test, I cannot see the CPU-usage-spike issue with the following either: diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index 690aceec45..ba1ab8e410 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -1166,7 +1166,15 @@ static void virtio_scsi_drained_end(SCSIBus *bus) for (uint32_t i = 0; i < total_queues; i++) { VirtQueue *vq = virtio_get_queue(vdev, i); -virtio_queue_aio_attach_host_notifier(vq, s->ctx); +if (!virtio_queue_get_notification(vq)) { +virtio_queue_set_notification(vq, true); +} +if (vq == VIRTIO_SCSI_COMMON(s)->event_vq) { +virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx); +} else { +virtio_queue_aio_attach_host_notifier(vq, s->ctx); +} +virtio_queue_notify(vdev, i); } } Perfect, so we agree on trying it that way. :) Hanna
Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler
On 22.01.24 18:52, Hanna Czenczek wrote: On 22.01.24 18:41, Hanna Czenczek wrote: On 05.01.24 15:30, Fiona Ebner wrote: Am 05.01.24 um 14:43 schrieb Fiona Ebner: Am 03.01.24 um 14:35 schrieb Paolo Bonzini: On 1/3/24 12:40, Fiona Ebner wrote: I'm happy to report that I cannot reproduce the CPU-usage-spike issue with the patch, but I did run into an assertion failure when trying to verify that it fixes my original stuck-guest-IO issue. See below for the backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934 I think it’s sufficient to simply call virtio_queue_notify_vq(vq) after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because both virtio-scsi’s and virtio-blk’s .handle_output() implementations acquire the device’s context, so this should be directly callable from any context. I guess this is not true anymore now that the AioContext locking was removed? Good point and, in fact, even before it was much safer to use virtio_queue_notify() instead. Not only does it use the event notifier handler, but it also calls it in the right thread/AioContext just by doing event_notifier_set(). But with virtio_queue_notify() using the event notifier, the CPU-usage-spike issue is present: Back to the CPU-usage-spike issue: I experimented around and it doesn't seem to matter whether I notify the virt queue before or after attaching the notifiers. But there's another functional difference. My patch called virtio_queue_notify() which contains this block: if (vq->host_notifier_enabled) { event_notifier_set(>host_notifier); } else if (vq->handle_output) { vq->handle_output(vdev, vq); In my testing, the first branch was taken, calling event_notifier_set(). Hanna's patch uses virtio_queue_notify_vq() and there, vq->handle_output() will be called. That seems to be the relevant difference regarding the CPU-usage-spike issue. I should mention that this is with a VirtIO SCSI disk. I also attempted to reproduce the CPU-usage-spike issue with a VirtIO block disk, but didn't manage yet. What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of the queues (but only one) will always show as nonempty. And then, run_poll_handlers_once() will always detect progress which explains the CPU usage. The following shows 1. vq address 2. number of times vq was passed to virtio_queue_host_notifier_aio_poll() 3. number of times the result of virtio_queue_host_notifier_aio_poll() was true for the vq 0x555fd93f9c80 17162000 0 0x555fd93f9e48 17162000 6 0x555fd93f9ee0 17162000 0 0x555fd93f9d18 17162000 17162000 0x555fd93f9db0 17162000 0 0x555fd93f9f78 17162000 0 And for the problematic one, the reason it is seen as nonempty is: 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0 vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and s->events_dropped is false in my testing, so virtio_scsi_handle_event_vq() doesn't do anything. Those values stay like this while the call counts above increase. So something going wrong with the indices when the event notifier is set from QEMU side (in the main thread)? The guest is Debian 12 with a 6.1 kernel. So, trying to figure out a new RFC version: About the stack trace you, Fiona, posted: As far as I understand, that happens because virtio_blk_drained_end() calling virtio_queue_notify_vq() wasn’t safe after all, and instead we need to use virtio_queue_notify(). Right? However, you say using virtio_queue_notify() instead causes busy loops of doing nothing in virtio-scsi (what you describe above). I mean, better than a crash, but, you know. :) I don’t know have any prior knowledge about the event handling, unfortunately. The fact that 8 buffers are available but we don’t use any sounds OK to me; as far as I understand, we only use those buffers if we have any events to push into them, so as long as we don’t, we won’t. Question is, should we not have its poll handler return false if we don’t have any events (i.e. events_dropped is false)? Would that solve it? Or actually, maybe we could just skip the virtio_queue_notify() call for the event vq? I.e. have it be `if (vq != VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`? I wouldn’t like that very much, (a) because this would make it slightly cumbersome to put that into virtio_queue_aio_attach_host_notifier*(), and (b) in case that does fix it, I do kind of feel like the real problem is that we use virtio_queue_host_notifier_aio_poll() for the event vq, which tells the polling code to poll whenever the vq is non-empty, but we (AFAIU) expect the event vq to be non-empty all the time. Turns out there was commit 38738f7dbbda90fbc161757b7f4be35b52205552 (“virtio-scsi: don't waste CPU polling the event virtqueue”) by Stefan, which specifically intended to not use virtio_queue_host_notifier_aio_poll() for the event vq. I think the problem is that virtio_scsi_drained
Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler
On 22.01.24 18:41, Hanna Czenczek wrote: On 05.01.24 15:30, Fiona Ebner wrote: Am 05.01.24 um 14:43 schrieb Fiona Ebner: Am 03.01.24 um 14:35 schrieb Paolo Bonzini: On 1/3/24 12:40, Fiona Ebner wrote: I'm happy to report that I cannot reproduce the CPU-usage-spike issue with the patch, but I did run into an assertion failure when trying to verify that it fixes my original stuck-guest-IO issue. See below for the backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934 I think it’s sufficient to simply call virtio_queue_notify_vq(vq) after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because both virtio-scsi’s and virtio-blk’s .handle_output() implementations acquire the device’s context, so this should be directly callable from any context. I guess this is not true anymore now that the AioContext locking was removed? Good point and, in fact, even before it was much safer to use virtio_queue_notify() instead. Not only does it use the event notifier handler, but it also calls it in the right thread/AioContext just by doing event_notifier_set(). But with virtio_queue_notify() using the event notifier, the CPU-usage-spike issue is present: Back to the CPU-usage-spike issue: I experimented around and it doesn't seem to matter whether I notify the virt queue before or after attaching the notifiers. But there's another functional difference. My patch called virtio_queue_notify() which contains this block: if (vq->host_notifier_enabled) { event_notifier_set(>host_notifier); } else if (vq->handle_output) { vq->handle_output(vdev, vq); In my testing, the first branch was taken, calling event_notifier_set(). Hanna's patch uses virtio_queue_notify_vq() and there, vq->handle_output() will be called. That seems to be the relevant difference regarding the CPU-usage-spike issue. I should mention that this is with a VirtIO SCSI disk. I also attempted to reproduce the CPU-usage-spike issue with a VirtIO block disk, but didn't manage yet. What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of the queues (but only one) will always show as nonempty. And then, run_poll_handlers_once() will always detect progress which explains the CPU usage. The following shows 1. vq address 2. number of times vq was passed to virtio_queue_host_notifier_aio_poll() 3. number of times the result of virtio_queue_host_notifier_aio_poll() was true for the vq 0x555fd93f9c80 17162000 0 0x555fd93f9e48 17162000 6 0x555fd93f9ee0 17162000 0 0x555fd93f9d18 17162000 17162000 0x555fd93f9db0 17162000 0 0x555fd93f9f78 17162000 0 And for the problematic one, the reason it is seen as nonempty is: 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0 vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and s->events_dropped is false in my testing, so virtio_scsi_handle_event_vq() doesn't do anything. Those values stay like this while the call counts above increase. So something going wrong with the indices when the event notifier is set from QEMU side (in the main thread)? The guest is Debian 12 with a 6.1 kernel. So, trying to figure out a new RFC version: About the stack trace you, Fiona, posted: As far as I understand, that happens because virtio_blk_drained_end() calling virtio_queue_notify_vq() wasn’t safe after all, and instead we need to use virtio_queue_notify(). Right? However, you say using virtio_queue_notify() instead causes busy loops of doing nothing in virtio-scsi (what you describe above). I mean, better than a crash, but, you know. :) I don’t know have any prior knowledge about the event handling, unfortunately. The fact that 8 buffers are available but we don’t use any sounds OK to me; as far as I understand, we only use those buffers if we have any events to push into them, so as long as we don’t, we won’t. Question is, should we not have its poll handler return false if we don’t have any events (i.e. events_dropped is false)? Would that solve it? Or actually, maybe we could just skip the virtio_queue_notify() call for the event vq? I.e. have it be `if (vq != VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`? I wouldn’t like that very much, (a) because this would make it slightly cumbersome to put that into virtio_queue_aio_attach_host_notifier*(), and (b) in case that does fix it, I do kind of feel like the real problem is that we use virtio_queue_host_notifier_aio_poll() for the event vq, which tells the polling code to poll whenever the vq is non-empty, but we (AFAIU) expect the event vq to be non-empty all the time. Hanna
Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler
On 05.01.24 15:30, Fiona Ebner wrote: Am 05.01.24 um 14:43 schrieb Fiona Ebner: Am 03.01.24 um 14:35 schrieb Paolo Bonzini: On 1/3/24 12:40, Fiona Ebner wrote: I'm happy to report that I cannot reproduce the CPU-usage-spike issue with the patch, but I did run into an assertion failure when trying to verify that it fixes my original stuck-guest-IO issue. See below for the backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934 I think it’s sufficient to simply call virtio_queue_notify_vq(vq) after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because both virtio-scsi’s and virtio-blk’s .handle_output() implementations acquire the device’s context, so this should be directly callable from any context. I guess this is not true anymore now that the AioContext locking was removed? Good point and, in fact, even before it was much safer to use virtio_queue_notify() instead. Not only does it use the event notifier handler, but it also calls it in the right thread/AioContext just by doing event_notifier_set(). But with virtio_queue_notify() using the event notifier, the CPU-usage-spike issue is present: Back to the CPU-usage-spike issue: I experimented around and it doesn't seem to matter whether I notify the virt queue before or after attaching the notifiers. But there's another functional difference. My patch called virtio_queue_notify() which contains this block: if (vq->host_notifier_enabled) { event_notifier_set(>host_notifier); } else if (vq->handle_output) { vq->handle_output(vdev, vq); In my testing, the first branch was taken, calling event_notifier_set(). Hanna's patch uses virtio_queue_notify_vq() and there, vq->handle_output() will be called. That seems to be the relevant difference regarding the CPU-usage-spike issue. I should mention that this is with a VirtIO SCSI disk. I also attempted to reproduce the CPU-usage-spike issue with a VirtIO block disk, but didn't manage yet. What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of the queues (but only one) will always show as nonempty. And then, run_poll_handlers_once() will always detect progress which explains the CPU usage. The following shows 1. vq address 2. number of times vq was passed to virtio_queue_host_notifier_aio_poll() 3. number of times the result of virtio_queue_host_notifier_aio_poll() was true for the vq 0x555fd93f9c80 17162000 0 0x555fd93f9e48 17162000 6 0x555fd93f9ee0 17162000 0 0x555fd93f9d18 17162000 17162000 0x555fd93f9db0 17162000 0 0x555fd93f9f78 17162000 0 And for the problematic one, the reason it is seen as nonempty is: 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0 vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and s->events_dropped is false in my testing, so virtio_scsi_handle_event_vq() doesn't do anything. Those values stay like this while the call counts above increase. So something going wrong with the indices when the event notifier is set from QEMU side (in the main thread)? The guest is Debian 12 with a 6.1 kernel. So, trying to figure out a new RFC version: About the stack trace you, Fiona, posted: As far as I understand, that happens because virtio_blk_drained_end() calling virtio_queue_notify_vq() wasn’t safe after all, and instead we need to use virtio_queue_notify(). Right? However, you say using virtio_queue_notify() instead causes busy loops of doing nothing in virtio-scsi (what you describe above). I mean, better than a crash, but, you know. :) I don’t know have any prior knowledge about the event handling, unfortunately. The fact that 8 buffers are available but we don’t use any sounds OK to me; as far as I understand, we only use those buffers if we have any events to push into them, so as long as we don’t, we won’t. Question is, should we not have its poll handler return false if we don’t have any events (i.e. events_dropped is false)? Would that solve it? Hanna
Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler
On 02.01.24 16:53, Paolo Bonzini wrote: On Tue, Jan 2, 2024 at 4:24 PM Hanna Czenczek wrote: I’ve attached the preliminary patch that I didn’t get to send (or test much) last year. Not sure if it has the same CPU-usage-spike issue Fiona was seeing, the only functional difference is that I notify the vq after attaching the notifiers instead of before. I think the patch makes sense and cleaning up the logic of aio_poll (which is one of those functions that grew and grew without much clarity into who did what) can be done on top. Just one small thing, the virtio_queue_notify_vq() call is required because the virtqueue interrupt and eventfd are edge-triggered rather than level-triggered; so perhaps it should be placed in the function(s) that establish the handlers, virtio_queue_aio_attach_host_notifier() and virtio_queue_aio_attach_host_notifier_no_poll()? Neither virtio_blk_drained_end() nor virtio_scsi_drained_end() are particularly special, and the comment applies just as well: /* * We will have ignored notifications about new requests from the guest * while handlers were not attached, so "kick" the virt queue to process * those requests now. */ I wasn’t entirely whether we want to call notify_vq() if we have instated the handlers for the first time (in which case someone ought to look for existing unprocessed requests anyway), so I decided to limit it to drained_end. But considering that it must be safe to call notify_vq() right after instating the handlers (virtio_queue_host_notifier_read() may then be called after all), we might as well do so every time, yes. Hanna
Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler
On 13.12.23 22:15, Stefan Hajnoczi wrote: Hanna and Fiona encountered a bug in aio_set_fd_handler(): there is no matching io_poll_end() call upon removing an AioHandler when io_poll_begin() was previously called. The missing io_poll_end() call leaves virtqueue notifications disabled and the virtqueue's ioeventfd will never become readable anymore. The details of how virtio-scsi devices using IOThreads can hang after hotplug/unplug are covered here: https://issues.redhat.com/browse/RHEL-3934 Hanna is currently away over the December holidays. I'm sending these RFC patches in the meantime. They demonstrate running aio_set_fd_handler() in the AioContext home thread and adding the missing io_poll_end() call. I agree with Paolo that if the handlers are removed, the caller probably isn’t interested in notifications: In our specific case, we remove the handlers because the device is to be drained, so we won’t poll the virtqueue anyway. Therefore, if io_poll_end() is to be called to complete the start-end pair, it shouldn’t be done when the handlers are removed, but instead when they are reinstated. I believe that’s quite infeasible to do in generic code: We’d have to basically remember that we haven’t called a specific io_poll_end callback yet, and so once it is reinstated while we’re not in a polling phase, we would need to call it then. That in itself is already hairy, but in addition, the callback consists of a function pointer and an opaque pointer, and it seems impossible to reliably establish identity between two opaque pointers to know whether a newly instated io_poll_end callback is the same one as one that had been removed before (pointer identity may work, but also may not). That’s why I think the responsibility should fall on the caller. I believe both virtio_queue_aio_attach_host_notifier*() functions should enable vq notifications before instating the handler – if we’re in polling mode, io_poll_start() will then immediately be called and disable vq notifications again. Having them enabled briefly shouldn’t hurt anything but performance (which I don’t think is terrible in the drain case). For completeness’ sake, we may even have virtio_queue_aio_detach_host_notifier() disable vq notifications, because otherwise it’s unknown whether notifications are enabled or disabled after removing the notifiers. That isn’t terrible, but I think (A) if any of the two, we want them to be disabled, because we won’t check for requests anyway, and (B) this gives us a clearer state machine, where removing the notifiers will always leave vq notifications disabled, so when they are reinstated (i.e. when calling virtio_queue_aio_attach_host_notifier*()), it’s clear that we must poll once to check for new requests. I’ve attached the preliminary patch that I didn’t get to send (or test much) last year. Not sure if it has the same CPU-usage-spike issue Fiona was seeing, the only functional difference is that I notify the vq after attaching the notifiers instead of before. HannaFrom 451aae74fc19a6ea5cd6381247cd9202571651e8 Mon Sep 17 00:00:00 2001 From: Hanna Czenczek Date: Wed, 6 Dec 2023 18:24:55 +0100 Subject: [PATCH] Keep notifications disabled during drain Preliminary patch with a preliminary description: During drain, we do not care about virtqueue notifications, which is why we remove the handlers on it. When removing those handlers, whether vq notifications are enabled or not depends on whether we were in polling mode or not; if not, they are enabled (by default); if so, they have been disabled by the io_poll_start callback. Ideally, we would rather have the vq notifications be disabled, because we will not process requests during a drained section anyway. Therefore, have virtio_queue_aio_detach_host_notifier() explicitly disable them, and virtio_queue_aio_attach_host_notifier*() re-enable them (if we are in a polling section, attaching them will invoke the io_poll_start callback, which will re-disable them). Because we will miss virtqueue updates in the drained section, we also need to poll the virtqueue once in the drained_end functions after attaching the notifiers. --- include/block/aio.h| 7 ++- include/hw/virtio/virtio.h | 1 + hw/block/virtio-blk.c | 10 ++ hw/scsi/virtio-scsi.c | 10 ++ hw/virtio/virtio.c | 30 +- 5 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/block/aio.h b/include/block/aio.h index f08b358077..795a375ff2 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -497,9 +497,14 @@ void aio_set_event_notifier(AioContext *ctx, AioPollFn *io_poll, EventNotifierHandler *io_poll_ready); -/* Set polling begin/end callbacks for an event notifier that has already been +/* + * Set polling begin/end callbacks for an event notifier that has already been * registered
[PULL 0/3] Block patches
The following changes since commit 3e01f1147a16ca566694b97eafc941d62fa1e8d8: Merge tag 'pull-sp-20231105' of https://gitlab.com/rth7680/qemu into staging (2023-11-06 09:34:22 +0800) are available in the Git repository at: https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-11-06 for you to fetch changes up to ad4feaca61d76fecad784e6d5e7bae40d0411c46: file-posix: fix over-writing of returning zone_append offset (2023-11-06 16:15:07 +0100) Block patches: - One patch to make qcow2's discard-no-unref option do better what it is supposed to do (i.e. prevent fragmentation) - Two fixes for zoned requests Jean-Louis Dupond (1): qcow2: keep reference on zeroize with discard-no-unref enabled Naohiro Aota (1): file-posix: fix over-writing of returning zone_append offset Sam Li (1): block/file-posix: fix update_zones_wp() caller qapi/block-core.json | 24 ++-- block/file-posix.c| 23 --- block/qcow2-cluster.c | 22 ++ qemu-options.hx | 10 +++--- 4 files changed, 51 insertions(+), 28 deletions(-) -- 2.41.0
[PULL 2/3] block/file-posix: fix update_zones_wp() caller
From: Sam Li When the zoned request fail, it needs to update only the wp of the target zones for not disrupting the in-flight writes on these other zones. The wp is updated successfully after the request completes. Fixed the callers with right offset and nr_zones. Signed-off-by: Sam Li Message-Id: <20230825040556.4217-1-faithilike...@gmail.com> Reviewed-by: Stefan Hajnoczi [hreitz: Rebased and fixed comment spelling] Signed-off-by: Hanna Czenczek --- block/file-posix.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 50e2b20d5c..3c0ce9c258 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2523,7 +2523,10 @@ out: } } } else { -update_zones_wp(bs, s->fd, 0, 1); +/* + * write and append write are not allowed to cross zone boundaries + */ +update_zones_wp(bs, s->fd, offset, 1); } qemu_co_mutex_unlock(>colock); @@ -3470,7 +3473,7 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, len >> BDRV_SECTOR_BITS); ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, ); if (ret != 0) { -update_zones_wp(bs, s->fd, offset, i); +update_zones_wp(bs, s->fd, offset, nrz); error_report("ioctl %s failed %d", op_name, ret); return ret; } -- 2.41.0
[PULL 1/3] qcow2: keep reference on zeroize with discard-no-unref enabled
From: Jean-Louis Dupond When the discard-no-unref flag is enabled, we keep the reference for normal discard requests. But when a discard is executed on a snapshot/qcow2 image with backing, the discards are saved as zero clusters in the snapshot image. When committing the snapshot to the backing file, not discard_in_l2_slice is called but zero_in_l2_slice. Which did not had any logic to keep the reference when discard-no-unref is enabled. Therefor we add logic in the zero_in_l2_slice call to keep the reference on commit. Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621 Signed-off-by: Jean-Louis Dupond Message-Id: <20231003125236.216473-2-jean-lo...@dupond.be> [hreitz: Made the documentation change more verbose, as discussed on-list] Signed-off-by: Hanna Czenczek --- qapi/block-core.json | 24 ++-- block/qcow2-cluster.c | 22 ++ qemu-options.hx | 10 +++--- 3 files changed, 39 insertions(+), 17 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 99961256f2..ca390c5700 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -3528,16 +3528,20 @@ # @pass-discard-other: whether discard requests for the data source # should be issued on other occasions where a cluster gets freed # -# @discard-no-unref: when enabled, discards from the guest will not -# cause cluster allocations to be relinquished. This prevents -# qcow2 fragmentation that would be caused by such discards. -# Besides potential performance degradation, such fragmentation -# can lead to increased allocation of clusters past the end of the -# image file, resulting in image files whose file length can grow -# much larger than their guest disk size would suggest. If image -# file length is of concern (e.g. when storing qcow2 images -# directly on block devices), you should consider enabling this -# option. (since 8.1) +# @discard-no-unref: when enabled, data clusters will remain +# preallocated when they are no longer used, e.g. because they are +# discarded or converted to zero clusters. As usual, whether the +# old data is discarded or kept on the protocol level (i.e. in the +# image file) depends on the setting of the pass-discard-request +# option. Keeping the clusters preallocated prevents qcow2 +# fragmentation that would otherwise be caused by freeing and +# re-allocating them later. Besides potential performance +# degradation, such fragmentation can lead to increased allocation +# of clusters past the end of the image file, resulting in image +# files whose file length can grow much larger than their guest disk +# size would suggest. If image file length is of concern (e.g. when +# storing qcow2 images directly on block devices), you should +# consider enabling this option. (since 8.1) # # @overlap-check: which overlap checks to perform for writes to the # image, defaults to 'cached' (since 2.2) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 904f00d1b3..5af439bd11 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1983,7 +1983,7 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, uint64_t nb_clusters, /* If we keep the reference, pass on the discard still */ bdrv_pdiscard(s->data_file, old_l2_entry & L2E_OFFSET_MASK, s->cluster_size); - } +} } qcow2_cache_put(s->l2_table_cache, (void **) _slice); @@ -2061,9 +2061,15 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset, QCow2ClusterType type = qcow2_get_cluster_type(bs, old_l2_entry); bool unmap = (type == QCOW2_CLUSTER_COMPRESSED) || ((flags & BDRV_REQ_MAY_UNMAP) && qcow2_cluster_is_allocated(type)); -uint64_t new_l2_entry = unmap ? 0 : old_l2_entry; +bool keep_reference = +(s->discard_no_unref && type != QCOW2_CLUSTER_COMPRESSED); +uint64_t new_l2_entry = old_l2_entry; uint64_t new_l2_bitmap = old_l2_bitmap; +if (unmap && !keep_reference) { +new_l2_entry = 0; +} + if (has_subclusters(s)) { new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES; } else { @@ -2081,9 +2087,17 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset, set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap); } -/* Then decrease the refcount */ if (unmap) { -qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST); +if (!keep_reference) { +/* Then decrease the refcount */ +qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST); +} else if (s->discard_passthrough[QCOW2_DISCARD_REQUEST] && + (type == QCOW2_CLUSTER_NORMAL
[PULL 3/3] file-posix: fix over-writing of returning zone_append offset
From: Naohiro Aota raw_co_zone_append() sets "s->offset" where "BDRVRawState *s". This pointer is used later at raw_co_prw() to save the block address where the data is written. When multiple IOs are on-going at the same time, a later IO's raw_co_zone_append() call over-writes a former IO's offset address before raw_co_prw() completes. As a result, the former zone append IO returns the initial value (= the start address of the writing zone), instead of the proper address. Fix the issue by passing the offset pointer to raw_co_prw() instead of passing it through s->offset. Also, remove "offset" from BDRVRawState as there is no usage anymore. Fixes: 4751d09adcc3 ("block: introduce zone append write for zoned devices") Signed-off-by: Naohiro Aota Message-Id: <20231030073853.2601162-1-naohiro.a...@wdc.com> Reviewed-by: Sam Li Reviewed-by: Stefan Hajnoczi Signed-off-by: Hanna Czenczek --- block/file-posix.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 3c0ce9c258..b862406c71 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -160,7 +160,6 @@ typedef struct BDRVRawState { bool has_write_zeroes:1; bool use_linux_aio:1; bool use_linux_io_uring:1; -int64_t *offset; /* offset of zone append operation */ int page_cache_inconsistent; /* errno from fdatasync failure */ bool has_fallocate; bool needs_alignment; @@ -2445,12 +2444,13 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov) return true; } -static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset, +static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr, uint64_t bytes, QEMUIOVector *qiov, int type) { BDRVRawState *s = bs->opaque; RawPosixAIOData acb; int ret; +uint64_t offset = *offset_ptr; if (fd_open(bs) < 0) return -EIO; @@ -2513,8 +2513,8 @@ out: uint64_t *wp = >wp[offset / bs->bl.zone_size]; if (!BDRV_ZT_IS_CONV(*wp)) { if (type & QEMU_AIO_ZONE_APPEND) { -*s->offset = *wp; -trace_zbd_zone_append_complete(bs, *s->offset +*offset_ptr = *wp; +trace_zbd_zone_append_complete(bs, *offset_ptr >> BDRV_SECTOR_BITS); } /* Advance the wp if needed */ @@ -2539,14 +2539,14 @@ static int coroutine_fn raw_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags) { -return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_READ); +return raw_co_prw(bs, , bytes, qiov, QEMU_AIO_READ); } static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags) { -return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE); +return raw_co_prw(bs, , bytes, qiov, QEMU_AIO_WRITE); } static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs) @@ -3509,8 +3509,6 @@ static int coroutine_fn raw_co_zone_append(BlockDriverState *bs, int64_t zone_size_mask = bs->bl.zone_size - 1; int64_t iov_len = 0; int64_t len = 0; -BDRVRawState *s = bs->opaque; -s->offset = offset; if (*offset & zone_size_mask) { error_report("sector offset %" PRId64 " is not aligned to zone size " @@ -3531,7 +3529,7 @@ static int coroutine_fn raw_co_zone_append(BlockDriverState *bs, } trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS); -return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND); +return raw_co_prw(bs, offset, len, qiov, QEMU_AIO_ZONE_APPEND); } #endif -- 2.41.0
Re: [PATCH] file-posix: fix over-writing of returning zone_append offset
On 30.10.23 08:38, Naohiro Aota wrote: raw_co_zone_append() sets "s->offset" where "BDRVRawState *s". This pointer is used later at raw_co_prw() to save the block address where the data is written. When multiple IOs are on-going at the same time, a later IO's raw_co_zone_append() call over-writes a former IO's offset address before raw_co_prw() completes. As a result, the former zone append IO returns the initial value (= the start address of the writing zone), instead of the proper address. Fix the issue by passing the offset pointer to raw_co_prw() instead of passing it through s->offset. Also, remove "offset" from BDRVRawState as there is no usage anymore. Fixes: 4751d09adcc3 ("block: introduce zone append write for zoned devices") Signed-off-by: Naohiro Aota --- block/file-posix.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) Thanks, applied to my block branch: https://gitlab.com/hreitz/qemu/-/commits/block Hanna
Re: [PATCH v2 09/10] block: Convert qmp_query_block() to coroutine_fn
On 09.06.23 22:19, Fabiano Rosas wrote: This is another caller of bdrv_get_allocated_file_size() that needs to be converted to a coroutine because that function will be made asynchronous when called (indirectly) from the QMP dispatcher. This QMP command is a candidate because it calls bdrv_do_query_node_info(), which in turn calls bdrv_get_allocated_file_size(). We've determined bdrv_do_query_node_info() to be coroutine-safe (see previous commits), so we can just put this QMP command in a coroutine. Since qmp_query_block() now expects to run in a coroutine, its callers need to be converted as well. Convert hmp_info_block(), which calls only coroutine-safe code, including qmp_query_named_block_nodes() which has been converted to coroutine in the previous patches. Now that all callers of bdrv_[co_]block_device_info() are using the coroutine version, a few things happen: - we can return to using bdrv_block_device_info() without a wrapper; - bdrv_get_allocated_file_size() can stop being mixed; - bdrv_co_get_allocated_file_size() needs to be put under the graph lock because it is being called wthout the wrapper; - bdrv_do_query_node_info() doesn't need to acquire the AioContext because it doesn't call aio_poll anymore; Signed-off-by: Fabiano Rosas --- block.c| 2 +- block/monitor/block-hmp-cmds.c | 2 +- block/qapi.c | 18 +- hmp-commands-info.hx | 1 + include/block/block-hmp-cmds.h | 2 +- include/block/block-io.h | 2 +- include/block/qapi.h | 12 qapi/block-core.json | 2 +- 8 files changed, 19 insertions(+), 22 deletions(-) After this series has been sent, we got some usages of GRAPH_RDLOCK_GUARD_MAINLOOP() that no longer fit with this patch – I’ve also mentioned one case on patch 7, not yet realizing that this was a new thing. Those must now be fixed (e.g. in qmp_query_block(), or in bdrv_snapshot_list()), or they’ll crash. Hanna
Re: [PATCH v2 10/10] block: Add a thread-pool version of fstat
On 09.06.23 22:19, Fabiano Rosas wrote: From: João Silva The fstat call can take a long time to finish when running over NFS. Add a version of it that runs in the thread pool. Adapt one of its users, raw_co_get_allocated_file size to use the new version. That function is called via QMP under the qemu_global_mutex so it has a large chance of blocking VCPU threads in case it takes too long to finish. Reviewed-by: Claudio Fontana Signed-off-by: João Silva Signed-off-by: Fabiano Rosas --- block/file-posix.c | 40 +--- include/block/raw-aio.h | 4 +++- 2 files changed, 40 insertions(+), 4 deletions(-) Reviewed-by: Hanna Czenczek
Re: [PATCH v2 05/10] block: Convert bdrv_query_block_graph_info to coroutine
On 09.06.23 22:19, Fabiano Rosas wrote: We're converting callers of bdrv_get_allocated_file_size() to run in coroutines because that function will be made asynchronous when called (indirectly) from the QMP dispatcher. This function is a candidate because it calls bdrv_do_query_node_info(), which in turn calls bdrv_get_allocated_file_size(). All the functions called from bdrv_do_query_node_info() onwards are coroutine-safe, either have a coroutine version themselves[1] or are mostly simple code/string manipulation[2]. 1) bdrv_getlength(), bdrv_get_allocated_file_size(), bdrv_get_info(), bdrv_get_specific_info(); 2) bdrv_refresh_filename(), bdrv_get_format_name(), bdrv_get_full_backing_filename(), bdrv_query_snapshot_info_list(); Signed-off-by: Fabiano Rosas --- block/qapi.c | 12 +++- include/block/qapi.h | 6 +- qemu-img.c | 2 -- 3 files changed, 12 insertions(+), 8 deletions(-) Reviewed-by: Hanna Czenczek
Re: [PATCH v2 04/10] block: Temporarily mark bdrv_co_get_allocated_file_size as mixed
On 09.06.23 22:19, Fabiano Rosas wrote: Some callers of this function are about to be converted to run in coroutines, so allow it to be executed both inside and outside a coroutine while we convert all the callers. This will be reverted once all callers of bdrv_do_query_node_info run in a coroutine. Signed-off-by: Fabiano Rosas Reviewed-by: Eric Blake --- include/block/block-io.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Hanna Czenczek
Re: [PATCH v2 03/10] block: Allow the wrapper script to see functions declared in qapi.h
On 09.06.23 22:19, Fabiano Rosas wrote: The following patches will add co_wrapper annotations to functions declared in qapi.h. Add that header to the set of files used by block-coroutine-wrapper.py. Signed-off-by: Fabiano Rosas --- block/meson.build | 1 + scripts/block-coroutine-wrapper.py | 1 + 2 files changed, 2 insertions(+) Reviewed-by: Hanna Czenczek
Re: [PATCH v2 09/10] block: Convert qmp_query_block() to coroutine_fn
On 09.06.23 22:19, Fabiano Rosas wrote: This is another caller of bdrv_get_allocated_file_size() that needs to be converted to a coroutine because that function will be made asynchronous when called (indirectly) from the QMP dispatcher. This QMP command is a candidate because it calls bdrv_do_query_node_info(), which in turn calls bdrv_get_allocated_file_size(). We've determined bdrv_do_query_node_info() to be coroutine-safe (see previous commits), so we can just put this QMP command in a coroutine. Since qmp_query_block() now expects to run in a coroutine, its callers need to be converted as well. Convert hmp_info_block(), which calls only coroutine-safe code, including qmp_query_named_block_nodes() which has been converted to coroutine in the previous patches. Now that all callers of bdrv_[co_]block_device_info() are using the coroutine version, a few things happen: - we can return to using bdrv_block_device_info() without a wrapper; - bdrv_get_allocated_file_size() can stop being mixed; - bdrv_co_get_allocated_file_size() needs to be put under the graph lock because it is being called wthout the wrapper; But bdrv_do_query_node_info() is marked GRAPH_RDLOCK, so the whole function must not be called without holding the lock. I don’t understand why we need to explicitly take it another time. - bdrv_do_query_node_info() doesn't need to acquire the AioContext because it doesn't call aio_poll anymore; In the past (very likely outdated, but still mentioning it) you’d need to take the AioContext just in general when operating on a block device that might be processed in a different AioContext than the main one, and the current code runs in the main context, i.e. which is the situation we have here. Speaking of contexts, I wonder how the threading is actually supposed to work. I assume QMP coroutines run in the main thread, so now we run bdrv_co_get_allocated_file_size() in the main thread – is that correct, or do we need to use bdrv_co_enter() like qmp_block_resize() does? And so, if we run it in the main thread, is it OK not to acquire the AioContext around it to prevent interference from a potential I/O thread? Signed-off-by: Fabiano Rosas --- block.c| 2 +- block/monitor/block-hmp-cmds.c | 2 +- block/qapi.c | 18 +- hmp-commands-info.hx | 1 + include/block/block-hmp-cmds.h | 2 +- include/block/block-io.h | 2 +- include/block/qapi.h | 12 qapi/block-core.json | 2 +- 8 files changed, 19 insertions(+), 22 deletions(-) This patch implicitly assumes that quite a lot of functions (at least bdrv_query_info(), bdrv_query_image_info(), bdrv_do_query_node_info()) are now run in coroutine context. This assumption must be formalized by annotating them all with coroutine_fn, and ideally adding a _co_ into their name. Also, those functions should be checked whether they call coroutine wrappers, and made to use the native coroutine version now if so. (At least I’d find that nicer, FWIW.) I’ve seen at least bdrv_getlength() in bdrv_do_query_node_info(), which could be a bdrv_co_getlength(). diff --git a/block.c b/block.c index abed744b60..f94cee8930 100644 --- a/block.c +++ b/block.c @@ -6148,7 +6148,7 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat, list = NULL; QTAILQ_FOREACH(bs, _bdrv_states, node_list) { -BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, errp); +BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp); if (!info) { qapi_free_BlockDeviceInfoList(list); return NULL; diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c index 26116fe831..1049f9b006 100644 --- a/block/monitor/block-hmp-cmds.c +++ b/block/monitor/block-hmp-cmds.c @@ -742,7 +742,7 @@ static void print_block_info(Monitor *mon, BlockInfo *info, } } -void hmp_info_block(Monitor *mon, const QDict *qdict) +void coroutine_fn hmp_info_block(Monitor *mon, const QDict *qdict) { BlockInfoList *block_list, *info; BlockDeviceInfoList *blockdev_list, *blockdev; diff --git a/block/qapi.c b/block/qapi.c index 20660e15d6..3b4bc0b782 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -41,10 +41,10 @@ #include "qemu/qemu-print.h" #include "sysemu/block-backend.h" -BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk, -BlockDriverState *bs, -bool flat, -Error **errp) +BlockDeviceInfo *coroutine_fn bdrv_block_device_info(BlockBackend *blk, + BlockDriverState *bs, + bool flat, +
Re: [PATCH v2 08/10] block: Don't query all block devices at hmp_nbd_server_start
On 09.06.23 22:19, Fabiano Rosas wrote: We're currently doing a full query-block just to enumerate the devices for qmp_nbd_server_add and then discarding the BlockInfoList afterwards. Alter hmp_nbd_server_start to instead iterate explicitly over the block_backends list. This allows the removal of the dependency on qmp_query_block from hmp_nbd_server_start. This is desirable because we're about to move qmp_query_block into a coroutine and don't need to change the NBD code at the same time. Signed-off-by: Fabiano Rosas --- block/monitor/block-hmp-cmds.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c index ca2599de44..26116fe831 100644 --- a/block/monitor/block-hmp-cmds.c +++ b/block/monitor/block-hmp-cmds.c @@ -394,7 +394,7 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict) bool writable = qdict_get_try_bool(qdict, "writable", false); bool all = qdict_get_try_bool(qdict, "all", false); Error *local_err = NULL; -BlockInfoList *block_list, *info; +BlockBackend *blk; SocketAddress *addr; NbdServerAddOptions export; @@ -419,18 +419,24 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict) return; } -/* Then try adding all block devices. If one fails, close all and +/* + * Then try adding all block devices. If one fails, close all and * exit. */ -block_list = qmp_query_block(NULL); +for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) { +BlockDriverState *bs = blk_bs(blk); -for (info = block_list; info; info = info->next) { -if (!info->value->inserted) { +if (!*blk_name(blk) && !blk_get_attached_dev(blk)) { I’d like a comment here that historically, we’ve used qmp_query_block() here, and this is the same condition that it uses. (Otherwise, it’s hard to see why it matters whether a device is attached or not.) +continue; +} + +bs = bdrv_skip_implicit_filters(bs); +if (!bs || !bs->drv) { Same here. Just checking blk_is_inserted() would make more sense in this place, but if you want to absolutely keep behavior unchanged, then there should be a comment here why this check is done (because bdrv_query_info() does it to determine whether info->inserted should be set, which was historically used to determine whether this BlockBackend can be exported). continue; } export = (NbdServerAddOptions) { -.device = info->value->device, +.device = g_strdup(blk_name(blk)), Do we need to g_strdup() here? We didn’t before, so I think this will leak export.device. I know bdrv_query_info() uses g_strdup(), but that was released by the qapi_free_BlockInfoList() below, which is now removed without replacement. (On that note, it also looks like qmp_nbd_server_add() can leak arg->name (i.e. device.name) if it is not set by the caller. It also uses g_strdup() there, but never frees it. It does free the export_opts it creates, and `arg` is put into it, but as a deep copy, so anything in `arg` is leaked.) Hanna .has_writable = true, .writable = writable, }; @@ -443,8 +449,6 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict) } } -qapi_free_BlockInfoList(block_list); - exit: hmp_handle_error(mon, local_err); }
Re: [PATCH v2 06/10] block: Convert bdrv_block_device_info into co_wrapper
On 09.06.23 22:19, Fabiano Rosas wrote: We're converting callers of bdrv_get_allocated_file_size() to run in coroutines because that function will be made asynchronous when called (indirectly) from the QMP dispatcher. This function is a candidate because it calls bdrv_query_image_info() -> bdrv_do_query_node_info() -> bdrv_get_allocated_file_size(). It is safe to turn this is a coroutine because the code it calls is made up of either simple accessors and string manipulation functions [1] or it has already been determined to be safe [2]. 1) bdrv_refresh_filename(), bdrv_is_read_only(), blk_enable_write_cache(), bdrv_cow_bs(), blk_get_public(), throttle_group_get_name(), bdrv_write_threshold_get(), bdrv_query_dirty_bitmaps(), throttle_group_get_config(), bdrv_filter_or_cow_bs(), bdrv_skip_implicit_filters() 2) bdrv_do_query_node_info() (see previous commit); Signed-off-by: Fabiano Rosas --- block/qapi.c | 8 include/block/qapi.h | 12 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/block/qapi.c b/block/qapi.c index a2e71edaff..20660e15d6 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -41,10 +41,10 @@ #include "qemu/qemu-print.h" #include "sysemu/block-backend.h" -BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk, -BlockDriverState *bs, -bool flat, -Error **errp) +BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk, +BlockDriverState *bs, +bool flat, +Error **errp) { ImageInfo **p_image_info; ImageInfo *backing_info; diff --git a/include/block/qapi.h b/include/block/qapi.h index 7035bcd1ae..5cb0202791 100644 --- a/include/block/qapi.h +++ b/include/block/qapi.h @@ -30,10 +30,14 @@ #include "block/snapshot.h" #include "qapi/qapi-types-block-core.h" -BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk, -BlockDriverState *bs, -bool flat, -Error **errp); +BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk, +BlockDriverState *bs, +bool flat, +Error **errp); +BlockDeviceInfo *co_wrapper bdrv_block_device_info(BlockBackend *blk, + BlockDriverState *bs, + bool flat, + Error **errp); bdrv_co_block_device_info() is now marked as GRAPH_RDLOCK, so should this use a co_wrapper_bdrv_rdlock instead? Hanna int bdrv_query_snapshot_info_list(BlockDriverState *bs, SnapshotInfoList **p_list, Error **errp);
Re: [PATCH v2 07/10] block: Convert qmp_query_named_block_nodes to coroutine
On 09.06.23 22:19, Fabiano Rosas wrote: From: Lin Ma We're converting callers of bdrv_get_allocated_file_size() to run in coroutines because that function will be made asynchronous when called (indirectly) from the QMP dispatcher. This QMP command is a candidate because it indirectly calls bdrv_get_allocated_file_size() through bdrv_block_device_info() -> bdrv_query_image_info() -> bdrv_query_image_info(). The previous patches have determined that bdrv_query_image_info() and bdrv_do_query_node_info() are coroutine-safe so we can just make the QMP command run in a coroutine. Signed-off-by: Lin Ma Signed-off-by: Fabiano Rosas --- block.c | 2 +- blockdev.c | 6 +++--- qapi/block-core.json | 3 ++- 3 files changed, 6 insertions(+), 5 deletions(-) (I see patch 9 does something with HMP code, but) hmp_info_block() calls qmp_query_named_block_nodes(), and I don’t think it may call such a coroutine_fn directly. diff --git a/block.c b/block.c index f94cee8930..abed744b60 100644 --- a/block.c +++ b/block.c @@ -6148,7 +6148,7 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat, list = NULL; QTAILQ_FOREACH(bs, _bdrv_states, node_list) { -BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp); +BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, errp); As far as I understand, only functions marked as coroutine_fn may call other coroutine_fn. Regardless of whether bdrv_named_nodes_list() is only called by another coroutine_fn, we still have to mark it as coroutine_fn, too (and probably rename it to bdrv_co_named_nodes_list()). Also, this function (bdrv_named_nodes_list()) uses GRAPH_RDLOCK_GUARD_MAINLOOP(). Is that the correct thing to use in a coroutine context? Hanna if (!info) { qapi_free_BlockDeviceInfoList(list); return NULL; diff --git a/blockdev.c b/blockdev.c index e6eba61484..8b5f7d06c8 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2818,9 +2818,9 @@ void qmp_drive_backup(DriveBackup *backup, Error **errp) blockdev_do_action(, errp); } -BlockDeviceInfoList *qmp_query_named_block_nodes(bool has_flat, - bool flat, - Error **errp) +BlockDeviceInfoList *coroutine_fn qmp_query_named_block_nodes(bool has_flat, + bool flat, + Error **errp) { bool return_flat = has_flat && flat; diff --git a/qapi/block-core.json b/qapi/block-core.json index 5dd5f7e4b0..9d4c92f2c9 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1972,7 +1972,8 @@ { 'command': 'query-named-block-nodes', 'returns': [ 'BlockDeviceInfo' ], 'data': { '*flat': 'bool' }, - 'allow-preconfig': true } + 'allow-preconfig': true, + 'coroutine': true} ## # @XDbgBlockGraphNodeType:
Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap
On 03.11.23 16:51, Hanna Czenczek wrote: On 20.10.23 23:56, Andrey Drobyshev wrote: [...] @@ -528,6 +543,14 @@ for use_backing_file in yes no; do else _make_test_img -o extended_l2=on 1M fi + # Write cluster #0 and discard its subclusters #0-#3 + $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG" + before=$(disk_usage "$TEST_IMG") + $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG" + after=$(disk_usage "$TEST_IMG") + _verify_du_delta $before $after 8192 + alloc="$(seq 4 31)"; zero="$(seq 0 3)" + _verify_l2_bitmap 0 # Write clusters #0-#2 and then discard them $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG" $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG" Similarly to above, I think it would be good if we combined this following case with the one you added, i.e. to write 128k from the beginning, drop the write here, and change the discard to be “discard -q 8k 120k”, i.e. skip the subclusters we have already discarded, to see that this is still combined to discard the whole first cluster. ...Ah, see, and when I try this, the following assertion fails: qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion `c->entries[i].ref == 0' failed. ./common.rc: line 220: 128894 Aborted (core dumped) ( VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec "${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" ) Looks like an L2 table is leaked somewhere. That’s why SCRI should be a g_auto()-able type. Forgot to add: This single test case here is the only place where we test the added functionality. I think there should be more cases. It doesn’t really make sense now that 271 has so many cases for writing zeroes, but so few for discarding, now that discarding works on subclusters. Most of them should at least be considered whether we can run them for discard as well. I didn’t want to push for such an extensive set of tests, but, well, now it turned out I overlooked a bug in patch 4, and only found it because I thought “this place might also make a nice test case for this series”. Hanna
Re: [PATCH 4/7] qcow2: make subclusters discardable
On 20.10.23 23:56, Andrey Drobyshev wrote: This commit makes the discard operation work on the subcluster level rather than cluster level. It introduces discard_l2_subclusters() function and makes use of it in qcow2 discard implementation, much like it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also changes the qcow2 driver pdiscard_alignment to subcluster_size. That way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE) operation and free host disk space. This feature will let us gain additional disk space on guest TRIM/discard requests, especially when using large enough clusters (1M, 2M) with subclusters enabled. Signed-off-by: Andrey Drobyshev --- block/qcow2-cluster.c | 100 -- block/qcow2.c | 8 ++-- 2 files changed, 101 insertions(+), 7 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 7c6fa5524c..cf40f2dc12 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, uint64_t nb_clusters, return nb_clusters; } +static int coroutine_fn GRAPH_RDLOCK +discard_l2_subclusters(BlockDriverState *bs, uint64_t offset, + uint64_t nb_subclusters, + enum qcow2_discard_type type, + bool full_discard, + SubClusterRangeInfo *pscri) +{ +BDRVQcow2State *s = bs->opaque; +uint64_t new_l2_bitmap, l2_bitmap_mask; +int ret, sc = offset_to_sc_index(s, offset); +SubClusterRangeInfo scri = { 0 }; + +if (!pscri) { +ret = get_sc_range_info(bs, offset, nb_subclusters, ); +if (ret < 0) { +goto out; +} +} else { +scri = *pscri; +} + +l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters); +new_l2_bitmap = scri.l2_bitmap; +new_l2_bitmap &= ~l2_bitmap_mask; + +/* + * If there're no allocated subclusters left, we might as well discard + * the entire cluster. That way we'd also update the refcount table. + */ +if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) { +return discard_in_l2_slice(bs, + QEMU_ALIGN_DOWN(offset, s->cluster_size), + 1, type, full_discard); +} scri.l2_slice is leaked here. Hanna
Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap
On 20.10.23 23:56, Andrey Drobyshev wrote: Add _verify_du_delta() checker which is used to check that real disk usage delta meets the expectations. For now we use it for checking that subcluster-based discard/unmap operations lead to actual disk usage decrease (i.e. PUNCH_HOLE operation is performed). I’m not too happy about checking the disk usage because that relies on the underlying filesystem actually accepting and executing the unmap. Why is it not enough to check the L2 bitmap? …Coming back later (I had to fix the missing `ret = ` I mentioned in patch 2, or this test would hang, so I couldn’t run it at first), I note that checking the disk usage in fact doesn’t work on tmpfs. I usually run the iotests in tmpfs, so that’s not great. Also add separate test case for discarding particular subcluster within one cluster. Signed-off-by: Andrey Drobyshev --- tests/qemu-iotests/271 | 25 - tests/qemu-iotests/271.out | 2 ++ 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271 index c7c2cadda0..5fcb209f5f 100755 --- a/tests/qemu-iotests/271 +++ b/tests/qemu-iotests/271 @@ -81,6 +81,15 @@ _verify_l2_bitmap() fi } +# Check disk usage delta after a discard/unmap operation +# _verify_du_delta $before $after $expected_delta +_verify_du_delta() +{ +if [ $(($1 - $2)) -ne $3 ]; then +printf "ERROR: unexpected delta: $1 - $2 = $(($1 - $2)) != $3\n" +fi +} + # This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX cmd=XXX # c: cluster number (0 if unset) # sc: subcluster number inside cluster @c (0 if unset) @@ -198,9 +207,12 @@ for use_backing_file in yes no; do alloc="$(seq 0 31)"; zero="" _run_test sc=0 len=64k -### Zero and unmap half of cluster #0 (this won't unmap it) +### Zero and unmap half of cluster #0 (this will unmap it) I think “it” refers to the cluster, and it is not unmapped. This test case does not use a discard, but write -z instead, so it worked before. (The L2 bitmap shown in the output doesn’t change, so functionally, this patch series didn’t change this case.) alloc="$(seq 16 31)"; zero="$(seq 0 15)" +before=$(disk_usage "$TEST_IMG") _run_test sc=0 len=32k cmd=unmap +after=$(disk_usage "$TEST_IMG") +_verify_du_delta $before $after 32768 ### Zero and unmap cluster #0 alloc=""; zero="$(seq 0 31)" For this following case shown truncated here, why don’t we try “_run_test sc=16 len=32k cmd=unmap” instead of “sc=0 len=64k”? I.e. unmap only the second half, which, thanks to patch 3, should still unmap the whole cluster, because the first half is already unmapped. @@ -447,7 +459,10 @@ for use_backing_file in yes no; do # Subcluster-aligned request from clusters #12 to #14 alloc="$(seq 0 15)"; zero="$(seq 16 31)" +before=$(disk_usage "$TEST_IMG") _run_test c=12 sc=16 len=128k cmd=unmap +after=$(disk_usage "$TEST_IMG") +_verify_du_delta $before $after $((128 * 1024)) alloc=""; zero="$(seq 0 31)" _verify_l2_bitmap 13 alloc="$(seq 16 31)"; zero="$(seq 0 15)" @@ -528,6 +543,14 @@ for use_backing_file in yes no; do else _make_test_img -o extended_l2=on 1M fi +# Write cluster #0 and discard its subclusters #0-#3 +$QEMU_IO -c 'write -q 0 64k' "$TEST_IMG" +before=$(disk_usage "$TEST_IMG") +$QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG" +after=$(disk_usage "$TEST_IMG") +_verify_du_delta $before $after 8192 +alloc="$(seq 4 31)"; zero="$(seq 0 3)" +_verify_l2_bitmap 0 # Write clusters #0-#2 and then discard them $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG" $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG" Similarly to above, I think it would be good if we combined this following case with the one you added, i.e. to write 128k from the beginning, drop the write here, and change the discard to be “discard -q 8k 120k”, i.e. skip the subclusters we have already discarded, to see that this is still combined to discard the whole first cluster. ...Ah, see, and when I try this, the following assertion fails: qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion `c->entries[i].ref == 0' failed. ./common.rc: line 220: 128894 Aborted (core dumped) ( VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec "${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" ) Looks like an L2 table is leaked somewhere. That’s why SCRI should be a g_auto()-able type. Hanna diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out index 5be780de76..0da8d72cde 100644 --- a/tests/qemu-iotests/271.out +++ b/tests/qemu-iotests/271.out @@ -426,6 +426,7 @@ L2 entry #29: 0x ### Discarding clusters with non-zero bitmaps (backing file: yes) ### Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
Re: [PATCH 6/7] iotests/common.rc: add disk_usage function
On 20.10.23 23:56, Andrey Drobyshev wrote: Move the definition from iotests/250 to common.rc. This is used to detect real disk usage of sparse files. In particular, we want to use it for checking subclusters-based discards. Signed-off-by: Andrey Drobyshev --- tests/qemu-iotests/250 | 5 - tests/qemu-iotests/common.rc | 6 ++ 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/tests/qemu-iotests/250 b/tests/qemu-iotests/250 index af48f83aba..c0a0dbc0ff 100755 --- a/tests/qemu-iotests/250 +++ b/tests/qemu-iotests/250 @@ -52,11 +52,6 @@ _unsupported_imgopts data_file # bdrv_co_truncate(bs->file) call in qcow2_co_truncate(), which might succeed # anyway. -disk_usage() -{ -du --block-size=1 $1 | awk '{print $1}' -} - size=2100M _make_test_img -o "cluster_size=1M,preallocation=metadata" $size diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc index 95c12577dd..5d2ea26c7f 100644 --- a/tests/qemu-iotests/common.rc +++ b/tests/qemu-iotests/common.rc @@ -140,6 +140,12 @@ _optstr_add() fi } +# report real disk usage for sparse files +disk_usage() +{ +du --block-size=1 $1 | awk '{print $1}' Pre-existing, but since you’re touching this now: Can you please change the $1 to "$1"? Hanna +} + # Set the variables to the empty string to turn Valgrind off # for specific processes, e.g. # $ VALGRIND_QEMU_IO= ./check -qcow2 -valgrind 015
Re: [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested
On 20.10.23 23:56, Andrey Drobyshev wrote: When zeroizing subclusters within single cluster, detect usage of the BDRV_REQ_MAY_UNMAP flag and fall through to the subcluster-based discard operation, much like it's done with the cluster-based discards. That way subcluster-aligned operations "qemu-io -c 'write -z -u ...'" will lead to actual unmap. Ever since the introduction of discard-no-unref, I wonder whether that is actually an advantage or not. I can’t think of a reason why it’d be advantageous to drop the allocation. On the other hand, zero_in_l2_slice() does it indeed, so consistency is a reasonable argument. Signed-off-by: Andrey Drobyshev --- block/qcow2-cluster.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index cf40f2dc12..040251f2c3 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -2242,7 +2242,7 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset, unsigned nb_subclusters, int flags) { BDRVQcow2State *s = bs->opaque; -uint64_t new_l2_bitmap; +uint64_t new_l2_bitmap, l2_bitmap_mask; int ret, sc = offset_to_sc_index(s, offset); SubClusterRangeInfo scri = { 0 }; @@ -2251,9 +2251,10 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset, goto out; } +l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters); “l2_bitmap_mask” already wasn’t a great name in patch 4 because it doesn’t say what mask it is, but in patch 4, there was at least only one relevant mask. Here, we have two (ALLOC_RANGE and ZERO_RANGE), so the name should reflect what kind of mask it is. new_l2_bitmap = scri.l2_bitmap; -new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters); -new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters); +new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters); +new_l2_bitmap &= ~l2_bitmap_mask; /* * If there're no non-zero subclusters left, we might as well zeroize @@ -2266,6 +2267,16 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset, 1, flags); } +/* + * If the request allows discarding subclusters and they're actually + * allocated, we go down the discard path since after the discard + * operation the subclusters are going to be read as zeroes anyway. + */ +if ((flags & BDRV_REQ_MAY_UNMAP) && (scri.l2_bitmap & l2_bitmap_mask)) { +return discard_l2_subclusters(bs, offset, nb_subclusters, + QCOW2_DISCARD_REQUEST, false, ); +} I wonder why it matters whether any subclusters are allocated, i.e. why can’t we just immediately use discard_l2_subclusters() whenever BDRV_REQ_MAY_UNMAP is set, without even fetching SCRI (that would also save us passing SCRI here, which I’ve already said on patch 4, I don’t find great). Doesn’t discard_l2_subclusters() guarantee the clusters read as zero when full_discard=false? There is this case where it won’t mark the subclusters as zero if there is no backing file, and none of the subclusters is allocated. But still, (A) those subclusters will read as zero, and (B) if there is a problem there, why don’t we just always mark those subclusters as zero? This optimization only has effect when the guest discards/zeroes subclusters (not whole clusters) that were already discarded, so sounds miniscule. Hanna + if (new_l2_bitmap != scri.l2_bitmap) { set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap); qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
Re: [PATCH] block-jobs: add final flush
On 01.11.23 20:53, Vladimir Sementsov-Ogievskiy wrote: On 31.10.23 17:05, Hanna Czenczek wrote: On 04.10.23 15:56, Vladimir Sementsov-Ogievskiy wrote: From: Vladimir Sementsov-Ogievskiy Actually block job is not completed without the final flush. It's rather unexpected to have broken target when job was successfully completed long ago and now we fail to flush or process just crashed/killed. Mirror job already has mirror_flush() for this. So, it's OK. Add similar things for other jobs: backup, stream, commit. Note, that stream has (documented) different treatment of IGNORE action: it don't retry the operation, continue execution and report error at last. We keep it for final flush too. Signed-off-by: Vladimir Sementsov-Ogievskiy --- Was: [PATCH v4] block-jobs: flush target at the end of .run() But now rewritten. Supersedes: <20230725174008.1147467-1-vsement...@yandex-team.ru> block/backup.c | 2 +- block/block-copy.c | 7 +++ block/commit.c | 16 block/stream.c | 21 + include/block/block-copy.h | 1 + 5 files changed, 38 insertions(+), 9 deletions(-) [...] diff --git a/block/commit.c b/block/commit.c index aa45beb0f0..5205c77ec9 100644 --- a/block/commit.c +++ b/block/commit.c [...] @@ -187,7 +187,15 @@ static int coroutine_fn commit_run(Job *job, Error **errp) } } - return 0; + do { + ret = blk_co_flush(s->base); + if (ret < 0) { + action = block_job_error_action(>common, s->on_error, + false, -ret); + } + } while (ret < 0 && action != BLOCK_ERROR_ACTION_REPORT); Do we need to yield in this loop somewhere so that BLOCK_ERROR_ACTION_STOP can pause the job? block_job_error_action calls job_pause_locked() itself in this case But that doesn’t really pause the job, does it? As far as I understand, it increases job->pause_count, then enters the job, and the job is then supposed to yield at some point so job_pause_point_locked() is called, which sees the increased job->pause_count and will actually pause the job. Hanna
Re: [PATCH 4/7] qcow2: make subclusters discardable
(Sorry, opened another reply window, forgot I already had one open...) On 20.10.23 23:56, Andrey Drobyshev wrote: This commit makes the discard operation work on the subcluster level rather than cluster level. It introduces discard_l2_subclusters() function and makes use of it in qcow2 discard implementation, much like it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also changes the qcow2 driver pdiscard_alignment to subcluster_size. That way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE) operation and free host disk space. This feature will let us gain additional disk space on guest TRIM/discard requests, especially when using large enough clusters (1M, 2M) with subclusters enabled. Signed-off-by: Andrey Drobyshev --- block/qcow2-cluster.c | 100 -- block/qcow2.c | 8 ++-- 2 files changed, 101 insertions(+), 7 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 7c6fa5524c..cf40f2dc12 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, uint64_t nb_clusters, return nb_clusters; } +static int coroutine_fn GRAPH_RDLOCK +discard_l2_subclusters(BlockDriverState *bs, uint64_t offset, + uint64_t nb_subclusters, + enum qcow2_discard_type type, + bool full_discard, + SubClusterRangeInfo *pscri) +{ +BDRVQcow2State *s = bs->opaque; +uint64_t new_l2_bitmap, l2_bitmap_mask; +int ret, sc = offset_to_sc_index(s, offset); +SubClusterRangeInfo scri = { 0 }; + +if (!pscri) { +ret = get_sc_range_info(bs, offset, nb_subclusters, ); +if (ret < 0) { +goto out; +} +} else { +scri = *pscri; Allowing to takes this from the caller sounds dangerous, considering we need to track who takes care of freeing scri.l2_slice. +} + +l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters); +new_l2_bitmap = scri.l2_bitmap; +new_l2_bitmap &= ~l2_bitmap_mask; + +/* + * If there're no allocated subclusters left, we might as well discard + * the entire cluster. That way we'd also update the refcount table. + */ +if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) { What if there are subclusters in the cluster that are marked as zero, outside of the discarded range? It sounds wrong to apply a discard with either full_discard set or cleared to them. +return discard_in_l2_slice(bs, + QEMU_ALIGN_DOWN(offset, s->cluster_size), + 1, type, full_discard); +} + +/* + * Full discard means we fall through to the backing file, thus we only + * need to mark the subclusters as deallocated. I think it also means we need to clear the zero bits. Hanna + * + * Non-full discard means subclusters should be explicitly marked as + * zeroes. In this case QCOW2 specification requires the corresponding + * allocation status bits to be unset as well. If the subclusters are + * deallocated in the first place and there's no backing, the operation + * can be skipped. + */ +if (!full_discard && +(bs->backing || scri.l2_bitmap & l2_bitmap_mask)) { +new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters); +}
Re: [PATCH 4/7] qcow2: make subclusters discardable
On 20.10.23 23:56, Andrey Drobyshev wrote: This commit makes the discard operation work on the subcluster level rather than cluster level. It introduces discard_l2_subclusters() function and makes use of it in qcow2 discard implementation, much like it's done with zero_in_l2_slice() / zero_l2_subclusters(). It also changes the qcow2 driver pdiscard_alignment to subcluster_size. That way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE) operation and free host disk space. This feature will let us gain additional disk space on guest TRIM/discard requests, especially when using large enough clusters (1M, 2M) with subclusters enabled. Signed-off-by: Andrey Drobyshev --- block/qcow2-cluster.c | 100 -- block/qcow2.c | 8 ++-- 2 files changed, 101 insertions(+), 7 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 7c6fa5524c..cf40f2dc12 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c [...] +if (scri.l2_bitmap != new_l2_bitmap) { +set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice); +} + +if (s->discard_passthrough[type]) { +qcow2_queue_discard(bs, (scri.l2_entry & L2E_OFFSET_MASK) + +offset_into_cluster(s, offset), +nb_subclusters * s->subcluster_size); Are we sure that the cluster is allocated, i.e. that scri.l2_entry & L2E_OFFSET_MASK != 0? As a side note, I guess discard_in_l2_slice() should also use qcow2_queue_discard() instead of bdrv_pdiscard() then. +} + +ret = 0; +out: +qcow2_cache_put(s->l2_table_cache, (void **) _slice); + +return ret; +} + int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset, uint64_t bytes, enum qcow2_discard_type type, bool full_discard) @@ -2049,19 +2117,36 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset, BDRVQcow2State *s = bs->opaque; uint64_t end_offset = offset + bytes; uint64_t nb_clusters; +unsigned head, tail; int64_t cleared; int ret; /* Caller must pass aligned values, except at image end */ -assert(QEMU_IS_ALIGNED(offset, s->cluster_size)); -assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) || +assert(QEMU_IS_ALIGNED(offset, s->subcluster_size)); +assert(QEMU_IS_ALIGNED(end_offset, s->subcluster_size) || end_offset == bs->total_sectors << BDRV_SECTOR_BITS); -nb_clusters = size_to_clusters(s, bytes); +head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset; +offset += head; + +tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 : + end_offset - MAX(offset, start_of_cluster(s, end_offset)); +end_offset -= tail; s->cache_discards = true; +if (head) { +ret = discard_l2_subclusters(bs, offset - head, + size_to_subclusters(s, head), type, + full_discard, NULL); +if (ret < 0) { +goto fail; +} +} + /* Each L2 slice is handled by its own loop iteration */ +nb_clusters = size_to_clusters(s, end_offset - offset); + while (nb_clusters > 0) { I think the comment should stay attached to the `while`. Hanna cleared = discard_in_l2_slice(bs, offset, nb_clusters, type, full_discard);
Re: [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters
On 20.10.23 23:56, Andrey Drobyshev wrote: When zeroizing the last non-zero subclusters within single cluster, it makes sense to go zeroize the entire cluster and go down zero_in_l2_slice() path right away. That way we'd also update the corresponding refcount table. Signed-off-by: Andrey Drobyshev --- block/qcow2-cluster.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) Reviewed-by: Hanna Czenczek
Re: [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges
On 20.10.23 23:56, Andrey Drobyshev wrote: This helper simply obtains the l2 table parameters of the cluster which contains the given subclusters range. Right now this info is being obtained and used by zero_l2_subclusters(). As we're about to introduce the subclusters discard operation, this helper would let us avoid code duplication. Also introduce struct SubClusterRangeInfo, which would contain all the needed params. Signed-off-by: Andrey Drobyshev --- block/qcow2-cluster.c | 90 +-- 1 file changed, 62 insertions(+), 28 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 904f00d1b3..8801856b93 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -32,6 +32,13 @@ #include "qemu/memalign.h" #include "trace.h" +typedef struct SubClusterRangeInfo { +uint64_t *l2_slice; We should document that this is a strong reference that must be returned via qcow2_cache_put(). Maybe you could also define a clean-up function using G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC() that does this, allowing this type to be used with g_auto(). +int l2_index; +uint64_t l2_entry; +uint64_t l2_bitmap; +} SubClusterRangeInfo; + int coroutine_fn qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size) { @@ -1892,6 +1899,50 @@ again: return 0; } +static int get_sc_range_info(BlockDriverState *bs, uint64_t offset, + unsigned nb_subclusters, + SubClusterRangeInfo *scri) It would be good to have documentation for this function, for example that it only works on a single cluster, i.e. that the range denoted by @offset and @nb_subclusters must not cross a cluster boundary, and that @offset must be aligned to subclusters. In general, it is unclear to me at this point what this function does. OK, it gets the SCRI for all subclusters in the cluster at @offset (this is what its name implies), but then it also has this loop that checks whether there are compressed clusters among the @nb_subclusters. It has a comment about being unable to zero/discard subclusters in compressed clusters, but the function name says nothing about this scope of zeroing/discarding. +{ +BDRVQcow2State *s = bs->opaque; +int ret, sc_cleared = 0, sc_index = offset_to_sc_index(s, offset); +QCow2SubclusterType sctype; + +/* Here we only work with the subclusters within single cluster. */ +assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster); +assert(sc_index + nb_subclusters <= s->subclusters_per_cluster); +assert(offset_into_subcluster(s, offset) == 0); + +ret = get_cluster_table(bs, offset, >l2_slice, >l2_index); +if (ret < 0) { +return ret; +} + +scri->l2_entry = get_l2_entry(s, scri->l2_slice, scri->l2_index); +scri->l2_bitmap = get_l2_bitmap(s, scri->l2_slice, scri->l2_index); + +do { +qcow2_get_subcluster_range_type(bs, scri->l2_entry, scri->l2_bitmap, +sc_index, ); I think there’s a `ret = ` missing here. +if (ret < 0) { +return ret; +} + +switch (sctype) { +case QCOW2_SUBCLUSTER_COMPRESSED: +/* We cannot partially zeroize/discard compressed clusters. */ +return -ENOTSUP; +case QCOW2_SUBCLUSTER_INVALID: +return -EINVAL; +default: +break; +} + +sc_cleared += ret; +} while (sc_cleared < nb_subclusters); + +return 0; +} + /* * This discards as many clusters of nb_clusters as possible at once (i.e. * all clusters in the same L2 slice) and returns the number of discarded @@ -2097,44 +2148,27 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset, unsigned nb_subclusters) { BDRVQcow2State *s = bs->opaque; -uint64_t *l2_slice; -uint64_t old_l2_bitmap, l2_bitmap; -int l2_index, ret, sc = offset_to_sc_index(s, offset); +uint64_t new_l2_bitmap; +int ret, sc = offset_to_sc_index(s, offset); +SubClusterRangeInfo scri = { 0 }; -/* For full clusters use zero_in_l2_slice() instead */ -assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster); -assert(sc + nb_subclusters <= s->subclusters_per_cluster); -assert(offset_into_subcluster(s, offset) == 0); - -ret = get_cluster_table(bs, offset, _slice, _index); +ret = get_sc_range_info(bs, offset, nb_subclusters, ); if (ret < 0) { -return ret; -} - -switch (qcow2_get_cluster_type(bs, get_l2_entry(s, l2_slice, l2_index))) { -case QCOW2_CLUSTER_COMPRESSED: -ret = -ENOTSUP; /* We cannot partially zeroize compressed clusters */ goto out; -case QCOW2_CLUSTER_NORMAL: -case QCOW2_CLUSTER_UNALLOCATED: -break; -default: -g_assert_not_reached(); } -
Re: [PATCH 1/7] qcow2: make function update_refcount_discard() global
On 20.10.23 23:56, Andrey Drobyshev wrote: We are going to need it for discarding separate subclusters. The function itself doesn't do anything with the refcount tables, I think the idea behind the naming was that updating refcounts can lead to clusters being discarded, i.e. update_refcount_discard() did the “discard” part of “update_refcount”. it simply adds a discard request to the queue, so rename it to qcow2_queue_discard(). But that’s fine, too. Signed-off-by: Andrey Drobyshev --- block/qcow2-refcount.c | 8 block/qcow2.h | 2 ++ 2 files changed, 6 insertions(+), 4 deletions(-) Reviewed-by: Hanna Czenczek
Re: [PATCH v2 1/1] qemu-img: do not erase destination file in qemu-img dd command
On 01.10.23 22:46, Denis V. Lunev wrote: Can you please not top-post. This makes the discussion complex. This approach is followed in this mailing list and in other similar lists like LKML. On 10/1/23 19:08, Mike Maslenkin wrote: I thought about "conv=notrunc", but my main concern is changed virtual disk metadata. It depends on how qemu-img used. May be I followed to wrong pattern, but pros and cons of adding "conv" parameter was not in my mind in scope of the first patch version. I see 4 obvious ways of using `qemu-img dd`: 1. Copy virtual disk data between images of same format. I think disk geometry must be preserved in this case. 2. Copy virtual disk data between different formats. It is a valid pattern? May be `qemu-img convert` should to be used instead? 3. Merge snapshots to specified disk image, i.e read current state and write it to new disk image. 4. Copy virtual disk data to raw binary file. Actually this patch breaks 'dd' behavior for this case when source image is less (in terms of logical blocks) than existed raw binary file. May be for this case condition can be improved to smth like if (strcmp(fmt, "raw") || !g_file_test(out.filename, G_FILE_TEST_EXISTS)) . And parameter "conv=notrunc" may be implemented additionally for this case. My personal opinion is that qemu dd when you will need to extract the SOME data from the original image and process it further. Thus I use it to copy some data into raw binary file. My next goal here would add ability to put data into stdout that would be beneficial for. Though this is out of the equation at the moment. Though, speaking about the approach, I would say that the patch changes current behavior which is not totally buggy under a matter of this or that taste. It should be noted that we are here in Linux world, not in the Mac world where we were in position to avoid options and selections. Thus my opinion that original behavior is to be preserved as somebody is relying on it. The option you are proposing seems valuable to me also and thus the switch is to be added. The switch is well-defined in the original 'dd' world thus either conv= option would be good, either nocreat or notrunc. For me 'nocreat' seems more natural. Anyway, the last word here belongs to either Hanna or Kevin ;) Personally, and honestly, I see no actual use for qemu-img dd at all, because we’re trying to mimic a subset of an interface of a rather complex program that has been designed to do what it does. We can only fail at that. Personally, whenever I need dd functionality, I use qemu-storage-daemon’s fuse export, and then use the actual dd program on top. Alternatively, qemu-img convert is our native interface; unfortunately, its feature set is lacking when compared to qemu-img dd, but I think it would be better to improve that rather than working on qemu-img dd. I tend to agree with you, Denis, though, that this patch changes behavior, and users may be relying on the current behavior. Comparing to “what would dd do” is difficult because dd just has no concept of file formats. (That’s another reason why I think it’s bad to provide a dd-like interface, and users should rather use dd itself, or qemu-img convert.) But in any case, adding conv=notrunc to keep the existing target file would seem fair to me. I understand conv=nocreat would advise dd to never create the target file, which is something slightly different (i.e. conv=notrunc will still create a new target file if it doesn’t exist yet, but won’t create/truncate one if it already exists; while conv=nocreat would disable creating a new target file if it doesn’t exist yet, but still truncate it if it does exist; using both together would ensure that the target file is never created/truncated). Summary: If we do this under a new conv=notrunc, fine with me. I just don’t think qemu-img dd is something that should be used at all. Hanna
Re: [PATCH] block-jobs: add final flush
On 04.10.23 15:56, Vladimir Sementsov-Ogievskiy wrote: From: Vladimir Sementsov-Ogievskiy Actually block job is not completed without the final flush. It's rather unexpected to have broken target when job was successfully completed long ago and now we fail to flush or process just crashed/killed. Mirror job already has mirror_flush() for this. So, it's OK. Add similar things for other jobs: backup, stream, commit. Note, that stream has (documented) different treatment of IGNORE action: it don't retry the operation, continue execution and report error at last. We keep it for final flush too. Signed-off-by: Vladimir Sementsov-Ogievskiy --- Was: [PATCH v4] block-jobs: flush target at the end of .run() But now rewritten. Supersedes: <20230725174008.1147467-1-vsement...@yandex-team.ru> block/backup.c | 2 +- block/block-copy.c | 7 +++ block/commit.c | 16 block/stream.c | 21 + include/block/block-copy.h | 1 + 5 files changed, 38 insertions(+), 9 deletions(-) [...] diff --git a/block/commit.c b/block/commit.c index aa45beb0f0..5205c77ec9 100644 --- a/block/commit.c +++ b/block/commit.c [...] @@ -187,7 +187,15 @@ static int coroutine_fn commit_run(Job *job, Error **errp) } } -return 0; +do { +ret = blk_co_flush(s->base); +if (ret < 0) { +action = block_job_error_action(>common, s->on_error, +false, -ret); +} +} while (ret < 0 && action != BLOCK_ERROR_ACTION_REPORT); Do we need to yield in this loop somewhere so that BLOCK_ERROR_ACTION_STOP can pause the job? Hanna
Re: [PATCH 2/2] iotests: Test media change with iothreads
On 13.10.23 17:33, Kevin Wolf wrote: iotests case 118 already tests all relevant operations for media change with multiple devices, however never with iothreads. This changes the test so that the virtio-scsi tests run with an iothread. Signed-off-by: Kevin Wolf --- tests/qemu-iotests/118 | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) Reviewed-by: Hanna Czenczek
Re: [PATCH 1/2] block: Fix locking in media change monitor commands
On 13.10.23 17:33, Kevin Wolf wrote: blk_insert_bs() requires that the caller holds the AioContext lock for the node to be inserted. Since commit c066e808e11, neglecting to do so causes a crash when the child has to be moved to a different AioContext to attach it to the BlockBackend. This fixes qmp_blockdev_insert_anon_medium(), which is called for the QMP commands 'blockdev-insert-medium' and 'blockdev-change-medium', to correctly take the lock. Cc: qemu-sta...@nongnu.org Fixes: https://issues.redhat.com/browse/RHEL-3922 Fixes: c066e808e11a5c181b625537b6c78e0de27a4801 Signed-off-by: Kevin Wolf --- block/qapi-sysemu.c | 5 + 1 file changed, 5 insertions(+) Do we need to take the lock for the dev_ops tray callbacks, too? I suppose not, and it also wouldn’t really matter in light of the lock being supposed to go away anyway, but still thought I should ask. In any case, this change here is necessary, so: Reviewed-by: Hanna Czenczek
Re: [PATCH v2] block/file-posix: fix update_zones_wp() caller
On 25.08.23 06:05, Sam Li wrote: When the zoned request fail, it needs to update only the wp of the target zones for not disrupting the in-flight writes on these other zones. The wp is updated successfully after the request completes. Fixed the callers with right offset and nr_zones. Signed-off-by: Sam Li --- block/file-posix.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Thanks, applied to my block branch: https://gitlab.com/hreitz/qemu/-/commits/block (Rebased on master, and I’ve also fixed the comment to read “boundaries” instead of “bounaries”. Hope that’s OK!) Hanna
Re: [PATCH v8 1/5] qemu-iotests: Filter warnings about block migration being deprecated
On 18.10.23 13:55, Juan Quintela wrote: Create a new filter that removes the two warnings for test 183. Signed-off-by: Juan Quintela --- tests/qemu-iotests/183 | 2 +- tests/qemu-iotests/common.filter | 7 +++ 2 files changed, 8 insertions(+), 1 deletion(-) Reviewed-by: Hanna Czenczek
Re: [PATCH v2 8/8] iotests: add tests for "qemu-img rebase" with compression
On 15.09.23 18:20, Andrey Drobyshev wrote: The test cases considered so far: 314 (new test suite): 1. Check that compression mode isn't compatible with "-f raw" (raw format doesn't support compression). 2. Check that rebasing an image onto no backing file preserves the data and writes the copied clusters actually compressed. 3. Same as 2, but with a raw backing file (i.e. the clusters copied from the backing are originally uncompressed -- we check they end up compressed after being merged). 4. Remove a single delta from a backing chain, perform the same checks as in 2. 5. Check that even when backing and overlay are initially uncompressed, copied clusters end up compressed when rebase with compression is performed. 271: 1. Check that when target image has subclusters, rebase with compression will make an entire cluster containing the written subcluster compressed. Signed-off-by: Andrey Drobyshev --- tests/qemu-iotests/271 | 65 +++ tests/qemu-iotests/271.out | 40 + tests/qemu-iotests/314 | 165 + tests/qemu-iotests/314.out | 75 + 4 files changed, 345 insertions(+) create mode 100755 tests/qemu-iotests/314 create mode 100644 tests/qemu-iotests/314.out Reviewed-by: Hanna Czenczek
Re: [PATCH v2 6/8] iotests/{024, 271}: add testcases for qemu-img rebase
On 15.09.23 18:20, Andrey Drobyshev wrote: As the previous commit changes the logic of "qemu-img rebase" (it's using write alignment now), let's add a couple more test cases which would ensure it works correctly. In particular, the following scenarios: 024: add test case for rebase within one backing chain when the overlay cluster size > backings cluster size; 271: add test case for rebase images that contain subclusters. Check that no extra allocations are being made. Signed-off-by: Andrey Drobyshev --- tests/qemu-iotests/024 | 60 ++ tests/qemu-iotests/024.out | 43 + tests/qemu-iotests/271 | 66 ++ tests/qemu-iotests/271.out | 42 4 files changed, 211 insertions(+) Interestingly, the new case in 024 hangs before this series. So it isn’t just an optimization, but a fix, actually! Reviewed-by: Hanna Czenczek
Re: Deadlock with SATA CD I/O and eject
On 19.09.23 12:54, Kevin Wolf wrote: Am 18.09.2023 um 19:28 hat John Levon geschrieben: Observed with base of qemu 6.2.0, but from code inspection it looks to me like it's still current in upstream master. Apologies if I have missed a fix in this area. Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading files.." screen, run an eject e.g. virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae '{"execute":"eject", "arguments": { "id": "sata0-0-0" } }' qemu will get stuck like so: gdb) bt #0 0x7f8ba4b16036 in ppoll () from /lib64/libc.so.6 #1 0x561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, __nfds=, __fds=) at /usr/include/bits/poll2.h:62 #2 qemu_poll_ns (fds=, nfds=, timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348 #3 0x561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80 #4 0x561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, blocking=blocking@entry=true) at ../util/aio-posix.c:607 #5 0x561813ae2fad in bdrv_do_drained_begin (poll=true, ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at ../block/io.c:483 #6 bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=, parent=0x0, ignore_bds_parents=, poll=) at ../block/io.c:446 #7 0x561813ad9982 in blk_drain (blk=0x5618161c1f10) at ../block/block-backend.c:1741 #8 0x561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at ../block/block-backend.c:852 #9 0x56181382b8ab in blockdev_remove_medium (has_device=, device=, has_id=, id=, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232 #10 0x56181382bfb1 in qmp_eject (has_device=, device=0x0, has_id=, id=0x561815e6efe0 "sata0-0-0", has_force=, force=, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45 We are stuck forever here: 351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent, 352 bool poll) ... 380 if (poll) { 381 BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent)); 382 } Because the blk root's ->in_flight is 1, as tested by the condition blk_root_drained_poll(). Our blk->in_flight user is stuck here: 1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk) ... 1310 blk_dec_in_flight(blk); 1311 qemu_co_queue_wait(>queued_requests, >queued_requests_lock); 1312 blk_inc_in_flight(blk); Note that before entering this stanza, blk->in_flight was 2. This turns out to be due to the ide atapi code. In particular, the guest VM is generating lots of read I/O. The "first IO" arrives into blk via: cd_read_sector()->ide_buffered_readv()->blk_aio_preadv() This initial IO completes: blk_aio_read_entry()->blk_aio_complete() 1560 static void blk_aio_complete(BlkAioEmAIOCB *acb) 1561 { 1562 if (acb->has_returned) { 1563 acb->common.cb(acb->common.opaque, acb->rwco.ret); 1564 blk_dec_in_flight(acb->rwco.blk); 1565 qemu_aio_unref(acb); 1566 } 1567 } Line 1564 is what we need to move blk->in_flight down to zero, but that is never reached! This is because of what happens at :1563 acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread() That is, the IO callback in the atapi code itself triggers more - synchronous - IO. In the meantime, we start processing the blk_drain() code, so by the time this blk_pread() actually gets handled, quiesce is set, and we get stuck in the blk_wait_while_drained(). I don't know the qemu block stack well enough to propose an actual fix. Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure that's just a band-aid instead of fixing the deadlock. Any suggestions/clues/thoughts? Related discussion: https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html Actually, it seems we never fixed that problem either? We didn’t, from my POV mainly because everyone had different suggestions and none of them looked nice enough that it was clear which one to pursue. I.e., there are also https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00709.html and https://lists.gnu.org/archive/html/qemu-block/2023-04/msg00118.html . Back then I suggested that blk_drain*() should disable request queuing because its callers don't want to quiesce the BlockBackend, but rather get their own requests completed. I think this approach would solve this one as well. Your experiment with moving the queuing later is basically what Paolo suggested, though I'd still argue it's not the right thing to do because while it seems to mostly work with both use cases of drain (give me exclusive access vs. wait for my requests to complete), it's not optimal for either one. Kevin
Re: [PATCH v2 5/8] qemu-img: rebase: avoid unnecessary COW operations
On 15.09.23 18:20, Andrey Drobyshev wrote: When rebasing an image from one backing file to another, we need to compare data from old and new backings. If the diff between that data happens to be unaligned to the target cluster size, we might end up doing partial writes, which would lead to copy-on-write and additional IO. Consider the following simple case (virtual_size == cluster_size == 64K): base <-- inc1 <-- inc2 qemu-io -c "write -P 0xaa 0 32K" base.qcow2 qemu-io -c "write -P 0xcc 32K 32K" base.qcow2 qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2 qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2 qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2 While doing rebase, we'll write a half of the cluster to inc2, and block layer will have to read the 2nd half of the same cluster from the base image inc1 while doing this write operation, although the whole cluster is already read earlier to perform data comparison. In order to avoid these unnecessary IO cycles, let's make sure every write request is aligned to the overlay subcluster boundaries. Using subcluster size is universal as for the images which don't have them this size equals to the cluster size, so in any case we end up aligning to the smallest unit of allocation. Signed-off-by: Andrey Drobyshev --- qemu-img.c | 76 -- 1 file changed, 56 insertions(+), 20 deletions(-) Looks good, I like the changes from v1! Two minor things: diff --git a/qemu-img.c b/qemu-img.c index fcd31d7b5b..83950af42b 100644 --- a/qemu-img.c +++ b/qemu-img.c [...] @@ -3844,33 +3861,48 @@ static int img_rebase(int argc, char **argv) } } +/* + * At this point we know that the region [offset; offset + n) + * is unallocated within the target image. This region might be + * unaligned to the target image's (sub)cluster boundaries, as + * old backing may have smaller clusters (or have subclusters). + * We extend it to the aligned boundaries to avoid CoW on + * partial writes in blk_pwrite(), + */ +n += offset - QEMU_ALIGN_DOWN(offset, write_align); +offset = QEMU_ALIGN_DOWN(offset, write_align); +n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n); +n = MIN(n, size - offset); +assert(!bdrv_is_allocated(unfiltered_bs, offset, n, _alloc) && + n_alloc == n); + +/* + * Much like the with the target image, we'll try to read as much s/the with the/with the/ + * of the old and new backings as we can. + */ +n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset)); +if (blk_new_backing) { +n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset)); +} If we don’t have a check for blk_old_backing (old_backing_size is 0 if blk_old_backing is NULL), why do we have a check for blk_new_backing (new_backing_size is 0 if blk_new_backing is NULL)? (Perhaps because the previous check was `offset >= new_backing_size || !blk_new_backing`, i.e. included exactly such a check – but I don’t think it’s necessary, new_backing_size will be 0 if blk_new_backing is NULL.) + /* * Read old and new backing file and take into consideration that * backing files may be smaller than the COW image. */ -if (offset >= old_backing_size) { -memset(buf_old, 0, n); -buf_old_is_zero = true; +memset(buf_old + n_old, 0, n - n_old); +if (!n_old) { +old_backing_eof = true; } else { -if (offset + n > old_backing_size) { -n = old_backing_size - offset; -} - -ret = blk_pread(blk_old_backing, offset, n, buf_old, 0); +ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0); if (ret < 0) { error_report("error while reading from old backing file"); goto out; } } -if (offset >= new_backing_size || !blk_new_backing) { -memset(buf_new, 0, n); -} else { -if (offset + n > new_backing_size) { -n = new_backing_size - offset; -} - -ret = blk_pread(blk_new_backing, offset, n, buf_new, 0); +memset(buf_new + n_new, 0, n - n_new); +if (blk_new_backing && n_new) { Same as above, I think we can drop the blk_new_backing check, just so that both blocks (for old and new) look the same. (Also, the memset() already has to trust that n_new is 0 if blk_new_backing is NULL, so the check doesn’t make much sense from that perspective either, and makes the memset() look wrong.) +
Re: [PATCH v2 4/8] qemu-img: add chunk size parameter to compare_buffers()
On 15.09.23 18:20, Andrey Drobyshev wrote: Add @chsize param to the function which, if non-zero, would represent the chunk size to be used for comparison. If it's zero, then BDRV_SECTOR_SIZE is used as default chunk size, which is the previous behaviour. In particular, we're going to use this param in img_rebase() to make the write requests aligned to a predefined alignment value. Signed-off-by: Andrey Drobyshev --- qemu-img.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) With the comment fix Eric has suggested: Reviewed-by: Hanna Czenczek
Re: [PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment
On 15.09.23 18:20, Andrey Drobyshev wrote: Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for the data read from the old and new backing files are aligned using BlockDriverState (or BlockBackend later on) referring to the target image. However, this isn't quite right, because buf_new is only being used for reading from the new backing, while buf_old is being used for both reading from the old backing and writing to the target. Let's take that into account and use more appropriate values as alignments. Signed-off-by: Andrey Drobyshev --- qemu-img.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/qemu-img.c b/qemu-img.c index 50660ba920..d12e4a4753 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv) int64_t n; float local_progress = 0; -buf_old = blk_blockalign(blk, IO_BUF_SIZE); -buf_new = blk_blockalign(blk, IO_BUF_SIZE); +if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) > +bdrv_opt_mem_align(blk_bs(blk_old_backing))) { +buf_old = blk_blockalign(blk, IO_BUF_SIZE); +} else { +buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE); +} As I read this, if blk_old_backing is NULL, we will go to the blk_blockalign(blk_old_backing, IO_BUF_SIZE) path. I think if it is NULL, we should align on blk instead. Hanna +buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE); size = blk_getlength(blk); if (size < 0) {
Re: [PULL 00/14] Block patches
On 07.09.23 13:21, Hanna Czenczek wrote: On 06.09.23 15:18, Stefan Hajnoczi wrote: On Fri, 1 Sept 2023 at 04:18, Hanna Czenczek wrote: The following changes since commit f5fe7c17ac4e309e47e78f0f9761aebc8d2f2c81: Merge tag 'pull-tcg-20230823-2' of https://gitlab.com/rth7680/qemu into staging (2023-08-28 16:07:04 -0400) are available in the Git repository at: https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-09-01 Hi Hanna, Please push a signed tag (git tag -s). Thanks! By the way, I meant to imply that I’m not going to push a new tag at this time. I have generated this pull request as I always do, using a script that automatically signs the tag. gpg asked me to enter my key password, indicating something had been signed, and the methods I know of tell me that the tag is indeed signed. So as far as I can tell, the tag is signed as usual. If I were to create a new pull request, I would do it the exact same way, which I expect would yield the same result, so we’d have the same problem again. To get a different result, I need to know how you determine the tag not to be signed, so I can reproduce it and potentially fix the problem in my workflow. Hanna Is it not signed? I don’t think gitlab has support to show that, but github shows it as verified: https://github.com/XanClic/qemu/releases/tag/pull-block-2023-09-01 And when I clone it: ``` $ git clone https://gitlab.com/hreitz/qemu -b pull-block-2023-09-01 --depth=1 [...] $ cd qemu $ git tag -v pull-block-2023-09-01 LANG=C git tag -v pull-block-2023-09-01 object 380448464dd89291cf7fd7434be6c225482a334d type commit tag pull-block-2023-09-01 tagger Hanna Reitz 1693555853 +0200 Block patches - Fix for file-posix's zoning code crashing on I/O errors - Throttling refactoring gpg: Signature made Fri Sep 1 10:11:46 2023 CEST gpg: using RSA key CB62D7A0EE3829E45F004D34A1FA40D098019CDF gpg: issuer "hre...@redhat.com" gpg: Good signature from "Hanna Reitz " [ultimate] Primary key fingerprint: CB62 D7A0 EE38 29E4 5F00 4D34 A1FA 40D0 9801 9CDF ``` Hanna
Re: [PULL 00/14] Block patches
On 06.09.23 15:18, Stefan Hajnoczi wrote: On Fri, 1 Sept 2023 at 04:18, Hanna Czenczek wrote: The following changes since commit f5fe7c17ac4e309e47e78f0f9761aebc8d2f2c81: Merge tag 'pull-tcg-20230823-2' of https://gitlab.com/rth7680/qemu into staging (2023-08-28 16:07:04 -0400) are available in the Git repository at: https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-09-01 Hi Hanna, Please push a signed tag (git tag -s). Thanks! Is it not signed? I don’t think gitlab has support to show that, but github shows it as verified: https://github.com/XanClic/qemu/releases/tag/pull-block-2023-09-01 And when I clone it: ``` $ git clone https://gitlab.com/hreitz/qemu -b pull-block-2023-09-01 --depth=1 [...] $ cd qemu $ git tag -v pull-block-2023-09-01 LANG=C git tag -v pull-block-2023-09-01 object 380448464dd89291cf7fd7434be6c225482a334d type commit tag pull-block-2023-09-01 tagger Hanna Reitz 1693555853 +0200 Block patches - Fix for file-posix's zoning code crashing on I/O errors - Throttling refactoring gpg: Signature made Fri Sep 1 10:11:46 2023 CEST gpg: using RSA key CB62D7A0EE3829E45F004D34A1FA40D098019CDF gpg: issuer "hre...@redhat.com" gpg: Good signature from "Hanna Reitz " [ultimate] Primary key fingerprint: CB62 D7A0 EE38 29E4 5F00 4D34 A1FA 40D0 9801 9CDF ``` Hanna
[PULL 09/14] block/throttle-groups: Use ThrottleDirection instread of bool is_write
From: zhenwei pi 'bool is_write' style is obsolete from throttle framework, adapt block throttle groups to the new style: - use ThrottleDirection instead of 'bool is_write'. Ex, schedule_next_request(ThrottleGroupMember *tgm, bool is_write) -> schedule_next_request(ThrottleGroupMember *tgm, ThrottleDirection direction) - use THROTTLE_MAX instead of hard code. Ex, ThrottleGroupMember *tokens[2] -> ThrottleGroupMember *tokens[THROTTLE_MAX] - use ThrottleDirection instead of hard code on iteration. Ex, (i = 0; i < 2; i++) -> for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++) Use a simple python script to test the new style: #!/usr/bin/python3 import subprocess import random import time commands = ['virsh blkdeviotune jammy vda --write-bytes-sec ', \ 'virsh blkdeviotune jammy vda --write-iops-sec ', \ 'virsh blkdeviotune jammy vda --read-bytes-sec ', \ 'virsh blkdeviotune jammy vda --read-iops-sec '] for loop in range(1, 1000): time.sleep(random.randrange(3, 5)) command = commands[random.randrange(0, 3)] + str(random.randrange(0, 100)) subprocess.run(command, shell=True, check=True) This works fine. Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-10-pizhen...@bytedance.com> Reviewed-by: Hanna Czenczek Signed-off-by: Hanna Czenczek --- include/block/throttle-groups.h | 6 +- block/block-backend.c | 4 +- block/throttle-groups.c | 161 block/throttle.c| 8 +- 4 files changed, 90 insertions(+), 89 deletions(-) diff --git a/include/block/throttle-groups.h b/include/block/throttle-groups.h index ff282fc0f8..2355e8d9de 100644 --- a/include/block/throttle-groups.h +++ b/include/block/throttle-groups.h @@ -37,7 +37,7 @@ typedef struct ThrottleGroupMember { AioContext *aio_context; /* throttled_reqs_lock protects the CoQueues for throttled requests. */ CoMutex throttled_reqs_lock; -CoQueue throttled_reqs[2]; +CoQueue throttled_reqs[THROTTLE_MAX]; /* Nonzero if the I/O limits are currently being ignored; generally * it is zero. Accessed with atomic operations. @@ -54,7 +54,7 @@ typedef struct ThrottleGroupMember { * throttle_state tells us if I/O limits are configured. */ ThrottleState *throttle_state; ThrottleTimers throttle_timers; -unsigned pending_reqs[2]; +unsigned pending_reqs[THROTTLE_MAX]; QLIST_ENTRY(ThrottleGroupMember) round_robin; } ThrottleGroupMember; @@ -78,7 +78,7 @@ void throttle_group_restart_tgm(ThrottleGroupMember *tgm); void coroutine_fn throttle_group_co_io_limits_intercept(ThrottleGroupMember *tgm, int64_t bytes, -bool is_write); +ThrottleDirection direction); void throttle_group_attach_aio_context(ThrottleGroupMember *tgm, AioContext *new_context); void throttle_group_detach_aio_context(ThrottleGroupMember *tgm); diff --git a/block/block-backend.c b/block/block-backend.c index 4009ed5fed..47d360c97a 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -1341,7 +1341,7 @@ blk_co_do_preadv_part(BlockBackend *blk, int64_t offset, int64_t bytes, /* throttling disk I/O */ if (blk->public.throttle_group_member.throttle_state) { throttle_group_co_io_limits_intercept(>public.throttle_group_member, -bytes, false); +bytes, THROTTLE_READ); } ret = bdrv_co_preadv_part(blk->root, offset, bytes, qiov, qiov_offset, @@ -1415,7 +1415,7 @@ blk_co_do_pwritev_part(BlockBackend *blk, int64_t offset, int64_t bytes, /* throttling disk I/O */ if (blk->public.throttle_group_member.throttle_state) { throttle_group_co_io_limits_intercept(>public.throttle_group_member, -bytes, true); +bytes, THROTTLE_WRITE); } if (!blk->enable_write_cache) { diff --git a/block/throttle-groups.c b/block/throttle-groups.c index 3847d27801..3eda4c4e3d 100644 --- a/block/throttle-groups.c +++ b/block/throttle-groups.c @@ -37,7 +37,7 @@ static void throttle_group_obj_init(Object *obj); static void throttle_group_obj_complete(UserCreatable *obj, Error **errp); -static void timer_cb(ThrottleGroupMember *tgm, bool is_write); +static void timer_cb(ThrottleGroupMember *tgm, ThrottleDirection direction); /* The ThrottleGroup structure (with its ThrottleState) is shared * among different ThrottleGroupMembers and it's independent from @@ -73,8 +73,8 @@ struct ThrottleGroup { QemuMutex lock; /* This lock protects the following four fields */ ThrottleState ts; QLIST_HEAD(, ThrottleGroupMember) head; -ThrottleGroupMember *tokens[2]; -bool any_timer_armed[2]; +T
[PULL 13/14] file-posix: Simplify raw_co_prw's 'out' zone code
We duplicate the same condition three times here, pull it out to the top level. Signed-off-by: Hanna Czenczek Message-Id: <20230824155345.109765-5-hre...@redhat.com> Reviewed-by: Sam Li --- block/file-posix.c | 18 +- 1 file changed, 5 insertions(+), 13 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index a050682e97..aa89789737 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2506,11 +2506,10 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset, out: #if defined(CONFIG_BLKZONED) -{ -BlockZoneWps *wps = bs->wps; -if (ret == 0) { -if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && -bs->bl.zoned != BLK_Z_NONE) { +if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && +bs->bl.zoned != BLK_Z_NONE) { +BlockZoneWps *wps = bs->wps; +if (ret == 0) { uint64_t *wp = >wp[offset / bs->bl.zone_size]; if (!BDRV_ZT_IS_CONV(*wp)) { if (type & QEMU_AIO_ZONE_APPEND) { @@ -2523,19 +2522,12 @@ out: *wp = offset + bytes; } } -} -} else { -if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && -bs->bl.zoned != BLK_Z_NONE) { +} else { update_zones_wp(bs, s->fd, 0, 1); } -} -if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && -bs->blk.zoned != BLK_Z_NONE) { qemu_co_mutex_unlock(>colock); } -} #endif return ret; } -- 2.41.0
[PULL 07/14] throttle: use THROTTLE_MAX/ARRAY_SIZE for hard code
From: zhenwei pi The first dimension of both to_check and bucket_types_size/bucket_types_units is used as throttle direction, use THROTTLE_MAX instead of hard coded number. Also use ARRAY_SIZE() to avoid hard coded number for the second dimension. Hanna noticed that the two array should be static. Yes, turn them into static variables. Reviewed-by: Hanna Czenczek Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-8-pizhen...@bytedance.com> Signed-off-by: Hanna Czenczek --- util/throttle.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/util/throttle.c b/util/throttle.c index 7d3eb6032f..9582899da3 100644 --- a/util/throttle.c +++ b/util/throttle.c @@ -142,7 +142,8 @@ int64_t throttle_compute_wait(LeakyBucket *bkt) static int64_t throttle_compute_wait_for(ThrottleState *ts, ThrottleDirection direction) { -BucketType to_check[2][4] = { {THROTTLE_BPS_TOTAL, +static const BucketType to_check[THROTTLE_MAX][4] = { + {THROTTLE_BPS_TOTAL, THROTTLE_OPS_TOTAL, THROTTLE_BPS_READ, THROTTLE_OPS_READ}, @@ -153,7 +154,7 @@ static int64_t throttle_compute_wait_for(ThrottleState *ts, int64_t wait, max_wait = 0; int i; -for (i = 0; i < 4; i++) { +for (i = 0; i < ARRAY_SIZE(to_check[THROTTLE_READ]); i++) { BucketType index = to_check[direction][i]; wait = throttle_compute_wait(>cfg.buckets[index]); if (wait > max_wait) { @@ -469,11 +470,11 @@ bool throttle_schedule_timer(ThrottleState *ts, void throttle_account(ThrottleState *ts, ThrottleDirection direction, uint64_t size) { -const BucketType bucket_types_size[2][2] = { +static const BucketType bucket_types_size[THROTTLE_MAX][2] = { { THROTTLE_BPS_TOTAL, THROTTLE_BPS_READ }, { THROTTLE_BPS_TOTAL, THROTTLE_BPS_WRITE } }; -const BucketType bucket_types_units[2][2] = { +static const BucketType bucket_types_units[THROTTLE_MAX][2] = { { THROTTLE_OPS_TOTAL, THROTTLE_OPS_READ }, { THROTTLE_OPS_TOTAL, THROTTLE_OPS_WRITE } }; @@ -486,7 +487,7 @@ void throttle_account(ThrottleState *ts, ThrottleDirection direction, units = (double) size / ts->cfg.op_size; } -for (i = 0; i < 2; i++) { +for (i = 0; i < ARRAY_SIZE(bucket_types_size[THROTTLE_READ]); i++) { LeakyBucket *bkt; bkt = >cfg.buckets[bucket_types_size[direction][i]]; -- 2.41.0
[PULL 10/14] file-posix: Clear bs->bl.zoned on error
bs->bl.zoned is what indicates whether the zone information is present and valid; it is the only thing that raw_refresh_zoned_limits() sets if CONFIG_BLKZONED is not defined, and it is also the only thing that it sets if CONFIG_BLKZONED is defined, but there are no zones. Make sure that it is always set to BLK_Z_NONE if there is an error anywhere in raw_refresh_zoned_limits() so that we do not accidentally announce zones while our information is incomplete or invalid. This also fixes a memory leak in the last error path in raw_refresh_zoned_limits(). Signed-off-by: Hanna Czenczek Message-Id: <20230824155345.109765-2-hre...@redhat.com> Reviewed-by: Sam Li --- block/file-posix.c | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index b16e9c21a1..2b88b9eefa 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1412,11 +1412,9 @@ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st, BlockZoneModel zoned; int ret; -bs->bl.zoned = BLK_Z_NONE; - ret = get_sysfs_zoned_model(st, ); if (ret < 0 || zoned == BLK_Z_NONE) { -return; +goto no_zoned; } bs->bl.zoned = zoned; @@ -1437,10 +1435,10 @@ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st, if (ret < 0) { error_setg_errno(errp, -ret, "Unable to read chunk_sectors " "sysfs attribute"); -return; +goto no_zoned; } else if (!ret) { error_setg(errp, "Read 0 from chunk_sectors sysfs attribute"); -return; +goto no_zoned; } bs->bl.zone_size = ret << BDRV_SECTOR_BITS; @@ -1448,10 +1446,10 @@ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st, if (ret < 0) { error_setg_errno(errp, -ret, "Unable to read nr_zones " "sysfs attribute"); -return; +goto no_zoned; } else if (!ret) { error_setg(errp, "Read 0 from nr_zones sysfs attribute"); -return; +goto no_zoned; } bs->bl.nr_zones = ret; @@ -1472,10 +1470,15 @@ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st, ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 0); if (ret < 0) { error_setg_errno(errp, -ret, "report wps failed"); -bs->wps = NULL; -return; +goto no_zoned; } qemu_co_mutex_init(>wps->colock); +return; + +no_zoned: +bs->bl.zoned = BLK_Z_NONE; +g_free(bs->wps); +bs->wps = NULL; } #else /* !defined(CONFIG_BLKZONED) */ static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st, -- 2.41.0
[PULL 06/14] throttle: use enum ThrottleDirection instead of bool is_write
From: zhenwei pi enum ThrottleDirection is already there, use ThrottleDirection instead of 'bool is_write' for throttle API, also modify related codes from block, fsdev, cryptodev and tests. Reviewed-by: Hanna Czenczek Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-7-pizhen...@bytedance.com> Signed-off-by: Hanna Czenczek --- include/qemu/throttle.h | 5 +++-- backends/cryptodev.c| 9 + block/throttle-groups.c | 6 -- fsdev/qemu-fsdev-throttle.c | 8 +--- tests/unit/test-throttle.c | 4 ++-- util/throttle.c | 31 +-- 6 files changed, 36 insertions(+), 27 deletions(-) diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h index 9ca5ab8197..181245d29b 100644 --- a/include/qemu/throttle.h +++ b/include/qemu/throttle.h @@ -154,9 +154,10 @@ void throttle_config_init(ThrottleConfig *cfg); /* usage */ bool throttle_schedule_timer(ThrottleState *ts, ThrottleTimers *tt, - bool is_write); + ThrottleDirection direction); -void throttle_account(ThrottleState *ts, bool is_write, uint64_t size); +void throttle_account(ThrottleState *ts, ThrottleDirection direction, + uint64_t size); void throttle_limits_to_config(ThrottleLimits *arg, ThrottleConfig *cfg, Error **errp); void throttle_config_to_limits(ThrottleConfig *cfg, ThrottleLimits *var); diff --git a/backends/cryptodev.c b/backends/cryptodev.c index c2356550c8..e5006bd215 100644 --- a/backends/cryptodev.c +++ b/backends/cryptodev.c @@ -252,10 +252,11 @@ static void cryptodev_backend_throttle_timer_cb(void *opaque) continue; } -throttle_account(>ts, true, ret); +throttle_account(>ts, THROTTLE_WRITE, ret); cryptodev_backend_operation(backend, op_info); if (throttle_enabled(>tc) && -throttle_schedule_timer(>ts, >tt, true)) { +throttle_schedule_timer(>ts, >tt, +THROTTLE_WRITE)) { break; } } @@ -271,7 +272,7 @@ int cryptodev_backend_crypto_operation( goto do_account; } -if (throttle_schedule_timer(>ts, >tt, true) || +if (throttle_schedule_timer(>ts, >tt, THROTTLE_WRITE) || !QTAILQ_EMPTY(>opinfos)) { QTAILQ_INSERT_TAIL(>opinfos, op_info, next); return 0; @@ -283,7 +284,7 @@ do_account: return ret; } -throttle_account(>ts, true, ret); +throttle_account(>ts, THROTTLE_WRITE, ret); return cryptodev_backend_operation(backend, op_info); } diff --git a/block/throttle-groups.c b/block/throttle-groups.c index fb203c3ced..3847d27801 100644 --- a/block/throttle-groups.c +++ b/block/throttle-groups.c @@ -270,6 +270,7 @@ static bool throttle_group_schedule_timer(ThrottleGroupMember *tgm, ThrottleState *ts = tgm->throttle_state; ThrottleGroup *tg = container_of(ts, ThrottleGroup, ts); ThrottleTimers *tt = >throttle_timers; +ThrottleDirection direction = is_write ? THROTTLE_WRITE : THROTTLE_READ; bool must_wait; if (qatomic_read(>io_limits_disabled)) { @@ -281,7 +282,7 @@ static bool throttle_group_schedule_timer(ThrottleGroupMember *tgm, return true; } -must_wait = throttle_schedule_timer(ts, tt, is_write); +must_wait = throttle_schedule_timer(ts, tt, direction); /* If a timer just got armed, set tgm as the current token */ if (must_wait) { @@ -364,6 +365,7 @@ void coroutine_fn throttle_group_co_io_limits_intercept(ThrottleGroupMember *tgm bool must_wait; ThrottleGroupMember *token; ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts); +ThrottleDirection direction = is_write ? THROTTLE_WRITE : THROTTLE_READ; assert(bytes >= 0); @@ -386,7 +388,7 @@ void coroutine_fn throttle_group_co_io_limits_intercept(ThrottleGroupMember *tgm } /* The I/O will be executed, so do the accounting */ -throttle_account(tgm->throttle_state, is_write, bytes); +throttle_account(tgm->throttle_state, direction, bytes); /* Schedule the next request */ schedule_next_request(tgm, is_write); diff --git a/fsdev/qemu-fsdev-throttle.c b/fsdev/qemu-fsdev-throttle.c index 5c83a1cc09..1c137d6f0f 100644 --- a/fsdev/qemu-fsdev-throttle.c +++ b/fsdev/qemu-fsdev-throttle.c @@ -97,16 +97,18 @@ void fsdev_throttle_init(FsThrottle *fst) void coroutine_fn fsdev_co_throttle_request(FsThrottle *fst, bool is_write, struct iovec *iov, int iovcnt) { +ThrottleDirection direction = is_write ? THROTTLE_WRITE : THROTTLE_READ; + if (throttle_enabled(>cfg)) { -if (throttle_schedule_timer(>ts, >tt, is_write) || +if (throttle_schedule
[PULL 11/14] file-posix: Check bs->bl.zoned for zone info
Instead of checking bs->wps or bs->bl.zone_size for whether zone information is present, check bs->bl.zoned. That is the flag that raw_refresh_zoned_limits() reliably sets to indicate zone support. If it is set to something other than BLK_Z_NONE, other values and objects like bs->wps and bs->bl.zone_size must be non-null/zero and valid; if it is not, we cannot rely on their validity. Signed-off-by: Hanna Czenczek Message-Id: <20230824155345.109765-3-hre...@redhat.com> Reviewed-by: Sam Li --- block/file-posix.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 2b88b9eefa..46e22403fe 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2455,9 +2455,10 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset, if (fd_open(bs) < 0) return -EIO; #if defined(CONFIG_BLKZONED) -if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && bs->wps) { +if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && +bs->bl.zoned != BLK_Z_NONE) { qemu_co_mutex_lock(>wps->colock); -if (type & QEMU_AIO_ZONE_APPEND && bs->bl.zone_size) { +if (type & QEMU_AIO_ZONE_APPEND) { int index = offset / bs->bl.zone_size; offset = bs->wps->wp[index]; } @@ -2508,8 +2509,8 @@ out: { BlockZoneWps *wps = bs->wps; if (ret == 0) { -if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) -&& wps && bs->bl.zone_size) { +if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && +bs->bl.zoned != BLK_Z_NONE) { uint64_t *wp = >wp[offset / bs->bl.zone_size]; if (!BDRV_ZT_IS_CONV(*wp)) { if (type & QEMU_AIO_ZONE_APPEND) { @@ -2529,7 +2530,8 @@ out: } } -if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && wps) { +if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && +bs->blk.zoned != BLK_Z_NONE) { qemu_co_mutex_unlock(>colock); } } -- 2.41.0
[PULL 14/14] tests/file-io-error: New test
This is a regression test for https://bugzilla.redhat.com/show_bug.cgi?id=2234374. All this test needs to do is trigger an I/O error inside of file-posix (specifically raw_co_prw()). One reliable way to do this without requiring special privileges is to use a FUSE export, which allows us to inject any error that we want, e.g. via blkdebug. Signed-off-by: Hanna Czenczek Message-Id: <20230824155345.109765-6-hre...@redhat.com> [hreitz: Fixed test to be skipped when there is no FUSE support, to suppress fusermount's allow_other warning, and to be skipped with $IMGOPTSSYNTAX enabled] Signed-off-by: Hanna Czenczek --- tests/qemu-iotests/tests/file-io-error | 119 + tests/qemu-iotests/tests/file-io-error.out | 33 ++ 2 files changed, 152 insertions(+) create mode 100755 tests/qemu-iotests/tests/file-io-error create mode 100644 tests/qemu-iotests/tests/file-io-error.out diff --git a/tests/qemu-iotests/tests/file-io-error b/tests/qemu-iotests/tests/file-io-error new file mode 100755 index 00..88ee5f670c --- /dev/null +++ b/tests/qemu-iotests/tests/file-io-error @@ -0,0 +1,119 @@ +#!/usr/bin/env bash +# group: rw +# +# Produce an I/O error in file-posix, and hope that it is not catastrophic. +# Regression test for: https://bugzilla.redhat.com/show_bug.cgi?id=2234374 +# +# Copyright (C) 2023 Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +seq=$(basename "$0") +echo "QA output created by $seq" + +status=1 # failure is the default! + +_cleanup() +{ +_cleanup_qemu +rm -f "$TEST_DIR/fuse-export" +} +trap "_cleanup; exit \$status" 0 1 2 3 15 + +# get standard environment, filters and checks +. ../common.rc +. ../common.filter +. ../common.qemu + +# Format-agnostic (we do not use any), but we do test the file protocol +_supported_proto file +_require_drivers blkdebug null-co + +if [ "$IMGOPTSSYNTAX" = "true" ]; then +# We need `$QEMU_IO -f file` to work; IMGOPTSSYNTAX uses --image-opts, +# breaking -f. +_unsupported_fmt $IMGFMT +fi + +# This is a regression test of a bug in which flie-posix would access zone +# information in case of an I/O error even when there is no zone information, +# resulting in a division by zero. +# To reproduce the problem, we need to trigger an I/O error inside of +# file-posix, which can be done (rootless) by providing a FUSE export that +# presents only errors when accessed. + +_launch_qemu +_send_qemu_cmd $QEMU_HANDLE \ +"{'execute': 'qmp_capabilities'}" \ +'return' + +_send_qemu_cmd $QEMU_HANDLE \ +"{'execute': 'blockdev-add', + 'arguments': { + 'driver': 'blkdebug', + 'node-name': 'node0', + 'inject-error': [{'event': 'none'}], + 'image': { + 'driver': 'null-co' + } + }}" \ +'return' + +# FUSE mountpoint must exist and be a regular file +touch "$TEST_DIR/fuse-export" + +# The grep -v to filter fusermount's (benign) error when /etc/fuse.conf does +# not contain user_allow_other and the subsequent check for missing FUSE support +# have both been taken from iotest 308. +output=$(_send_qemu_cmd $QEMU_HANDLE \ +"{'execute': 'block-export-add', + 'arguments': { + 'id': 'exp0', + 'type': 'fuse', + 'node-name': 'node0', + 'mountpoint': '$TEST_DIR/fuse-export', + 'writable': true + }}" \ +'return' \ +| grep -v 'option allow_other only allowed if') + +if echo "$output" | grep -q "Parameter 'type' does not accept value 'fuse'"; then +_notrun 'No FUSE support' +fi +echo "$output" + +echo +# This should fail, but gracefully, i.e. just print an I/O error, not crash. +$QEMU_IO -f file -c 'write 0 64M' "$TEST_DIR/fuse-export" | _filter_qemu_io +echo + +_send_qemu_cmd $QEMU_HANDLE \ +"{'execute': 'block-export-del', + 'arguments': {'id': 'exp0'}}" \ +'return' + +_send_qemu_cmd $QEMU_HANDLE \ +'' \ +'BLOCK_EXPORT_DELETED' + +_send_qemu_cmd $QEMU_HANDLE \ +"{'execute': 'blockdev-del', + 'arguments': {'node-name': 'node0'}}" \ +'return' + +# success, all done +echo "*** done" +rm -f $seq.full +status=0 diff --git a/tests/qemu-iotests/tests/fi
[PULL 12/14] file-posix: Fix zone update in I/O error path
We must check that zone information is present before running update_zones_wp(). Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2234374 Fixes: Coverity CID 1512459 Signed-off-by: Hanna Czenczek Message-Id: <20230824155345.109765-4-hre...@redhat.com> Reviewed-by: Sam Li --- block/file-posix.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/file-posix.c b/block/file-posix.c index 46e22403fe..a050682e97 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2525,7 +2525,8 @@ out: } } } else { -if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) { +if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && +bs->bl.zoned != BLK_Z_NONE) { update_zones_wp(bs, s->fd, 0, 1); } } -- 2.41.0
[PULL 08/14] fsdev: Use ThrottleDirection instread of bool is_write
From: zhenwei pi 'bool is_write' style is obsolete from throttle framework, adapt fsdev to the new style. Cc: Greg Kurz Reviewed-by: Hanna Czenczek Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-9-pizhen...@bytedance.com> Reviewed-by: Greg Kurz Signed-off-by: Hanna Czenczek --- fsdev/qemu-fsdev-throttle.h | 4 ++-- fsdev/qemu-fsdev-throttle.c | 14 +++--- hw/9pfs/cofile.c| 4 ++-- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/fsdev/qemu-fsdev-throttle.h b/fsdev/qemu-fsdev-throttle.h index a21aecddc7..daa8ca2494 100644 --- a/fsdev/qemu-fsdev-throttle.h +++ b/fsdev/qemu-fsdev-throttle.h @@ -23,14 +23,14 @@ typedef struct FsThrottle { ThrottleState ts; ThrottleTimers tt; ThrottleConfig cfg; -CoQueue throttled_reqs[2]; +CoQueue throttled_reqs[THROTTLE_MAX]; } FsThrottle; int fsdev_throttle_parse_opts(QemuOpts *, FsThrottle *, Error **); void fsdev_throttle_init(FsThrottle *); -void coroutine_fn fsdev_co_throttle_request(FsThrottle *, bool , +void coroutine_fn fsdev_co_throttle_request(FsThrottle *, ThrottleDirection , struct iovec *, int); void fsdev_throttle_cleanup(FsThrottle *); diff --git a/fsdev/qemu-fsdev-throttle.c b/fsdev/qemu-fsdev-throttle.c index 1c137d6f0f..d912da906d 100644 --- a/fsdev/qemu-fsdev-throttle.c +++ b/fsdev/qemu-fsdev-throttle.c @@ -94,22 +94,22 @@ void fsdev_throttle_init(FsThrottle *fst) } } -void coroutine_fn fsdev_co_throttle_request(FsThrottle *fst, bool is_write, +void coroutine_fn fsdev_co_throttle_request(FsThrottle *fst, +ThrottleDirection direction, struct iovec *iov, int iovcnt) { -ThrottleDirection direction = is_write ? THROTTLE_WRITE : THROTTLE_READ; - +assert(direction < THROTTLE_MAX); if (throttle_enabled(>cfg)) { if (throttle_schedule_timer(>ts, >tt, direction) || -!qemu_co_queue_empty(>throttled_reqs[is_write])) { -qemu_co_queue_wait(>throttled_reqs[is_write], NULL); +!qemu_co_queue_empty(>throttled_reqs[direction])) { +qemu_co_queue_wait(>throttled_reqs[direction], NULL); } throttle_account(>ts, direction, iov_size(iov, iovcnt)); -if (!qemu_co_queue_empty(>throttled_reqs[is_write]) && +if (!qemu_co_queue_empty(>throttled_reqs[direction]) && !throttle_schedule_timer(>ts, >tt, direction)) { -qemu_co_queue_next(>throttled_reqs[is_write]); +qemu_co_queue_next(>throttled_reqs[direction]); } } } diff --git a/hw/9pfs/cofile.c b/hw/9pfs/cofile.c index 9c5344039e..71174c3e4a 100644 --- a/hw/9pfs/cofile.c +++ b/hw/9pfs/cofile.c @@ -252,7 +252,7 @@ int coroutine_fn v9fs_co_pwritev(V9fsPDU *pdu, V9fsFidState *fidp, if (v9fs_request_cancelled(pdu)) { return -EINTR; } -fsdev_co_throttle_request(s->ctx.fst, true, iov, iovcnt); +fsdev_co_throttle_request(s->ctx.fst, THROTTLE_WRITE, iov, iovcnt); v9fs_co_run_in_worker( { err = s->ops->pwritev(>ctx, >fs, iov, iovcnt, offset); @@ -272,7 +272,7 @@ int coroutine_fn v9fs_co_preadv(V9fsPDU *pdu, V9fsFidState *fidp, if (v9fs_request_cancelled(pdu)) { return -EINTR; } -fsdev_co_throttle_request(s->ctx.fst, false, iov, iovcnt); +fsdev_co_throttle_request(s->ctx.fst, THROTTLE_READ, iov, iovcnt); v9fs_co_run_in_worker( { err = s->ops->preadv(>ctx, >fs, iov, iovcnt, offset); -- 2.41.0
[PULL 05/14] cryptodev: use NULL throttle timer cb for read direction
From: zhenwei pi Operations on a cryptodev are considered as *write* only, the callback of read direction is never invoked. Use NULL instead of an unreachable path(cryptodev_backend_throttle_timer_cb on read direction). The dummy read timer(never invoked) is already removed here, it means that the 'FIXME' tag is no longer needed. Reviewed-by: Alberto Garcia Reviewed-by: Hanna Czenczek Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-6-pizhen...@bytedance.com> Signed-off-by: Hanna Czenczek --- backends/cryptodev.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/backends/cryptodev.c b/backends/cryptodev.c index 4d183f7237..c2356550c8 100644 --- a/backends/cryptodev.c +++ b/backends/cryptodev.c @@ -341,8 +341,7 @@ static void cryptodev_backend_set_throttle(CryptoDevBackend *backend, int field, if (!enabled) { throttle_init(>ts); throttle_timers_init(>tt, qemu_get_aio_context(), - QEMU_CLOCK_REALTIME, - cryptodev_backend_throttle_timer_cb, /* FIXME */ + QEMU_CLOCK_REALTIME, NULL, cryptodev_backend_throttle_timer_cb, backend); } -- 2.41.0
[PULL 01/14] throttle: introduce enum ThrottleDirection
From: zhenwei pi Use enum ThrottleDirection instead of number index. Reviewed-by: Alberto Garcia Reviewed-by: Hanna Czenczek Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-2-pizhen...@bytedance.com> Signed-off-by: Hanna Czenczek --- include/qemu/throttle.h | 11 --- util/throttle.c | 16 +--- 2 files changed, 17 insertions(+), 10 deletions(-) diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h index 05f6346137..9ca5ab8197 100644 --- a/include/qemu/throttle.h +++ b/include/qemu/throttle.h @@ -99,13 +99,18 @@ typedef struct ThrottleState { int64_t previous_leak;/* timestamp of the last leak done */ } ThrottleState; +typedef enum { +THROTTLE_READ = 0, +THROTTLE_WRITE, +THROTTLE_MAX +} ThrottleDirection; + typedef struct ThrottleTimers { -QEMUTimer *timers[2]; /* timers used to do the throttling */ +QEMUTimer *timers[THROTTLE_MAX];/* timers used to do the throttling */ QEMUClockType clock_type; /* the clock used */ /* Callbacks */ -QEMUTimerCB *read_timer_cb; -QEMUTimerCB *write_timer_cb; +QEMUTimerCB *timer_cb[THROTTLE_MAX]; void *timer_opaque; } ThrottleTimers; diff --git a/util/throttle.c b/util/throttle.c index 81f247a8d1..5642e61763 100644 --- a/util/throttle.c +++ b/util/throttle.c @@ -199,10 +199,12 @@ static bool throttle_compute_timer(ThrottleState *ts, void throttle_timers_attach_aio_context(ThrottleTimers *tt, AioContext *new_context) { -tt->timers[0] = aio_timer_new(new_context, tt->clock_type, SCALE_NS, - tt->read_timer_cb, tt->timer_opaque); -tt->timers[1] = aio_timer_new(new_context, tt->clock_type, SCALE_NS, - tt->write_timer_cb, tt->timer_opaque); +tt->timers[THROTTLE_READ] = +aio_timer_new(new_context, tt->clock_type, SCALE_NS, + tt->timer_cb[THROTTLE_READ], tt->timer_opaque); +tt->timers[THROTTLE_WRITE] = +aio_timer_new(new_context, tt->clock_type, SCALE_NS, + tt->timer_cb[THROTTLE_WRITE], tt->timer_opaque); } /* @@ -236,8 +238,8 @@ void throttle_timers_init(ThrottleTimers *tt, memset(tt, 0, sizeof(ThrottleTimers)); tt->clock_type = clock_type; -tt->read_timer_cb = read_timer_cb; -tt->write_timer_cb = write_timer_cb; +tt->timer_cb[THROTTLE_READ] = read_timer_cb; +tt->timer_cb[THROTTLE_WRITE] = write_timer_cb; tt->timer_opaque = timer_opaque; throttle_timers_attach_aio_context(tt, aio_context); } @@ -256,7 +258,7 @@ void throttle_timers_detach_aio_context(ThrottleTimers *tt) { int i; -for (i = 0; i < 2; i++) { +for (i = 0; i < THROTTLE_MAX; i++) { throttle_timer_destroy(>timers[i]); } } -- 2.41.0
[PULL 04/14] test-throttle: test read only and write only
From: zhenwei pi Reviewed-by: Alberto Garcia Reviewed-by: Hanna Czenczek Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-5-pizhen...@bytedance.com> Signed-off-by: Hanna Czenczek --- tests/unit/test-throttle.c | 66 ++ 1 file changed, 66 insertions(+) diff --git a/tests/unit/test-throttle.c b/tests/unit/test-throttle.c index a60b5fe22e..5547837a58 100644 --- a/tests/unit/test-throttle.c +++ b/tests/unit/test-throttle.c @@ -184,6 +184,70 @@ static void test_init(void) throttle_timers_destroy(tt); } +static void test_init_readonly(void) +{ +int i; + +tt = _timers; + +/* fill the structures with crap */ +memset(, 1, sizeof(ts)); +memset(tt, 1, sizeof(*tt)); + +/* init structures */ +throttle_init(); +throttle_timers_init(tt, ctx, QEMU_CLOCK_VIRTUAL, + read_timer_cb, NULL, ); + +/* check initialized fields */ +g_assert(tt->clock_type == QEMU_CLOCK_VIRTUAL); +g_assert(tt->timers[THROTTLE_READ]); +g_assert(!tt->timers[THROTTLE_WRITE]); + +/* check other fields where cleared */ +g_assert(!ts.previous_leak); +g_assert(!ts.cfg.op_size); +for (i = 0; i < BUCKETS_COUNT; i++) { +g_assert(!ts.cfg.buckets[i].avg); +g_assert(!ts.cfg.buckets[i].max); +g_assert(!ts.cfg.buckets[i].level); +} + +throttle_timers_destroy(tt); +} + +static void test_init_writeonly(void) +{ +int i; + +tt = _timers; + +/* fill the structures with crap */ +memset(, 1, sizeof(ts)); +memset(tt, 1, sizeof(*tt)); + +/* init structures */ +throttle_init(); +throttle_timers_init(tt, ctx, QEMU_CLOCK_VIRTUAL, + NULL, write_timer_cb, ); + +/* check initialized fields */ +g_assert(tt->clock_type == QEMU_CLOCK_VIRTUAL); +g_assert(!tt->timers[THROTTLE_READ]); +g_assert(tt->timers[THROTTLE_WRITE]); + +/* check other fields where cleared */ +g_assert(!ts.previous_leak); +g_assert(!ts.cfg.op_size); +for (i = 0; i < BUCKETS_COUNT; i++) { +g_assert(!ts.cfg.buckets[i].avg); +g_assert(!ts.cfg.buckets[i].max); +g_assert(!ts.cfg.buckets[i].level); +} + +throttle_timers_destroy(tt); +} + static void test_destroy(void) { int i; @@ -752,6 +816,8 @@ int main(int argc, char **argv) g_test_add_func("/throttle/leak_bucket",test_leak_bucket); g_test_add_func("/throttle/compute_wait", test_compute_wait); g_test_add_func("/throttle/init", test_init); +g_test_add_func("/throttle/init_readonly", test_init_readonly); +g_test_add_func("/throttle/init_writeonly", test_init_writeonly); g_test_add_func("/throttle/destroy",test_destroy); g_test_add_func("/throttle/have_timer", test_have_timer); g_test_add_func("/throttle/detach_attach", test_detach_attach); -- 2.41.0
[PULL 02/14] test-throttle: use enum ThrottleDirection
From: zhenwei pi Use enum ThrottleDirection instead in the throttle test codes. Reviewed-by: Alberto Garcia Reviewed-by: Hanna Czenczek Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-3-pizhen...@bytedance.com> Signed-off-by: Hanna Czenczek --- tests/unit/test-throttle.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tests/unit/test-throttle.c b/tests/unit/test-throttle.c index 7adb5e6652..a60b5fe22e 100644 --- a/tests/unit/test-throttle.c +++ b/tests/unit/test-throttle.c @@ -169,8 +169,8 @@ static void test_init(void) /* check initialized fields */ g_assert(tt->clock_type == QEMU_CLOCK_VIRTUAL); -g_assert(tt->timers[0]); -g_assert(tt->timers[1]); +g_assert(tt->timers[THROTTLE_READ]); +g_assert(tt->timers[THROTTLE_WRITE]); /* check other fields where cleared */ g_assert(!ts.previous_leak); @@ -191,7 +191,7 @@ static void test_destroy(void) throttle_timers_init(tt, ctx, QEMU_CLOCK_VIRTUAL, read_timer_cb, write_timer_cb, ); throttle_timers_destroy(tt); -for (i = 0; i < 2; i++) { +for (i = 0; i < THROTTLE_MAX; i++) { g_assert(!tt->timers[i]); } } -- 2.41.0
[PULL 00/14] Block patches
The following changes since commit f5fe7c17ac4e309e47e78f0f9761aebc8d2f2c81: Merge tag 'pull-tcg-20230823-2' of https://gitlab.com/rth7680/qemu into staging (2023-08-28 16:07:04 -0400) are available in the Git repository at: https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-09-01 for you to fetch changes up to 380448464dd89291cf7fd7434be6c225482a334d: tests/file-io-error: New test (2023-08-29 13:01:24 +0200) Block patches - Fix for file-posix's zoning code crashing on I/O errors - Throttling refactoring Hanna Czenczek (5): file-posix: Clear bs->bl.zoned on error file-posix: Check bs->bl.zoned for zone info file-posix: Fix zone update in I/O error path file-posix: Simplify raw_co_prw's 'out' zone code tests/file-io-error: New test Zhenwei Pi (9): throttle: introduce enum ThrottleDirection test-throttle: use enum ThrottleDirection throttle: support read-only and write-only test-throttle: test read only and write only cryptodev: use NULL throttle timer cb for read direction throttle: use enum ThrottleDirection instead of bool is_write throttle: use THROTTLE_MAX/ARRAY_SIZE for hard code fsdev: Use ThrottleDirection instread of bool is_write block/throttle-groups: Use ThrottleDirection instread of bool is_write fsdev/qemu-fsdev-throttle.h| 4 +- include/block/throttle-groups.h| 6 +- include/qemu/throttle.h| 16 +- backends/cryptodev.c | 12 +- block/block-backend.c | 4 +- block/file-posix.c | 42 +++--- block/throttle-groups.c| 163 +++-- block/throttle.c | 8 +- fsdev/qemu-fsdev-throttle.c| 18 ++- hw/9pfs/cofile.c | 4 +- tests/unit/test-throttle.c | 76 +- util/throttle.c| 84 +++ tests/qemu-iotests/tests/file-io-error | 119 +++ tests/qemu-iotests/tests/file-io-error.out | 33 + 14 files changed, 418 insertions(+), 171 deletions(-) create mode 100755 tests/qemu-iotests/tests/file-io-error create mode 100644 tests/qemu-iotests/tests/file-io-error.out -- 2.41.0
[PULL 03/14] throttle: support read-only and write-only
From: zhenwei pi Only one direction is necessary in several scenarios: - a read-only disk - operations on a device are considered as *write* only. For example, encrypt/decrypt/sign/verify operations on a cryptodev use a single *write* timer(read timer callback is defined, but never invoked). Allow a single direction in throttle, this reduces memory, and uplayer does not need a dummy callback any more. Reviewed-by: Alberto Garcia Reviewed-by: Hanna Czenczek Signed-off-by: zhenwei pi Message-Id: <20230728022006.1098509-4-pizhen...@bytedance.com> Signed-off-by: Hanna Czenczek --- util/throttle.c | 42 -- 1 file changed, 28 insertions(+), 14 deletions(-) diff --git a/util/throttle.c b/util/throttle.c index 5642e61763..0439028d21 100644 --- a/util/throttle.c +++ b/util/throttle.c @@ -199,12 +199,15 @@ static bool throttle_compute_timer(ThrottleState *ts, void throttle_timers_attach_aio_context(ThrottleTimers *tt, AioContext *new_context) { -tt->timers[THROTTLE_READ] = -aio_timer_new(new_context, tt->clock_type, SCALE_NS, - tt->timer_cb[THROTTLE_READ], tt->timer_opaque); -tt->timers[THROTTLE_WRITE] = -aio_timer_new(new_context, tt->clock_type, SCALE_NS, - tt->timer_cb[THROTTLE_WRITE], tt->timer_opaque); +ThrottleDirection dir; + +for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++) { +if (tt->timer_cb[dir]) { +tt->timers[dir] = +aio_timer_new(new_context, tt->clock_type, SCALE_NS, + tt->timer_cb[dir], tt->timer_opaque); +} +} } /* @@ -235,6 +238,7 @@ void throttle_timers_init(ThrottleTimers *tt, QEMUTimerCB *write_timer_cb, void *timer_opaque) { +assert(read_timer_cb || write_timer_cb); memset(tt, 0, sizeof(ThrottleTimers)); tt->clock_type = clock_type; @@ -247,7 +251,9 @@ void throttle_timers_init(ThrottleTimers *tt, /* destroy a timer */ static void throttle_timer_destroy(QEMUTimer **timer) { -assert(*timer != NULL); +if (*timer == NULL) { +return; +} timer_free(*timer); *timer = NULL; @@ -256,10 +262,10 @@ static void throttle_timer_destroy(QEMUTimer **timer) /* Remove timers from event loop */ void throttle_timers_detach_aio_context(ThrottleTimers *tt) { -int i; +ThrottleDirection dir; -for (i = 0; i < THROTTLE_MAX; i++) { -throttle_timer_destroy(>timers[i]); +for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++) { +throttle_timer_destroy(>timers[dir]); } } @@ -272,8 +278,12 @@ void throttle_timers_destroy(ThrottleTimers *tt) /* is any throttling timer configured */ bool throttle_timers_are_initialized(ThrottleTimers *tt) { -if (tt->timers[0]) { -return true; +ThrottleDirection dir; + +for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++) { +if (tt->timers[dir]) { +return true; +} } return false; @@ -424,8 +434,12 @@ bool throttle_schedule_timer(ThrottleState *ts, { int64_t now = qemu_clock_get_ns(tt->clock_type); int64_t next_timestamp; +QEMUTimer *timer; bool must_wait; +timer = is_write ? tt->timers[THROTTLE_WRITE] : tt->timers[THROTTLE_READ]; +assert(timer); + must_wait = throttle_compute_timer(ts, is_write, now, @@ -437,12 +451,12 @@ bool throttle_schedule_timer(ThrottleState *ts, } /* request throttled and timer pending -> do nothing */ -if (timer_pending(tt->timers[is_write])) { +if (timer_pending(timer)) { return true; } /* request throttled and timer not pending -> arm timer */ -timer_mod(tt->timers[is_write], next_timestamp); +timer_mod(timer, next_timestamp); return true; } -- 2.41.0
Re: [PATCH v2 2/3] qemu-img: map: report compressed data blocks
On 29.08.23 08:44, Andrey Drobyshev wrote: On 8/25/23 17:14, Hanna Czenczek wrote: On 06.07.23 18:30, Andrey Drobyshev wrote: Right now "qemu-img map" reports compressed blocks as containing data but having no host offset. This is not very informative. Instead, let's add another boolean field named "compressed" in case JSON output mode is specified. This is achieved by utilizing new allocation status flag BDRV_BLOCK_COMPRESSED for bdrv_block_status(). Signed-off-by: Andrey Drobyshev --- qapi/block-core.json | 7 +-- qemu-img.c | 16 +--- 2 files changed, 18 insertions(+), 5 deletions(-) Patch 3 must be merged into this patch. Every test must pass on every commit so we don’t break bisecting. Agreed, should've figured that myself. diff --git a/qapi/block-core.json b/qapi/block-core.json index 5dd5f7e4b0..b263d2cd30 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -409,6 +409,9 @@ # # @zero: whether the virtual blocks read as zeroes # +# @compressed: true indicates that data is stored compressed. Optional, +# only valid for the formats whith support compression This is missing information for when this field was introduced (i.e. a “(since 8.2)”). Noted. I also wonder why this field is optional. We have compression information even for formats that don’t support compression, specifically, nothing is compressed. I would just make this field mandatory and print it always. (A technical reason to do so is that this patch uses block_driver_can_compress() to figure out whether there is compression support; but that function only tells whether the driver can write compressed data. Even if it cannot do that, the format may still support compression, and the driver may be able to read compressed data, just not write it.) I figured that for the formats which surely can't support compression, such as "raw", this information would simply be redundant. AFAICT right now the only drivers which can read compressed data are exactly the ones which can write it: vmdk, qcow and qcow2. But if we assume that this might change, and that we'd better show the field value no matter what (as we do with "zero" field) -- I'm OK with it. It is indeed redundant, but as this is intended to be the machine-readable output, I don’t mind the output becoming arguably needlessly verbose.
Re: [PATCH 3/6] qemu-img: rebase: use backing files' BlockBackend for buffer alignment
On 29.08.23 09:06, Andrey Drobyshev wrote: On 8/25/23 17:29, Hanna Czenczek wrote: On 01.06.23 21:28, Andrey Drobyshev via wrote: Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for the data read from the old and new backing files are aligned using BlockDriverState (or BlockBackend later on) referring to the target image. However, this isn't quite right, because target image is only being written to and has nothing to do with those buffers. Let's fix that. I don’t understand. The write to the target image does use one of those buffers (buf_old, specifically). This change is correct for buf_new/blk_new_backing, but for buf_old, in theory, we need a buffer that fulfills both the alignment requirements of blk and blk_old_backing. (Not that this patch really makes the situation worse for buf_old.) Hanna Hmm, you're right. In which case the right thing to do would probably be smth like: float local_progress = 0; -buf_old = blk_blockalign(blk, IO_BUF_SIZE); -buf_new = blk_blockalign(blk, IO_BUF_SIZE); +if (bdrv_opt_mem_align(blk_bs(blk)) > +bdrv_opt_mem_align(blk_bs(blk_old_backing))) { +buf_old = blk_blockalign(blk, IO_BUF_SIZE); +} else { +buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE); +} +buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE); size = blk_getlength(blk); I'll include this in v2 if you don't have any objections. Looks good to me, thanks!