from:"Hanna Czenczek"

Re: [PATCH for-9.0?] usb-storage: Fix BlockConf defaults

2024-04-16 Thread Hanna Czenczek


On 12.04.24 16:42, Kevin Wolf wrote:

Commit 30896374 started to pass the full BlockConf from usb-storage to
scsi-disk, while previously only a few select properties would be
forwarded. This enables the user to set more properties, e.g. the block
size, that are actually taking effect.

However, now the calls to blkconf_apply_backend_options() and
blkconf_blocksizes() in usb_msd_storage_realize() that modify some of
these properties take effect, too, instead of being silently ignored.
This means at least that the block sizes get an unconditional default of
512 bytes before the configuration is passed to scsi-disk.

Before commit 30896374, the property wouldn't be set for scsi-disk and
therefore the device dependent defaults would apply - 512 for scsi-hd,
but 2048 for scsi-cd. The latter default has now become 512, too, which
makes at least Windows 11 installation fail when installing from
usb-storage.

Fix this by simply not calling these functions any more in usb-storage
and passing BlockConf on unmodified (except for the BlockBackend). The
same functions are called by the SCSI code anyway and it sets the right
defaults for the actual media type.

Fixes: 308963746169 ('scsi: Don't ignore most usb-storage properties')
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2260
Reported-by: Jonas Svensson
Signed-off-by: Kevin Wolf 
---
Considering this a candidate for 9.0 given that we're already having an
rc4, it's a regression from 8.2 and breaks installing Windows from USB

  hw/usb/dev-storage-classic.c | 9 -
  1 file changed, 9 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 0/2] block: Allow concurrent BB context changes

2024-02-12 Thread Hanna Czenczek


On 10.02.24 09:46, Michael Tokarev wrote:

09.02.2024 19:51, Hanna Czenczek :

On 09.02.24 15:08, Michael Tokarev wrote:

02.02.2024 17:47, Hanna Czenczek :

Hi,

Without the AioContext lock, a BB's context may kind of change at any
time (unless it has a root node, and I/O requests are pending). That
also means that its own context (BlockBackend.ctx) and that of its 
root

node can differ sometimes (while the context is being changed).


How relevant this is for -stable (8.2 at least) which does not have
"scsi: eliminate AioContext lock" patchset, and in particular,:
v8.2.0-124-geaad0fe260 "scsi: only access SCSIDevice->requests from
one thread"?

The issue first patch "block-backend: Allow concurrent context changes"
fixes (RHEL-19381) seems to be for 8.1.something, so it exists in 8.2
too, and this particular fix applies to 8.2.

But with other changes around all this, I'm a bit lost as of what 
should

be done on stable.  Not even thinking about 7.2 here :)


Ah, sorry, yes.  Since we do still have the AioContext lock, this 
series won’t be necessary in -stable.  Sorry for the noise!


Hm. Now I'm confused even more.. :)

ad89367202 "block-backend: Allow concurrent context changes" - the first
one in this series - apparently is needed, as it fixes an issue reported
for qemu 8.1 (https://issues.redhat.com/browse/RHEL-19381).  Or is it not
the case?


Ah, yes, I got confused there.  There are two (unfortunately? 
fortunately? Red-Hat-internal) comments, one of which describes the 
crash that’s fixed here, so I thought that bug described this crash.  
But the actual description in the report describes something different 
(more like what’s fixed by 
https://lists.nongnu.org/archive/html/qemu-devel/2024-01/msg03649.html, 
but I’m not entirely sure yet).


So basically I got the bug link wrong.  We now have 
https://issues.redhat.com/browse/RHEL-24593, which has been reported 
only against 8.2.


Hanna


FWIW, truth is born in the noise, not in silence ;)

Thanks,

/mjt

Re: [PATCH v2 3/3] virtio-blk: Use ioeventfd_attach in start_ioeventfd

2024-02-09 Thread Hanna Czenczek


On 09.02.24 15:38, Michael Tokarev wrote:

02.02.2024 18:31, Hanna Czenczek :

Commit d3f6f294aeadd5f88caf0155e4360808c95b3146 ("virtio-blk: always set
ioeventfd during startup") has made virtio_blk_start_ioeventfd() always
kick the virtqueue (set the ioeventfd), regardless of whether the BB is
drained.  That is no longer necessary, because attaching the host
notifier will now set the ioeventfd, too; this happens either
immediately right here in virtio_blk_start_ioeventfd(), or later when
the drain ends, in virtio_blk_ioeventfd_attach().

With event_notifier_set() removed, the code becomes the same as the one
in virtio_blk_ioeventfd_attach(), so we can reuse that function.


The mentioned comit is v8.2.0-812-gd3f6f294ae, - ie, past 8.2.
Is this new change still relevant for stable?


Sorry again. :/  This patch is a clean-up patch that won’t apply to 
8.2.  Now, 8.2 does have basically the same logic as described in the 
patch message (d3f6f294aea restored it after it was broken), so a 
similar patch could be made for it (removing the event_notifier_set() 
from virtio_blk_data_plane_start()), but whether we kick the virtqueues 
once or twice on start-up probably won’t make a difference, certainly 
not in terms of correctness.


Hanna

Re: [PATCH 0/2] block: Allow concurrent BB context changes

2024-02-09 Thread Hanna Czenczek


On 09.02.24 15:08, Michael Tokarev wrote:

02.02.2024 17:47, Hanna Czenczek :

Hi,

Without the AioContext lock, a BB's context may kind of change at any
time (unless it has a root node, and I/O requests are pending). That
also means that its own context (BlockBackend.ctx) and that of its root
node can differ sometimes (while the context is being changed).


How relevant this is for -stable (8.2 at least) which does not have
"scsi: eliminate AioContext lock" patchset, and in particular,:
v8.2.0-124-geaad0fe260 "scsi: only access SCSIDevice->requests from
one thread"?

The issue first patch "block-backend: Allow concurrent context changes"
fixes (RHEL-19381) seems to be for 8.1.something, so it exists in 8.2
too, and this particular fix applies to 8.2.

But with other changes around all this, I'm a bit lost as of what should
be done on stable.  Not even thinking about 7.2 here :)


Ah, sorry, yes.  Since we do still have the AioContext lock, this series 
won’t be necessary in -stable.  Sorry for the noise!


Hanna

Re: [PATCH 0/2] block: Allow concurrent BB context changes

2024-02-07 Thread Hanna Czenczek


On 06.02.24 17:53, Stefan Hajnoczi wrote:

On Fri, Feb 02, 2024 at 03:47:53PM +0100, Hanna Czenczek wrote:

Hi,

Without the AioContext lock, a BB's context may kind of change at any
time (unless it has a root node, and I/O requests are pending).  That
also means that its own context (BlockBackend.ctx) and that of its root
node can differ sometimes (while the context is being changed).

blk_get_aio_context() doesn't know this yet and asserts that both are
always equal (if there is a root node).  Because it's no longer true,
and because callers don't seem to really care about the root node's
context, we can and should remove the assertion and just return the BB's
context.

Beyond that, the question is whether the callers of
blk_get_aio_context() are OK with the context potentially changing
concurrently.  Honestly, it isn't entirely clear to me; most look OK,
except for the virtio-scsi code, which operates under the general
assumption that the BB's context is always equal to that of the
virtio-scsi device.  I doubt that this assumption always holds (it is
definitely not obvious to me that it would), but then again, this series
will not make matters worse in that regard, and that is what counts for
me now.

One clear point of contention is scsi_device_for_each_req_async(), which
is addressed by patch 2.  Right now, it schedules a BH in the BB
context, then the BH double-checks whether the context still fits, and
if not, re-schedules itself.  Because virtio-scsi's context is fixed,
this seems to indicate to me that it wants to be able to deal with a
case where BB and virtio-scsi context differ, which seems to break that
aforementioned general virtio-scsi assumption.

I don't agree with the last sentence: virtio-scsi's context isn't fixed.

The AioContext changes when dataplane is started/stopped. virtio-scsi
switches AioContext between the IOThread's AioContext and the main
loop's qemu_aio_context.

However, virtio-scsi virtqueue processing only happens in the IOThread's
AioContext. Maybe this is what you meant when you said the AioContext is
fixed?


Specifically, I meant VirtIOSCSI.ctx, which is set only once in 
virtio_scsi_dataplane_setup().  That’s at least where the virtqueue 
notifiers are registered, so yes, virtqueue processing should at least 
be fixed to that context.  It seems like it’s always possible some 
things are processed in the main thread (not just setup/teardown, but 
also e.g. TMF_LOGICAL_UNIT_RESET), so to me it seems like virtio-scsi 
kind of runs in two contexts simultaneously. Yes, when virtqueue 
processing is paused, all processing VirtIOSCSI.ctx is stopped, but I 
wouldn’t say it switches contexts there.  It just stops processing some 
requests.


Either way, virtio-scsi request processing doesn’t stop just because a 
scsi-hd device is hot-plugged or -unplugged.  If the BB changes contexts 
in the hot-unplug path (while vq request processing is continuing in the 
I/O thread), its context will differ from that of virtio-scsi.


So should I just replace the “the context is fixed” and say that in this 
specific instance, virtio-scsi vq processing continues in the I/O thread?



The BH function is aware that the current AioContext might not be the
same as the AioContext at the time the BH was scheduled. That doesn't
break assumptions in the code.

(It may be possible to rewrite virtio-blk, virtio-scsi, and core
VirtIODevice ioeventfd code to use the simpler model where the
AioContext really is fixed because things have changed significantly
over the years, but I looked a few weeks ago and it's difficult work.)

I'm just pointing out that I think this description is incomplete. I
*do* agree with what this patch series is doing :).


Well, this description won’t land in any commit log, so from my side, 
I’m not too worried about its correctness. O:)


Hanna


Unfortunately, I definitely have to touch that code, because accepting
concurrent changes of AioContexts breaks the double-check (just because
the BB has the right context in that place does not mean it stays in
that context); instead, we must prevent any concurrent change until the
BH is done.  Because changing contexts generally involves a drained
section, we can prevent it by keeping the BB in-flight counter elevated.

Question is, how to reason for that.  I’d really rather not include the
need to follow the BB context in my argument, because I find that part a
bit fishy.

Luckily, there’s a second, completely different reason for having
scsi_device_for_each_req_async() increment the in-flight counter:
Specifically, scsi_device_purge_requests() probably wants to await full
completion of scsi_device_for_each_req_async(), and we can do that most
easily in the very same way by incrementing the in-flight counter.  This
way, the blk_drain() in scsi_device_purge_requests() will not only await
all (cancelled) I/O requests, but also the non-I/O requests.

The fact that this prevents the BB AioContext from changing while the BH

Re: [PATCH] virtio-blk: do not use C99 mixed declarations

2024-02-06 Thread Hanna Czenczek


On 06.02.24 15:04, Stefan Hajnoczi wrote:

QEMU's coding style generally forbids C99 mixed declarations.

Signed-off-by: Stefan Hajnoczi 
---
  hw/block/virtio-blk.c | 25 ++---
  1 file changed, 14 insertions(+), 11 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 5/5] monitor: use aio_co_reschedule_self()

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

The aio_co_reschedule_self() API is designed to avoid the race
condition between scheduling the coroutine in another AioContext and
yielding.

The QMP dispatch code uses the open-coded version that appears
susceptible to the race condition at first glance:

   aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
   qemu_coroutine_yield();

The code is actually safe because the iohandler and qemu_aio_context
AioContext run under the Big QEMU Lock. Nevertheless, set a good example
and use aio_co_reschedule_self() so it's obvious that there is no race.

Suggested-by: Hanna Reitz 
Signed-off-by: Stefan Hajnoczi 
---
  qapi/qmp-dispatch.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 4/5] virtio-blk: declare VirtIOBlock::rq with a type

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

The VirtIOBlock::rq field has had the type void * since its introduction
in commit 869a5c6df19a ("Stop VM on error in virtio-blk. (Gleb
Natapov)").

Perhaps this was done to avoid the forward declaration of
VirtIOBlockReq.

Hanna Czenczek  pointed out the missing type. Specify
the actual type because there is no need to use void * here.

Suggested-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
---
  include/hw/virtio/virtio-blk.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 3/5] virtio-blk: add vq_rq[] bounds check in virtio_blk_dma_restart_cb()

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

Hanna Czenczek  noted that the array index in
virtio_blk_dma_restart_cb() is not bounds-checked:

   g_autofree VirtIOBlockReq **vq_rq = g_new0(VirtIOBlockReq *, num_queues);
   ...
   while (rq) {
   VirtIOBlockReq *next = rq->next;
   uint16_t idx = virtio_get_queue_index(rq->vq);

   rq->next = vq_rq[idx];
  ^^

The code is correct because both rq->vq and vq_rq[] depend on
num_queues, but this is indirect and not 100% obvious. Add an assertion.

Suggested-by: Hanna Czenczek 
Signed-off-by: Stefan Hajnoczi 
---
  hw/block/virtio-blk.c | 1 +
  1 file changed, 1 insertion(+)


Reviewed-by: Hanna Czenczek

Re: [PATCH 2/5] virtio-blk: clarify that there is at least 1 virtqueue

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

It is not possible to instantiate a virtio-blk device with 0 virtqueues.
The following check is located in ->realize():

   if (!conf->num_queues) {
   error_setg(errp, "num-queues property must be larger than 0");
   return;
   }

Later on we access s->vq_aio_context[0] under the assumption that there
is as least one virtqueue. Hanna Czenczek  noted that
it would help to show that the array index is already valid.

Add an assertion to document that s->vq_aio_context[0] is always
safe...and catch future code changes that break this assumption.

Suggested-by: Hanna Czenczek
Signed-off-by: Stefan Hajnoczi
---
  hw/block/virtio-blk.c | 1 +
  1 file changed, 1 insertion(+)


Reviewed-by: Hanna Czenczek

Re: [PATCH 1/5] virtio-blk: enforce iothread-vq-mapping validation

2024-02-06 Thread Hanna Czenczek


On 05.02.24 18:26, Stefan Hajnoczi wrote:

Hanna Czenczek  noticed that the safety of
`vq_aio_context[vq->value] = ctx;` with user-defined vq->value inputs is
not obvious.

The code is structured in validate() + apply() steps so input validation
is there, but it happens way earlier and there is nothing that
guarantees apply() can only be called with validated inputs.

This patch moves the validate() call inside the apply() function so
validation is guaranteed. I also added the bounds checking assertion
that Hanna suggested.

Signed-off-by: Stefan Hajnoczi 
---
  hw/block/virtio-blk.c | 192 +++---
  1 file changed, 107 insertions(+), 85 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 227d83569f..e8b37fd5f4 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c


[...]


@@ -1660,6 +1681,14 @@ static bool virtio_blk_vq_aio_context_init(VirtIOBlock 
*s, Error **errp)
  BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
  VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
  
+if (conf->iothread && conf->iothread_vq_mapping_list) {

+if (conf->iothread) {


This inner condition should probably be dropped.  With that done:

Reviewed-by: Hanna Czenczek 


+error_setg(errp, "iothread and iothread-vq-mapping properties "
+ "cannot be set at the same time");
+return false;
+}
+}
+
  if (conf->iothread || conf->iothread_vq_mapping_list) {
  if (!k->set_guest_notifiers || !k->ioeventfd_assign) {
  error_setg(errp,

[PATCH v2 2/3] virtio: Re-enable notifications after drain

2024-02-02 Thread Hanna Czenczek

During drain, we do not care about virtqueue notifications, which is why
we remove the handlers on it.  When removing those handlers, whether vq
notifications are enabled or not depends on whether we were in polling
mode or not; if not, they are enabled (by default); if so, they have
been disabled by the io_poll_start callback.

Because we do not care about those notifications after removing the
handlers, this is fine.  However, we have to explicitly ensure they are
enabled when re-attaching the handlers, so we will resume receiving
notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
If such a function is called while we are in a polling section,
attaching the notifiers will then invoke the io_poll_start callback,
re-disabling notifications.

Because we will always miss virtqueue updates in the drained section, we
also need to poll the virtqueue once after attaching the notifiers.

Buglink: https://issues.redhat.com/browse/RHEL-3934
Signed-off-by: Hanna Czenczek 
---
 include/block/aio.h |  7 ++-
 hw/virtio/virtio.c  | 42 ++
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 5d0a114988..8378553eb9 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
 
-/* Set polling begin/end callbacks for an event notifier that has already been
+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
  * not registered.
+ *
+ * Note that if the io_poll_end() callback (or the entire notifier) is removed
+ * during polling, it will not be called, so an io_poll_begin() is not
+ * necessarily always followed by an io_poll_end().
  */
 void aio_set_event_notifier_poll(AioContext *ctx,
  EventNotifier *notifier,
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7549094154..d229755eae 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3556,6 +3556,17 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
+/*
+ * virtio_queue_aio_detach_host_notifier() can leave notifications 
disabled.
+ * Re-enable them.  (And if detach has not been used before, notifications
+ * being enabled is still the default state while a notifier is attached;
+ * see virtio_queue_host_notifier_aio_poll_end(), which will always leave
+ * notifications enabled once the polling section is left.)
+ */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
@@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
 aio_set_event_notifier_poll(ctx, >host_notifier,
 virtio_queue_host_notifier_aio_poll_begin,
 virtio_queue_host_notifier_aio_poll_end);
+
+/*
+ * We will have ignored notifications about new requests from the guest
+ * while no notifiers were attached, so "kick" the virt queue to process
+ * those requests now.
+ */
+event_notifier_set(>host_notifier);
 }
 
 /*
@@ -3573,14 +3591,38 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
+/* See virtio_queue_aio_attach_host_notifier() */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
NULL, NULL);
+
+/*
+ * See virtio_queue_aio_attach_host_notifier().
+ * Note that this may be unnecessary for the type of virtqueues this
+ * function is used for.  Still, it will not hurt to have a quick look into
+ * whether we can/should process any of the virtqueue elements.
+ */
+event_notifier_set(>host_notifier);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
 aio_set_event_notifier(ctx, >host_notifier, NULL, NULL, NULL);
+
+/*
+ * aio_set_event_notifier_poll() does not guarantee whether io_poll_end()
+ * will run after io_poll_begin(), so by removing the notifier, we do not
+ * know whether virtio_queue_host_notifier_aio_poll_end() has run after a
+ * previous virtio_queue_host_notifier_aio_poll_begin(), i.e. whether
+ * notifications are e

[PATCH v2 1/3] virtio-scsi: Attach event vq notifier with no_poll

2024-02-02 Thread Hanna Czenczek

As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi:
don't waste CPU polling the event virtqueue"), we only attach an io_read
notifier for the virtio-scsi event virtqueue instead, and no polling
notifiers.  During operation, the event virtqueue is typically
non-empty, but none of the buffers are intended to be used immediately.
Instead, they only get used when certain events occur.  Therefore, it
makes no sense to continuously poll it when non-empty, because it is
supposed to be and stay non-empty.

We do this by using virtio_queue_aio_attach_host_notifier_no_poll()
instead of virtio_queue_aio_attach_host_notifier() for the event
virtqueue.

Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement
BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use
virtio_queue_aio_attach_host_notifier() for all virtqueues, including
the event virtqueue.  This can lead to it being polled again, undoing
the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552.

Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the
event virtqueue.

Reported-by: Fiona Ebner 
Fixes: 766aa2de0f29b657148e04599320d771c36fd126
   ("virtio-scsi: implement BlockDevOps->drained_begin()")
Reviewed-by: Stefan Hajnoczi 
Tested-by: Fiona Ebner 
Reviewed-by: Fiona Ebner 
Signed-off-by: Hanna Czenczek 
---
 hw/scsi/virtio-scsi.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 690aceec45..9f02ceea09 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1149,6 +1149,7 @@ static void virtio_scsi_drained_begin(SCSIBus *bus)
 static void virtio_scsi_drained_end(SCSIBus *bus)
 {
 VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus);
+VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
 s->parent_obj.conf.num_queues;
@@ -1166,7 +1167,11 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
 
 for (uint32_t i = 0; i < total_queues; i++) {
 VirtQueue *vq = virtio_get_queue(vdev, i);
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+if (vq == vs->event_vq) {
+virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
+} else {
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
 }
 }
 
-- 
2.43.0

[PATCH v2 3/3] virtio-blk: Use ioeventfd_attach in start_ioeventfd

2024-02-02 Thread Hanna Czenczek

Commit d3f6f294aeadd5f88caf0155e4360808c95b3146 ("virtio-blk: always set
ioeventfd during startup") has made virtio_blk_start_ioeventfd() always
kick the virtqueue (set the ioeventfd), regardless of whether the BB is
drained.  That is no longer necessary, because attaching the host
notifier will now set the ioeventfd, too; this happens either
immediately right here in virtio_blk_start_ioeventfd(), or later when
the drain ends, in virtio_blk_ioeventfd_attach().

With event_notifier_set() removed, the code becomes the same as the one
in virtio_blk_ioeventfd_attach(), so we can reuse that function.

Signed-off-by: Hanna Czenczek 
---
 hw/block/virtio-blk.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 227d83569f..22b8eef69b 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -37,6 +37,8 @@
 #include "hw/virtio/virtio-blk-common.h"
 #include "qemu/coroutine.h"
 
+static void virtio_blk_ioeventfd_attach(VirtIOBlock *s);
+
 static void virtio_blk_init_request(VirtIOBlock *s, VirtQueue *vq,
 VirtIOBlockReq *req)
 {
@@ -1808,17 +1810,14 @@ static int virtio_blk_start_ioeventfd(VirtIODevice 
*vdev)
 s->ioeventfd_started = true;
 smp_wmb(); /* paired with aio_notify_accept() on the read side */
 
-/* Get this show started by hooking up our callbacks */
-for (i = 0; i < nvqs; i++) {
-VirtQueue *vq = virtio_get_queue(vdev, i);
-AioContext *ctx = s->vq_aio_context[i];
-
-/* Kick right away to begin processing requests already in vring */
-event_notifier_set(virtio_queue_get_host_notifier(vq));
-
-if (!blk_in_drain(s->conf.conf.blk)) {
-virtio_queue_aio_attach_host_notifier(vq, ctx);
-}
+/*
+ * Get this show started by hooking up our callbacks.  If drained now,
+ * virtio_blk_drained_end() will do this later.
+ * Attaching the notifier also kicks the virtqueues, processing any 
requests
+ * they may already have.
+ */
+if (!blk_in_drain(s->conf.conf.blk)) {
+virtio_blk_ioeventfd_attach(s);
 }
 return 0;
 
-- 
2.43.0

[PATCH v2 0/3] virtio: Re-enable notifications after drain

2024-02-02 Thread Hanna Czenczek

v1:

https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00336.html


Hi,

This is basically the same series as v1: When using
aio_set_event_notifier_poll(), the io_poll_end() callback is only
invoked when polling ends, not when the notifier is being removed while
in a polling section.  This can leave the virtqueue notifier disabled
during drained sections, which however is not a bad thing.  We just need
to ensure they are re-enabled after the drain, and kick the virtqueue
once to pick up all the requests that came in during the drained
section.

Patch 1 is a technically unrelated fix, but addresses a problem that
became visible with patch 2 applied.

Patch 3 is a small (optional) clean-up patch.


v2:
- Changed the title of this series and patch 2 (was: "Keep notifications
  disabled durin drain"): Keeping the notifier disabled was something
  the initial RFC did, this version (v1 too) just ensures the notifier
  is enabled after the drain, regardless of its state before.

- Use event_notifier_set() instead of virtio_queue_notify() in patch 2

- Added patch 3


Hanna Czenczek (3):
  virtio-scsi: Attach event vq notifier with no_poll
  virtio: Re-enable notifications after drain
  virtio-blk: Use ioeventfd_attach in start_ioeventfd

 include/block/aio.h   |  7 ++-
 hw/block/virtio-blk.c | 21 ++---
 hw/scsi/virtio-scsi.c |  7 ++-
 hw/virtio/virtio.c| 42 ++
 4 files changed, 64 insertions(+), 13 deletions(-)

-- 
2.43.0

[PATCH 2/2] scsi: Await request purging

2024-02-02 Thread Hanna Czenczek

scsi_device_for_each_req_async() currently does not provide any way to
be awaited.  One of its callers is scsi_device_purge_requests(), which
therefore currently does not guarantee that all requests are fully
settled when it returns.

We want all requests to be settled, because scsi_device_purge_requests()
is called through the unrealize path, including the one invoked by
virtio_scsi_hotunplug() through qdev_simple_device_unplug_cb(), which
most likely assumes that all SCSI requests are done then.

In fact, scsi_device_purge_requests() already contains a blk_drain(),
but this will not fully await scsi_device_for_each_req_async(), only the
I/O requests it potentially cancels (not the non-I/O requests).
However, we can have scsi_device_for_each_req_async() increment the BB
in-flight counter, and have scsi_device_for_each_req_async_bh()
decrement it when it is done.  This way, the blk_drain() will fully
await all SCSI requests to be purged.

This also removes the need for scsi_device_for_each_req_async_bh() to
double-check the current context and potentially re-schedule itself,
should it now differ from the BB's context: Changing a BB's AioContext
with a root node is done through bdrv_try_change_aio_context(), which
creates a drained section.  With this patch, we keep the BB in-flight
counter elevated throughout, so we know the BB's context cannot change.

Signed-off-by: Hanna Czenczek 
---
 hw/scsi/scsi-bus.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 0a2eb11c56..230313022c 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -120,17 +120,13 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
 SCSIRequest *next;
 
 /*
- * If the AioContext changed before this BH was called then reschedule into
- * the new AioContext before accessing ->requests. This can happen when
- * scsi_device_for_each_req_async() is called and then the AioContext is
- * changed before BHs are run.
+ * The BB cannot have changed contexts between this BH being scheduled and
+ * now: BBs' AioContexts, when they have a node attached, can only be
+ * changed via bdrv_try_change_aio_context(), in a drained section.  While
+ * we have the in-flight counter incremented, that drain must block.
  */
 ctx = blk_get_aio_context(s->conf.blk);
-if (ctx != qemu_get_current_aio_context()) {
-aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
-g_steal_pointer());
-return;
-}
+assert(ctx == qemu_get_current_aio_context());
 
 QTAILQ_FOREACH_SAFE(req, >requests, next, next) {
 data->fn(req, data->fn_opaque);
@@ -138,11 +134,16 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
 
 /* Drop the reference taken by scsi_device_for_each_req_async() */
 object_unref(OBJECT(s));
+
+/* Paired with blk_inc_in_flight() in scsi_device_for_each_req_async() */
+blk_dec_in_flight(s->conf.blk);
 }
 
 /*
  * Schedule @fn() to be invoked for each enqueued request in device @s. @fn()
  * runs in the AioContext that is executing the request.
+ * Keeps the BlockBackend's in-flight counter incremented until everything is
+ * done, so draining it will settle all scheduled @fn() calls.
  */
 static void scsi_device_for_each_req_async(SCSIDevice *s,
void (*fn)(SCSIRequest *, void *),
@@ -163,6 +164,8 @@ static void scsi_device_for_each_req_async(SCSIDevice *s,
  */
 object_ref(OBJECT(s));
 
+/* Paired with blk_dec_in_flight() in scsi_device_for_each_req_async_bh() 
*/
+blk_inc_in_flight(s->conf.blk);
 aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
 scsi_device_for_each_req_async_bh,
 data);
@@ -1728,11 +1731,20 @@ static void scsi_device_purge_one_req(SCSIRequest *req, 
void *opaque)
 scsi_req_cancel_async(req, NULL);
 }
 
+/**
+ * Cancel all requests, and block until they are deleted.
+ */
 void scsi_device_purge_requests(SCSIDevice *sdev, SCSISense sense)
 {
 scsi_device_for_each_req_async(sdev, scsi_device_purge_one_req, NULL);
 
+/*
+ * Await all the scsi_device_purge_one_req() calls scheduled by
+ * scsi_device_for_each_req_async(), and all I/O requests that were
+ * cancelled this way, but may still take a bit of time to settle.
+ */
 blk_drain(sdev->conf.blk);
+
 scsi_device_set_ua(sdev, sense);
 }
 
-- 
2.43.0

[PATCH 1/2] block-backend: Allow concurrent context changes

2024-02-02 Thread Hanna Czenczek

Since AioContext locks have been removed, a BlockBackend's AioContext
may really change at any time (only exception is that it is often
confined to a drained section, as noted in this patch).  Therefore,
blk_get_aio_context() cannot rely on its root node's context always
matching that of the BlockBackend.

In practice, whether they match does not matter anymore anyway: Requests
can be sent to BDSs from any context, so anyone who requests the BB's
context should have no reason to require the root node to have the same
context.  Therefore, we can and should remove the assertion to that
effect.

In addition, because the context can be set and queried from different
threads concurrently, it has to be accessed with atomic operations.

Buglink: https://issues.redhat.com/browse/RHEL-19381
Suggested-by: Kevin Wolf 
Signed-off-by: Hanna Czenczek 
---
 block/block-backend.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 209eb07528..9c4de79e6b 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -44,7 +44,7 @@ struct BlockBackend {
 char *name;
 int refcnt;
 BdrvChild *root;
-AioContext *ctx;
+AioContext *ctx; /* access with atomic operations only */
 DriveInfo *legacy_dinfo;/* null unless created by drive_new() */
 QTAILQ_ENTRY(BlockBackend) link; /* for block_backends */
 QTAILQ_ENTRY(BlockBackend) monitor_link; /* for monitor_block_backends */
@@ -2414,22 +2414,22 @@ void blk_op_unblock_all(BlockBackend *blk, Error 
*reason)
 }
 }
 
+/**
+ * Return BB's current AioContext.  Note that this context may change
+ * concurrently at any time, with one exception: If the BB has a root node
+ * attached, its context will only change through 
bdrv_try_change_aio_context(),
+ * which creates a drained section.  Therefore, incrementing such a BB's
+ * in-flight counter will prevent its context from changing.
+ */
 AioContext *blk_get_aio_context(BlockBackend *blk)
 {
-BlockDriverState *bs;
 IO_CODE();
 
 if (!blk) {
 return qemu_get_aio_context();
 }
 
-bs = blk_bs(blk);
-if (bs) {
-AioContext *ctx = bdrv_get_aio_context(blk_bs(blk));
-assert(ctx == blk->ctx);
-}
-
-return blk->ctx;
+return qatomic_read(>ctx);
 }
 
 int blk_set_aio_context(BlockBackend *blk, AioContext *new_context,
@@ -2442,7 +2442,7 @@ int blk_set_aio_context(BlockBackend *blk, AioContext 
*new_context,
 GLOBAL_STATE_CODE();
 
 if (!bs) {
-blk->ctx = new_context;
+qatomic_set(>ctx, new_context);
 return 0;
 }
 
@@ -2471,7 +2471,7 @@ static void blk_root_set_aio_ctx_commit(void *opaque)
 AioContext *new_context = s->new_ctx;
 ThrottleGroupMember *tgm = >public.throttle_group_member;
 
-blk->ctx = new_context;
+qatomic_set(>ctx, new_context);
 if (tgm->throttle_state) {
 throttle_group_detach_aio_context(tgm);
 throttle_group_attach_aio_context(tgm, new_context);
-- 
2.43.0

[PATCH 0/2] block: Allow concurrent BB context changes

2024-02-02 Thread Hanna Czenczek

Hi,

Without the AioContext lock, a BB's context may kind of change at any
time (unless it has a root node, and I/O requests are pending).  That
also means that its own context (BlockBackend.ctx) and that of its root
node can differ sometimes (while the context is being changed).

blk_get_aio_context() doesn't know this yet and asserts that both are
always equal (if there is a root node).  Because it's no longer true,
and because callers don't seem to really care about the root node's
context, we can and should remove the assertion and just return the BB's
context.

Beyond that, the question is whether the callers of
blk_get_aio_context() are OK with the context potentially changing
concurrently.  Honestly, it isn't entirely clear to me; most look OK,
except for the virtio-scsi code, which operates under the general
assumption that the BB's context is always equal to that of the
virtio-scsi device.  I doubt that this assumption always holds (it is
definitely not obvious to me that it would), but then again, this series
will not make matters worse in that regard, and that is what counts for
me now.

One clear point of contention is scsi_device_for_each_req_async(), which
is addressed by patch 2.  Right now, it schedules a BH in the BB
context, then the BH double-checks whether the context still fits, and
if not, re-schedules itself.  Because virtio-scsi's context is fixed,
this seems to indicate to me that it wants to be able to deal with a
case where BB and virtio-scsi context differ, which seems to break that
aforementioned general virtio-scsi assumption.

Unfortunately, I definitely have to touch that code, because accepting
concurrent changes of AioContexts breaks the double-check (just because
the BB has the right context in that place does not mean it stays in
that context); instead, we must prevent any concurrent change until the
BH is done.  Because changing contexts generally involves a drained
section, we can prevent it by keeping the BB in-flight counter elevated.

Question is, how to reason for that.  I’d really rather not include the
need to follow the BB context in my argument, because I find that part a
bit fishy.

Luckily, there’s a second, completely different reason for having
scsi_device_for_each_req_async() increment the in-flight counter:
Specifically, scsi_device_purge_requests() probably wants to await full
completion of scsi_device_for_each_req_async(), and we can do that most
easily in the very same way by incrementing the in-flight counter.  This
way, the blk_drain() in scsi_device_purge_requests() will not only await
all (cancelled) I/O requests, but also the non-I/O requests.

The fact that this prevents the BB AioContext from changing while the BH
is scheduled/running then is just a nice side effect.


Hanna Czenczek (2):
  block-backend: Allow concurrent context changes
  scsi: Await request purging

 block/block-backend.c | 22 +++---
 hw/scsi/scsi-bus.c| 30 +-
 2 files changed, 32 insertions(+), 20 deletions(-)

-- 
2.43.0

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-02 Thread Hanna Czenczek


On 01.02.24 16:25, Hanna Czenczek wrote:

On 01.02.24 15:28, Stefan Hajnoczi wrote:


[...]


Did you find a scenario where the virtio-scsi AioContext is different
from the scsi-hd BB's Aiocontext?


Technically, that’s the reason for this thread, specifically that 
virtio_scsi_hotunplug() switches the BB back to the main context while 
scsi_device_for_each_req_async_bh() is running.  Yes, we can fix that 
specific case via the in-flight counter, but I’m wondering whether 
there’s really any merit in requiring the BB to always be in 
virtio-scsi’s context, or whether it would make more sense to schedule 
everything in virtio-scsi’s context.  Now that BBs/BDSs can receive 
requests from any context, that is.


Now that I know that wouldn’t be easy, let me turn this around: As far 
as I understand, scsi_device_for_each_req_async_bh() should still run in 
virtio-scsi’s context, but that’s hard, so we take the BB’s context, 
which we therefore require to be the same one. Further, (again AFAIU,) 
virtio-scsi’s context cannot change (only set in 
virtio_scsi_dataplane_setup(), which is run in 
virtio_scsi_device_realize()).  Therefore, why does the 
scsi_device_for_each_req_async() code accommodate for BB context changes?


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 01.02.24 16:25, Hanna Czenczek wrote:

[...]


It just seems simpler to me to not rely on the BB's context at all.


Hm, I now see the problem is that the processing (and scheduling) is 
largely done in generic SCSI code, which doesn’t have access to 
virtio-scsi’s context, only to that of the BB.  That makes my idea quite 
impossible. :/

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 01.02.24 15:28, Stefan Hajnoczi wrote:

On Thu, Feb 01, 2024 at 03:10:12PM +0100, Hanna Czenczek wrote:

On 31.01.24 21:35, Stefan Hajnoczi wrote:

On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote:

On 26.01.24 14:18, Kevin Wolf wrote:

Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:

On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
   the requests list.
- When the VM is stopped only the main loop may access the requests
   list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
  include/hw/scsi/scsi.h |   7 +-
  hw/scsi/scsi-bus.c | 181 -
  2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

Note: We (on issues.redhat.com) have a separate report that seems to be
concerning this very problem: https://issues.redhat.com/browse/RHEL-19381


{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

[...]


I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

Actually, now that completely unproblematic part is giving me trouble.  I
wanted to just put a graph lock into blk_get_aio_context() (making it a
coroutine with a wrapper)

Which is the first thing I neglected and already not great. We have
calls of blk_get_aio_context() in the SCSI I/O path, and creating a
coroutine and doing at least two context switches simply for this call
is a lot of overhead...


but callers of blk_get_aio_context() generally assume the context is
going to stay the BB’s context for as long as their AioContext *
variable is in scope.

I'm not so sure about that. And taking another step back, I'm actually
also not sure how much it still matters now that they can submit I/O
from any thread.

That’s my impression, too, but “not sure” doesn’t feel great. :)
scsi_device_for_each_req_async_bh() specifically double-checks whether it’s
still in the right context before invoking the specified function, so it
seems there was some intention to continue to run in the context associated
with the BB.

(Not judging whether that intent makes sense or not, yet.)


Maybe the correct solution is to remove the assertion from
blk_get_aio_context() and just always return blk->ctx. If it's in the
middle of a change, you'll either get the old one or the new one. Either
one is fine to submit I/O from, and if you care about changes for other
reasons (like SCSI does), then you need explicit code to protect it
anyway (which SCSI apparently has, but it doesn't work).

I think most callers do just assume the BB stays in the context they got
(without any proof, admittedly), but I agree that under re-evaluation, it
probably doesn’t actually matter to them, really. And yes, basically, if the
caller doesn’t need to take a lock because it doesn’t really matter whether
blk->ctx changes while its still using the old value, blk_get_aio_context()
in turn doesn’t need to double-check blk->ctx against the root node’s
context either, and nobody needs a lock.

So I agree, it’s on the caller to protect against a potentially changing
context,

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 31.01.24 21:35, Stefan Hajnoczi wrote:

On Fri, Jan 26, 2024 at 04:24:49PM +0100, Hanna Czenczek wrote:

On 26.01.24 14:18, Kevin Wolf wrote:

Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:

On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
  the requests list.
- When the VM is stopped only the main loop may access the requests
  list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
 include/hw/scsi/scsi.h |   7 +-
 hw/scsi/scsi-bus.c | 181 -
 2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

Note: We (on issues.redhat.com) have a separate report that seems to be
concerning this very problem: https://issues.redhat.com/browse/RHEL-19381


{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

[...]


I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

Actually, now that completely unproblematic part is giving me trouble.  I
wanted to just put a graph lock into blk_get_aio_context() (making it a
coroutine with a wrapper)

Which is the first thing I neglected and already not great. We have
calls of blk_get_aio_context() in the SCSI I/O path, and creating a
coroutine and doing at least two context switches simply for this call
is a lot of overhead...


but callers of blk_get_aio_context() generally assume the context is
going to stay the BB’s context for as long as their AioContext *
variable is in scope.

I'm not so sure about that. And taking another step back, I'm actually
also not sure how much it still matters now that they can submit I/O
from any thread.

That’s my impression, too, but “not sure” doesn’t feel great. :)
scsi_device_for_each_req_async_bh() specifically double-checks whether it’s
still in the right context before invoking the specified function, so it
seems there was some intention to continue to run in the context associated
with the BB.

(Not judging whether that intent makes sense or not, yet.)


Maybe the correct solution is to remove the assertion from
blk_get_aio_context() and just always return blk->ctx. If it's in the
middle of a change, you'll either get the old one or the new one. Either
one is fine to submit I/O from, and if you care about changes for other
reasons (like SCSI does), then you need explicit code to protect it
anyway (which SCSI apparently has, but it doesn't work).

I think most callers do just assume the BB stays in the context they got
(without any proof, admittedly), but I agree that under re-evaluation, it
probably doesn’t actually matter to them, really. And yes, basically, if the
caller doesn’t need to take a lock because it doesn’t really matter whether
blk->ctx changes while its still using the old value, blk_get_aio_context()
in turn doesn’t need to double-check blk->ctx against the root node’s
context either, and nobody needs a lock.

So I agree, it’s on the caller to protect against a potentially changing
context, blk_get_aio_context() should just return blk->ctx and not check
against the root node.

(On a tangent: blk_drain(

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 01.02.24 11:21, Kevin Wolf wrote:

Am 01.02.2024 um 10:43 hat Hanna Czenczek geschrieben:

On 31.01.24 11:17, Kevin Wolf wrote:

Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:

I don’t like using drain as a form of lock specifically against AioContext
changes, but maybe Stefan is right, and we should use it in this specific
case to get just the single problem fixed.  (Though it’s not quite trivial
either.  We’d probably still want to remove the assertion from
blk_get_aio_context(), so we don’t have to require all of its callers to
hold a count in the in-flight counter.)

Okay, fair, maybe fixing the specific problem is more important that
solving the more generic blk_get_aio_context() race.

In this case, wouldn't it be enough to increase the in-flight counter so
that the drain before switching AioContexts would run the BH before
anything bad can happen? Does the following work?

Yes, that’s what I had in mind (Stefan, too, I think), and in testing,
it looks good.

Oh, sorry, I completely misunderstood then. I thought you were talking
about adding a new drained section somewhere and that sounded a bit more
complicated. :-)

If it works, let's do this. Would you like to pick this up and send it
as a formal patch (possibly in a more polished form), or should I do
that?


Sure, I can do it.

Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-02-01 Thread Hanna Czenczek


On 31.01.24 11:17, Kevin Wolf wrote:

Am 29.01.2024 um 17:30 hat Hanna Czenczek geschrieben:

I don’t like using drain as a form of lock specifically against AioContext
changes, but maybe Stefan is right, and we should use it in this specific
case to get just the single problem fixed.  (Though it’s not quite trivial
either.  We’d probably still want to remove the assertion from
blk_get_aio_context(), so we don’t have to require all of its callers to
hold a count in the in-flight counter.)

Okay, fair, maybe fixing the specific problem is more important that
solving the more generic blk_get_aio_context() race.

In this case, wouldn't it be enough to increase the in-flight counter so
that the drain before switching AioContexts would run the BH before
anything bad can happen? Does the following work?


Yes, that’s what I had in mind (Stefan, too, I think), and in testing, 
it looks good.


Hanna



Kevin

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 0a2eb11c56..dc09eb8024 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -120,17 +120,11 @@ static void scsi_device_for_each_req_async_bh(void 
*opaque)
  SCSIRequest *next;
  
  /*

- * If the AioContext changed before this BH was called then reschedule into
- * the new AioContext before accessing ->requests. This can happen when
- * scsi_device_for_each_req_async() is called and then the AioContext is
- * changed before BHs are run.
+ * The AioContext can't have changed because we increased the in-flight
+ * counter for s->conf.blk.
   */
  ctx = blk_get_aio_context(s->conf.blk);
-if (ctx != qemu_get_current_aio_context()) {
-aio_bh_schedule_oneshot(ctx, scsi_device_for_each_req_async_bh,
-g_steal_pointer());
-return;
-}
+assert(ctx == qemu_get_current_aio_context());
  
  QTAILQ_FOREACH_SAFE(req, >requests, next, next) {

  data->fn(req, data->fn_opaque);
@@ -138,6 +132,7 @@ static void scsi_device_for_each_req_async_bh(void *opaque)
  
  /* Drop the reference taken by scsi_device_for_each_req_async() */

  object_unref(OBJECT(s));
+blk_dec_in_flight(s->conf.blk);
  }
  
  /*

@@ -163,6 +158,7 @@ static void scsi_device_for_each_req_async(SCSIDevice *s,
   */
  object_ref(OBJECT(s));
  
+blk_inc_in_flight(s->conf.blk);

  aio_bh_schedule_oneshot(blk_get_aio_context(s->conf.blk),
  scsi_device_for_each_req_async_bh,
  data);

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-29 Thread Hanna Czenczek


On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
the requests list.
- When the VM is stopped only the main loop may access the requests
list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c | 181 -
   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

(gdb) bt
#0  0x7f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x7f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x7f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x7f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x7f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2417
#7  0x556e6b112d87 in scsi_device_for_each_req_async_bh
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at
../util/async.c:218
#9  0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290,
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920)
at ../iothread.c:63
#11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at
../util/qemu-thread-posix.c:541
#12 0x7f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x7f32a670f7cc in  () at /usr/lib/libc.so.6

I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()?


Removing the drain to allow for all of bdrv_try_change_aio_context() to 
require GRAPH_WRLOCK, but that broke tests/unit/test-block-iothread, 
because without draining, block jobs would need to switch AioContexts 
while running, and job_set_aio_context() doesn’t like that.  Similarly 
to blk_get_aio_context(), I assume we can in theory just drop the 
assertion there and change the context while the job is running, because 
then the job can just change AioContexts on the next pause point (and in 
the meantime send requests from the old context, which is fine), but 
this does get quite murky.  (One rather virtual (but present) problem is 
that test-block-iothread itself contains some assertions in the job that 
its AioContext is actually the on its running in, but this assertion 
would no longer necessarily hold true.)


I don’t like using drain as a form of lock specifically against 
AioContext changes, but maybe Stefan is right, and we should use it in 
this specific case to get just the single problem fixed.  (Though it’s 
not quite trivial either.  We’d probably still want to remove the 
assertion from blk_get_aio_context(), so we don’t have to require all of 
its callers to hold a count in the in-flight counter.)


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-26 Thread Hanna Czenczek


On 26.01.24 14:18, Kevin Wolf wrote:

Am 25.01.2024 um 18:32 hat Hanna Czenczek geschrieben:

On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
 the requests list.
- When the VM is stopped only the main loop may access the requests
 list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
include/hw/scsi/scsi.h |   7 +-
hw/scsi/scsi-bus.c | 181 -
2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:


Note: We (on issues.redhat.com) have a separate report that seems to be 
concerning this very problem: https://issues.redhat.com/browse/RHEL-19381



{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

[...]


I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

Actually, now that completely unproblematic part is giving me trouble.  I
wanted to just put a graph lock into blk_get_aio_context() (making it a
coroutine with a wrapper)

Which is the first thing I neglected and already not great. We have
calls of blk_get_aio_context() in the SCSI I/O path, and creating a
coroutine and doing at least two context switches simply for this call
is a lot of overhead...


but callers of blk_get_aio_context() generally assume the context is
going to stay the BB’s context for as long as their AioContext *
variable is in scope.

I'm not so sure about that. And taking another step back, I'm actually
also not sure how much it still matters now that they can submit I/O
from any thread.


That’s my impression, too, but “not sure” doesn’t feel great. :) 
scsi_device_for_each_req_async_bh() specifically double-checks whether 
it’s still in the right context before invoking the specified function, 
so it seems there was some intention to continue to run in the context 
associated with the BB.


(Not judging whether that intent makes sense or not, yet.)


Maybe the correct solution is to remove the assertion from
blk_get_aio_context() and just always return blk->ctx. If it's in the
middle of a change, you'll either get the old one or the new one. Either
one is fine to submit I/O from, and if you care about changes for other
reasons (like SCSI does), then you need explicit code to protect it
anyway (which SCSI apparently has, but it doesn't work).


I think most callers do just assume the BB stays in the context they got 
(without any proof, admittedly), but I agree that under re-evaluation, 
it probably doesn’t actually matter to them, really. And yes, basically, 
if the caller doesn’t need to take a lock because it doesn’t really 
matter whether blk->ctx changes while its still using the old value, 
blk_get_aio_context() in turn doesn’t need to double-check blk->ctx 
against the root node’s context either, and nobody needs a lock.


So I agree, it’s on the caller to protect against a potentially changing 
context, blk_get_aio_context() should just return blk->ctx and not check 
against the root node.


(On a tangent: blk_drain() is a caller of blk_get_aio_context(), and it 
polls that context.  Why does it need to

Re: [PATCH 2/2] virtio: Keep notifications disabled during drain

2024-01-25 Thread Hanna Czenczek


On 25.01.24 19:18, Hanna Czenczek wrote:

On 25.01.24 19:03, Stefan Hajnoczi wrote:

On Wed, Jan 24, 2024 at 06:38:30PM +0100, Hanna Czenczek wrote:


[...]

@@ -3563,6 +3574,13 @@ void 
virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)

  aio_set_event_notifier_poll(ctx, >host_notifier,
virtio_queue_host_notifier_aio_poll_begin,
virtio_queue_host_notifier_aio_poll_end);
+
+    /*
+ * We will have ignored notifications about new requests from 
the guest
+ * during the drain, so "kick" the virt queue to process those 
requests

+ * now.
+ */
+    virtio_queue_notify(vq->vdev, vq->queue_index);

event_notifier_set(>host_notifier) is easier to understand because
it doesn't contain a non-host_notifier code path that we must not take.

Is there a reason why you used virtio_queue_notify() instead?


Not a good one anyway!

virtio_queue_notify() is just what seemed obvious to me (i.e. to 
notify the virtqueue).  Before removal of the AioContext lock, calling 
handle_output seemed safe.  But, yes, there was the discussion on the 
RFC that it really isn’t.  I didn’t consider that means we must rely 
on virtio_queue_notify() calling event_notifier_set(), so we may as 
well call it explicitly here.


I’ll fix it, thanks for pointing it out!


(I think together with this change, I’ll also remove the 
event_notifier_set() call from virtio_blk_data_plane_start().  It’d 
obviously be a duplicate, and removing it shows why 
virtio_queue_aio_attach_host_notifier() should always kick the queue.)

Re: [PATCH 2/2] virtio: Keep notifications disabled during drain

2024-01-25 Thread Hanna Czenczek


On 25.01.24 19:03, Stefan Hajnoczi wrote:

On Wed, Jan 24, 2024 at 06:38:30PM +0100, Hanna Czenczek wrote:

During drain, we do not care about virtqueue notifications, which is why
we remove the handlers on it.  When removing those handlers, whether vq
notifications are enabled or not depends on whether we were in polling
mode or not; if not, they are enabled (by default); if so, they have
been disabled by the io_poll_start callback.

Because we do not care about those notifications after removing the
handlers, this is fine.  However, we have to explicitly ensure they are
enabled when re-attaching the handlers, so we will resume receiving
notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
If such a function is called while we are in a polling section,
attaching the notifiers will then invoke the io_poll_start callback,
re-disabling notifications.

Because we will always miss virtqueue updates in the drained section, we
also need to poll the virtqueue once after attaching the notifiers.

Buglink: https://issues.redhat.com/browse/RHEL-3934
Signed-off-by: Hanna Czenczek 
---
  include/block/aio.h |  7 ++-
  hw/virtio/virtio.c  | 42 ++
  2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 5d0a114988..8378553eb9 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx,
  AioPollFn *io_poll,
  EventNotifierHandler *io_poll_ready);
  
-/* Set polling begin/end callbacks for an event notifier that has already been

+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
   * registered with aio_set_event_notifier.  Do nothing if the event notifier 
is
   * not registered.
+ *
+ * Note that if the io_poll_end() callback (or the entire notifier) is removed
+ * during polling, it will not be called, so an io_poll_begin() is not
+ * necessarily always followed by an io_poll_end().
   */
  void aio_set_event_notifier_poll(AioContext *ctx,
   EventNotifier *notifier,
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7549094154..4166da9e97 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3556,6 +3556,17 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
  
  void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)

  {
+/*
+ * virtio_queue_aio_detach_host_notifier() can leave notifications 
disabled.
+ * Re-enable them.  (And if detach has not been used before, notifications
+ * being enabled is still the default state while a notifier is attached;
+ * see virtio_queue_host_notifier_aio_poll_end(), which will always leave
+ * notifications enabled once the polling section is left.)
+ */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
  aio_set_event_notifier(ctx, >host_notifier,
 virtio_queue_host_notifier_read,
 virtio_queue_host_notifier_aio_poll,
@@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  aio_set_event_notifier_poll(ctx, >host_notifier,
  virtio_queue_host_notifier_aio_poll_begin,
  virtio_queue_host_notifier_aio_poll_end);
+
+/*
+ * We will have ignored notifications about new requests from the guest
+ * during the drain, so "kick" the virt queue to process those requests
+ * now.
+ */
+virtio_queue_notify(vq->vdev, vq->queue_index);

event_notifier_set(>host_notifier) is easier to understand because
it doesn't contain a non-host_notifier code path that we must not take.

Is there a reason why you used virtio_queue_notify() instead?


Not a good one anyway!

virtio_queue_notify() is just what seemed obvious to me (i.e. to notify 
the virtqueue).  Before removal of the AioContext lock, calling 
handle_output seemed safe.  But, yes, there was the discussion on the 
RFC that it really isn’t.  I didn’t consider that means we must rely on 
virtio_queue_notify() calling event_notifier_set(), so we may as well 
call it explicitly here.


I’ll fix it, thanks for pointing it out!


Otherwise:
Reviewed-by: Stefan Hajnoczi 


Thanks!

Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-25 Thread Hanna Czenczek


On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
the requests list.
- When the VM is stopped only the main loop may access the requests
list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c | 181 -
   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.


[...]


I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.


Actually, now that completely unproblematic part is giving me trouble.  
I wanted to just put a graph lock into blk_get_aio_context() (making it 
a coroutine with a wrapper), but callers of blk_get_aio_context() 
generally assume the context is going to stay the BB’s context for as 
long as their AioContext * variable is in scope.  I was tempted to think 
callers know what happens to the BB they pass to blk_get_aio_context(), 
and it won’t change contexts so easily, but then I remembered this is 
exactly what happens in this case; we run 
scsi_device_for_each_req_async_bh() in one thread (which calls 
blk_get_aio_context()), and in the other, we change the BB’s context.


It seems like there are very few blk_* functions right now that require 
taking a graph lock around it, so I’m hesitant to go that route.  But if 
we’re protecting a BB’s context via the graph write lock, I can’t think 
of a way around having to take a read lock whenever reading a BB’s 
context, and holding it for as long as we assume that context to remain 
the BB’s context.  It’s also hard to figure out how long that is, case 
by case; for example, dma_blk_read() schedules an AIO function in the BB 
context; but we probably don’t care that this context remains the BB’s 
context until the request is done.  In the case of 
scsi_device_for_each_req_async_bh(), we already take care to re-schedule 
it when it turns out the context is outdated, so it does seem quite 
important here, and we probably want to keep the lock until after the 
QTAILQ_FOREACH_SAFE() loop.


On a tangent, this TOCTTOU problem makes me wary of other blk_* 
functions that query information.  For example, fuse_read() (in 
block/export/fuse.c) truncates requests to the BB length.  But what if 
the BB length changes concurrently between blk_getlength() and 
blk_pread()?  While we can justify using the graph lock for a BB’s 
AioContext, we can’t use it for other metadata like its length.


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-25 Thread Hanna Czenczek


On 24.01.24 22:53, Stefan Hajnoczi wrote:

On Wed, Jan 24, 2024 at 01:12:47PM +0100, Hanna Czenczek wrote:

On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
 the requests list.
- When the VM is stopped only the main loop may access the requests
 list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
include/hw/scsi/scsi.h |   7 +-
hw/scsi/scsi-bus.c | 181 -
2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

[...]


I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

I tried this, and it’s not easy taking the lock just for tran_commit(),
because some callers of bdrv_try_change_aio_context() already hold the write
lock (specifically bdrv_attach_child_common(),
bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and
qmp_x_blockdev_set_iothread() holds the read lock.  Other callers don’t hold
any lock[2].

So I’m not sure whether we should mark all of bdrv_try_change_aio_context()
as GRAPH_WRLOCK and then make all callers take the lock, or really only take
it for tran_commit(), and have callers release the lock around
bdrv_try_change_aio_context(). Former sounds better to naïve me.

(In any case, FWIW, having blk_set_aio_context() take the write lock, and
scsi_device_for_each_req_async_bh() take the read lock[3], does fix the
assertion failure.)

I wonder if a simpler solution is blk_inc_in_flight() in
scsi_device_for_each_req_async() and blk_dec_in_flight() in
scsi_device_for_each_req_async_bh() so that drain
waits for the BH.

There is a drain around the AioContext change, so as long as
scsi_device_for_each_req_async() was called before blk_set_aio_context()
and not _during_ aio_poll(), we would prevent inconsistent BB vs BDS
aio_contexts.


Actually, Kevin has suggested on IRC that we drop the whole drain. :)

Dropping the write lock in our outside of bdrv_try_change_aio_context() 
for callers that have already taken it seems unsafe, so the only option 
would be to make the whole function write-lock-able.  The drained 
section can cause problems with that if it ends up wanting to reorganize 
the graph, so AFAIU drain should never be done while under a write 
lock.  This is already a problem now because there are three callers 
that do call bdrv_try_change_aio_context() while under a write lock, so 
it seems like we shouldn’t keep the drain as-is.


So, Kevin suggested just dropping that drain, because I/O requests are 
no longer supposed to care about a BDS’s native AioContext anymore 
anyway, so it seems like the need for the drain has gone away with 
multiqueue.  Then we could make the whole function GRAPH_WRLOCK.


Hanna

[PATCH 2/2] virtio: Keep notifications disabled during drain

2024-01-24 Thread Hanna Czenczek

During drain, we do not care about virtqueue notifications, which is why
we remove the handlers on it.  When removing those handlers, whether vq
notifications are enabled or not depends on whether we were in polling
mode or not; if not, they are enabled (by default); if so, they have
been disabled by the io_poll_start callback.

Because we do not care about those notifications after removing the
handlers, this is fine.  However, we have to explicitly ensure they are
enabled when re-attaching the handlers, so we will resume receiving
notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
If such a function is called while we are in a polling section,
attaching the notifiers will then invoke the io_poll_start callback,
re-disabling notifications.

Because we will always miss virtqueue updates in the drained section, we
also need to poll the virtqueue once after attaching the notifiers.

Buglink: https://issues.redhat.com/browse/RHEL-3934
Signed-off-by: Hanna Czenczek 
---
 include/block/aio.h |  7 ++-
 hw/virtio/virtio.c  | 42 ++
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 5d0a114988..8378553eb9 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -480,9 +480,14 @@ void aio_set_event_notifier(AioContext *ctx,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
 
-/* Set polling begin/end callbacks for an event notifier that has already been
+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
  * not registered.
+ *
+ * Note that if the io_poll_end() callback (or the entire notifier) is removed
+ * during polling, it will not be called, so an io_poll_begin() is not
+ * necessarily always followed by an io_poll_end().
  */
 void aio_set_event_notifier_poll(AioContext *ctx,
  EventNotifier *notifier,
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 7549094154..4166da9e97 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3556,6 +3556,17 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
+/*
+ * virtio_queue_aio_detach_host_notifier() can leave notifications 
disabled.
+ * Re-enable them.  (And if detach has not been used before, notifications
+ * being enabled is still the default state while a notifier is attached;
+ * see virtio_queue_host_notifier_aio_poll_end(), which will always leave
+ * notifications enabled once the polling section is left.)
+ */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
@@ -3563,6 +3574,13 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
 aio_set_event_notifier_poll(ctx, >host_notifier,
 virtio_queue_host_notifier_aio_poll_begin,
 virtio_queue_host_notifier_aio_poll_end);
+
+/*
+ * We will have ignored notifications about new requests from the guest
+ * during the drain, so "kick" the virt queue to process those requests
+ * now.
+ */
+virtio_queue_notify(vq->vdev, vq->queue_index);
 }
 
 /*
@@ -3573,14 +3591,38 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
+/* See virtio_queue_aio_attach_host_notifier() */
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, 1);
+}
+
 aio_set_event_notifier(ctx, >host_notifier,
virtio_queue_host_notifier_read,
NULL, NULL);
+
+/*
+ * See virtio_queue_aio_attach_host_notifier().
+ * Note that this may be unnecessary for the type of virtqueues this
+ * function is used for.  Still, it will not hurt to have a quick look into
+ * whether we can/should process any of the virtqueue elements.
+ */
+virtio_queue_notify(vq->vdev, vq->queue_index);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
 aio_set_event_notifier(ctx, >host_notifier, NULL, NULL, NULL);
+
+/*
+ * aio_set_event_notifier_poll() does not guarantee whether io_poll_end()
+ * will run after io_poll_begin(), so by removing the notifier, we do not
+ * know whether virtio_queue_host_notifier_aio_poll_end() has run after a
+ * previous virtio_queue_host_notifier_aio_poll_begin(), i.e. whether
+ * not

[PATCH 1/2] virtio-scsi: Attach event vq notifier with no_poll

2024-01-24 Thread Hanna Czenczek

As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi:
don't waste CPU polling the event virtqueue"), we only attach an io_read
notifier for the virtio-scsi event virtqueue instead, and no polling
notifiers.  During operation, the event virtqueue is typically
non-empty, but none of the buffers are intended to be used immediately.
Instead, they only get used when certain events occur.  Therefore, it
makes no sense to continuously poll it when non-empty, because it is
supposed to be and stay non-empty.

We do this by using virtio_queue_aio_attach_host_notifier_no_poll()
instead of virtio_queue_aio_attach_host_notifier() for the event
virtqueue.

Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement
BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use
virtio_queue_aio_attach_host_notifier() for all virtqueues, including
the event virtqueue.  This can lead to it being polled again, undoing
the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552.

Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the
event virtqueue.

Reported-by: Fiona Ebner 
Fixes: 766aa2de0f29b657148e04599320d771c36fd126
   ("virtio-scsi: implement BlockDevOps->drained_begin()")
Signed-off-by: Hanna Czenczek 
---
 hw/scsi/virtio-scsi.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 690aceec45..9f02ceea09 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1149,6 +1149,7 @@ static void virtio_scsi_drained_begin(SCSIBus *bus)
 static void virtio_scsi_drained_end(SCSIBus *bus)
 {
 VirtIOSCSI *s = container_of(bus, VirtIOSCSI, bus);
+VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 uint32_t total_queues = VIRTIO_SCSI_VQ_NUM_FIXED +
 s->parent_obj.conf.num_queues;
@@ -1166,7 +1167,11 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
 
 for (uint32_t i = 0; i < total_queues; i++) {
 VirtQueue *vq = virtio_get_queue(vdev, i);
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+if (vq == vs->event_vq) {
+virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
+} else {
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
 }
 }
 
-- 
2.43.0

[PATCH 0/2] virtio: Keep notifications disabled during drain

2024-01-24 Thread Hanna Czenczek

Hi,

When registering callbacks via aio_set_event_notifier_poll(), the
io_poll_end() callback is only invoked when polling actually ends.  If
the notifiers are removed while in a polling section, it is not called.
Therefore, io_poll_start() is not necessarily followed up by
io_poll_end().

It is not entirely clear whether this is good or bad behavior.  On one
hand, it may be unexpected to callers.  On the other, it may be
counterproductive to call io_poll_end() when the polling section has not
ended yet.

Right now, there is only one user of aio_set_event_notifier(), which is
virtio_queue_aio_attach_host_notifier().  It does not expect this
behavior, which leads to virtqueue notifiers remaining disabled if
virtio_queue_aio_detach_host_notifier() is called while polling.  That
can happen e.g. through virtio_scsi_drained_begin() or
virtio_blk_drained_begin() (through virtio_blk_data_plane_detach()).
In such a case, the virtqueue may not be processed for a while, letting
the guest driver hang.  This can be reproduced by repeatedly
hot-plugging and -unplugging a virtio-scsi device with a scsi-hd disk,
because the guest will try to enumerate the virtio-scsi device while
we’re attaching the scsi-hd disk, which causes a drain, which can cause
the virtio-scsi virtqueue to stall as described.

Stefan has suggested ensuring we always follow up io_poll_start() by
io_poll_end():

https://lists.nongnu.org/archive/html/qemu-block/2023-12/msg00163.html

I prefer changing the caller instead, because I don’t think we actually
want the virtqueue notifier to be force-enabled when removing our AIO
notifiers.  So I believe we actually only want to take care to
force-enable it when we re-attach the AIO notifiers, and to kick
virtqueue processing once, in case we missed any events while the AIO
notifiers were not attached.

That is done by patch 2.  We have already discussed a prior version of
it here:

https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg1.html

And compared to that, based on the discussion, there are some changes:
1. Used virtio_queue_notify() instead of virtio_queue_notify_vq(), as
   suggested by Paolo, because it’s thread-safe
2. Moved virtio_queue_notify() into
   virtio_queue_aio_attach_host_notifier*(), because we always want it
3. Dropped virtio_queue_set_notification(vq, 0) from
   virtio_queue_aio_detach_host_notifier(): Paolo wasn’t sure whether
   that was safe to do from any context.  We don’t really need to call
   it anyway, so I just dropped it.
4. Added patch 1:

Patch 1 fixes virtio_scsi_drained_end() so it won’t attach polling
notifiers for the event virtqueue.  That didn’t turn out to be an issue
so far, but with patch 2, Fiona saw the virtqueue processing queue
spinning in a loop as described in
38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: don't waste CPU
polling the event virtqueue").


Note that as of eaad0fe26050c227dc5dad63205835bac4912a51 ("scsi: only
access SCSIDevice->requests from one thread") there’s a different
problem when trying to reproduce the bug via hot-plugging and
-unplugging a virtio-scsi device, specifically, when unplugging, qemu
may crash with an assertion failure[1].  I don’t have a full fix for
that yet, but in case you need a work-around for the specific case of
virtio-scsi hot-plugging and -unplugging, you can use this patch:

https://czenczek.de/0001-DONTMERGE-Fix-crash-on-scsi-unplug.patch


[1] https://lists.nongnu.org/archive/html/qemu-block/2024-01/msg00317.html


Hanna Czenczek (2):
  virtio-scsi: Attach event vq notifier with no_poll
  virtio: Keep notifications disabled during drain

 include/block/aio.h   |  7 ++-
 hw/scsi/virtio-scsi.c |  7 ++-
 hw/virtio/virtio.c| 42 ++
 3 files changed, 54 insertions(+), 2 deletions(-)

-- 
2.43.0

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-24 Thread Hanna Czenczek


On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
the requests list.
- When the VM is stopped only the main loop may access the requests
list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c | 181 -
   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.


[...]


I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.


I tried this, and it’s not easy taking the lock just for tran_commit(), 
because some callers of bdrv_try_change_aio_context() already hold the 
write lock (specifically bdrv_attach_child_common(), 
bdrv_attach_child_common_abort(), and bdrv_root_unref_child()[1]), and 
qmp_x_blockdev_set_iothread() holds the read lock.  Other callers don’t 
hold any lock[2].


So I’m not sure whether we should mark all of 
bdrv_try_change_aio_context() as GRAPH_WRLOCK and then make all callers 
take the lock, or really only take it for tran_commit(), and have 
callers release the lock around bdrv_try_change_aio_context(). Former 
sounds better to naïve me.


(In any case, FWIW, having blk_set_aio_context() take the write lock, 
and scsi_device_for_each_req_async_bh() take the read lock[3], does fix 
the assertion failure.)


Hanna

[1] bdrv_root_unref_child() is not marked as GRAPH_WRLOCK, but it’s 
callers generally seem to ensure that the lock is taken when calling it.


[2] blk_set_aio_context() (evidently), blk_exp_add(), 
external_snapshot_abort(), {blockdev,drive}_backup_action(), 
qmp_{blockdev,drive}_mirror()


[3] I’ve made the _bh a coroutine (for bdrv_graph_co_rdlock()) and 
replaced the aio_bh_schedule_oneshot() by aio_co_enter() – hope that’s 
right.

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-23 Thread Hanna Czenczek


On 23.01.24 18:10, Kevin Wolf wrote:

Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
the requests list.
- When the VM is stopped only the main loop may access the requests
list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
   include/hw/scsi/scsi.h |   7 +-
   hw/scsi/scsi-bus.c | 181 -
   2 files changed, 131 insertions(+), 57 deletions(-)

My reproducer forhttps://issues.redhat.com/browse/RHEL-3934  now breaks more
often because of this commit than because of the original bug, i.e. when
repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device,
this tends to happen when unplugging the scsi-hd:

{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context:
Assertion `ctx == blk->ctx' failed.

(gdb) bt
#0  0x7f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x7f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x7f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x7f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x7f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at
../block/block-backend.c:2417
#7  0x556e6b112d87 in scsi_device_for_each_req_async_bh
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at
../util/async.c:218
#9  0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290,
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x556e6b4564b6 in iothread_run (opaque=opaque@entry=0x556e6d89d920)
at ../iothread.c:63
#11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at
../util/qemu-thread-posix.c:541
#12 0x7f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x7f32a670f7cc in  () at /usr/lib/libc.so.6

I don’t know anything about the problem yet, but as usual, I like
speculation and discovering how wrong I was later on, so one thing I came
across that’s funny about virtio-scsi is that requests can happen even while
a disk is being attached or detached.  That is, Linux seems to probe all
LUNs when a new virtio-scsi device is being attached, and it won’t stop just
because a disk is being attached or removed.  So maybe that’s part of the
problem, that we get a request while the BB is being detached, and
temporarily in an inconsistent state (BDS context differs from BB context).

I don't know anything about the problem either, but since you already
speculated about the cause, let me speculate about the solution:
Can we hold the graph writer lock for the tran_commit() call in
bdrv_try_change_aio_context()? And of course take the reader lock for
blk_get_aio_context(), but that should be completely unproblematic.

At the first sight I don't see a reason why this would break something,
but I've learnt not to trust my first impression with the graph locking
work...

Of course, I also didn't check if there are more things inside of the
device emulation that need additional locking in this case, too. But
even if so, blk_get_aio_context() should never see an inconsistent
state.


Ah, sorry, saw your reply only now that I hit “send”.

I forgot that we do have that big lock that I thought rather to avoid 
:)  Sounds good and very reasonable to me.  Changing the contexts in the 
graph sounds like a graph change operation, and reading and comparing 
contexts in the graph sounds like reading the graph.


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-23 Thread Hanna Czenczek


On 23.01.24 17:40, Hanna Czenczek wrote:

On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
   the requests list.
- When the VM is stopped only the main loop may access the requests
   list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
  include/hw/scsi/scsi.h |   7 +-
  hw/scsi/scsi-bus.c | 181 -
  2 files changed, 131 insertions(+), 57 deletions(-)

[...]

I don’t know anything about the problem yet, but as usual, I like 
speculation and discovering how wrong I was later on, so one thing I 
came across that’s funny about virtio-scsi is that requests can happen 
even while a disk is being attached or detached.  That is, Linux seems 
to probe all LUNs when a new virtio-scsi device is being attached, and 
it won’t stop just because a disk is being attached or removed.  So 
maybe that’s part of the problem, that we get a request while the BB 
is being detached, and temporarily in an inconsistent state (BDS 
context differs from BB context).


I’ll look more into it.


What I think happens is that scsi_device_purge_requests() runs (perhaps 
through virtio_scsi_hotunplug() -> qdev_simple_device_unplug_cb() -> 
scsi_qdev_unrealize()?), which schedules 
scsi_device_for_each_req_async_bh() to run, but doesn’t await it.  We go 
on, begin to move the BB and its BDS back to the main context (via 
blk_set_aio_context() in virtio_scsi_hotunplug()), but 
scsi_device_for_each_req_async_bh() still runs in the I/O thread, it 
calls blk_get_aio_context() while the contexts are inconsistent, and we 
get the crash.


There is a comment above blk_get_aio_context() in 
scsi_device_for_each_req_async_bh() about the BB potentially being moved 
to a different context prior to the BH running, but it doesn’t consider 
the possibility that that move may occur *concurrently*.


I don’t know how to fix this, though.  The whole idea of anything 
happening to a BB while it is being moved to a different context seems 
so wrong to me that I’d want to slap a big lock on it, but I have the 
feeling that that isn’t what we want.


Hanna

Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread

2024-01-23 Thread Hanna Czenczek


On 21.12.23 22:23, Kevin Wolf wrote:

From: Stefan Hajnoczi

Stop depending on the AioContext lock and instead access
SCSIDevice->requests from only one thread at a time:
- When the VM is running only the BlockBackend's AioContext may access
   the requests list.
- When the VM is stopped only the main loop may access the requests
   list.

These constraints protect the requests list without the need for locking
in the I/O code path.

Note that multiple IOThreads are not supported yet because the code
assumes all SCSIRequests are executed from a single AioContext. Leave
that as future work.

Signed-off-by: Stefan Hajnoczi
Reviewed-by: Eric Blake
Message-ID:<20231204164259.1515217-2-stefa...@redhat.com>
Signed-off-by: Kevin Wolf
---
  include/hw/scsi/scsi.h |   7 +-
  hw/scsi/scsi-bus.c | 181 -
  2 files changed, 131 insertions(+), 57 deletions(-)


My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks 
more often because of this commit than because of the original bug, i.e. 
when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd 
device, this tends to happen when unplugging the scsi-hd:


{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: 
Assertion `ctx == blk->ctx' failed.


(gdb) bt
#0  0x7f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x7f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x7f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x7f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x7f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2417
#7  0x556e6b112d87 in scsi_device_for_each_req_async_bh 
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at 
../util/async.c:218
#9  0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x556e6b4564b6 in iothread_run 
(opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63
#11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at 
../util/qemu-thread-posix.c:541

#12 0x7f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x7f32a670f7cc in  () at /usr/lib/libc.so.6

I don’t know anything about the problem yet, but as usual, I like 
speculation and discovering how wrong I was later on, so one thing I 
came across that’s funny about virtio-scsi is that requests can happen 
even while a disk is being attached or detached.  That is, Linux seems 
to probe all LUNs when a new virtio-scsi device is being attached, and 
it won’t stop just because a disk is being attached or removed.  So 
maybe that’s part of the problem, that we get a request while the BB is 
being detached, and temporarily in an inconsistent state (BDS context 
differs from BB context).


I’ll look more into it.

Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-23 Thread Hanna Czenczek


On 02.01.24 16:24, Hanna Czenczek wrote:

[...]

I’ve attached the preliminary patch that I didn’t get to send (or test 
much) last year.  Not sure if it has the same CPU-usage-spike issue 
Fiona was seeing, the only functional difference is that I notify the 
vq after attaching the notifiers instead of before.


It’ll take me some more time to send a patch mail to that effect, 
because now there’s an assertion failure on hotunplug, which bisects 
back to eaad0fe26050c227dc5dad63205835bac4912a51 (“scsi: only access 
SCSIDevice->requests from one thread”):


{"execute":"device_del","arguments":{"id":"stg0"}}
{"return": {}}
qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: 
Assertion `ctx == blk->ctx' failed.


(gdb) bt
#0  0x7f32a668d83c in  () at /usr/lib/libc.so.6
#1  0x7f32a663d668 in raise () at /usr/lib/libc.so.6
#2  0x7f32a66254b8 in abort () at /usr/lib/libc.so.6
#3  0x7f32a66253dc in  () at /usr/lib/libc.so.6
#4  0x7f32a6635d26 in  () at /usr/lib/libc.so.6
#5  0x556e6b4880a4 in blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2429
#6  blk_get_aio_context (blk=0x556e6e89ccf0) at 
../block/block-backend.c:2417
#7  0x556e6b112d87 in scsi_device_for_each_req_async_bh 
(opaque=0x556e6e2c6d10) at ../hw/scsi/scsi-bus.c:128
#8  0x556e6b5d1966 in aio_bh_poll (ctx=ctx@entry=0x556e6d8aa290) at 
../util/async.c:218
#9  0x556e6b5bb16a in aio_poll (ctx=0x556e6d8aa290, 
blocking=blocking@entry=true) at ../util/aio-posix.c:722
#10 0x556e6b4564b6 in iothread_run 
(opaque=opaque@entry=0x556e6d89d920) at ../iothread.c:63
#11 0x556e6b5bde58 in qemu_thread_start (args=0x556e6d8aa9b0) at 
../util/qemu-thread-posix.c:541

#12 0x7f32a668b9eb in  () at /usr/lib/libc.so.6
#13 0x7f32a670f7cc in  () at /usr/lib/libc.so.6

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-23 Thread Hanna Czenczek


On 23.01.24 12:12, Fiona Ebner wrote:

[...]


I noticed poll_set_started() is not called, because
ctx->fdmon_ops->need_wait(ctx) was true, i.e. ctx->poll_disable_cnt was
positive (I'm using fdmon_poll). I then found this is because of the
notifier for the event vq, being attached with


virtio_queue_aio_attach_host_notifier_no_poll(vs->event_vq, s->ctx);

in virtio_scsi_dataplane_start(). But in virtio_scsi_drained_end() it is
attached with virtio_queue_aio_attach_host_notifier() instead of the
_no_poll() variant. So that might be the actual issue here?

 From a quick test, I cannot see the CPU-usage-spike issue with the
following either:


diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 690aceec45..ba1ab8e410 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1166,7 +1166,15 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
  
  for (uint32_t i = 0; i < total_queues; i++) {

  VirtQueue *vq = virtio_get_queue(vdev, i);
-virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+if (!virtio_queue_get_notification(vq)) {
+virtio_queue_set_notification(vq, true);
+}
+if (vq == VIRTIO_SCSI_COMMON(s)->event_vq) {
+virtio_queue_aio_attach_host_notifier_no_poll(vq, s->ctx);
+} else {
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
+}
+virtio_queue_notify(vdev, i);
  }
  }


Perfect, so we agree on trying it that way. :)

Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-23 Thread Hanna Czenczek


On 22.01.24 18:52, Hanna Czenczek wrote:

On 22.01.24 18:41, Hanna Czenczek wrote:

On 05.01.24 15:30, Fiona Ebner wrote:

Am 05.01.24 um 14:43 schrieb Fiona Ebner:

Am 03.01.24 um 14:35 schrieb Paolo Bonzini:

On 1/3/24 12:40, Fiona Ebner wrote:
I'm happy to report that I cannot reproduce the CPU-usage-spike 
issue
with the patch, but I did run into an assertion failure when 
trying to
verify that it fixes my original stuck-guest-IO issue. See below 
for the
backtrace [0]. Hanna wrote in 
https://issues.redhat.com/browse/RHEL-3934



I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, 
because
both virtio-scsi’s and virtio-blk’s .handle_output() 
implementations
acquire the device’s context, so this should be directly 
callable from

any context.

I guess this is not true anymore now that the AioContext locking was
removed?

Good point and, in fact, even before it was much safer to use
virtio_queue_notify() instead.  Not only does it use the event 
notifier

handler, but it also calls it in the right thread/AioContext just by
doing event_notifier_set().


But with virtio_queue_notify() using the event notifier, the
CPU-usage-spike issue is present:

Back to the CPU-usage-spike issue: I experimented around and it 
doesn't
seem to matter whether I notify the virt queue before or after 
attaching

the notifiers. But there's another functional difference. My patch
called virtio_queue_notify() which contains this block:


 if (vq->host_notifier_enabled) {
event_notifier_set(>host_notifier);
 } else if (vq->handle_output) {
 vq->handle_output(vdev, vq);
In my testing, the first branch was taken, calling 
event_notifier_set().

Hanna's patch uses virtio_queue_notify_vq() and there,
vq->handle_output() will be called. That seems to be the relevant
difference regarding the CPU-usage-spike issue.
I should mention that this is with a VirtIO SCSI disk. I also 
attempted

to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
didn't manage yet.

What I noticed is that in virtio_queue_host_notifier_aio_poll(), 
one of

the queues (but only one) will always show as nonempty. And then,
run_poll_handlers_once() will always detect progress which explains 
the

CPU usage.

The following shows
1. vq address
2. number of times vq was passed to 
virtio_queue_host_notifier_aio_poll()

3. number of times the result of virtio_queue_host_notifier_aio_poll()
was true for the vq


0x555fd93f9c80 17162000 0
0x555fd93f9e48 17162000 6
0x555fd93f9ee0 17162000 0
0x555fd93f9d18 17162000 17162000
0x555fd93f9db0 17162000 0
0x555fd93f9f78 17162000 0

And for the problematic one, the reason it is seen as nonempty is:


0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0

vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
s->events_dropped is false in my testing, so
virtio_scsi_handle_event_vq() doesn't do anything.


Those values stay like this while the call counts above increase.

So something going wrong with the indices when the event notifier 
is set

from QEMU side (in the main thread)?

The guest is Debian 12 with a 6.1 kernel.


So, trying to figure out a new RFC version:

About the stack trace you, Fiona, posted:  As far as I understand, 
that happens because virtio_blk_drained_end() calling 
virtio_queue_notify_vq() wasn’t safe after all, and instead we need 
to use virtio_queue_notify().  Right?


However, you say using virtio_queue_notify() instead causes busy 
loops of doing nothing in virtio-scsi (what you describe above). I 
mean, better than a crash, but, you know. :)


I don’t know have any prior knowledge about the event handling, 
unfortunately.  The fact that 8 buffers are available but we don’t 
use any sounds OK to me; as far as I understand, we only use those 
buffers if we have any events to push into them, so as long as we 
don’t, we won’t.  Question is, should we not have its poll handler 
return false if we don’t have any events (i.e. events_dropped is 
false)?  Would that solve it?


Or actually, maybe we could just skip the virtio_queue_notify() call 
for the event vq?  I.e. have it be `if (vq != 
VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`?  
I wouldn’t like that very much, (a) because this would make it 
slightly cumbersome to put that into 
virtio_queue_aio_attach_host_notifier*(), and (b) in case that does 
fix it, I do kind of feel like the real problem is that we use 
virtio_queue_host_notifier_aio_poll() for the event vq, which tells 
the polling code to poll whenever the vq is non-empty, but we (AFAIU) 
expect the event vq to be non-empty all the time.


Turns out there was commit 38738f7dbbda90fbc161757b7f4be35b52205552 
(“virtio-scsi: don't waste CPU polling the event virtqueue”) by Stefan, 
which specifically intended to not use 
virtio_queue_host_notifier_aio_poll() for the event vq.  I think the 
problem is that virtio_scsi_drained

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-22 Thread Hanna Czenczek


On 22.01.24 18:41, Hanna Czenczek wrote:

On 05.01.24 15:30, Fiona Ebner wrote:

Am 05.01.24 um 14:43 schrieb Fiona Ebner:

Am 03.01.24 um 14:35 schrieb Paolo Bonzini:

On 1/3/24 12:40, Fiona Ebner wrote:

I'm happy to report that I cannot reproduce the CPU-usage-spike issue
with the patch, but I did run into an assertion failure when 
trying to
verify that it fixes my original stuck-guest-IO issue. See below 
for the
backtrace [0]. Hanna wrote in 
https://issues.redhat.com/browse/RHEL-3934



I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, 
because

both virtio-scsi’s and virtio-blk’s .handle_output() implementations
acquire the device’s context, so this should be directly callable 
from

any context.

I guess this is not true anymore now that the AioContext locking was
removed?

Good point and, in fact, even before it was much safer to use
virtio_queue_notify() instead.  Not only does it use the event 
notifier

handler, but it also calls it in the right thread/AioContext just by
doing event_notifier_set().


But with virtio_queue_notify() using the event notifier, the
CPU-usage-spike issue is present:

Back to the CPU-usage-spike issue: I experimented around and it 
doesn't
seem to matter whether I notify the virt queue before or after 
attaching

the notifiers. But there's another functional difference. My patch
called virtio_queue_notify() which contains this block:


 if (vq->host_notifier_enabled) {
 event_notifier_set(>host_notifier);
 } else if (vq->handle_output) {
 vq->handle_output(vdev, vq);
In my testing, the first branch was taken, calling 
event_notifier_set().

Hanna's patch uses virtio_queue_notify_vq() and there,
vq->handle_output() will be called. That seems to be the relevant
difference regarding the CPU-usage-spike issue.

I should mention that this is with a VirtIO SCSI disk. I also attempted
to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
didn't manage yet.

What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
the queues (but only one) will always show as nonempty. And then,
run_poll_handlers_once() will always detect progress which explains the
CPU usage.

The following shows
1. vq address
2. number of times vq was passed to 
virtio_queue_host_notifier_aio_poll()

3. number of times the result of virtio_queue_host_notifier_aio_poll()
was true for the vq


0x555fd93f9c80 17162000 0
0x555fd93f9e48 17162000 6
0x555fd93f9ee0 17162000 0
0x555fd93f9d18 17162000 17162000
0x555fd93f9db0 17162000 0
0x555fd93f9f78 17162000 0

And for the problematic one, the reason it is seen as nonempty is:


0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0

vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
s->events_dropped is false in my testing, so
virtio_scsi_handle_event_vq() doesn't do anything.


Those values stay like this while the call counts above increase.

So something going wrong with the indices when the event notifier is 
set

from QEMU side (in the main thread)?

The guest is Debian 12 with a 6.1 kernel.


So, trying to figure out a new RFC version:

About the stack trace you, Fiona, posted:  As far as I understand, 
that happens because virtio_blk_drained_end() calling 
virtio_queue_notify_vq() wasn’t safe after all, and instead we need to 
use virtio_queue_notify().  Right?


However, you say using virtio_queue_notify() instead causes busy loops 
of doing nothing in virtio-scsi (what you describe above). I mean, 
better than a crash, but, you know. :)


I don’t know have any prior knowledge about the event handling, 
unfortunately.  The fact that 8 buffers are available but we don’t use 
any sounds OK to me; as far as I understand, we only use those buffers 
if we have any events to push into them, so as long as we don’t, we 
won’t.  Question is, should we not have its poll handler return false 
if we don’t have any events (i.e. events_dropped is false)?  Would 
that solve it?


Or actually, maybe we could just skip the virtio_queue_notify() call for 
the event vq?  I.e. have it be `if (vq != 
VIRTIO_SCSI_COMMON(s)->event_vq) { virtio_queue_notify(vdev, i); }`?  I 
wouldn’t like that very much, (a) because this would make it slightly 
cumbersome to put that into virtio_queue_aio_attach_host_notifier*(), 
and (b) in case that does fix it, I do kind of feel like the real 
problem is that we use virtio_queue_host_notifier_aio_poll() for the 
event vq, which tells the polling code to poll whenever the vq is 
non-empty, but we (AFAIU) expect the event vq to be non-empty all the time.


Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-22 Thread Hanna Czenczek


On 05.01.24 15:30, Fiona Ebner wrote:

Am 05.01.24 um 14:43 schrieb Fiona Ebner:

Am 03.01.24 um 14:35 schrieb Paolo Bonzini:

On 1/3/24 12:40, Fiona Ebner wrote:

I'm happy to report that I cannot reproduce the CPU-usage-spike issue
with the patch, but I did run into an assertion failure when trying to
verify that it fixes my original stuck-guest-IO issue. See below for the
backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934


I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because
both virtio-scsi’s and virtio-blk’s .handle_output() implementations
acquire the device’s context, so this should be directly callable from
any context.

I guess this is not true anymore now that the AioContext locking was
removed?

Good point and, in fact, even before it was much safer to use
virtio_queue_notify() instead.  Not only does it use the event notifier
handler, but it also calls it in the right thread/AioContext just by
doing event_notifier_set().


But with virtio_queue_notify() using the event notifier, the
CPU-usage-spike issue is present:


Back to the CPU-usage-spike issue: I experimented around and it doesn't
seem to matter whether I notify the virt queue before or after attaching
the notifiers. But there's another functional difference. My patch
called virtio_queue_notify() which contains this block:


 if (vq->host_notifier_enabled) {
 event_notifier_set(>host_notifier);
 } else if (vq->handle_output) {
 vq->handle_output(vdev, vq);

In my testing, the first branch was taken, calling event_notifier_set().
Hanna's patch uses virtio_queue_notify_vq() and there,
vq->handle_output() will be called. That seems to be the relevant
difference regarding the CPU-usage-spike issue.

I should mention that this is with a VirtIO SCSI disk. I also attempted
to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
didn't manage yet.

What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
the queues (but only one) will always show as nonempty. And then,
run_poll_handlers_once() will always detect progress which explains the
CPU usage.

The following shows
1. vq address
2. number of times vq was passed to virtio_queue_host_notifier_aio_poll()
3. number of times the result of virtio_queue_host_notifier_aio_poll()
was true for the vq


0x555fd93f9c80 17162000 0
0x555fd93f9e48 17162000 6
0x555fd93f9ee0 17162000 0
0x555fd93f9d18 17162000 17162000
0x555fd93f9db0 17162000 0
0x555fd93f9f78 17162000 0

And for the problematic one, the reason it is seen as nonempty is:


0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0

vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
s->events_dropped is false in my testing, so
virtio_scsi_handle_event_vq() doesn't do anything.


Those values stay like this while the call counts above increase.

So something going wrong with the indices when the event notifier is set
from QEMU side (in the main thread)?

The guest is Debian 12 with a 6.1 kernel.


So, trying to figure out a new RFC version:

About the stack trace you, Fiona, posted:  As far as I understand, that 
happens because virtio_blk_drained_end() calling 
virtio_queue_notify_vq() wasn’t safe after all, and instead we need to 
use virtio_queue_notify().  Right?


However, you say using virtio_queue_notify() instead causes busy loops 
of doing nothing in virtio-scsi (what you describe above).  I mean, 
better than a crash, but, you know. :)


I don’t know have any prior knowledge about the event handling, 
unfortunately.  The fact that 8 buffers are available but we don’t use 
any sounds OK to me; as far as I understand, we only use those buffers 
if we have any events to push into them, so as long as we don’t, we 
won’t.  Question is, should we not have its poll handler return false if 
we don’t have any events (i.e. events_dropped is false)?  Would that 
solve it?


Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-02 Thread Hanna Czenczek


On 02.01.24 16:53, Paolo Bonzini wrote:

On Tue, Jan 2, 2024 at 4:24 PM Hanna Czenczek  wrote:

I’ve attached the preliminary patch that I didn’t get to send (or test
much) last year.  Not sure if it has the same CPU-usage-spike issue
Fiona was seeing, the only functional difference is that I notify the vq
after attaching the notifiers instead of before.

I think the patch makes sense and cleaning up the logic of aio_poll
(which is one of those functions that grew and grew without much
clarity into who did what) can be done on top.

Just one small thing, the virtio_queue_notify_vq() call is required
because the virtqueue interrupt and eventfd are edge-triggered rather
than level-triggered; so perhaps it should be placed in the
function(s) that establish the handlers,
virtio_queue_aio_attach_host_notifier() and
virtio_queue_aio_attach_host_notifier_no_poll()? Neither
virtio_blk_drained_end() nor virtio_scsi_drained_end() are
particularly special, and the comment applies just as well:

 /*
  * We will have ignored notifications about new requests from the guest
  * while handlers were not attached, so "kick" the virt queue to process
  * those requests now.
  */


I wasn’t entirely whether we want to call notify_vq() if we have 
instated the handlers for the first time (in which case someone ought to 
look for existing unprocessed requests anyway), so I decided to limit it 
to drained_end.


But considering that it must be safe to call notify_vq() right after 
instating the handlers (virtio_queue_host_notifier_read() may then be 
called after all), we might as well do so every time, yes.


Hanna

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-02 Thread Hanna Czenczek


On 13.12.23 22:15, Stefan Hajnoczi wrote:

Hanna and Fiona encountered a bug in aio_set_fd_handler(): there is no matching
io_poll_end() call upon removing an AioHandler when io_poll_begin() was
previously called. The missing io_poll_end() call leaves virtqueue
notifications disabled and the virtqueue's ioeventfd will never become
readable anymore.

The details of how virtio-scsi devices using IOThreads can hang after
hotplug/unplug are covered here:
https://issues.redhat.com/browse/RHEL-3934

Hanna is currently away over the December holidays. I'm sending these RFC
patches in the meantime. They demonstrate running aio_set_fd_handler() in the
AioContext home thread and adding the missing io_poll_end() call.


I agree with Paolo that if the handlers are removed, the caller probably 
isn’t interested in notifications: In our specific case, we remove the 
handlers because the device is to be drained, so we won’t poll the 
virtqueue anyway.  Therefore, if io_poll_end() is to be called to 
complete the start-end pair, it shouldn’t be done when the handlers are 
removed, but instead when they are reinstated.


I believe that’s quite infeasible to do in generic code: We’d have to 
basically remember that we haven’t called a specific io_poll_end 
callback yet, and so once it is reinstated while we’re not in a polling 
phase, we would need to call it then.  That in itself is already hairy, 
but in addition, the callback consists of a function pointer and an 
opaque pointer, and it seems impossible to reliably establish identity 
between two opaque pointers to know whether a newly instated io_poll_end 
callback is the same one as one that had been removed before (pointer 
identity may work, but also may not).


That’s why I think the responsibility should fall on the caller.  I 
believe both virtio_queue_aio_attach_host_notifier*() functions should 
enable vq notifications before instating the handler – if we’re in 
polling mode, io_poll_start() will then immediately be called and 
disable vq notifications again.  Having them enabled briefly shouldn’t 
hurt anything but performance (which I don’t think is terrible in the 
drain case).  For completeness’ sake, we may even have 
virtio_queue_aio_detach_host_notifier() disable vq notifications, 
because otherwise it’s unknown whether notifications are enabled or 
disabled after removing the notifiers.  That isn’t terrible, but I think 
(A) if any of the two, we want them to be disabled, because we won’t 
check for requests anyway, and (B) this gives us a clearer state 
machine, where removing the notifiers will always leave vq notifications 
disabled, so when they are reinstated (i.e. when calling 
virtio_queue_aio_attach_host_notifier*()), it’s clear that we must poll 
once to check for new requests.


I’ve attached the preliminary patch that I didn’t get to send (or test 
much) last year.  Not sure if it has the same CPU-usage-spike issue 
Fiona was seeing, the only functional difference is that I notify the vq 
after attaching the notifiers instead of before.


HannaFrom 451aae74fc19a6ea5cd6381247cd9202571651e8 Mon Sep 17 00:00:00 2001
From: Hanna Czenczek 
Date: Wed, 6 Dec 2023 18:24:55 +0100
Subject: [PATCH] Keep notifications disabled during drain

Preliminary patch with a preliminary description: During drain, we do
not care about virtqueue notifications, which is why we remove the
handlers on it.  When removing those handlers, whether vq notifications
are enabled or not depends on whether we were in polling mode or not; if
not, they are enabled (by default); if so, they have been disabled by
the io_poll_start callback.

Ideally, we would rather have the vq notifications be disabled, because
we will not process requests during a drained section anyway.
Therefore, have virtio_queue_aio_detach_host_notifier() explicitly
disable them, and virtio_queue_aio_attach_host_notifier*() re-enable
them (if we are in a polling section, attaching them will invoke the
io_poll_start callback, which will re-disable them).

Because we will miss virtqueue updates in the drained section, we also
need to poll the virtqueue once in the drained_end functions after
attaching the notifiers.
---
 include/block/aio.h|  7 ++-
 include/hw/virtio/virtio.h |  1 +
 hw/block/virtio-blk.c  | 10 ++
 hw/scsi/virtio-scsi.c  | 10 ++
 hw/virtio/virtio.c | 30 +-
 5 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index f08b358077..795a375ff2 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -497,9 +497,14 @@ void aio_set_event_notifier(AioContext *ctx,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
 
-/* Set polling begin/end callbacks for an event notifier that has already been
+/*
+ * Set polling begin/end callbacks for an event notifier that has already been
  * registered

[PULL 0/3] Block patches

2023-11-06 Thread Hanna Czenczek

The following changes since commit 3e01f1147a16ca566694b97eafc941d62fa1e8d8:

  Merge tag 'pull-sp-20231105' of https://gitlab.com/rth7680/qemu into staging 
(2023-11-06 09:34:22 +0800)

are available in the Git repository at:

  https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-11-06

for you to fetch changes up to ad4feaca61d76fecad784e6d5e7bae40d0411c46:

  file-posix: fix over-writing of returning zone_append offset (2023-11-06 
16:15:07 +0100)


Block patches:
- One patch to make qcow2's discard-no-unref option do better what it is
  supposed to do (i.e. prevent fragmentation)
- Two fixes for zoned requests


Jean-Louis Dupond (1):
  qcow2: keep reference on zeroize with discard-no-unref enabled

Naohiro Aota (1):
  file-posix: fix over-writing of returning zone_append offset

Sam Li (1):
  block/file-posix: fix update_zones_wp() caller

 qapi/block-core.json  | 24 ++--
 block/file-posix.c| 23 ---
 block/qcow2-cluster.c | 22 ++
 qemu-options.hx   | 10 +++---
 4 files changed, 51 insertions(+), 28 deletions(-)

-- 
2.41.0

[PULL 2/3] block/file-posix: fix update_zones_wp() caller

2023-11-06 Thread Hanna Czenczek

From: Sam Li 

When the zoned request fail, it needs to update only the wp of
the target zones for not disrupting the in-flight writes on
these other zones. The wp is updated successfully after the
request completes.

Fixed the callers with right offset and nr_zones.

Signed-off-by: Sam Li 
Message-Id: <20230825040556.4217-1-faithilike...@gmail.com>
Reviewed-by: Stefan Hajnoczi 
[hreitz: Rebased and fixed comment spelling]
Signed-off-by: Hanna Czenczek 
---
 block/file-posix.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 50e2b20d5c..3c0ce9c258 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2523,7 +2523,10 @@ out:
 }
 }
 } else {
-update_zones_wp(bs, s->fd, 0, 1);
+/*
+ * write and append write are not allowed to cross zone boundaries
+ */
+update_zones_wp(bs, s->fd, offset, 1);
 }
 
 qemu_co_mutex_unlock(>colock);
@@ -3470,7 +3473,7 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState 
*bs, BlockZoneOp op,
 len >> BDRV_SECTOR_BITS);
 ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, );
 if (ret != 0) {
-update_zones_wp(bs, s->fd, offset, i);
+update_zones_wp(bs, s->fd, offset, nrz);
 error_report("ioctl %s failed %d", op_name, ret);
 return ret;
 }
-- 
2.41.0

[PULL 1/3] qcow2: keep reference on zeroize with discard-no-unref enabled

2023-11-06 Thread Hanna Czenczek

From: Jean-Louis Dupond 

When the discard-no-unref flag is enabled, we keep the reference for
normal discard requests.
But when a discard is executed on a snapshot/qcow2 image with backing,
the discards are saved as zero clusters in the snapshot image.

When committing the snapshot to the backing file, not
discard_in_l2_slice is called but zero_in_l2_slice. Which did not had
any logic to keep the reference when discard-no-unref is enabled.

Therefor we add logic in the zero_in_l2_slice call to keep the reference
on commit.

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
Signed-off-by: Jean-Louis Dupond 
Message-Id: <20231003125236.216473-2-jean-lo...@dupond.be>
[hreitz: Made the documentation change more verbose, as discussed
 on-list]
Signed-off-by: Hanna Czenczek 
---
 qapi/block-core.json  | 24 ++--
 block/qcow2-cluster.c | 22 ++
 qemu-options.hx   | 10 +++---
 3 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 99961256f2..ca390c5700 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3528,16 +3528,20 @@
 # @pass-discard-other: whether discard requests for the data source
 # should be issued on other occasions where a cluster gets freed
 #
-# @discard-no-unref: when enabled, discards from the guest will not
-# cause cluster allocations to be relinquished.  This prevents
-# qcow2 fragmentation that would be caused by such discards.
-# Besides potential performance degradation, such fragmentation
-# can lead to increased allocation of clusters past the end of the
-# image file, resulting in image files whose file length can grow
-# much larger than their guest disk size would suggest.  If image
-# file length is of concern (e.g. when storing qcow2 images
-# directly on block devices), you should consider enabling this
-# option.  (since 8.1)
+# @discard-no-unref: when enabled, data clusters will remain
+# preallocated when they are no longer used, e.g. because they are
+# discarded or converted to zero clusters.  As usual, whether the
+# old data is discarded or kept on the protocol level (i.e. in the
+# image file) depends on the setting of the pass-discard-request
+# option.  Keeping the clusters preallocated prevents qcow2
+# fragmentation that would otherwise be caused by freeing and
+# re-allocating them later.  Besides potential performance
+# degradation, such fragmentation can lead to increased allocation
+# of clusters past the end of the image file, resulting in image
+# files whose file length can grow much larger than their guest disk
+# size would suggest.  If image file length is of concern (e.g. when
+# storing qcow2 images directly on block devices), you should
+# consider enabling this option.  (since 8.1)
 #
 # @overlap-check: which overlap checks to perform for writes to the
 # image, defaults to 'cached' (since 2.2)
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 904f00d1b3..5af439bd11 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1983,7 +1983,7 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
 /* If we keep the reference, pass on the discard still */
 bdrv_pdiscard(s->data_file, old_l2_entry & L2E_OFFSET_MASK,
   s->cluster_size);
-   }
+}
 }
 
 qcow2_cache_put(s->l2_table_cache, (void **) _slice);
@@ -2061,9 +2061,15 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
 QCow2ClusterType type = qcow2_get_cluster_type(bs, old_l2_entry);
 bool unmap = (type == QCOW2_CLUSTER_COMPRESSED) ||
 ((flags & BDRV_REQ_MAY_UNMAP) && qcow2_cluster_is_allocated(type));
-uint64_t new_l2_entry = unmap ? 0 : old_l2_entry;
+bool keep_reference =
+(s->discard_no_unref && type != QCOW2_CLUSTER_COMPRESSED);
+uint64_t new_l2_entry = old_l2_entry;
 uint64_t new_l2_bitmap = old_l2_bitmap;
 
+if (unmap && !keep_reference) {
+new_l2_entry = 0;
+}
+
 if (has_subclusters(s)) {
 new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
 } else {
@@ -2081,9 +2087,17 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
 set_l2_bitmap(s, l2_slice, l2_index + i, new_l2_bitmap);
 }
 
-/* Then decrease the refcount */
 if (unmap) {
-qcow2_free_any_cluster(bs, old_l2_entry, QCOW2_DISCARD_REQUEST);
+if (!keep_reference) {
+/* Then decrease the refcount */
+qcow2_free_any_cluster(bs, old_l2_entry, 
QCOW2_DISCARD_REQUEST);
+} else if (s->discard_passthrough[QCOW2_DISCARD_REQUEST] &&
+   (type == QCOW2_CLUSTER_NORMAL

[PULL 3/3] file-posix: fix over-writing of returning zone_append offset

2023-11-06 Thread Hanna Czenczek

From: Naohiro Aota 

raw_co_zone_append() sets "s->offset" where "BDRVRawState *s". This pointer
is used later at raw_co_prw() to save the block address where the data is
written.

When multiple IOs are on-going at the same time, a later IO's
raw_co_zone_append() call over-writes a former IO's offset address before
raw_co_prw() completes. As a result, the former zone append IO returns the
initial value (= the start address of the writing zone), instead of the
proper address.

Fix the issue by passing the offset pointer to raw_co_prw() instead of
passing it through s->offset. Also, remove "offset" from BDRVRawState as
there is no usage anymore.

Fixes: 4751d09adcc3 ("block: introduce zone append write for zoned devices")
Signed-off-by: Naohiro Aota 
Message-Id: <20231030073853.2601162-1-naohiro.a...@wdc.com>
Reviewed-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Hanna Czenczek 
---
 block/file-posix.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 3c0ce9c258..b862406c71 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -160,7 +160,6 @@ typedef struct BDRVRawState {
 bool has_write_zeroes:1;
 bool use_linux_aio:1;
 bool use_linux_io_uring:1;
-int64_t *offset; /* offset of zone append operation */
 int page_cache_inconsistent; /* errno from fdatasync failure */
 bool has_fallocate;
 bool needs_alignment;
@@ -2445,12 +2444,13 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
QEMUIOVector *qiov)
 return true;
 }
 
-static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
+static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr,
uint64_t bytes, QEMUIOVector *qiov, int 
type)
 {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
 int ret;
+uint64_t offset = *offset_ptr;
 
 if (fd_open(bs) < 0)
 return -EIO;
@@ -2513,8 +2513,8 @@ out:
 uint64_t *wp = >wp[offset / bs->bl.zone_size];
 if (!BDRV_ZT_IS_CONV(*wp)) {
 if (type & QEMU_AIO_ZONE_APPEND) {
-*s->offset = *wp;
-trace_zbd_zone_append_complete(bs, *s->offset
+*offset_ptr = *wp;
+trace_zbd_zone_append_complete(bs, *offset_ptr
 >> BDRV_SECTOR_BITS);
 }
 /* Advance the wp if needed */
@@ -2539,14 +2539,14 @@ static int coroutine_fn raw_co_preadv(BlockDriverState 
*bs, int64_t offset,
   int64_t bytes, QEMUIOVector *qiov,
   BdrvRequestFlags flags)
 {
-return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_READ);
+return raw_co_prw(bs, , bytes, qiov, QEMU_AIO_READ);
 }
 
 static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags)
 {
-return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
+return raw_co_prw(bs, , bytes, qiov, QEMU_AIO_WRITE);
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
@@ -3509,8 +3509,6 @@ static int coroutine_fn 
raw_co_zone_append(BlockDriverState *bs,
 int64_t zone_size_mask = bs->bl.zone_size - 1;
 int64_t iov_len = 0;
 int64_t len = 0;
-BDRVRawState *s = bs->opaque;
-s->offset = offset;
 
 if (*offset & zone_size_mask) {
 error_report("sector offset %" PRId64 " is not aligned to zone size "
@@ -3531,7 +3529,7 @@ static int coroutine_fn 
raw_co_zone_append(BlockDriverState *bs,
 }
 
 trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
-return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
+return raw_co_prw(bs, offset, len, qiov, QEMU_AIO_ZONE_APPEND);
 }
 #endif
 
-- 
2.41.0

Re: [PATCH] file-posix: fix over-writing of returning zone_append offset

2023-11-06 Thread Hanna Czenczek


On 30.10.23 08:38, Naohiro Aota wrote:

raw_co_zone_append() sets "s->offset" where "BDRVRawState *s". This pointer
is used later at raw_co_prw() to save the block address where the data is
written.

When multiple IOs are on-going at the same time, a later IO's
raw_co_zone_append() call over-writes a former IO's offset address before
raw_co_prw() completes. As a result, the former zone append IO returns the
initial value (= the start address of the writing zone), instead of the
proper address.

Fix the issue by passing the offset pointer to raw_co_prw() instead of
passing it through s->offset. Also, remove "offset" from BDRVRawState as
there is no usage anymore.

Fixes: 4751d09adcc3 ("block: introduce zone append write for zoned devices")
Signed-off-by: Naohiro Aota 
---
  block/file-posix.c | 16 +++-
  1 file changed, 7 insertions(+), 9 deletions(-)


Thanks, applied to my block branch:

https://gitlab.com/hreitz/qemu/-/commits/block

Hanna

Re: [PATCH v2 09/10] block: Convert qmp_query_block() to coroutine_fn

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

This is another caller of bdrv_get_allocated_file_size() that needs to
be converted to a coroutine because that function will be made
asynchronous when called (indirectly) from the QMP dispatcher.

This QMP command is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_get_allocated_file_size().

We've determined bdrv_do_query_node_info() to be coroutine-safe (see
previous commits), so we can just put this QMP command in a coroutine.

Since qmp_query_block() now expects to run in a coroutine, its callers
need to be converted as well. Convert hmp_info_block(), which calls
only coroutine-safe code, including qmp_query_named_block_nodes()
which has been converted to coroutine in the previous patches.

Now that all callers of bdrv_[co_]block_device_info() are using the
coroutine version, a few things happen:

  - we can return to using bdrv_block_device_info() without a wrapper;

  - bdrv_get_allocated_file_size() can stop being mixed;

  - bdrv_co_get_allocated_file_size() needs to be put under the graph
lock because it is being called wthout the wrapper;

  - bdrv_do_query_node_info() doesn't need to acquire the AioContext
because it doesn't call aio_poll anymore;

Signed-off-by: Fabiano Rosas 
---
  block.c|  2 +-
  block/monitor/block-hmp-cmds.c |  2 +-
  block/qapi.c   | 18 +-
  hmp-commands-info.hx   |  1 +
  include/block/block-hmp-cmds.h |  2 +-
  include/block/block-io.h   |  2 +-
  include/block/qapi.h   | 12 
  qapi/block-core.json   |  2 +-
  8 files changed, 19 insertions(+), 22 deletions(-)


After this series has been sent, we got some usages of 
GRAPH_RDLOCK_GUARD_MAINLOOP() that no longer fit with this patch – I’ve 
also mentioned one case on patch 7, not yet realizing that this was a 
new thing.  Those must now be fixed (e.g. in qmp_query_block(), or in 
bdrv_snapshot_list()), or they’ll crash.


Hanna

Re: [PATCH v2 10/10] block: Add a thread-pool version of fstat

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

From: João Silva 

The fstat call can take a long time to finish when running over
NFS. Add a version of it that runs in the thread pool.

Adapt one of its users, raw_co_get_allocated_file size to use the new
version. That function is called via QMP under the qemu_global_mutex
so it has a large chance of blocking VCPU threads in case it takes too
long to finish.

Reviewed-by: Claudio Fontana 
Signed-off-by: João Silva 
Signed-off-by: Fabiano Rosas 
---
  block/file-posix.c  | 40 +---
  include/block/raw-aio.h |  4 +++-
  2 files changed, 40 insertions(+), 4 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 05/10] block: Convert bdrv_query_block_graph_info to coroutine

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This function is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_get_allocated_file_size().

All the functions called from bdrv_do_query_node_info() onwards are
coroutine-safe, either have a coroutine version themselves[1] or are
mostly simple code/string manipulation[2].

1) bdrv_getlength(), bdrv_get_allocated_file_size(), bdrv_get_info(),
bdrv_get_specific_info();

2) bdrv_refresh_filename(), bdrv_get_format_name(),
bdrv_get_full_backing_filename(), bdrv_query_snapshot_info_list();

Signed-off-by: Fabiano Rosas 
---
  block/qapi.c | 12 +++-
  include/block/qapi.h |  6 +-
  qemu-img.c   |  2 --
  3 files changed, 12 insertions(+), 8 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 04/10] block: Temporarily mark bdrv_co_get_allocated_file_size as mixed

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

Some callers of this function are about to be converted to run in
coroutines, so allow it to be executed both inside and outside a
coroutine while we convert all the callers.

This will be reverted once all callers of bdrv_do_query_node_info run
in a coroutine.

Signed-off-by: Fabiano Rosas 
Reviewed-by: Eric Blake 
---
  include/block/block-io.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 03/10] block: Allow the wrapper script to see functions declared in qapi.h

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

The following patches will add co_wrapper annotations to functions
declared in qapi.h. Add that header to the set of files used by
block-coroutine-wrapper.py.

Signed-off-by: Fabiano Rosas 
---
  block/meson.build  | 1 +
  scripts/block-coroutine-wrapper.py | 1 +
  2 files changed, 2 insertions(+)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 09/10] block: Convert qmp_query_block() to coroutine_fn

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

This is another caller of bdrv_get_allocated_file_size() that needs to
be converted to a coroutine because that function will be made
asynchronous when called (indirectly) from the QMP dispatcher.

This QMP command is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_get_allocated_file_size().

We've determined bdrv_do_query_node_info() to be coroutine-safe (see
previous commits), so we can just put this QMP command in a coroutine.

Since qmp_query_block() now expects to run in a coroutine, its callers
need to be converted as well. Convert hmp_info_block(), which calls
only coroutine-safe code, including qmp_query_named_block_nodes()
which has been converted to coroutine in the previous patches.

Now that all callers of bdrv_[co_]block_device_info() are using the
coroutine version, a few things happen:

  - we can return to using bdrv_block_device_info() without a wrapper;

  - bdrv_get_allocated_file_size() can stop being mixed;

  - bdrv_co_get_allocated_file_size() needs to be put under the graph
lock because it is being called wthout the wrapper;


But bdrv_do_query_node_info() is marked GRAPH_RDLOCK, so the whole 
function must not be called without holding the lock.  I don’t 
understand why we need to explicitly take it another time.



  - bdrv_do_query_node_info() doesn't need to acquire the AioContext
because it doesn't call aio_poll anymore;


In the past (very likely outdated, but still mentioning it) you’d need 
to take the AioContext just in general when operating on a block device 
that might be processed in a different AioContext than the main one, and 
the current code runs in the main context, i.e. which is the situation 
we have here.


Speaking of contexts, I wonder how the threading is actually supposed to 
work.  I assume QMP coroutines run in the main thread, so now we run 
bdrv_co_get_allocated_file_size() in the main thread – is that correct, 
or do we need to use bdrv_co_enter() like qmp_block_resize() does?  And 
so, if we run it in the main thread, is it OK not to acquire the 
AioContext around it to prevent interference from a potential I/O thread?



Signed-off-by: Fabiano Rosas 
---
  block.c|  2 +-
  block/monitor/block-hmp-cmds.c |  2 +-
  block/qapi.c   | 18 +-
  hmp-commands-info.hx   |  1 +
  include/block/block-hmp-cmds.h |  2 +-
  include/block/block-io.h   |  2 +-
  include/block/qapi.h   | 12 
  qapi/block-core.json   |  2 +-
  8 files changed, 19 insertions(+), 22 deletions(-)


This patch implicitly assumes that quite a lot of functions (at least 
bdrv_query_info(), bdrv_query_image_info(), bdrv_do_query_node_info()) 
are now run in coroutine context.  This assumption must be formalized by 
annotating them all with coroutine_fn, and ideally adding a _co_ into 
their name.


Also, those functions should be checked whether they call coroutine 
wrappers, and made to use the native coroutine version now if so. (At 
least I’d find that nicer, FWIW.)  I’ve seen at least bdrv_getlength() 
in bdrv_do_query_node_info(), which could be a bdrv_co_getlength().



diff --git a/block.c b/block.c
index abed744b60..f94cee8930 100644
--- a/block.c
+++ b/block.c
@@ -6148,7 +6148,7 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
  
  list = NULL;

  QTAILQ_FOREACH(bs, _bdrv_states, node_list) {
-BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, 
errp);
+BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
  if (!info) {
  qapi_free_BlockDeviceInfoList(list);
  return NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 26116fe831..1049f9b006 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -742,7 +742,7 @@ static void print_block_info(Monitor *mon, BlockInfo *info,
  }
  }
  
-void hmp_info_block(Monitor *mon, const QDict *qdict)

+void coroutine_fn hmp_info_block(Monitor *mon, const QDict *qdict)
  {
  BlockInfoList *block_list, *info;
  BlockDeviceInfoList *blockdev_list, *blockdev;
diff --git a/block/qapi.c b/block/qapi.c
index 20660e15d6..3b4bc0b782 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -41,10 +41,10 @@
  #include "qemu/qemu-print.h"
  #include "sysemu/block-backend.h"
  
-BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,

-BlockDriverState *bs,
-bool flat,
-Error **errp)
+BlockDeviceInfo *coroutine_fn bdrv_block_device_info(BlockBackend *blk,
+ BlockDriverState *bs,
+ bool flat,
+

Re: [PATCH v2 08/10] block: Don't query all block devices at hmp_nbd_server_start

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

We're currently doing a full query-block just to enumerate the devices
for qmp_nbd_server_add and then discarding the BlockInfoList
afterwards. Alter hmp_nbd_server_start to instead iterate explicitly
over the block_backends list.

This allows the removal of the dependency on qmp_query_block from
hmp_nbd_server_start. This is desirable because we're about to move
qmp_query_block into a coroutine and don't need to change the NBD code
at the same time.

Signed-off-by: Fabiano Rosas 
---
  block/monitor/block-hmp-cmds.c | 20 
  1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index ca2599de44..26116fe831 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -394,7 +394,7 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict)
  bool writable = qdict_get_try_bool(qdict, "writable", false);
  bool all = qdict_get_try_bool(qdict, "all", false);
  Error *local_err = NULL;
-BlockInfoList *block_list, *info;
+BlockBackend *blk;
  SocketAddress *addr;
  NbdServerAddOptions export;
  
@@ -419,18 +419,24 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict)

  return;
  }
  
-/* Then try adding all block devices.  If one fails, close all and

+/*
+ * Then try adding all block devices.  If one fails, close all and
   * exit.
   */
-block_list = qmp_query_block(NULL);
+for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) {
+BlockDriverState *bs = blk_bs(blk);
  
-for (info = block_list; info; info = info->next) {

-if (!info->value->inserted) {
+if (!*blk_name(blk) && !blk_get_attached_dev(blk)) {


I’d like a comment here that historically, we’ve used qmp_query_block() 
here, and this is the same condition that it uses.  (Otherwise, it’s 
hard to see why it matters whether a device is attached or not.)



+continue;
+}
+
+bs = bdrv_skip_implicit_filters(bs);
+if (!bs || !bs->drv) {


Same here.  Just checking blk_is_inserted() would make more sense in 
this place, but if you want to absolutely keep behavior unchanged, then 
there should be a comment here why this check is done (because 
bdrv_query_info() does it to determine whether info->inserted should be 
set, which was historically used to determine whether this BlockBackend 
can be exported).



  continue;
  }
  
  export = (NbdServerAddOptions) {

-.device = info->value->device,
+.device = g_strdup(blk_name(blk)),


Do we need to g_strdup() here?  We didn’t before, so I think this will 
leak export.device.


I know bdrv_query_info() uses g_strdup(), but that was released by the 
qapi_free_BlockInfoList() below, which is now removed without replacement.


(On that note, it also looks like qmp_nbd_server_add() can leak 
arg->name (i.e. device.name) if it is not set by the caller.  It also 
uses g_strdup() there, but never frees it.  It does free the export_opts 
it creates, and `arg` is put into it, but as a deep copy, so anything in 
`arg` is leaked.)


Hanna


  .has_writable   = true,
  .writable   = writable,
  };
@@ -443,8 +449,6 @@ void hmp_nbd_server_start(Monitor *mon, const QDict *qdict)
  }
  }
  
-qapi_free_BlockInfoList(block_list);

-
  exit:
  hmp_handle_error(mon, local_err);
  }

Re: [PATCH v2 06/10] block: Convert bdrv_block_device_info into co_wrapper

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This function is a candidate because it calls bdrv_query_image_info()
-> bdrv_do_query_node_info() -> bdrv_get_allocated_file_size().

It is safe to turn this is a coroutine because the code it calls is
made up of either simple accessors and string manipulation functions
[1] or it has already been determined to be safe [2].

1) bdrv_refresh_filename(), bdrv_is_read_only(),
blk_enable_write_cache(), bdrv_cow_bs(), blk_get_public(),
throttle_group_get_name(), bdrv_write_threshold_get(),
bdrv_query_dirty_bitmaps(), throttle_group_get_config(),
bdrv_filter_or_cow_bs(), bdrv_skip_implicit_filters()

2) bdrv_do_query_node_info() (see previous commit);

Signed-off-by: Fabiano Rosas 
---
  block/qapi.c |  8 
  include/block/qapi.h | 12 
  2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index a2e71edaff..20660e15d6 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -41,10 +41,10 @@
  #include "qemu/qemu-print.h"
  #include "sysemu/block-backend.h"
  
-BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,

-BlockDriverState *bs,
-bool flat,
-Error **errp)
+BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,
+BlockDriverState *bs,
+bool flat,
+Error **errp)
  {
  ImageInfo **p_image_info;
  ImageInfo *backing_info;
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 7035bcd1ae..5cb0202791 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -30,10 +30,14 @@
  #include "block/snapshot.h"
  #include "qapi/qapi-types-block-core.h"
  
-BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,

-BlockDriverState *bs,
-bool flat,
-Error **errp);
+BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,
+BlockDriverState *bs,
+bool flat,
+Error **errp);
+BlockDeviceInfo *co_wrapper bdrv_block_device_info(BlockBackend *blk,
+   BlockDriverState *bs,
+   bool flat,
+   Error **errp);


bdrv_co_block_device_info() is now marked as GRAPH_RDLOCK, so should 
this use a co_wrapper_bdrv_rdlock instead?


Hanna


  int bdrv_query_snapshot_info_list(BlockDriverState *bs,
SnapshotInfoList **p_list,
Error **errp);

Re: [PATCH v2 07/10] block: Convert qmp_query_named_block_nodes to coroutine

2023-11-06 Thread Hanna Czenczek


On 09.06.23 22:19, Fabiano Rosas wrote:

From: Lin Ma 

We're converting callers of bdrv_get_allocated_file_size() to run in
coroutines because that function will be made asynchronous when called
(indirectly) from the QMP dispatcher.

This QMP command is a candidate because it indirectly calls
bdrv_get_allocated_file_size() through bdrv_block_device_info() ->
bdrv_query_image_info() -> bdrv_query_image_info().

The previous patches have determined that bdrv_query_image_info() and
bdrv_do_query_node_info() are coroutine-safe so we can just make the
QMP command run in a coroutine.

Signed-off-by: Lin Ma 
Signed-off-by: Fabiano Rosas 
---
  block.c  | 2 +-
  blockdev.c   | 6 +++---
  qapi/block-core.json | 3 ++-
  3 files changed, 6 insertions(+), 5 deletions(-)


(I see patch 9 does something with HMP code, but) hmp_info_block() calls 
qmp_query_named_block_nodes(), and I don’t think it may call such a 
coroutine_fn directly.



diff --git a/block.c b/block.c
index f94cee8930..abed744b60 100644
--- a/block.c
+++ b/block.c
@@ -6148,7 +6148,7 @@ BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
  
  list = NULL;

  QTAILQ_FOREACH(bs, _bdrv_states, node_list) {
-BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
+BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, 
errp);


As far as I understand, only functions marked as coroutine_fn may call 
other coroutine_fn.  Regardless of whether bdrv_named_nodes_list() is 
only called by another coroutine_fn, we still have to mark it as 
coroutine_fn, too (and probably rename it to bdrv_co_named_nodes_list()).


Also, this function (bdrv_named_nodes_list()) uses 
GRAPH_RDLOCK_GUARD_MAINLOOP().  Is that the correct thing to use in a 
coroutine context?


Hanna


  if (!info) {
  qapi_free_BlockDeviceInfoList(list);
  return NULL;
diff --git a/blockdev.c b/blockdev.c
index e6eba61484..8b5f7d06c8 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2818,9 +2818,9 @@ void qmp_drive_backup(DriveBackup *backup, Error **errp)
  blockdev_do_action(, errp);
  }
  
-BlockDeviceInfoList *qmp_query_named_block_nodes(bool has_flat,

- bool flat,
- Error **errp)
+BlockDeviceInfoList *coroutine_fn qmp_query_named_block_nodes(bool has_flat,
+  bool flat,
+  Error **errp)
  {
  bool return_flat = has_flat && flat;
  
diff --git a/qapi/block-core.json b/qapi/block-core.json

index 5dd5f7e4b0..9d4c92f2c9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1972,7 +1972,8 @@
  { 'command': 'query-named-block-nodes',
'returns': [ 'BlockDeviceInfo' ],
'data': { '*flat': 'bool' },
-  'allow-preconfig': true }
+  'allow-preconfig': true,
+  'coroutine': true}
  
  ##

  # @XDbgBlockGraphNodeType:

Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap

2023-11-03 Thread Hanna Czenczek


On 03.11.23 16:51, Hanna Czenczek wrote:
On 20.10.23 23:56, Andrey Drobyshev wrote: 


[...]


@@ -528,6 +543,14 @@ for use_backing_file in yes no; do
  else
  _make_test_img -o extended_l2=on 1M
  fi
+    # Write cluster #0 and discard its subclusters #0-#3
+    $QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
+    before=$(disk_usage "$TEST_IMG")
+    $QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
+    after=$(disk_usage "$TEST_IMG")
+    _verify_du_delta $before $after 8192
+    alloc="$(seq 4 31)"; zero="$(seq 0 3)"
+    _verify_l2_bitmap 0
  # Write clusters #0-#2 and then discard them
  $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
  $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"


Similarly to above, I think it would be good if we combined this 
following case with the one you added, i.e. to write 128k from the 
beginning, drop the write here, and change the discard to be “discard 
-q 8k 120k”, i.e. skip the subclusters we have already discarded, to 
see that this is still combined to discard the whole first cluster.


...Ah, see, and when I try this, the following assertion fails:

qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion 
`c->entries[i].ref == 0' failed.
./common.rc: line 220: 128894 Aborted (core dumped) ( 
VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec 
"${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )


Looks like an L2 table is leaked somewhere.  That’s why SCRI should be 
a g_auto()-able type.


Forgot to add: This single test case here is the only place where we 
test the added functionality.  I think there should be more cases. It 
doesn’t really make sense now that 271 has so many cases for writing 
zeroes, but so few for discarding, now that discarding works on 
subclusters.  Most of them should at least be considered whether we can 
run them for discard as well.


I didn’t want to push for such an extensive set of tests, but, well, now 
it turned out I overlooked a bug in patch 4, and only found it because I 
thought “this place might also make a nice test case for this series”.


Hanna

Re: [PATCH 4/7] qcow2: make subclusters discardable

2023-11-03 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

This commit makes the discard operation work on the subcluster level
rather than cluster level.  It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters().  It also
changes the qcow2 driver pdiscard_alignment to subcluster_size.  That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.

This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 100 --
  block/qcow2.c |   8 ++--
  2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7c6fa5524c..cf40f2dc12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
  return nb_clusters;
  }
  
+static int coroutine_fn GRAPH_RDLOCK

+discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+   uint64_t nb_subclusters,
+   enum qcow2_discard_type type,
+   bool full_discard,
+   SubClusterRangeInfo *pscri)
+{
+BDRVQcow2State *s = bs->opaque;
+uint64_t new_l2_bitmap, l2_bitmap_mask;
+int ret, sc = offset_to_sc_index(s, offset);
+SubClusterRangeInfo scri = { 0 };
+
+if (!pscri) {
+ret = get_sc_range_info(bs, offset, nb_subclusters, );
+if (ret < 0) {
+goto out;
+}
+} else {
+scri = *pscri;
+}
+
+l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap = scri.l2_bitmap;
+new_l2_bitmap &= ~l2_bitmap_mask;
+
+/*
+ * If there're no allocated subclusters left, we might as well discard
+ * the entire cluster.  That way we'd also update the refcount table.
+ */
+if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {
+return discard_in_l2_slice(bs,
+   QEMU_ALIGN_DOWN(offset, s->cluster_size),
+   1, type, full_discard);
+}


scri.l2_slice is leaked here.

Hanna

Re: [PATCH 7/7] iotests/271: check disk usage on subcluster-based discard/unmap

2023-11-03 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

Add _verify_du_delta() checker which is used to check that real disk
usage delta meets the expectations.  For now we use it for checking that
subcluster-based discard/unmap operations lead to actual disk usage
decrease (i.e. PUNCH_HOLE operation is performed).


I’m not too happy about checking the disk usage because that relies on 
the underlying filesystem actually accepting and executing the unmap.  
Why is it not enough to check the L2 bitmap?


…Coming back later (I had to fix the missing `ret = ` I mentioned in 
patch 2, or this test would hang, so I couldn’t run it at first), I note 
that checking the disk usage in fact doesn’t work on tmpfs.  I usually 
run the iotests in tmpfs, so that’s not great.



Also add separate test case for discarding particular subcluster within
one cluster.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/271 | 25 -
  tests/qemu-iotests/271.out |  2 ++
  2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index c7c2cadda0..5fcb209f5f 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -81,6 +81,15 @@ _verify_l2_bitmap()
  fi
  }
  
+# Check disk usage delta after a discard/unmap operation

+# _verify_du_delta $before $after $expected_delta
+_verify_du_delta()
+{
+if [ $(($1 - $2)) -ne $3 ]; then
+printf "ERROR: unexpected delta: $1 - $2 = $(($1 - $2)) != $3\n"
+fi
+}
+
  # This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX cmd=XXX
  # c:   cluster number (0 if unset)
  # sc:  subcluster number inside cluster @c (0 if unset)
@@ -198,9 +207,12 @@ for use_backing_file in yes no; do
  alloc="$(seq 0 31)"; zero=""
  _run_test sc=0 len=64k
  
-### Zero and unmap half of cluster #0 (this won't unmap it)

+### Zero and unmap half of cluster #0 (this will unmap it)


I think “it” refers to the cluster, and it is not unmapped.  This test 
case does not use a discard, but write -z instead, so it worked before.  
(The L2 bitmap shown in the output doesn’t change, so functionally, this 
patch series didn’t change this case.)



  alloc="$(seq 16 31)"; zero="$(seq 0 15)"
+before=$(disk_usage "$TEST_IMG")
  _run_test sc=0 len=32k cmd=unmap
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after 32768
  
  ### Zero and unmap cluster #0

  alloc=""; zero="$(seq 0 31)"


For this following case shown truncated here, why don’t we try 
“_run_test sc=16 len=32k cmd=unmap” instead of “sc=0 len=64k”?  I.e. 
unmap only the second half, which, thanks to patch 3, should still unmap 
the whole cluster, because the first half is already unmapped.



@@ -447,7 +459,10 @@ for use_backing_file in yes no; do
  
  # Subcluster-aligned request from clusters #12 to #14

  alloc="$(seq 0 15)"; zero="$(seq 16 31)"
+before=$(disk_usage "$TEST_IMG")
  _run_test c=12 sc=16 len=128k cmd=unmap
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after $((128 * 1024))
  alloc=""; zero="$(seq 0 31)"
  _verify_l2_bitmap 13
  alloc="$(seq 16 31)"; zero="$(seq 0 15)"
@@ -528,6 +543,14 @@ for use_backing_file in yes no; do
  else
  _make_test_img -o extended_l2=on 1M
  fi
+# Write cluster #0 and discard its subclusters #0-#3
+$QEMU_IO -c 'write -q 0 64k' "$TEST_IMG"
+before=$(disk_usage "$TEST_IMG")
+$QEMU_IO -c 'discard -q 0 8k' "$TEST_IMG"
+after=$(disk_usage "$TEST_IMG")
+_verify_du_delta $before $after 8192
+alloc="$(seq 4 31)"; zero="$(seq 0 3)"
+_verify_l2_bitmap 0
  # Write clusters #0-#2 and then discard them
  $QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
  $QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"


Similarly to above, I think it would be good if we combined this 
following case with the one you added, i.e. to write 128k from the 
beginning, drop the write here, and change the discard to be “discard -q 
8k 120k”, i.e. skip the subclusters we have already discarded, to see 
that this is still combined to discard the whole first cluster.


...Ah, see, and when I try this, the following assertion fails:

qemu-io: ../block/qcow2-cache.c:156: qcow2_cache_destroy: Assertion 
`c->entries[i].ref == 0' failed.
./common.rc: line 220: 128894 Aborted (core dumped) ( 
VALGRIND_QEMU="${VALGRIND_QEMU_IO}" _qemu_proc_exec 
"${VALGRIND_LOGFILE}" "$QEMU_IO_PROG" $QEMU_IO_ARGS "$@" )


Looks like an L2 table is leaked somewhere.  That’s why SCRI should be a 
g_auto()-able type.


Hanna


diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
index 5be780de76..0da8d72cde 100644
--- a/tests/qemu-iotests/271.out
+++ b/tests/qemu-iotests/271.out
@@ -426,6 +426,7 @@ L2 entry #29: 0x 
  ### Discarding clusters with non-zero bitmaps (backing file: yes) ###
  
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576

Re: [PATCH 6/7] iotests/common.rc: add disk_usage function

2023-11-03 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

Move the definition from iotests/250 to common.rc.  This is used to
detect real disk usage of sparse files.  In particular, we want to use
it for checking subclusters-based discards.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/250   | 5 -
  tests/qemu-iotests/common.rc | 6 ++
  2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/250 b/tests/qemu-iotests/250
index af48f83aba..c0a0dbc0ff 100755
--- a/tests/qemu-iotests/250
+++ b/tests/qemu-iotests/250
@@ -52,11 +52,6 @@ _unsupported_imgopts data_file
  # bdrv_co_truncate(bs->file) call in qcow2_co_truncate(), which might succeed
  # anyway.
  
-disk_usage()

-{
-du --block-size=1 $1 | awk '{print $1}'
-}
-
  size=2100M
  
  _make_test_img -o "cluster_size=1M,preallocation=metadata" $size

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 95c12577dd..5d2ea26c7f 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -140,6 +140,12 @@ _optstr_add()
  fi
  }
  
+# report real disk usage for sparse files

+disk_usage()
+{
+du --block-size=1 $1 | awk '{print $1}'


Pre-existing, but since you’re touching this now: Can you please change 
the $1 to "$1"?


Hanna


+}
+
  # Set the variables to the empty string to turn Valgrind off
  # for specific processes, e.g.
  # $ VALGRIND_QEMU_IO= ./check -qcow2 -valgrind 015

Re: [PATCH 5/7] qcow2: zero_l2_subclusters: fall through to discard operation when requested

2023-11-03 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

When zeroizing subclusters within single cluster, detect usage of the
BDRV_REQ_MAY_UNMAP flag and fall through to the subcluster-based discard
operation, much like it's done with the cluster-based discards.  That
way subcluster-aligned operations "qemu-io -c 'write -z -u ...'" will
lead to actual unmap.


Ever since the introduction of discard-no-unref, I wonder whether that 
is actually an advantage or not.  I can’t think of a reason why it’d be 
advantageous to drop the allocation.


On the other hand, zero_in_l2_slice() does it indeed, so consistency is 
a reasonable argument.



Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 17 ++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index cf40f2dc12..040251f2c3 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2242,7 +2242,7 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
  unsigned nb_subclusters, int flags)
  {
  BDRVQcow2State *s = bs->opaque;
-uint64_t new_l2_bitmap;
+uint64_t new_l2_bitmap, l2_bitmap_mask;
  int ret, sc = offset_to_sc_index(s, offset);
  SubClusterRangeInfo scri = { 0 };
  
@@ -2251,9 +2251,10 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,

  goto out;
  }
  
+l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);


“l2_bitmap_mask” already wasn’t a great name in patch 4 because it 
doesn’t say what mask it is, but in patch 4, there was at least only one 
relevant mask.  Here, we have two (ALLOC_RANGE and ZERO_RANGE), so the 
name should reflect what kind of mask it is.



  new_l2_bitmap = scri.l2_bitmap;
-new_l2_bitmap |=  QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
-new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap &= ~l2_bitmap_mask;
  
  /*

   * If there're no non-zero subclusters left, we might as well zeroize
@@ -2266,6 +2267,16 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
  1, flags);
  }
  
+/*

+ * If the request allows discarding subclusters and they're actually
+ * allocated, we go down the discard path since after the discard
+ * operation the subclusters are going to be read as zeroes anyway.
+ */
+if ((flags & BDRV_REQ_MAY_UNMAP) && (scri.l2_bitmap & l2_bitmap_mask)) {
+return discard_l2_subclusters(bs, offset, nb_subclusters,
+  QCOW2_DISCARD_REQUEST, false, );
+}


I wonder why it matters whether any subclusters are allocated, i.e. why 
can’t we just immediately use discard_l2_subclusters() whenever 
BDRV_REQ_MAY_UNMAP is set, without even fetching SCRI (that would also 
save us passing SCRI here, which I’ve already said on patch 4, I don’t 
find great).


Doesn’t discard_l2_subclusters() guarantee the clusters read as zero 
when full_discard=false?  There is this case where it won’t mark the 
subclusters as zero if there is no backing file, and none of the 
subclusters is allocated.  But still, (A) those subclusters will read as 
zero, and (B) if there is a problem there, why don’t we just always mark 
those subclusters as zero?  This optimization only has effect when the 
guest discards/zeroes subclusters (not whole clusters) that were already 
discarded, so sounds miniscule.


Hanna


+
  if (new_l2_bitmap != scri.l2_bitmap) {
  set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
  qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);

Re: [PATCH] block-jobs: add final flush

2023-11-02 Thread Hanna Czenczek


On 01.11.23 20:53, Vladimir Sementsov-Ogievskiy wrote:

On 31.10.23 17:05, Hanna Czenczek wrote:

On 04.10.23 15:56, Vladimir Sementsov-Ogievskiy wrote:

From: Vladimir Sementsov-Ogievskiy 

Actually block job is not completed without the final flush. It's
rather unexpected to have broken target when job was successfully
completed long ago and now we fail to flush or process just
crashed/killed.

Mirror job already has mirror_flush() for this. So, it's OK.

Add similar things for other jobs: backup, stream, commit.

Note, that stream has (documented) different treatment of IGNORE
action: it don't retry the operation, continue execution and report
error at last. We keep it for final flush too.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Was: [PATCH v4] block-jobs: flush target at the end of .run()
   But now rewritten.
Supersedes: <20230725174008.1147467-1-vsement...@yandex-team.ru>

  block/backup.c |  2 +-
  block/block-copy.c |  7 +++
  block/commit.c | 16 
  block/stream.c | 21 +
  include/block/block-copy.h |  1 +
  5 files changed, 38 insertions(+), 9 deletions(-)


[...]


diff --git a/block/commit.c b/block/commit.c
index aa45beb0f0..5205c77ec9 100644
--- a/block/commit.c
+++ b/block/commit.c


[...]

@@ -187,7 +187,15 @@ static int coroutine_fn commit_run(Job *job, 
Error **errp)

  }
  }
-    return 0;
+    do {
+    ret = blk_co_flush(s->base);
+    if (ret < 0) {
+    action = block_job_error_action(>common, s->on_error,
+    false, -ret);
+    }
+    } while (ret < 0 && action != BLOCK_ERROR_ACTION_REPORT);


Do we need to yield in this loop somewhere so that 
BLOCK_ERROR_ACTION_STOP can pause the job?




block_job_error_action calls job_pause_locked() itself in this case


But that doesn’t really pause the job, does it?  As far as I understand, 
it increases job->pause_count, then enters the job, and the job is then 
supposed to yield at some point so job_pause_point_locked() is called, 
which sees the increased job->pause_count and will actually pause the job.


Hanna

Re: [PATCH 4/7] qcow2: make subclusters discardable

2023-10-31 Thread Hanna Czenczek


(Sorry, opened another reply window, forgot I already had one open...)

On 20.10.23 23:56, Andrey Drobyshev wrote:

This commit makes the discard operation work on the subcluster level
rather than cluster level.  It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters().  It also
changes the qcow2 driver pdiscard_alignment to subcluster_size.  That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.

This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 100 --
  block/qcow2.c |   8 ++--
  2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7c6fa5524c..cf40f2dc12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2042,6 +2042,74 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
  return nb_clusters;
  }
  
+static int coroutine_fn GRAPH_RDLOCK

+discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+   uint64_t nb_subclusters,
+   enum qcow2_discard_type type,
+   bool full_discard,
+   SubClusterRangeInfo *pscri)
+{
+BDRVQcow2State *s = bs->opaque;
+uint64_t new_l2_bitmap, l2_bitmap_mask;
+int ret, sc = offset_to_sc_index(s, offset);
+SubClusterRangeInfo scri = { 0 };
+
+if (!pscri) {
+ret = get_sc_range_info(bs, offset, nb_subclusters, );
+if (ret < 0) {
+goto out;
+}
+} else {
+scri = *pscri;


Allowing to takes this from the caller sounds dangerous, considering we 
need to track who takes care of freeing scri.l2_slice.



+}
+
+l2_bitmap_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+new_l2_bitmap = scri.l2_bitmap;
+new_l2_bitmap &= ~l2_bitmap_mask;
+
+/*
+ * If there're no allocated subclusters left, we might as well discard
+ * the entire cluster.  That way we'd also update the refcount table.
+ */
+if (!(new_l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC)) {


What if there are subclusters in the cluster that are marked as zero, 
outside of the discarded range?  It sounds wrong to apply a discard with 
either full_discard set or cleared to them.



+return discard_in_l2_slice(bs,
+   QEMU_ALIGN_DOWN(offset, s->cluster_size),
+   1, type, full_discard);
+}
+
+/*
+ * Full discard means we fall through to the backing file, thus we only
+ * need to mark the subclusters as deallocated.


I think it also means we need to clear the zero bits.

Hanna


+ *
+ * Non-full discard means subclusters should be explicitly marked as
+ * zeroes.  In this case QCOW2 specification requires the corresponding
+ * allocation status bits to be unset as well.  If the subclusters are
+ * deallocated in the first place and there's no backing, the operation
+ * can be skipped.
+ */
+if (!full_discard &&
+(bs->backing || scri.l2_bitmap & l2_bitmap_mask)) {
+new_l2_bitmap |= QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+}

Re: [PATCH 4/7] qcow2: make subclusters discardable

2023-10-31 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

This commit makes the discard operation work on the subcluster level
rather than cluster level.  It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters().  It also
changes the qcow2 driver pdiscard_alignment to subcluster_size.  That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.

This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 100 --
  block/qcow2.c |   8 ++--
  2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7c6fa5524c..cf40f2dc12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c


[...]


+if (scri.l2_bitmap != new_l2_bitmap) {
+set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
+}
+
+if (s->discard_passthrough[type]) {
+qcow2_queue_discard(bs, (scri.l2_entry & L2E_OFFSET_MASK) +
+offset_into_cluster(s, offset),
+nb_subclusters * s->subcluster_size);


Are we sure that the cluster is allocated, i.e. that scri.l2_entry & 
L2E_OFFSET_MASK != 0?


As a side note, I guess discard_in_l2_slice() should also use 
qcow2_queue_discard() instead of bdrv_pdiscard() then.



+}
+
+ret = 0;
+out:
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
+
+return ret;
+}
+
  int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
uint64_t bytes, enum qcow2_discard_type type,
bool full_discard)
@@ -2049,19 +2117,36 @@ int qcow2_cluster_discard(BlockDriverState *bs, 
uint64_t offset,
  BDRVQcow2State *s = bs->opaque;
  uint64_t end_offset = offset + bytes;
  uint64_t nb_clusters;
+unsigned head, tail;
  int64_t cleared;
  int ret;
  
  /* Caller must pass aligned values, except at image end */

-assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
-assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
+assert(QEMU_IS_ALIGNED(offset, s->subcluster_size));
+assert(QEMU_IS_ALIGNED(end_offset, s->subcluster_size) ||
 end_offset == bs->total_sectors << BDRV_SECTOR_BITS);
  
-nb_clusters = size_to_clusters(s, bytes);

+head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset;
+offset += head;
+
+tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 :
+   end_offset - MAX(offset, start_of_cluster(s, end_offset));
+end_offset -= tail;
  
  s->cache_discards = true;
  
+if (head) {

+ret = discard_l2_subclusters(bs, offset - head,
+ size_to_subclusters(s, head), type,
+ full_discard, NULL);
+if (ret < 0) {
+goto fail;
+}
+}
+
  /* Each L2 slice is handled by its own loop iteration */
+nb_clusters = size_to_clusters(s, end_offset - offset);
+
  while (nb_clusters > 0) {


I think the comment should stay attached to the `while`.

Hanna


  cleared = discard_in_l2_slice(bs, offset, nb_clusters, type,
full_discard);

Re: [PATCH 3/7] qcow2: zeroize the entire cluster when there're no non-zero subclusters

2023-10-31 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

When zeroizing the last non-zero subclusters within single cluster, it
makes sense to go zeroize the entire cluster and go down zero_in_l2_slice()
path right away.  That way we'd also update the corresponding refcount
table.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 18 +++---
  1 file changed, 15 insertions(+), 3 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 2/7] qcow2: add get_sc_range_info() helper for working with subcluster ranges

2023-10-31 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

This helper simply obtains the l2 table parameters of the cluster which
contains the given subclusters range.  Right now this info is being
obtained and used by zero_l2_subclusters().  As we're about to introduce
the subclusters discard operation, this helper would let us avoid code
duplication.

Also introduce struct SubClusterRangeInfo, which would contain all the
needed params.

Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-cluster.c | 90 +--
  1 file changed, 62 insertions(+), 28 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 904f00d1b3..8801856b93 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -32,6 +32,13 @@
  #include "qemu/memalign.h"
  #include "trace.h"
  
+typedef struct SubClusterRangeInfo {

+uint64_t *l2_slice;


We should document that this is a strong reference that must be returned 
via qcow2_cache_put().  Maybe you could also define a clean-up function 
using G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC() that does this, allowing this 
type to be used with g_auto().



+int l2_index;
+uint64_t l2_entry;
+uint64_t l2_bitmap;
+} SubClusterRangeInfo;
+
  int coroutine_fn qcow2_shrink_l1_table(BlockDriverState *bs,
 uint64_t exact_size)
  {
@@ -1892,6 +1899,50 @@ again:
  return 0;
  }
  
+static int get_sc_range_info(BlockDriverState *bs, uint64_t offset,

+ unsigned nb_subclusters,
+ SubClusterRangeInfo *scri)


It would be good to have documentation for this function, for example 
that it only works on a single cluster, i.e. that the range denoted by 
@offset and @nb_subclusters must not cross a cluster boundary, and that 
@offset must be aligned to subclusters.


In general, it is unclear to me at this point what this function does.  
OK, it gets the SCRI for all subclusters in the cluster at @offset (this 
is what its name implies), but then it also has this loop that checks 
whether there are compressed clusters among the @nb_subclusters.  It has 
a comment about being unable to zero/discard subclusters in compressed 
clusters, but the function name says nothing about this scope of 
zeroing/discarding.



+{
+BDRVQcow2State *s = bs->opaque;
+int ret, sc_cleared = 0, sc_index = offset_to_sc_index(s, offset);
+QCow2SubclusterType sctype;
+
+/* Here we only work with the subclusters within single cluster. */
+assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
+assert(sc_index + nb_subclusters <= s->subclusters_per_cluster);
+assert(offset_into_subcluster(s, offset) == 0);
+
+ret = get_cluster_table(bs, offset, >l2_slice, >l2_index);
+if (ret < 0) {
+return ret;
+}
+
+scri->l2_entry = get_l2_entry(s, scri->l2_slice, scri->l2_index);
+scri->l2_bitmap = get_l2_bitmap(s, scri->l2_slice, scri->l2_index);
+
+do {
+qcow2_get_subcluster_range_type(bs, scri->l2_entry, scri->l2_bitmap,
+sc_index, );


I think there’s a `ret = ` missing here.


+if (ret < 0) {
+return ret;
+}
+
+switch (sctype) {
+case QCOW2_SUBCLUSTER_COMPRESSED:
+/* We cannot partially zeroize/discard compressed clusters. */
+return -ENOTSUP;
+case QCOW2_SUBCLUSTER_INVALID:
+return -EINVAL;
+default:
+break;
+}
+
+sc_cleared += ret;
+} while (sc_cleared < nb_subclusters);
+
+return 0;
+}
+
  /*
   * This discards as many clusters of nb_clusters as possible at once (i.e.
   * all clusters in the same L2 slice) and returns the number of discarded
@@ -2097,44 +2148,27 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
  unsigned nb_subclusters)
  {
  BDRVQcow2State *s = bs->opaque;
-uint64_t *l2_slice;
-uint64_t old_l2_bitmap, l2_bitmap;
-int l2_index, ret, sc = offset_to_sc_index(s, offset);
+uint64_t new_l2_bitmap;
+int ret, sc = offset_to_sc_index(s, offset);
+SubClusterRangeInfo scri = { 0 };
  
-/* For full clusters use zero_in_l2_slice() instead */

-assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
-assert(sc + nb_subclusters <= s->subclusters_per_cluster);
-assert(offset_into_subcluster(s, offset) == 0);
-
-ret = get_cluster_table(bs, offset, _slice, _index);
+ret = get_sc_range_info(bs, offset, nb_subclusters, );
  if (ret < 0) {
-return ret;
-}
-
-switch (qcow2_get_cluster_type(bs, get_l2_entry(s, l2_slice, l2_index))) {
-case QCOW2_CLUSTER_COMPRESSED:
-ret = -ENOTSUP; /* We cannot partially zeroize compressed clusters */
  goto out;
-case QCOW2_CLUSTER_NORMAL:
-case QCOW2_CLUSTER_UNALLOCATED:
-break;
-default:
-g_assert_not_reached();
  }
  
-

Re: [PATCH 1/7] qcow2: make function update_refcount_discard() global

2023-10-31 Thread Hanna Czenczek


On 20.10.23 23:56, Andrey Drobyshev wrote:

We are going to need it for discarding separate subclusters.  The
function itself doesn't do anything with the refcount tables,


I think the idea behind the naming was that updating refcounts can lead 
to clusters being discarded, i.e. update_refcount_discard() did the 
“discard” part of “update_refcount”.



it simply
adds a discard request to the queue, so rename it to qcow2_queue_discard().


But that’s fine, too.


Signed-off-by: Andrey Drobyshev 
---
  block/qcow2-refcount.c | 8 
  block/qcow2.h  | 2 ++
  2 files changed, 6 insertions(+), 4 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 1/1] qemu-img: do not erase destination file in qemu-img dd command

2023-10-31 Thread Hanna Czenczek


On 01.10.23 22:46, Denis V. Lunev wrote:

Can you please not top-post. This makes the discussion complex. This
approach is followed in this mailing list and in other similar lists
like LKML.

On 10/1/23 19:08, Mike Maslenkin wrote:

I thought about "conv=notrunc", but my main concern is changed virtual
disk metadata.
It depends on how qemu-img used.
May be I followed to wrong pattern, but pros and cons of adding "conv"
parameter was not in my mind in scope of the first patch version.
I see 4 obvious ways of using `qemu-img dd`:
1. Copy virtual disk data between images of same format. I think disk
geometry must be preserved in this case.
2. Copy virtual disk data between different formats. It is a valid
pattern? May be `qemu-img convert` should to be used instead?
3. Merge snapshots to specified disk image, i.e read current state and
write it to new disk image.
4. Copy virtual disk data to raw binary file. Actually this patch
breaks 'dd' behavior for this case when source image is less (in terms
of logical blocks) than existed raw binary file.
 May be for this case condition can be improved to smth like
    if (strcmp(fmt, "raw") || !g_file_test(out.filename,
G_FILE_TEST_EXISTS)) . And parameter "conv=notrunc" may be implemented
additionally for this case.

My personal opinion is that qemu dd when you will need to
extract the SOME data from the original image and process
it further. Thus I use it to copy some data into raw binary
file. My next goal here would add ability to put data into
stdout that would be beneficial for. Though this is out of the
equation at the moment.

Though, speaking about the approach, I would say that the
patch changes current behavior which is not totally buggy
under a matter of this or that taste. It should be noted that
we are here in Linux world, not in the Mac world where we
were in position to avoid options and selections.

Thus my opinion that original behavior is to be preserved
as somebody is relying on it. The option you are proposing
seems valuable to me also and thus the switch is to be added.
The switch is well-defined in the original 'dd' world thus
either conv= option would be good, either nocreat or notrunc.
For me 'nocreat' seems more natural.

Anyway, the last word here belongs to either Hanna or Kevin ;)


Personally, and honestly, I see no actual use for qemu-img dd at all, 
because we’re trying to mimic a subset of an interface of a rather 
complex program that has been designed to do what it does. We can only 
fail at that.  Personally, whenever I need dd functionality, I use 
qemu-storage-daemon’s fuse export, and then use the actual dd program on 
top.  Alternatively, qemu-img convert is our native interface; 
unfortunately, its feature set is lacking when compared to qemu-img dd, 
but I think it would be better to improve that rather than working on 
qemu-img dd.


I tend to agree with you, Denis, though, that this patch changes 
behavior, and users may be relying on the current behavior.


Comparing to “what would dd do” is difficult because dd just has no 
concept of file formats.  (That’s another reason why I think it’s bad to 
provide a dd-like interface, and users should rather use dd itself, or 
qemu-img convert.)  But in any case, adding conv=notrunc to keep the 
existing target file would seem fair to me.  I understand conv=nocreat 
would advise dd to never create the target file, which is something 
slightly different (i.e. conv=notrunc will still create a new target 
file if it doesn’t exist yet, but won’t create/truncate one if it 
already exists; while conv=nocreat would disable creating a new target 
file if it doesn’t exist yet, but still truncate it if it does exist; 
using both together would ensure that the target file is never 
created/truncated).


Summary: If we do this under a new conv=notrunc, fine with me.  I just 
don’t think qemu-img dd is something that should be used at all.


Hanna

Re: [PATCH] block-jobs: add final flush

2023-10-31 Thread Hanna Czenczek


On 04.10.23 15:56, Vladimir Sementsov-Ogievskiy wrote:

From: Vladimir Sementsov-Ogievskiy 

Actually block job is not completed without the final flush. It's
rather unexpected to have broken target when job was successfully
completed long ago and now we fail to flush or process just
crashed/killed.

Mirror job already has mirror_flush() for this. So, it's OK.

Add similar things for other jobs: backup, stream, commit.

Note, that stream has (documented) different treatment of IGNORE
action: it don't retry the operation, continue execution and report
error at last. We keep it for final flush too.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Was: [PATCH v4] block-jobs: flush target at the end of .run()
   But now rewritten.
Supersedes: <20230725174008.1147467-1-vsement...@yandex-team.ru>

  block/backup.c |  2 +-
  block/block-copy.c |  7 +++
  block/commit.c | 16 
  block/stream.c | 21 +
  include/block/block-copy.h |  1 +
  5 files changed, 38 insertions(+), 9 deletions(-)


[...]


diff --git a/block/commit.c b/block/commit.c
index aa45beb0f0..5205c77ec9 100644
--- a/block/commit.c
+++ b/block/commit.c


[...]


@@ -187,7 +187,15 @@ static int coroutine_fn commit_run(Job *job, Error **errp)
  }
  }
  
-return 0;

+do {
+ret = blk_co_flush(s->base);
+if (ret < 0) {
+action = block_job_error_action(>common, s->on_error,
+false, -ret);
+}
+} while (ret < 0 && action != BLOCK_ERROR_ACTION_REPORT);


Do we need to yield in this loop somewhere so that 
BLOCK_ERROR_ACTION_STOP can pause the job?


Hanna

Re: [PATCH 2/2] iotests: Test media change with iothreads

2023-10-31 Thread Hanna Czenczek


On 13.10.23 17:33, Kevin Wolf wrote:

iotests case 118 already tests all relevant operations for media change
with multiple devices, however never with iothreads. This changes the
test so that the virtio-scsi tests run with an iothread.

Signed-off-by: Kevin Wolf 
---
  tests/qemu-iotests/118 | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH 1/2] block: Fix locking in media change monitor commands

2023-10-31 Thread Hanna Czenczek


On 13.10.23 17:33, Kevin Wolf wrote:

blk_insert_bs() requires that the caller holds the AioContext lock for
the node to be inserted. Since commit c066e808e11, neglecting to do so
causes a crash when the child has to be moved to a different AioContext
to attach it to the BlockBackend.

This fixes qmp_blockdev_insert_anon_medium(), which is called for the
QMP commands 'blockdev-insert-medium' and 'blockdev-change-medium', to
correctly take the lock.

Cc: qemu-sta...@nongnu.org
Fixes: https://issues.redhat.com/browse/RHEL-3922
Fixes: c066e808e11a5c181b625537b6c78e0de27a4801
Signed-off-by: Kevin Wolf 
---
  block/qapi-sysemu.c | 5 +
  1 file changed, 5 insertions(+)


Do we need to take the lock for the dev_ops tray callbacks, too?  I 
suppose not, and it also wouldn’t really matter in light of the lock 
being supposed to go away anyway, but still thought I should ask.


In any case, this change here is necessary, so:

Reviewed-by: Hanna Czenczek

Re: [PATCH v2] block/file-posix: fix update_zones_wp() caller

2023-10-31 Thread Hanna Czenczek


On 25.08.23 06:05, Sam Li wrote:

When the zoned request fail, it needs to update only the wp of
the target zones for not disrupting the in-flight writes on
these other zones. The wp is updated successfully after the
request completes.

Fixed the callers with right offset and nr_zones.

Signed-off-by: Sam Li
---
  block/file-posix.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)


Thanks, applied to my block branch:

https://gitlab.com/hreitz/qemu/-/commits/block

(Rebased on master, and I’ve also fixed the comment to read “boundaries” 
instead of “bounaries”.  Hope that’s OK!)


Hanna

Re: [PATCH v8 1/5] qemu-iotests: Filter warnings about block migration being deprecated

2023-10-24 Thread Hanna Czenczek


On 18.10.23 13:55, Juan Quintela wrote:

Create a new filter that removes the two warnings for test 183.

Signed-off-by: Juan Quintela 
---
  tests/qemu-iotests/183   | 2 +-
  tests/qemu-iotests/common.filter | 7 +++
  2 files changed, 8 insertions(+), 1 deletion(-)


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 8/8] iotests: add tests for "qemu-img rebase" with compression

2023-09-19 Thread Hanna Czenczek


On 15.09.23 18:20, Andrey Drobyshev wrote:

The test cases considered so far:

314 (new test suite):

1. Check that compression mode isn't compatible with "-f raw" (raw
format doesn't support compression).
2. Check that rebasing an image onto no backing file preserves the data
and writes the copied clusters actually compressed.
3. Same as 2, but with a raw backing file (i.e. the clusters copied from the
backing are originally uncompressed -- we check they end up compressed
after being merged).
4. Remove a single delta from a backing chain, perform the same checks
as in 2.
5. Check that even when backing and overlay are initially uncompressed,
copied clusters end up compressed when rebase with compression is
performed.

271:

1. Check that when target image has subclusters, rebase with compression
will make an entire cluster containing the written subcluster
compressed.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/271 |  65 +++
  tests/qemu-iotests/271.out |  40 +
  tests/qemu-iotests/314 | 165 +
  tests/qemu-iotests/314.out |  75 +
  4 files changed, 345 insertions(+)
  create mode 100755 tests/qemu-iotests/314
  create mode 100644 tests/qemu-iotests/314.out


Reviewed-by: Hanna Czenczek

Re: [PATCH v2 6/8] iotests/{024, 271}: add testcases for qemu-img rebase

2023-09-19 Thread Hanna Czenczek


On 15.09.23 18:20, Andrey Drobyshev wrote:

As the previous commit changes the logic of "qemu-img rebase" (it's using
write alignment now), let's add a couple more test cases which would
ensure it works correctly.  In particular, the following scenarios:

024: add test case for rebase within one backing chain when the overlay
  cluster size > backings cluster size;
271: add test case for rebase images that contain subclusters.  Check
  that no extra allocations are being made.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/024 | 60 ++
  tests/qemu-iotests/024.out | 43 +
  tests/qemu-iotests/271 | 66 ++
  tests/qemu-iotests/271.out | 42 
  4 files changed, 211 insertions(+)


Interestingly, the new case in 024 hangs before this series.  So it 
isn’t just an optimization, but a fix, actually!


Reviewed-by: Hanna Czenczek

Re: Deadlock with SATA CD I/O and eject

2023-09-19 Thread Hanna Czenczek


On 19.09.23 12:54, Kevin Wolf wrote:

Am 18.09.2023 um 19:28 hat John Levon geschrieben:

Observed with base of qemu 6.2.0, but from code inspection it looks to me like
it's still current in upstream master. Apologies if I have missed a fix in this
area.

Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading
files.." screen, run an eject e.g.

virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae '{"execute":"eject", "arguments": { 
"id": "sata0-0-0" } }'

qemu will get stuck like so:

gdb) bt
#0  0x7f8ba4b16036 in ppoll () from /lib64/libc.so.6
#1  0x561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, __nfds=, __fds=) at /usr/include/bits/poll2.h:62
#2  qemu_poll_ns (fds=, nfds=, 
timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348
#3  0x561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, 
ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80
#4  0x561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, 
blocking=blocking@entry=true) at ../util/aio-posix.c:607
#5  0x561813ae2fad in bdrv_do_drained_begin (poll=true, 
ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at 
../block/io.c:483
#6  bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=, parent=0x0, 
ignore_bds_parents=, poll=) at ../block/io.c:446
#7  0x561813ad9982 in blk_drain (blk=0x5618161c1f10) at 
../block/block-backend.c:1741
#8  0x561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at 
../block/block-backend.c:852
#9  0x56181382b8ab in blockdev_remove_medium (has_device=, 
device=, has_id=, id=, 
errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232
#10 0x56181382bfb1 in qmp_eject (has_device=, device=0x0, has_id=, 
id=0x561815e6efe0 "sata0-0-0", has_force=, force=, 
errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45

We are stuck forever here:

  351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
  352   bool poll)
...
  380 if (poll) {
  381 BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
  382 }

Because the blk root's ->in_flight is 1, as tested by the condition
blk_root_drained_poll().


Our blk->in_flight user is stuck here:

1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
...
1310 blk_dec_in_flight(blk);
1311 qemu_co_queue_wait(>queued_requests, 
>queued_requests_lock);
1312 blk_inc_in_flight(blk);

Note that before entering this stanza, blk->in_flight was 2. This turns out to
be due to the ide atapi code. In particular, the guest VM is generating lots of
read I/O. The "first IO" arrives into blk via:

cd_read_sector()->ide_buffered_readv()->blk_aio_preadv()

This initial IO completes:

blk_aio_read_entry()->blk_aio_complete()

1560 static void blk_aio_complete(BlkAioEmAIOCB *acb)
1561 {
1562 if (acb->has_returned) {
1563 acb->common.cb(acb->common.opaque, acb->rwco.ret);
1564 blk_dec_in_flight(acb->rwco.blk);
1565 qemu_aio_unref(acb);
1566 }
1567 }

Line 1564 is what we need to move blk->in_flight down to zero, but that is never
reached! This is because of what happens at :1563

acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread()

That is, the IO callback in the atapi code itself triggers more - synchronous - 
IO.

In the meantime, we start processing the blk_drain() code, so by the time this
blk_pread() actually gets handled, quiesce is set, and we get stuck in the
blk_wait_while_drained().

I don't know the qemu block stack well enough to propose an actual fix.

Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure
that's just a band-aid instead of fixing the deadlock.

Any suggestions/clues/thoughts?

Related discussion: 
https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html

Actually, it seems we never fixed that problem either?


We didn’t, from my POV mainly because everyone had different suggestions 
and none of them looked nice enough that it was clear which one to pursue.


I.e., there are also 
https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00709.html and 
https://lists.gnu.org/archive/html/qemu-block/2023-04/msg00118.html .



Back then I suggested that blk_drain*() should disable request queuing
because its callers don't want to quiesce the BlockBackend, but rather
get their own requests completed. I think this approach would solve this
one as well.

Your experiment with moving the queuing later is basically what Paolo
suggested, though I'd still argue it's not the right thing to do because
while it seems to mostly work with both use cases of drain (give me
exclusive access vs. wait for my requests to complete), it's not optimal
for either one.

Kevin

Re: [PATCH v2 5/8] qemu-img: rebase: avoid unnecessary COW operations

2023-09-19 Thread Hanna Czenczek


On 15.09.23 18:20, Andrey Drobyshev wrote:

When rebasing an image from one backing file to another, we need to
compare data from old and new backings.  If the diff between that data
happens to be unaligned to the target cluster size, we might end up
doing partial writes, which would lead to copy-on-write and additional IO.

Consider the following simple case (virtual_size == cluster_size == 64K):

base <-- inc1 <-- inc2

qemu-io -c "write -P 0xaa 0 32K" base.qcow2
qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2

While doing rebase, we'll write a half of the cluster to inc2, and block
layer will have to read the 2nd half of the same cluster from the base image
inc1 while doing this write operation, although the whole cluster is already
read earlier to perform data comparison.

In order to avoid these unnecessary IO cycles, let's make sure every
write request is aligned to the overlay subcluster boundaries.  Using
subcluster size is universal as for the images which don't have them
this size equals to the cluster size, so in any case we end up aligning
to the smallest unit of allocation.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 76 --
  1 file changed, 56 insertions(+), 20 deletions(-)


Looks good, I like the changes from v1!  Two minor things:


diff --git a/qemu-img.c b/qemu-img.c
index fcd31d7b5b..83950af42b 100644
--- a/qemu-img.c
+++ b/qemu-img.c


[...]


@@ -3844,33 +3861,48 @@ static int img_rebase(int argc, char **argv)
  }
  }
  
+/*

+ * At this point we know that the region [offset; offset + n)
+ * is unallocated within the target image.  This region might be
+ * unaligned to the target image's (sub)cluster boundaries, as
+ * old backing may have smaller clusters (or have subclusters).
+ * We extend it to the aligned boundaries to avoid CoW on
+ * partial writes in blk_pwrite(),
+ */
+n += offset - QEMU_ALIGN_DOWN(offset, write_align);
+offset = QEMU_ALIGN_DOWN(offset, write_align);
+n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n);
+n = MIN(n, size - offset);
+assert(!bdrv_is_allocated(unfiltered_bs, offset, n, _alloc) &&
+   n_alloc == n);
+
+/*
+ * Much like the with the target image, we'll try to read as much


s/the with the/with the/


+ * of the old and new backings as we can.
+ */
+n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
+if (blk_new_backing) {
+n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
+}


If we don’t have a check for blk_old_backing (old_backing_size is 0 if 
blk_old_backing is NULL), why do we have a check for blk_new_backing 
(new_backing_size is 0 if blk_new_backing is NULL)?


(Perhaps because the previous check was `offset >= new_backing_size || 
!blk_new_backing`, i.e. included exactly such a check – but I don’t 
think it’s necessary, new_backing_size will be 0 if blk_new_backing is 
NULL.)



+
  /*
   * Read old and new backing file and take into consideration that
   * backing files may be smaller than the COW image.
   */
-if (offset >= old_backing_size) {
-memset(buf_old, 0, n);
-buf_old_is_zero = true;
+memset(buf_old + n_old, 0, n - n_old);
+if (!n_old) {
+old_backing_eof = true;
  } else {
-if (offset + n > old_backing_size) {
-n = old_backing_size - offset;
-}
-
-ret = blk_pread(blk_old_backing, offset, n, buf_old, 0);
+ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0);
  if (ret < 0) {
  error_report("error while reading from old backing file");
  goto out;
  }
  }
  
-if (offset >= new_backing_size || !blk_new_backing) {

-memset(buf_new, 0, n);
-} else {
-if (offset + n > new_backing_size) {
-n = new_backing_size - offset;
-}
-
-ret = blk_pread(blk_new_backing, offset, n, buf_new, 0);
+memset(buf_new + n_new, 0, n - n_new);
+if (blk_new_backing && n_new) {


Same as above, I think we can drop the blk_new_backing check, just so 
that both blocks (for old and new) look the same.


(Also, the memset() already has to trust that n_new is 0 if 
blk_new_backing is NULL, so the check doesn’t make much sense from that 
perspective either, and makes the memset() look wrong.)



+

Re: [PATCH v2 4/8] qemu-img: add chunk size parameter to compare_buffers()

2023-09-19 Thread Hanna Czenczek


On 15.09.23 18:20, Andrey Drobyshev wrote:

Add @chsize param to the function which, if non-zero, would represent
the chunk size to be used for comparison.  If it's zero, then
BDRV_SECTOR_SIZE is used as default chunk size, which is the previous
behaviour.

In particular, we're going to use this param in img_rebase() to make the
write requests aligned to a predefined alignment value.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 22 ++
  1 file changed, 14 insertions(+), 8 deletions(-)


With the comment fix Eric has suggested:

Reviewed-by: Hanna Czenczek

Re: [PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-09-19 Thread Hanna Czenczek


On 15.09.23 18:20, Andrey Drobyshev wrote:

Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
the data read from the old and new backing files are aligned using
BlockDriverState (or BlockBackend later on) referring to the target image.
However, this isn't quite right, because buf_new is only being used for
reading from the new backing, while buf_old is being used for both reading
from the old backing and writing to the target.  Let's take that into account
and use more appropriate values as alignments.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 50660ba920..d12e4a4753 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv)
  int64_t n;
  float local_progress = 0;
  
-buf_old = blk_blockalign(blk, IO_BUF_SIZE);

-buf_new = blk_blockalign(blk, IO_BUF_SIZE);
+if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) >
+bdrv_opt_mem_align(blk_bs(blk_old_backing))) {
+buf_old = blk_blockalign(blk, IO_BUF_SIZE);
+} else {
+buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
+}


As I read this, if blk_old_backing is NULL, we will go to the 
blk_blockalign(blk_old_backing, IO_BUF_SIZE) path.  I think if it is 
NULL, we should align on blk instead.


Hanna


+buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
  
  size = blk_getlength(blk);

  if (size < 0) {

Re: [PULL 00/14] Block patches

2023-09-12 Thread Hanna Czenczek


On 07.09.23 13:21, Hanna Czenczek wrote:

On 06.09.23 15:18, Stefan Hajnoczi wrote:

On Fri, 1 Sept 2023 at 04:18, Hanna Czenczek  wrote:
The following changes since commit 
f5fe7c17ac4e309e47e78f0f9761aebc8d2f2c81:


   Merge tag 'pull-tcg-20230823-2' of 
https://gitlab.com/rth7680/qemu into staging (2023-08-28 16:07:04 
-0400)


are available in the Git repository at:

   https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-09-01

Hi Hanna,
Please push a signed tag (git tag -s). Thanks!


By the way, I meant to imply that I’m not going to push a new tag at 
this time.


I have generated this pull request as I always do, using a script that 
automatically signs the tag.  gpg asked me to enter my key password, 
indicating something had been signed, and the methods I know of tell me 
that the tag is indeed signed.  So as far as I can tell, the tag is 
signed as usual.  If I were to create a new pull request, I would do it 
the exact same way, which I expect would yield the same result, so we’d 
have the same problem again.


To get a different result, I need to know how you determine the tag not 
to be signed, so I can reproduce it and potentially fix the problem in 
my workflow.


Hanna

Is it not signed?  I don’t think gitlab has support to show that, but 
github shows it as verified: 
https://github.com/XanClic/qemu/releases/tag/pull-block-2023-09-01


And when I clone it:
```
$ git clone https://gitlab.com/hreitz/qemu -b pull-block-2023-09-01 
--depth=1

[...]
$ cd qemu
$ git tag -v pull-block-2023-09-01
LANG=C git tag -v pull-block-2023-09-01
object 380448464dd89291cf7fd7434be6c225482a334d
type commit
tag pull-block-2023-09-01
tagger Hanna Reitz  1693555853 +0200

Block patches

- Fix for file-posix's zoning code crashing on I/O errors
- Throttling refactoring
gpg: Signature made Fri Sep  1 10:11:46 2023 CEST
gpg:    using RSA key 
CB62D7A0EE3829E45F004D34A1FA40D098019CDF

gpg:    issuer "hre...@redhat.com"
gpg: Good signature from "Hanna Reitz " [ultimate]
Primary key fingerprint: CB62 D7A0 EE38 29E4 5F00  4D34 A1FA 40D0 9801 
9CDF

```

Hanna

Re: [PULL 00/14] Block patches

2023-09-07 Thread Hanna Czenczek


On 06.09.23 15:18, Stefan Hajnoczi wrote:

On Fri, 1 Sept 2023 at 04:18, Hanna Czenczek  wrote:

The following changes since commit f5fe7c17ac4e309e47e78f0f9761aebc8d2f2c81:

   Merge tag 'pull-tcg-20230823-2' of https://gitlab.com/rth7680/qemu into 
staging (2023-08-28 16:07:04 -0400)

are available in the Git repository at:

   https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-09-01

Hi Hanna,
Please push a signed tag (git tag -s). Thanks!


Is it not signed?  I don’t think gitlab has support to show that, but 
github shows it as verified: 
https://github.com/XanClic/qemu/releases/tag/pull-block-2023-09-01


And when I clone it:
```
$ git clone https://gitlab.com/hreitz/qemu -b pull-block-2023-09-01 
--depth=1

[...]
$ cd qemu
$ git tag -v pull-block-2023-09-01
LANG=C git tag -v pull-block-2023-09-01
object 380448464dd89291cf7fd7434be6c225482a334d
type commit
tag pull-block-2023-09-01
tagger Hanna Reitz  1693555853 +0200

Block patches

- Fix for file-posix's zoning code crashing on I/O errors
- Throttling refactoring
gpg: Signature made Fri Sep  1 10:11:46 2023 CEST
gpg:    using RSA key CB62D7A0EE3829E45F004D34A1FA40D098019CDF
gpg:    issuer "hre...@redhat.com"
gpg: Good signature from "Hanna Reitz " [ultimate]
Primary key fingerprint: CB62 D7A0 EE38 29E4 5F00  4D34 A1FA 40D0 9801 9CDF
```

Hanna

[PULL 09/14] block/throttle-groups: Use ThrottleDirection instread of bool is_write

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

'bool is_write' style is obsolete from throttle framework, adapt
block throttle groups to the new style:
- use ThrottleDirection instead of 'bool is_write'. Ex,
  schedule_next_request(ThrottleGroupMember *tgm, bool is_write)
  -> schedule_next_request(ThrottleGroupMember *tgm, ThrottleDirection 
direction)

- use THROTTLE_MAX instead of hard code. Ex, ThrottleGroupMember *tokens[2]
  -> ThrottleGroupMember *tokens[THROTTLE_MAX]

- use ThrottleDirection instead of hard code on iteration. Ex, (i = 0; i < 2; 
i++)
  -> for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++)

Use a simple python script to test the new style:
 #!/usr/bin/python3
import subprocess
import random
import time

commands = ['virsh blkdeviotune jammy vda --write-bytes-sec ', \
'virsh blkdeviotune jammy vda --write-iops-sec ', \
'virsh blkdeviotune jammy vda --read-bytes-sec ', \
'virsh blkdeviotune jammy vda --read-iops-sec ']

for loop in range(1, 1000):
time.sleep(random.randrange(3, 5))
command = commands[random.randrange(0, 3)] + str(random.randrange(0, 
100))
subprocess.run(command, shell=True, check=True)

This works fine.

Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-10-pizhen...@bytedance.com>
Reviewed-by: Hanna Czenczek 
Signed-off-by: Hanna Czenczek 
---
 include/block/throttle-groups.h |   6 +-
 block/block-backend.c   |   4 +-
 block/throttle-groups.c | 161 
 block/throttle.c|   8 +-
 4 files changed, 90 insertions(+), 89 deletions(-)

diff --git a/include/block/throttle-groups.h b/include/block/throttle-groups.h
index ff282fc0f8..2355e8d9de 100644
--- a/include/block/throttle-groups.h
+++ b/include/block/throttle-groups.h
@@ -37,7 +37,7 @@ typedef struct ThrottleGroupMember {
 AioContext   *aio_context;
 /* throttled_reqs_lock protects the CoQueues for throttled requests.  */
 CoMutex  throttled_reqs_lock;
-CoQueue  throttled_reqs[2];
+CoQueue  throttled_reqs[THROTTLE_MAX];
 
 /* Nonzero if the I/O limits are currently being ignored; generally
  * it is zero.  Accessed with atomic operations.
@@ -54,7 +54,7 @@ typedef struct ThrottleGroupMember {
  * throttle_state tells us if I/O limits are configured. */
 ThrottleState *throttle_state;
 ThrottleTimers throttle_timers;
-unsigned   pending_reqs[2];
+unsigned   pending_reqs[THROTTLE_MAX];
 QLIST_ENTRY(ThrottleGroupMember) round_robin;
 
 } ThrottleGroupMember;
@@ -78,7 +78,7 @@ void throttle_group_restart_tgm(ThrottleGroupMember *tgm);
 
 void coroutine_fn throttle_group_co_io_limits_intercept(ThrottleGroupMember 
*tgm,
 int64_t bytes,
-bool is_write);
+ThrottleDirection 
direction);
 void throttle_group_attach_aio_context(ThrottleGroupMember *tgm,
AioContext *new_context);
 void throttle_group_detach_aio_context(ThrottleGroupMember *tgm);
diff --git a/block/block-backend.c b/block/block-backend.c
index 4009ed5fed..47d360c97a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1341,7 +1341,7 @@ blk_co_do_preadv_part(BlockBackend *blk, int64_t offset, 
int64_t bytes,
 /* throttling disk I/O */
 if (blk->public.throttle_group_member.throttle_state) {
 
throttle_group_co_io_limits_intercept(>public.throttle_group_member,
-bytes, false);
+bytes, THROTTLE_READ);
 }
 
 ret = bdrv_co_preadv_part(blk->root, offset, bytes, qiov, qiov_offset,
@@ -1415,7 +1415,7 @@ blk_co_do_pwritev_part(BlockBackend *blk, int64_t offset, 
int64_t bytes,
 /* throttling disk I/O */
 if (blk->public.throttle_group_member.throttle_state) {
 
throttle_group_co_io_limits_intercept(>public.throttle_group_member,
-bytes, true);
+bytes, THROTTLE_WRITE);
 }
 
 if (!blk->enable_write_cache) {
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index 3847d27801..3eda4c4e3d 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -37,7 +37,7 @@
 
 static void throttle_group_obj_init(Object *obj);
 static void throttle_group_obj_complete(UserCreatable *obj, Error **errp);
-static void timer_cb(ThrottleGroupMember *tgm, bool is_write);
+static void timer_cb(ThrottleGroupMember *tgm, ThrottleDirection direction);
 
 /* The ThrottleGroup structure (with its ThrottleState) is shared
  * among different ThrottleGroupMembers and it's independent from
@@ -73,8 +73,8 @@ struct ThrottleGroup {
 QemuMutex lock; /* This lock protects the following four fields */
 ThrottleState ts;
 QLIST_HEAD(, ThrottleGroupMember) head;
-ThrottleGroupMember *tokens[2];
-bool any_timer_armed[2];
+T

[PULL 13/14] file-posix: Simplify raw_co_prw's 'out' zone code

2023-09-01 Thread Hanna Czenczek

We duplicate the same condition three times here, pull it out to the top
level.

Signed-off-by: Hanna Czenczek 
Message-Id: <20230824155345.109765-5-hre...@redhat.com>
Reviewed-by: Sam Li 
---
 block/file-posix.c | 18 +-
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index a050682e97..aa89789737 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2506,11 +2506,10 @@ static int coroutine_fn raw_co_prw(BlockDriverState 
*bs, uint64_t offset,
 
 out:
 #if defined(CONFIG_BLKZONED)
-{
-BlockZoneWps *wps = bs->wps;
-if (ret == 0) {
-if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
-bs->bl.zoned != BLK_Z_NONE) {
+if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
+bs->bl.zoned != BLK_Z_NONE) {
+BlockZoneWps *wps = bs->wps;
+if (ret == 0) {
 uint64_t *wp = >wp[offset / bs->bl.zone_size];
 if (!BDRV_ZT_IS_CONV(*wp)) {
 if (type & QEMU_AIO_ZONE_APPEND) {
@@ -2523,19 +2522,12 @@ out:
 *wp = offset + bytes;
 }
 }
-}
-} else {
-if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
-bs->bl.zoned != BLK_Z_NONE) {
+} else {
 update_zones_wp(bs, s->fd, 0, 1);
 }
-}
 
-if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
-bs->blk.zoned != BLK_Z_NONE) {
 qemu_co_mutex_unlock(>colock);
 }
-}
 #endif
 return ret;
 }
-- 
2.41.0

[PULL 07/14] throttle: use THROTTLE_MAX/ARRAY_SIZE for hard code

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

The first dimension of both to_check and
bucket_types_size/bucket_types_units is used as throttle direction,
use THROTTLE_MAX instead of hard coded number. Also use ARRAY_SIZE()
to avoid hard coded number for the second dimension.

Hanna noticed that the two array should be static. Yes, turn them
into static variables.

Reviewed-by: Hanna Czenczek 
Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-8-pizhen...@bytedance.com>
Signed-off-by: Hanna Czenczek 
---
 util/throttle.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/util/throttle.c b/util/throttle.c
index 7d3eb6032f..9582899da3 100644
--- a/util/throttle.c
+++ b/util/throttle.c
@@ -142,7 +142,8 @@ int64_t throttle_compute_wait(LeakyBucket *bkt)
 static int64_t throttle_compute_wait_for(ThrottleState *ts,
  ThrottleDirection direction)
 {
-BucketType to_check[2][4] = { {THROTTLE_BPS_TOTAL,
+static const BucketType to_check[THROTTLE_MAX][4] = {
+  {THROTTLE_BPS_TOTAL,
THROTTLE_OPS_TOTAL,
THROTTLE_BPS_READ,
THROTTLE_OPS_READ},
@@ -153,7 +154,7 @@ static int64_t throttle_compute_wait_for(ThrottleState *ts,
 int64_t wait, max_wait = 0;
 int i;
 
-for (i = 0; i < 4; i++) {
+for (i = 0; i < ARRAY_SIZE(to_check[THROTTLE_READ]); i++) {
 BucketType index = to_check[direction][i];
 wait = throttle_compute_wait(>cfg.buckets[index]);
 if (wait > max_wait) {
@@ -469,11 +470,11 @@ bool throttle_schedule_timer(ThrottleState *ts,
 void throttle_account(ThrottleState *ts, ThrottleDirection direction,
   uint64_t size)
 {
-const BucketType bucket_types_size[2][2] = {
+static const BucketType bucket_types_size[THROTTLE_MAX][2] = {
 { THROTTLE_BPS_TOTAL, THROTTLE_BPS_READ },
 { THROTTLE_BPS_TOTAL, THROTTLE_BPS_WRITE }
 };
-const BucketType bucket_types_units[2][2] = {
+static const BucketType bucket_types_units[THROTTLE_MAX][2] = {
 { THROTTLE_OPS_TOTAL, THROTTLE_OPS_READ },
 { THROTTLE_OPS_TOTAL, THROTTLE_OPS_WRITE }
 };
@@ -486,7 +487,7 @@ void throttle_account(ThrottleState *ts, ThrottleDirection 
direction,
 units = (double) size / ts->cfg.op_size;
 }
 
-for (i = 0; i < 2; i++) {
+for (i = 0; i < ARRAY_SIZE(bucket_types_size[THROTTLE_READ]); i++) {
 LeakyBucket *bkt;
 
 bkt = >cfg.buckets[bucket_types_size[direction][i]];
-- 
2.41.0

[PULL 10/14] file-posix: Clear bs->bl.zoned on error

2023-09-01 Thread Hanna Czenczek

bs->bl.zoned is what indicates whether the zone information is present
and valid; it is the only thing that raw_refresh_zoned_limits() sets if
CONFIG_BLKZONED is not defined, and it is also the only thing that it
sets if CONFIG_BLKZONED is defined, but there are no zones.

Make sure that it is always set to BLK_Z_NONE if there is an error
anywhere in raw_refresh_zoned_limits() so that we do not accidentally
announce zones while our information is incomplete or invalid.

This also fixes a memory leak in the last error path in
raw_refresh_zoned_limits().

Signed-off-by: Hanna Czenczek 
Message-Id: <20230824155345.109765-2-hre...@redhat.com>
Reviewed-by: Sam Li 
---
 block/file-posix.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index b16e9c21a1..2b88b9eefa 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1412,11 +1412,9 @@ static void raw_refresh_zoned_limits(BlockDriverState 
*bs, struct stat *st,
 BlockZoneModel zoned;
 int ret;
 
-bs->bl.zoned = BLK_Z_NONE;
-
 ret = get_sysfs_zoned_model(st, );
 if (ret < 0 || zoned == BLK_Z_NONE) {
-return;
+goto no_zoned;
 }
 bs->bl.zoned = zoned;
 
@@ -1437,10 +1435,10 @@ static void raw_refresh_zoned_limits(BlockDriverState 
*bs, struct stat *st,
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Unable to read chunk_sectors "
  "sysfs attribute");
-return;
+goto no_zoned;
 } else if (!ret) {
 error_setg(errp, "Read 0 from chunk_sectors sysfs attribute");
-return;
+goto no_zoned;
 }
 bs->bl.zone_size = ret << BDRV_SECTOR_BITS;
 
@@ -1448,10 +1446,10 @@ static void raw_refresh_zoned_limits(BlockDriverState 
*bs, struct stat *st,
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Unable to read nr_zones "
  "sysfs attribute");
-return;
+goto no_zoned;
 } else if (!ret) {
 error_setg(errp, "Read 0 from nr_zones sysfs attribute");
-return;
+goto no_zoned;
 }
 bs->bl.nr_zones = ret;
 
@@ -1472,10 +1470,15 @@ static void raw_refresh_zoned_limits(BlockDriverState 
*bs, struct stat *st,
 ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 0);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "report wps failed");
-bs->wps = NULL;
-return;
+goto no_zoned;
 }
 qemu_co_mutex_init(>wps->colock);
+return;
+
+no_zoned:
+bs->bl.zoned = BLK_Z_NONE;
+g_free(bs->wps);
+bs->wps = NULL;
 }
 #else /* !defined(CONFIG_BLKZONED) */
 static void raw_refresh_zoned_limits(BlockDriverState *bs, struct stat *st,
-- 
2.41.0

[PULL 06/14] throttle: use enum ThrottleDirection instead of bool is_write

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

enum ThrottleDirection is already there, use ThrottleDirection instead
of 'bool is_write' for throttle API, also modify related codes from
block, fsdev, cryptodev and tests.

Reviewed-by: Hanna Czenczek 
Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-7-pizhen...@bytedance.com>
Signed-off-by: Hanna Czenczek 
---
 include/qemu/throttle.h |  5 +++--
 backends/cryptodev.c|  9 +
 block/throttle-groups.c |  6 --
 fsdev/qemu-fsdev-throttle.c |  8 +---
 tests/unit/test-throttle.c  |  4 ++--
 util/throttle.c | 31 +--
 6 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h
index 9ca5ab8197..181245d29b 100644
--- a/include/qemu/throttle.h
+++ b/include/qemu/throttle.h
@@ -154,9 +154,10 @@ void throttle_config_init(ThrottleConfig *cfg);
 /* usage */
 bool throttle_schedule_timer(ThrottleState *ts,
  ThrottleTimers *tt,
- bool is_write);
+ ThrottleDirection direction);
 
-void throttle_account(ThrottleState *ts, bool is_write, uint64_t size);
+void throttle_account(ThrottleState *ts, ThrottleDirection direction,
+  uint64_t size);
 void throttle_limits_to_config(ThrottleLimits *arg, ThrottleConfig *cfg,
Error **errp);
 void throttle_config_to_limits(ThrottleConfig *cfg, ThrottleLimits *var);
diff --git a/backends/cryptodev.c b/backends/cryptodev.c
index c2356550c8..e5006bd215 100644
--- a/backends/cryptodev.c
+++ b/backends/cryptodev.c
@@ -252,10 +252,11 @@ static void cryptodev_backend_throttle_timer_cb(void 
*opaque)
 continue;
 }
 
-throttle_account(>ts, true, ret);
+throttle_account(>ts, THROTTLE_WRITE, ret);
 cryptodev_backend_operation(backend, op_info);
 if (throttle_enabled(>tc) &&
-throttle_schedule_timer(>ts, >tt, true)) {
+throttle_schedule_timer(>ts, >tt,
+THROTTLE_WRITE)) {
 break;
 }
 }
@@ -271,7 +272,7 @@ int cryptodev_backend_crypto_operation(
 goto do_account;
 }
 
-if (throttle_schedule_timer(>ts, >tt, true) ||
+if (throttle_schedule_timer(>ts, >tt, THROTTLE_WRITE) ||
 !QTAILQ_EMPTY(>opinfos)) {
 QTAILQ_INSERT_TAIL(>opinfos, op_info, next);
 return 0;
@@ -283,7 +284,7 @@ do_account:
 return ret;
 }
 
-throttle_account(>ts, true, ret);
+throttle_account(>ts, THROTTLE_WRITE, ret);
 
 return cryptodev_backend_operation(backend, op_info);
 }
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index fb203c3ced..3847d27801 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -270,6 +270,7 @@ static bool 
throttle_group_schedule_timer(ThrottleGroupMember *tgm,
 ThrottleState *ts = tgm->throttle_state;
 ThrottleGroup *tg = container_of(ts, ThrottleGroup, ts);
 ThrottleTimers *tt = >throttle_timers;
+ThrottleDirection direction = is_write ? THROTTLE_WRITE : THROTTLE_READ;
 bool must_wait;
 
 if (qatomic_read(>io_limits_disabled)) {
@@ -281,7 +282,7 @@ static bool 
throttle_group_schedule_timer(ThrottleGroupMember *tgm,
 return true;
 }
 
-must_wait = throttle_schedule_timer(ts, tt, is_write);
+must_wait = throttle_schedule_timer(ts, tt, direction);
 
 /* If a timer just got armed, set tgm as the current token */
 if (must_wait) {
@@ -364,6 +365,7 @@ void coroutine_fn 
throttle_group_co_io_limits_intercept(ThrottleGroupMember *tgm
 bool must_wait;
 ThrottleGroupMember *token;
 ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts);
+ThrottleDirection direction = is_write ? THROTTLE_WRITE : THROTTLE_READ;
 
 assert(bytes >= 0);
 
@@ -386,7 +388,7 @@ void coroutine_fn 
throttle_group_co_io_limits_intercept(ThrottleGroupMember *tgm
 }
 
 /* The I/O will be executed, so do the accounting */
-throttle_account(tgm->throttle_state, is_write, bytes);
+throttle_account(tgm->throttle_state, direction, bytes);
 
 /* Schedule the next request */
 schedule_next_request(tgm, is_write);
diff --git a/fsdev/qemu-fsdev-throttle.c b/fsdev/qemu-fsdev-throttle.c
index 5c83a1cc09..1c137d6f0f 100644
--- a/fsdev/qemu-fsdev-throttle.c
+++ b/fsdev/qemu-fsdev-throttle.c
@@ -97,16 +97,18 @@ void fsdev_throttle_init(FsThrottle *fst)
 void coroutine_fn fsdev_co_throttle_request(FsThrottle *fst, bool is_write,
 struct iovec *iov, int iovcnt)
 {
+ThrottleDirection direction = is_write ? THROTTLE_WRITE : THROTTLE_READ;
+
 if (throttle_enabled(>cfg)) {
-if (throttle_schedule_timer(>ts, >tt, is_write) ||
+if (throttle_schedule

[PULL 11/14] file-posix: Check bs->bl.zoned for zone info

2023-09-01 Thread Hanna Czenczek

Instead of checking bs->wps or bs->bl.zone_size for whether zone
information is present, check bs->bl.zoned.  That is the flag that
raw_refresh_zoned_limits() reliably sets to indicate zone support.  If
it is set to something other than BLK_Z_NONE, other values and objects
like bs->wps and bs->bl.zone_size must be non-null/zero and valid; if it
is not, we cannot rely on their validity.

Signed-off-by: Hanna Czenczek 
Message-Id: <20230824155345.109765-3-hre...@redhat.com>
Reviewed-by: Sam Li 
---
 block/file-posix.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 2b88b9eefa..46e22403fe 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2455,9 +2455,10 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, 
uint64_t offset,
 if (fd_open(bs) < 0)
 return -EIO;
 #if defined(CONFIG_BLKZONED)
-if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && bs->wps) {
+if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
+bs->bl.zoned != BLK_Z_NONE) {
 qemu_co_mutex_lock(>wps->colock);
-if (type & QEMU_AIO_ZONE_APPEND && bs->bl.zone_size) {
+if (type & QEMU_AIO_ZONE_APPEND) {
 int index = offset / bs->bl.zone_size;
 offset = bs->wps->wp[index];
 }
@@ -2508,8 +2509,8 @@ out:
 {
 BlockZoneWps *wps = bs->wps;
 if (ret == 0) {
-if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))
-&& wps && bs->bl.zone_size) {
+if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
+bs->bl.zoned != BLK_Z_NONE) {
 uint64_t *wp = >wp[offset / bs->bl.zone_size];
 if (!BDRV_ZT_IS_CONV(*wp)) {
 if (type & QEMU_AIO_ZONE_APPEND) {
@@ -2529,7 +2530,8 @@ out:
 }
 }
 
-if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && wps) {
+if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
+bs->blk.zoned != BLK_Z_NONE) {
 qemu_co_mutex_unlock(>colock);
 }
 }
-- 
2.41.0

[PULL 14/14] tests/file-io-error: New test

2023-09-01 Thread Hanna Czenczek

This is a regression test for
https://bugzilla.redhat.com/show_bug.cgi?id=2234374.

All this test needs to do is trigger an I/O error inside of file-posix
(specifically raw_co_prw()).  One reliable way to do this without
requiring special privileges is to use a FUSE export, which allows us to
inject any error that we want, e.g. via blkdebug.

Signed-off-by: Hanna Czenczek 
Message-Id: <20230824155345.109765-6-hre...@redhat.com>
[hreitz: Fixed test to be skipped when there is no FUSE support, to
 suppress fusermount's allow_other warning, and to be skipped
 with $IMGOPTSSYNTAX enabled]
Signed-off-by: Hanna Czenczek 
---
 tests/qemu-iotests/tests/file-io-error | 119 +
 tests/qemu-iotests/tests/file-io-error.out |  33 ++
 2 files changed, 152 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/file-io-error
 create mode 100644 tests/qemu-iotests/tests/file-io-error.out

diff --git a/tests/qemu-iotests/tests/file-io-error 
b/tests/qemu-iotests/tests/file-io-error
new file mode 100755
index 00..88ee5f670c
--- /dev/null
+++ b/tests/qemu-iotests/tests/file-io-error
@@ -0,0 +1,119 @@
+#!/usr/bin/env bash
+# group: rw
+#
+# Produce an I/O error in file-posix, and hope that it is not catastrophic.
+# Regression test for: https://bugzilla.redhat.com/show_bug.cgi?id=2234374
+#
+# Copyright (C) 2023 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+seq=$(basename "$0")
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_qemu
+rm -f "$TEST_DIR/fuse-export"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ../common.rc
+. ../common.filter
+. ../common.qemu
+
+# Format-agnostic (we do not use any), but we do test the file protocol
+_supported_proto file
+_require_drivers blkdebug null-co
+
+if [ "$IMGOPTSSYNTAX" = "true" ]; then
+# We need `$QEMU_IO -f file` to work; IMGOPTSSYNTAX uses --image-opts,
+# breaking -f.
+_unsupported_fmt $IMGFMT
+fi
+
+# This is a regression test of a bug in which flie-posix would access zone
+# information in case of an I/O error even when there is no zone information,
+# resulting in a division by zero.
+# To reproduce the problem, we need to trigger an I/O error inside of
+# file-posix, which can be done (rootless) by providing a FUSE export that
+# presents only errors when accessed.
+
+_launch_qemu
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'qmp_capabilities'}" \
+'return'
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'blockdev-add',
+  'arguments': {
+  'driver': 'blkdebug',
+  'node-name': 'node0',
+  'inject-error': [{'event': 'none'}],
+  'image': {
+  'driver': 'null-co'
+  }
+  }}" \
+'return'
+
+# FUSE mountpoint must exist and be a regular file
+touch "$TEST_DIR/fuse-export"
+
+# The grep -v to filter fusermount's (benign) error when /etc/fuse.conf does
+# not contain user_allow_other and the subsequent check for missing FUSE 
support
+# have both been taken from iotest 308.
+output=$(_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'block-export-add',
+  'arguments': {
+  'id': 'exp0',
+  'type': 'fuse',
+  'node-name': 'node0',
+  'mountpoint': '$TEST_DIR/fuse-export',
+  'writable': true
+  }}" \
+'return' \
+| grep -v 'option allow_other only allowed if')
+
+if echo "$output" | grep -q "Parameter 'type' does not accept value 'fuse'"; 
then
+_notrun 'No FUSE support'
+fi
+echo "$output"
+
+echo
+# This should fail, but gracefully, i.e. just print an I/O error, not crash.
+$QEMU_IO -f file -c 'write 0 64M' "$TEST_DIR/fuse-export" | _filter_qemu_io
+echo
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'block-export-del',
+  'arguments': {'id': 'exp0'}}" \
+'return'
+
+_send_qemu_cmd $QEMU_HANDLE \
+'' \
+'BLOCK_EXPORT_DELETED'
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'blockdev-del',
+  'arguments': {'node-name': 'node0'}}" \
+'return'
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/fi

[PULL 12/14] file-posix: Fix zone update in I/O error path

2023-09-01 Thread Hanna Czenczek

We must check that zone information is present before running
update_zones_wp().

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2234374
Fixes: Coverity CID 1512459
Signed-off-by: Hanna Czenczek 
Message-Id: <20230824155345.109765-4-hre...@redhat.com>
Reviewed-by: Sam Li 
---
 block/file-posix.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 46e22403fe..a050682e97 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2525,7 +2525,8 @@ out:
 }
 }
 } else {
-if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
+if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) &&
+bs->bl.zoned != BLK_Z_NONE) {
 update_zones_wp(bs, s->fd, 0, 1);
 }
 }
-- 
2.41.0

[PULL 08/14] fsdev: Use ThrottleDirection instread of bool is_write

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

'bool is_write' style is obsolete from throttle framework, adapt
fsdev to the new style.

Cc: Greg Kurz 
Reviewed-by: Hanna Czenczek 
Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-9-pizhen...@bytedance.com>
Reviewed-by: Greg Kurz 
Signed-off-by: Hanna Czenczek 
---
 fsdev/qemu-fsdev-throttle.h |  4 ++--
 fsdev/qemu-fsdev-throttle.c | 14 +++---
 hw/9pfs/cofile.c|  4 ++--
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fsdev/qemu-fsdev-throttle.h b/fsdev/qemu-fsdev-throttle.h
index a21aecddc7..daa8ca2494 100644
--- a/fsdev/qemu-fsdev-throttle.h
+++ b/fsdev/qemu-fsdev-throttle.h
@@ -23,14 +23,14 @@ typedef struct FsThrottle {
 ThrottleState ts;
 ThrottleTimers tt;
 ThrottleConfig cfg;
-CoQueue  throttled_reqs[2];
+CoQueue  throttled_reqs[THROTTLE_MAX];
 } FsThrottle;
 
 int fsdev_throttle_parse_opts(QemuOpts *, FsThrottle *, Error **);
 
 void fsdev_throttle_init(FsThrottle *);
 
-void coroutine_fn fsdev_co_throttle_request(FsThrottle *, bool ,
+void coroutine_fn fsdev_co_throttle_request(FsThrottle *, ThrottleDirection ,
 struct iovec *, int);
 
 void fsdev_throttle_cleanup(FsThrottle *);
diff --git a/fsdev/qemu-fsdev-throttle.c b/fsdev/qemu-fsdev-throttle.c
index 1c137d6f0f..d912da906d 100644
--- a/fsdev/qemu-fsdev-throttle.c
+++ b/fsdev/qemu-fsdev-throttle.c
@@ -94,22 +94,22 @@ void fsdev_throttle_init(FsThrottle *fst)
 }
 }
 
-void coroutine_fn fsdev_co_throttle_request(FsThrottle *fst, bool is_write,
+void coroutine_fn fsdev_co_throttle_request(FsThrottle *fst,
+ThrottleDirection direction,
 struct iovec *iov, int iovcnt)
 {
-ThrottleDirection direction = is_write ? THROTTLE_WRITE : THROTTLE_READ;
-
+assert(direction < THROTTLE_MAX);
 if (throttle_enabled(>cfg)) {
 if (throttle_schedule_timer(>ts, >tt, direction) ||
-!qemu_co_queue_empty(>throttled_reqs[is_write])) {
-qemu_co_queue_wait(>throttled_reqs[is_write], NULL);
+!qemu_co_queue_empty(>throttled_reqs[direction])) {
+qemu_co_queue_wait(>throttled_reqs[direction], NULL);
 }
 
 throttle_account(>ts, direction, iov_size(iov, iovcnt));
 
-if (!qemu_co_queue_empty(>throttled_reqs[is_write]) &&
+if (!qemu_co_queue_empty(>throttled_reqs[direction]) &&
 !throttle_schedule_timer(>ts, >tt, direction)) {
-qemu_co_queue_next(>throttled_reqs[is_write]);
+qemu_co_queue_next(>throttled_reqs[direction]);
 }
 }
 }
diff --git a/hw/9pfs/cofile.c b/hw/9pfs/cofile.c
index 9c5344039e..71174c3e4a 100644
--- a/hw/9pfs/cofile.c
+++ b/hw/9pfs/cofile.c
@@ -252,7 +252,7 @@ int coroutine_fn v9fs_co_pwritev(V9fsPDU *pdu, V9fsFidState 
*fidp,
 if (v9fs_request_cancelled(pdu)) {
 return -EINTR;
 }
-fsdev_co_throttle_request(s->ctx.fst, true, iov, iovcnt);
+fsdev_co_throttle_request(s->ctx.fst, THROTTLE_WRITE, iov, iovcnt);
 v9fs_co_run_in_worker(
 {
 err = s->ops->pwritev(>ctx, >fs, iov, iovcnt, offset);
@@ -272,7 +272,7 @@ int coroutine_fn v9fs_co_preadv(V9fsPDU *pdu, V9fsFidState 
*fidp,
 if (v9fs_request_cancelled(pdu)) {
 return -EINTR;
 }
-fsdev_co_throttle_request(s->ctx.fst, false, iov, iovcnt);
+fsdev_co_throttle_request(s->ctx.fst, THROTTLE_READ, iov, iovcnt);
 v9fs_co_run_in_worker(
 {
 err = s->ops->preadv(>ctx, >fs, iov, iovcnt, offset);
-- 
2.41.0

[PULL 05/14] cryptodev: use NULL throttle timer cb for read direction

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

Operations on a cryptodev are considered as *write* only, the callback
of read direction is never invoked. Use NULL instead of an unreachable
path(cryptodev_backend_throttle_timer_cb on read direction).

The dummy read timer(never invoked) is already removed here, it means
that the 'FIXME' tag is no longer needed.

Reviewed-by: Alberto Garcia 
Reviewed-by: Hanna Czenczek 
Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-6-pizhen...@bytedance.com>
Signed-off-by: Hanna Czenczek 
---
 backends/cryptodev.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/backends/cryptodev.c b/backends/cryptodev.c
index 4d183f7237..c2356550c8 100644
--- a/backends/cryptodev.c
+++ b/backends/cryptodev.c
@@ -341,8 +341,7 @@ static void cryptodev_backend_set_throttle(CryptoDevBackend 
*backend, int field,
 if (!enabled) {
 throttle_init(>ts);
 throttle_timers_init(>tt, qemu_get_aio_context(),
- QEMU_CLOCK_REALTIME,
- cryptodev_backend_throttle_timer_cb, /* FIXME */
+ QEMU_CLOCK_REALTIME, NULL,
  cryptodev_backend_throttle_timer_cb, backend);
 }
 
-- 
2.41.0

[PULL 01/14] throttle: introduce enum ThrottleDirection

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

Use enum ThrottleDirection instead of number index.

Reviewed-by: Alberto Garcia 
Reviewed-by: Hanna Czenczek 
Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-2-pizhen...@bytedance.com>
Signed-off-by: Hanna Czenczek 
---
 include/qemu/throttle.h | 11 ---
 util/throttle.c | 16 +---
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h
index 05f6346137..9ca5ab8197 100644
--- a/include/qemu/throttle.h
+++ b/include/qemu/throttle.h
@@ -99,13 +99,18 @@ typedef struct ThrottleState {
 int64_t previous_leak;/* timestamp of the last leak done */
 } ThrottleState;
 
+typedef enum {
+THROTTLE_READ = 0,
+THROTTLE_WRITE,
+THROTTLE_MAX
+} ThrottleDirection;
+
 typedef struct ThrottleTimers {
-QEMUTimer *timers[2]; /* timers used to do the throttling */
+QEMUTimer *timers[THROTTLE_MAX];/* timers used to do the throttling */
 QEMUClockType clock_type; /* the clock used */
 
 /* Callbacks */
-QEMUTimerCB *read_timer_cb;
-QEMUTimerCB *write_timer_cb;
+QEMUTimerCB *timer_cb[THROTTLE_MAX];
 void *timer_opaque;
 } ThrottleTimers;
 
diff --git a/util/throttle.c b/util/throttle.c
index 81f247a8d1..5642e61763 100644
--- a/util/throttle.c
+++ b/util/throttle.c
@@ -199,10 +199,12 @@ static bool throttle_compute_timer(ThrottleState *ts,
 void throttle_timers_attach_aio_context(ThrottleTimers *tt,
 AioContext *new_context)
 {
-tt->timers[0] = aio_timer_new(new_context, tt->clock_type, SCALE_NS,
-  tt->read_timer_cb, tt->timer_opaque);
-tt->timers[1] = aio_timer_new(new_context, tt->clock_type, SCALE_NS,
-  tt->write_timer_cb, tt->timer_opaque);
+tt->timers[THROTTLE_READ] =
+aio_timer_new(new_context, tt->clock_type, SCALE_NS,
+  tt->timer_cb[THROTTLE_READ], tt->timer_opaque);
+tt->timers[THROTTLE_WRITE] =
+aio_timer_new(new_context, tt->clock_type, SCALE_NS,
+  tt->timer_cb[THROTTLE_WRITE], tt->timer_opaque);
 }
 
 /*
@@ -236,8 +238,8 @@ void throttle_timers_init(ThrottleTimers *tt,
 memset(tt, 0, sizeof(ThrottleTimers));
 
 tt->clock_type = clock_type;
-tt->read_timer_cb = read_timer_cb;
-tt->write_timer_cb = write_timer_cb;
+tt->timer_cb[THROTTLE_READ] = read_timer_cb;
+tt->timer_cb[THROTTLE_WRITE] = write_timer_cb;
 tt->timer_opaque = timer_opaque;
 throttle_timers_attach_aio_context(tt, aio_context);
 }
@@ -256,7 +258,7 @@ void throttle_timers_detach_aio_context(ThrottleTimers *tt)
 {
 int i;
 
-for (i = 0; i < 2; i++) {
+for (i = 0; i < THROTTLE_MAX; i++) {
 throttle_timer_destroy(>timers[i]);
 }
 }
-- 
2.41.0

[PULL 04/14] test-throttle: test read only and write only

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

Reviewed-by: Alberto Garcia 
Reviewed-by: Hanna Czenczek 
Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-5-pizhen...@bytedance.com>
Signed-off-by: Hanna Czenczek 
---
 tests/unit/test-throttle.c | 66 ++
 1 file changed, 66 insertions(+)

diff --git a/tests/unit/test-throttle.c b/tests/unit/test-throttle.c
index a60b5fe22e..5547837a58 100644
--- a/tests/unit/test-throttle.c
+++ b/tests/unit/test-throttle.c
@@ -184,6 +184,70 @@ static void test_init(void)
 throttle_timers_destroy(tt);
 }
 
+static void test_init_readonly(void)
+{
+int i;
+
+tt = _timers;
+
+/* fill the structures with crap */
+memset(, 1, sizeof(ts));
+memset(tt, 1, sizeof(*tt));
+
+/* init structures */
+throttle_init();
+throttle_timers_init(tt, ctx, QEMU_CLOCK_VIRTUAL,
+ read_timer_cb, NULL, );
+
+/* check initialized fields */
+g_assert(tt->clock_type == QEMU_CLOCK_VIRTUAL);
+g_assert(tt->timers[THROTTLE_READ]);
+g_assert(!tt->timers[THROTTLE_WRITE]);
+
+/* check other fields where cleared */
+g_assert(!ts.previous_leak);
+g_assert(!ts.cfg.op_size);
+for (i = 0; i < BUCKETS_COUNT; i++) {
+g_assert(!ts.cfg.buckets[i].avg);
+g_assert(!ts.cfg.buckets[i].max);
+g_assert(!ts.cfg.buckets[i].level);
+}
+
+throttle_timers_destroy(tt);
+}
+
+static void test_init_writeonly(void)
+{
+int i;
+
+tt = _timers;
+
+/* fill the structures with crap */
+memset(, 1, sizeof(ts));
+memset(tt, 1, sizeof(*tt));
+
+/* init structures */
+throttle_init();
+throttle_timers_init(tt, ctx, QEMU_CLOCK_VIRTUAL,
+ NULL, write_timer_cb, );
+
+/* check initialized fields */
+g_assert(tt->clock_type == QEMU_CLOCK_VIRTUAL);
+g_assert(!tt->timers[THROTTLE_READ]);
+g_assert(tt->timers[THROTTLE_WRITE]);
+
+/* check other fields where cleared */
+g_assert(!ts.previous_leak);
+g_assert(!ts.cfg.op_size);
+for (i = 0; i < BUCKETS_COUNT; i++) {
+g_assert(!ts.cfg.buckets[i].avg);
+g_assert(!ts.cfg.buckets[i].max);
+g_assert(!ts.cfg.buckets[i].level);
+}
+
+throttle_timers_destroy(tt);
+}
+
 static void test_destroy(void)
 {
 int i;
@@ -752,6 +816,8 @@ int main(int argc, char **argv)
 g_test_add_func("/throttle/leak_bucket",test_leak_bucket);
 g_test_add_func("/throttle/compute_wait",   test_compute_wait);
 g_test_add_func("/throttle/init",   test_init);
+g_test_add_func("/throttle/init_readonly",  test_init_readonly);
+g_test_add_func("/throttle/init_writeonly", test_init_writeonly);
 g_test_add_func("/throttle/destroy",test_destroy);
 g_test_add_func("/throttle/have_timer", test_have_timer);
 g_test_add_func("/throttle/detach_attach",  test_detach_attach);
-- 
2.41.0

[PULL 02/14] test-throttle: use enum ThrottleDirection

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

Use enum ThrottleDirection instead in the throttle test codes.

Reviewed-by: Alberto Garcia 
Reviewed-by: Hanna Czenczek 
Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-3-pizhen...@bytedance.com>
Signed-off-by: Hanna Czenczek 
---
 tests/unit/test-throttle.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tests/unit/test-throttle.c b/tests/unit/test-throttle.c
index 7adb5e6652..a60b5fe22e 100644
--- a/tests/unit/test-throttle.c
+++ b/tests/unit/test-throttle.c
@@ -169,8 +169,8 @@ static void test_init(void)
 
 /* check initialized fields */
 g_assert(tt->clock_type == QEMU_CLOCK_VIRTUAL);
-g_assert(tt->timers[0]);
-g_assert(tt->timers[1]);
+g_assert(tt->timers[THROTTLE_READ]);
+g_assert(tt->timers[THROTTLE_WRITE]);
 
 /* check other fields where cleared */
 g_assert(!ts.previous_leak);
@@ -191,7 +191,7 @@ static void test_destroy(void)
 throttle_timers_init(tt, ctx, QEMU_CLOCK_VIRTUAL,
  read_timer_cb, write_timer_cb, );
 throttle_timers_destroy(tt);
-for (i = 0; i < 2; i++) {
+for (i = 0; i < THROTTLE_MAX; i++) {
 g_assert(!tt->timers[i]);
 }
 }
-- 
2.41.0

[PULL 00/14] Block patches

2023-09-01 Thread Hanna Czenczek

The following changes since commit f5fe7c17ac4e309e47e78f0f9761aebc8d2f2c81:

  Merge tag 'pull-tcg-20230823-2' of https://gitlab.com/rth7680/qemu into 
staging (2023-08-28 16:07:04 -0400)

are available in the Git repository at:

  https://gitlab.com/hreitz/qemu.git tags/pull-block-2023-09-01

for you to fetch changes up to 380448464dd89291cf7fd7434be6c225482a334d:

  tests/file-io-error: New test (2023-08-29 13:01:24 +0200)


Block patches

- Fix for file-posix's zoning code crashing on I/O errors
- Throttling refactoring


Hanna Czenczek (5):
  file-posix: Clear bs->bl.zoned on error
  file-posix: Check bs->bl.zoned for zone info
  file-posix: Fix zone update in I/O error path
  file-posix: Simplify raw_co_prw's 'out' zone code
  tests/file-io-error: New test

Zhenwei Pi (9):
  throttle: introduce enum ThrottleDirection
  test-throttle: use enum ThrottleDirection
  throttle: support read-only and write-only
  test-throttle: test read only and write only
  cryptodev: use NULL throttle timer cb for read direction
  throttle: use enum ThrottleDirection instead of bool is_write
  throttle: use THROTTLE_MAX/ARRAY_SIZE for hard code
  fsdev: Use ThrottleDirection instread of bool is_write
  block/throttle-groups: Use ThrottleDirection instread of bool is_write

 fsdev/qemu-fsdev-throttle.h|   4 +-
 include/block/throttle-groups.h|   6 +-
 include/qemu/throttle.h|  16 +-
 backends/cryptodev.c   |  12 +-
 block/block-backend.c  |   4 +-
 block/file-posix.c |  42 +++---
 block/throttle-groups.c| 163 +++--
 block/throttle.c   |   8 +-
 fsdev/qemu-fsdev-throttle.c|  18 ++-
 hw/9pfs/cofile.c   |   4 +-
 tests/unit/test-throttle.c |  76 +-
 util/throttle.c|  84 +++
 tests/qemu-iotests/tests/file-io-error | 119 +++
 tests/qemu-iotests/tests/file-io-error.out |  33 +
 14 files changed, 418 insertions(+), 171 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/file-io-error
 create mode 100644 tests/qemu-iotests/tests/file-io-error.out

-- 
2.41.0

[PULL 03/14] throttle: support read-only and write-only

2023-09-01 Thread Hanna Czenczek

From: zhenwei pi 

Only one direction is necessary in several scenarios:
- a read-only disk
- operations on a device are considered as *write* only. For example,
  encrypt/decrypt/sign/verify operations on a cryptodev use a single
  *write* timer(read timer callback is defined, but never invoked).

Allow a single direction in throttle, this reduces memory, and uplayer
does not need a dummy callback any more.

Reviewed-by: Alberto Garcia 
Reviewed-by: Hanna Czenczek 
Signed-off-by: zhenwei pi 
Message-Id: <20230728022006.1098509-4-pizhen...@bytedance.com>
Signed-off-by: Hanna Czenczek 
---
 util/throttle.c | 42 --
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/util/throttle.c b/util/throttle.c
index 5642e61763..0439028d21 100644
--- a/util/throttle.c
+++ b/util/throttle.c
@@ -199,12 +199,15 @@ static bool throttle_compute_timer(ThrottleState *ts,
 void throttle_timers_attach_aio_context(ThrottleTimers *tt,
 AioContext *new_context)
 {
-tt->timers[THROTTLE_READ] =
-aio_timer_new(new_context, tt->clock_type, SCALE_NS,
-  tt->timer_cb[THROTTLE_READ], tt->timer_opaque);
-tt->timers[THROTTLE_WRITE] =
-aio_timer_new(new_context, tt->clock_type, SCALE_NS,
-  tt->timer_cb[THROTTLE_WRITE], tt->timer_opaque);
+ThrottleDirection dir;
+
+for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++) {
+if (tt->timer_cb[dir]) {
+tt->timers[dir] =
+aio_timer_new(new_context, tt->clock_type, SCALE_NS,
+  tt->timer_cb[dir], tt->timer_opaque);
+}
+}
 }
 
 /*
@@ -235,6 +238,7 @@ void throttle_timers_init(ThrottleTimers *tt,
   QEMUTimerCB *write_timer_cb,
   void *timer_opaque)
 {
+assert(read_timer_cb || write_timer_cb);
 memset(tt, 0, sizeof(ThrottleTimers));
 
 tt->clock_type = clock_type;
@@ -247,7 +251,9 @@ void throttle_timers_init(ThrottleTimers *tt,
 /* destroy a timer */
 static void throttle_timer_destroy(QEMUTimer **timer)
 {
-assert(*timer != NULL);
+if (*timer == NULL) {
+return;
+}
 
 timer_free(*timer);
 *timer = NULL;
@@ -256,10 +262,10 @@ static void throttle_timer_destroy(QEMUTimer **timer)
 /* Remove timers from event loop */
 void throttle_timers_detach_aio_context(ThrottleTimers *tt)
 {
-int i;
+ThrottleDirection dir;
 
-for (i = 0; i < THROTTLE_MAX; i++) {
-throttle_timer_destroy(>timers[i]);
+for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++) {
+throttle_timer_destroy(>timers[dir]);
 }
 }
 
@@ -272,8 +278,12 @@ void throttle_timers_destroy(ThrottleTimers *tt)
 /* is any throttling timer configured */
 bool throttle_timers_are_initialized(ThrottleTimers *tt)
 {
-if (tt->timers[0]) {
-return true;
+ThrottleDirection dir;
+
+for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++) {
+if (tt->timers[dir]) {
+return true;
+}
 }
 
 return false;
@@ -424,8 +434,12 @@ bool throttle_schedule_timer(ThrottleState *ts,
 {
 int64_t now = qemu_clock_get_ns(tt->clock_type);
 int64_t next_timestamp;
+QEMUTimer *timer;
 bool must_wait;
 
+timer = is_write ? tt->timers[THROTTLE_WRITE] : tt->timers[THROTTLE_READ];
+assert(timer);
+
 must_wait = throttle_compute_timer(ts,
is_write,
now,
@@ -437,12 +451,12 @@ bool throttle_schedule_timer(ThrottleState *ts,
 }
 
 /* request throttled and timer pending -> do nothing */
-if (timer_pending(tt->timers[is_write])) {
+if (timer_pending(timer)) {
 return true;
 }
 
 /* request throttled and timer not pending -> arm timer */
-timer_mod(tt->timers[is_write], next_timestamp);
+timer_mod(timer, next_timestamp);
 return true;
 }
 
-- 
2.41.0

Re: [PATCH v2 2/3] qemu-img: map: report compressed data blocks

2023-08-29 Thread Hanna Czenczek


On 29.08.23 08:44, Andrey Drobyshev wrote:

On 8/25/23 17:14, Hanna Czenczek wrote:

On 06.07.23 18:30, Andrey Drobyshev wrote:

Right now "qemu-img map" reports compressed blocks as containing data
but having no host offset.  This is not very informative.  Instead,
let's add another boolean field named "compressed" in case JSON output
mode is specified.  This is achieved by utilizing new allocation status
flag BDRV_BLOCK_COMPRESSED for bdrv_block_status().

Signed-off-by: Andrey Drobyshev 
---
   qapi/block-core.json |  7 +--
   qemu-img.c   | 16 +---
   2 files changed, 18 insertions(+), 5 deletions(-)

Patch 3 must be merged into this patch.  Every test must pass on every
commit so we don’t break bisecting.

Agreed, should've figured that myself.


diff --git a/qapi/block-core.json b/qapi/block-core.json
index 5dd5f7e4b0..b263d2cd30 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -409,6 +409,9 @@
   #
   # @zero: whether the virtual blocks read as zeroes
   #
+# @compressed: true indicates that data is stored compressed.  Optional,
+# only valid for the formats whith support compression

This is missing information for when this field was introduced (i.e. a
“(since 8.2)”).

Noted.


I also wonder why this field is optional.  We have compression
information even for formats that don’t support compression,
specifically, nothing is compressed.  I would just make this field
mandatory and print it always.  (A technical reason to do so is that
this patch uses block_driver_can_compress() to figure out whether there
is compression support; but that function only tells whether the driver
can write compressed data.  Even if it cannot do that, the format may
still support compression, and the driver may be able to read compressed
data, just not write it.)


I figured that for the formats which surely can't support compression,
such as "raw", this information would simply be redundant.  AFAICT right
now the only drivers which can read compressed data are exactly the ones
which can write it: vmdk, qcow and qcow2.  But if we assume that this
might change, and that we'd better show the field value no matter what
(as we do with "zero" field) -- I'm OK with it.


It is indeed redundant, but as this is intended to be the 
machine-readable output, I don’t mind the output becoming arguably 
needlessly verbose.

Re: [PATCH 3/6] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-08-29 Thread Hanna Czenczek


On 29.08.23 09:06, Andrey Drobyshev wrote:

On 8/25/23 17:29, Hanna Czenczek wrote:

On 01.06.23 21:28, Andrey Drobyshev via wrote:

Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
the data read from the old and new backing files are aligned using
BlockDriverState (or BlockBackend later on) referring to the target
image.
However, this isn't quite right, because target image is only being
written to and has nothing to do with those buffers.  Let's fix that.

I don’t understand.  The write to the target image does use one of those
buffers (buf_old, specifically).

This change is correct for buf_new/blk_new_backing, but for buf_old, in
theory, we need a buffer that fulfills both the alignment requirements
of blk and blk_old_backing.  (Not that this patch really makes the
situation worse for buf_old.)

Hanna


Hmm, you're right.  In which case the right thing to do would probably
be smth like:


  float local_progress = 0;
  
-buf_old = blk_blockalign(blk, IO_BUF_SIZE);

-buf_new = blk_blockalign(blk, IO_BUF_SIZE);
+if (bdrv_opt_mem_align(blk_bs(blk)) >
+bdrv_opt_mem_align(blk_bs(blk_old_backing))) {
+buf_old = blk_blockalign(blk, IO_BUF_SIZE);
+} else {
+buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
+}
+buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
  
  size = blk_getlength(blk);

I'll include this in v2 if you don't have any objections.


Looks good to me, thanks!

1 2 3 >

1 - 100 of 224 matches

Mail list logo