Re: [PULL 0/3] Memory/SDHCI/ParallelFlash patches for v7.2.0-rc0

2022-11-08 Thread Philippe Mathieu-Daudé

On 8/11/22 21:57, Stefan Hajnoczi wrote:

I've dropped the SDHCI CVE fix due to the CI failure.

The rest of the commits are still in the staging tree and I plan to
include them in v7.2.0-rc0.


Thank you Stefan, sorry for not catching that failure sooner.



Re: [PATCH v3 4/4] hw/nvme: add polling support

2022-11-08 Thread Jinhao Fan
at 10:11 PM, Klaus Jensen  wrote:

> On Nov  8 12:39, John Levon wrote:
>> On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote:
>> 
>>> On Nov  3 21:19, Jinhao Fan wrote:
 On 11/3/2022 8:10 PM, Klaus Jensen wrote:
> I agree that the spec is a little unclear on this point. In any case, in
> Linux, when the driver has decided that the sq tail must be updated,
> it will use this check:
> 
>   (new_idx - event_idx - 1) < (new_idx - old)
 
 When eventidx is already behind, it's like:
 
 0
 1 <- event_idx
 2 <- old
 3 <- new_idx
 4
 .
 .
 .
 
 In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) =
 3-2=1, so the host won't update sq tail. Where am I wrong in this example?
>>> 
>>> That becomes 1 >= 1, i.e. "true". So this will result in the driver
>>> doing an mmio doorbell write.
>> 
>> The code is:
>> 
>> static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old) 
>> 
>> {
>> 
>>return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old); 
>>
>> }
>> 
>> 
>> which per the above is "return 1 < 1;", or false. So the above case does 
>> *not*
>> do an mmio write. No?
> 
> Whelp.
> 
> Looks like I'm in the wrong here, apologies!

So disabling eventidx updates during polling has the potential of reducing
doorbell writes. But as Klaus observed removing this function does not cause
performance difference. So I guess only one command is processed during each
polling iteration.



[PATCH 8/8] virtio-blk: minimize virtio_blk_reset() AioContext lock region

2022-11-08 Thread Stefan Hajnoczi
blk_drain() needs the lock because it calls AIO_WAIT_WHILE().

The s->rq loop doesn't need the lock because dataplane has been stopped
when virtio_blk_reset() is called.

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/virtio-blk.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 771d87cfbe..0b411b3065 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -898,6 +898,7 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 ctx = blk_get_aio_context(s->blk);
 aio_context_acquire(ctx);
 blk_drain(s->blk);
+aio_context_release(ctx);
 
 /* We drop queued requests after blk_drain() because blk_drain() itself can
  * produce them. */
@@ -908,8 +909,6 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 virtio_blk_free_request(req);
 }
 
-aio_context_release(ctx);
-
 assert(!s->dataplane_started);
 blk_set_enable_write_cache(s->blk, s->original_wce);
 }
-- 
2.38.1




[PATCH 7/8] virtio-blk: don't acquire AioContext in virtio_blk_handle_vq()

2022-11-08 Thread Stefan Hajnoczi
There is no need to acquire AioContext in virtio_blk_handle_vq() because
no APIs used in the function require it and nothing else in the
virtio-blk code requires mutual exclusion anymore.

Signed-off-by: Stefan Hajnoczi 
---
 hw/block/virtio-blk.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index faea045178..771d87cfbe 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -784,7 +784,6 @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 
 IO_CODE();
 
-aio_context_acquire(blk_get_aio_context(s->blk));
 blk_io_plug(s->blk);
 
 do {
@@ -810,7 +809,6 @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 }
 
 blk_io_unplug(s->blk);
-aio_context_release(blk_get_aio_context(s->blk));
 }
 
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
-- 
2.38.1




[PATCH 1/8] virtio_queue_aio_attach_host_notifier: remove AioContext lock

2022-11-08 Thread Stefan Hajnoczi
From: Emanuele Giuseppe Esposito 

virtio_queue_aio_attach_host_notifier() and
virtio_queue_aio_attach_host_notifier_nopoll() run always in the
main loop, so there is no need to protect them with AioContext
lock.

On the other side, virtio_queue_aio_detach_host_notifier() runs
in a bh in the iothread context, but it is always scheduled
(thus serialized) by the main loop. Therefore removing the
AioContext lock is safe.

In order to remove the AioContext lock it is necessary to switch
aio_wait_bh_oneshot() to AIO_WAIT_WHILE_UNLOCKED(). virtio-blk and
virtio-scsi are the only users of aio_wait_bh_oneshot() so it is
possible to make this change.

For now bdrv_set_aio_context() still needs the AioContext lock.

Signed-off-by: Emanuele Giuseppe Esposito 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20220609143727.1151816-2-eespo...@redhat.com>
---
 include/block/aio-wait.h|  4 ++--
 hw/block/dataplane/virtio-blk.c | 10 ++
 hw/block/virtio-blk.c   |  2 ++
 hw/scsi/virtio-scsi-dataplane.c | 10 --
 util/aio-wait.c |  2 +-
 5 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
index dd9a7f6461..fce6bfee3a 100644
--- a/include/block/aio-wait.h
+++ b/include/block/aio-wait.h
@@ -131,8 +131,8 @@ void aio_wait_kick(void);
  *
  * Run a BH in @ctx and wait for it to complete.
  *
- * Must be called from the main loop thread with @ctx acquired exactly once.
- * Note that main loop event processing may occur.
+ * Must be called from the main loop thread. @ctx must not be acquired by the
+ * caller. Note that main loop event processing may occur.
  */
 void aio_wait_bh_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index b28d81737e..975f5ca8c4 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -167,6 +167,8 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 Error *local_err = NULL;
 int r;
 
+GLOBAL_STATE_CODE();
+
 if (vblk->dataplane_started || s->starting) {
 return 0;
 }
@@ -245,13 +247,11 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 }
 
 /* Get this show started by hooking up our callbacks */
-aio_context_acquire(s->ctx);
 for (i = 0; i < nvqs; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
 virtio_queue_aio_attach_host_notifier(vq, s->ctx);
 }
-aio_context_release(s->ctx);
 return 0;
 
   fail_aio_context:
@@ -301,6 +301,8 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 unsigned i;
 unsigned nvqs = s->conf->num_queues;
 
+GLOBAL_STATE_CODE();
+
 if (!vblk->dataplane_started || s->stopping) {
 return;
 }
@@ -314,9 +316,10 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 s->stopping = true;
 trace_virtio_blk_data_plane_stop(s);
 
-aio_context_acquire(s->ctx);
 aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
 
+aio_context_acquire(s->ctx);
+
 /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */
 blk_drain(s->conf->conf.blk);
 
@@ -325,7 +328,6 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
  * BlockBackend in the iothread, that's ok
  */
 blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context(), NULL);
-
 aio_context_release(s->ctx);
 
 /*
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 1762517878..cdc6fd5979 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -100,6 +100,8 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
 VirtIOBlock *s = next->dev;
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 
+IO_CODE();
+
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 while (next) {
 VirtIOBlockReq *req = next;
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 20bb91766e..f6f55d4511 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -91,6 +91,8 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev)
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev);
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
+GLOBAL_STATE_CODE();
+
 if (s->dataplane_started ||
 s->dataplane_starting ||
 s->dataplane_fenced) {
@@ -138,20 +140,18 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev)
 
 /*
  * These fields are visible to the IOThread so we rely on implicit barriers
- * in aio_context_acquire() on the write side and aio_notify_accept() on
- * the read side.
+ * in virtio_queue_aio_attach_host_notifier() on the write side and
+ * aio_notify_accept() on the read side.
  */
 s->dataplane_starting = false;
 s->dataplane_started = true;
 
-aio_context_acquire(s->ctx);
 virtio_queue_aio_attach_host_notifier(vs->ctrl_vq, s->ctx);
 

[PATCH 5/8] virtio-blk: mark IO_CODE functions

2022-11-08 Thread Stefan Hajnoczi
From: Emanuele Giuseppe Esposito 

Just as done in the block API, mark functions in virtio-blk
that are called also from iothread(s).

We know such functions are IO because many are blk_* callbacks,
running always in the device iothread, and remaining are propagated
from the leaf IO functions (if a function calls a IO_CODE function,
itself is categorized as IO_CODE too).

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20220609143727.1151816-7-eespo...@redhat.com>
---
 hw/block/dataplane/virtio-blk.c |  4 +++
 hw/block/virtio-blk.c   | 45 -
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 728c9cd86c..3593ac0e7b 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -63,6 +63,8 @@ static void notify_guest_bh(void *opaque)
 unsigned long bitmap[BITS_TO_LONGS(nvqs)];
 unsigned j;
 
+IO_CODE();
+
 memcpy(bitmap, s->batch_notify_vqs, sizeof(bitmap));
 memset(s->batch_notify_vqs, 0, sizeof(bitmap));
 
@@ -288,6 +290,8 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
 VirtIOBlockDataPlane *s = opaque;
 unsigned i;
 
+IO_CODE();
+
 for (i = 0; i < s->conf->num_queues; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 02b213a140..f8fcf25292 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -39,6 +39,8 @@
 static void virtio_blk_init_request(VirtIOBlock *s, VirtQueue *vq,
 VirtIOBlockReq *req)
 {
+IO_CODE();
+
 req->dev = s;
 req->vq = vq;
 req->qiov.size = 0;
@@ -57,6 +59,8 @@ static void virtio_blk_req_complete(VirtIOBlockReq *req, 
unsigned char status)
 VirtIOBlock *s = req->dev;
 VirtIODevice *vdev = VIRTIO_DEVICE(s);
 
+IO_CODE();
+
 trace_virtio_blk_req_complete(vdev, req, status);
 
 stb_p(>in->status, status);
@@ -76,6 +80,8 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, 
int error,
 VirtIOBlock *s = req->dev;
 BlockErrorAction action = blk_get_error_action(s->blk, is_read, error);
 
+IO_CODE();
+
 if (action == BLOCK_ERROR_ACTION_STOP) {
 /* Break the link as the next request is going to be parsed from the
  * ring again. Otherwise we may end up doing a double completion! */
@@ -143,7 +149,9 @@ static void virtio_blk_flush_complete(void *opaque, int ret)
 VirtIOBlockReq *req = opaque;
 VirtIOBlock *s = req->dev;
 
+IO_CODE();
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+
 if (ret) {
 if (virtio_blk_handle_rw_error(req, -ret, 0, true)) {
 goto out;
@@ -165,7 +173,9 @@ static void virtio_blk_discard_write_zeroes_complete(void 
*opaque, int ret)
 bool is_write_zeroes = (virtio_ldl_p(VIRTIO_DEVICE(s), >out.type) &
 ~VIRTIO_BLK_T_BARRIER) == 
VIRTIO_BLK_T_WRITE_ZEROES;
 
+IO_CODE();
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
+
 if (ret) {
 if (virtio_blk_handle_rw_error(req, -ret, false, is_write_zeroes)) {
 goto out;
@@ -198,6 +208,8 @@ static void virtio_blk_ioctl_complete(void *opaque, int 
status)
 struct virtio_scsi_inhdr *scsi;
 struct sg_io_hdr *hdr;
 
+IO_CODE();
+
 scsi = (void *)req->elem.in_sg[req->elem.in_num - 2].iov_base;
 
 if (status) {
@@ -239,6 +251,8 @@ static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock 
*s, VirtQueue *vq)
 {
 VirtIOBlockReq *req = virtqueue_pop(vq, sizeof(VirtIOBlockReq));
 
+IO_CODE();
+
 if (req) {
 virtio_blk_init_request(s, vq, req);
 }
@@ -259,6 +273,8 @@ static int virtio_blk_handle_scsi_req(VirtIOBlockReq *req)
 BlockAIOCB *acb;
 #endif
 
+IO_CODE();
+
 /*
  * We require at least one output segment each for the virtio_blk_outhdr
  * and the SCSI command block.
@@ -357,6 +373,7 @@ fail:
 static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
 {
 int status;
+IO_CODE();
 
 status = virtio_blk_handle_scsi_req(req);
 if (status != -EINPROGRESS) {
@@ -374,6 +391,8 @@ static inline void submit_requests(VirtIOBlock *s, 
MultiReqBuffer *mrb,
 bool is_write = mrb->is_write;
 BdrvRequestFlags flags = 0;
 
+IO_CODE();
+
 if (num_reqs > 1) {
 int i;
 struct iovec *tmp_iov = qiov->iov;
@@ -423,6 +442,8 @@ static int multireq_compare(const void *a, const void *b)
 const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a,
  *req2 = *(VirtIOBlockReq **)b;
 
+IO_CODE();
+
 /*
  * Note that we can't simply subtract sector_num1 from sector_num2
  * here as that could overflow the return value.
@@ -442,6 +463,8 @@ static void virtio_blk_submit_multireq(VirtIOBlock *s, 
MultiReqBuffer *mrb)
 uint32_t 

[PATCH 6/8] virtio-blk: remove unnecessary AioContext lock from function already safe

2022-11-08 Thread Stefan Hajnoczi
From: Emanuele Giuseppe Esposito 

AioContext lock was introduced in b9e413dd375 and in this instance
it is used to protect these 3 functions:
- virtio_blk_handle_rw_error
- virtio_blk_req_complete
- block_acct_done

Now that all three of the above functions are protected with
their own locks, we can get rid of the AioContext lock.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20220609143727.1151816-9-eespo...@redhat.com>
---
 hw/block/virtio-blk.c | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f8fcf25292..faea045178 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -108,7 +108,6 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
 
 IO_CODE();
 
-aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 while (next) {
 VirtIOBlockReq *req = next;
 next = req->mr_next;
@@ -141,7 +140,6 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
 block_acct_done(blk_get_stats(s->blk), >acct);
 virtio_blk_free_request(req);
 }
-aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 static void virtio_blk_flush_complete(void *opaque, int ret)
@@ -150,20 +148,16 @@ static void virtio_blk_flush_complete(void *opaque, int 
ret)
 VirtIOBlock *s = req->dev;
 
 IO_CODE();
-aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 
 if (ret) {
 if (virtio_blk_handle_rw_error(req, -ret, 0, true)) {
-goto out;
+return;
 }
 }
 
 virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
 block_acct_done(blk_get_stats(s->blk), >acct);
 virtio_blk_free_request(req);
-
-out:
-aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 static void virtio_blk_discard_write_zeroes_complete(void *opaque, int ret)
@@ -174,11 +168,10 @@ static void virtio_blk_discard_write_zeroes_complete(void 
*opaque, int ret)
 ~VIRTIO_BLK_T_BARRIER) == 
VIRTIO_BLK_T_WRITE_ZEROES;
 
 IO_CODE();
-aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 
 if (ret) {
 if (virtio_blk_handle_rw_error(req, -ret, false, is_write_zeroes)) {
-goto out;
+return;
 }
 }
 
@@ -187,9 +180,6 @@ static void virtio_blk_discard_write_zeroes_complete(void 
*opaque, int ret)
 block_acct_done(blk_get_stats(s->blk), >acct);
 }
 virtio_blk_free_request(req);
-
-out:
-aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 #ifdef __linux__
@@ -238,10 +228,8 @@ static void virtio_blk_ioctl_complete(void *opaque, int 
status)
 virtio_stl_p(vdev, >data_len, hdr->dxfer_len);
 
 out:
-aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 virtio_blk_req_complete(req, status);
 virtio_blk_free_request(req);
-aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 g_free(ioctl_req);
 }
 
@@ -852,7 +840,6 @@ static void virtio_blk_dma_restart_bh(void *opaque)
 
 s->rq = NULL;
 
-aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
 while (req) {
 VirtIOBlockReq *next = req->next;
 if (virtio_blk_handle_request(req, )) {
@@ -876,8 +863,6 @@ static void virtio_blk_dma_restart_bh(void *opaque)
 
 /* Paired with inc in virtio_blk_dma_restart_cb() */
 blk_dec_in_flight(s->conf.conf.blk);
-
-aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 /*
-- 
2.38.1




[PATCH 4/8] virtio-blk: mark GLOBAL_STATE_CODE functions

2022-11-08 Thread Stefan Hajnoczi
From: Emanuele Giuseppe Esposito 

Just as done in the block API, mark functions in virtio-blk
that are always called in the main loop with BQL held.

We know such functions are GS because they all are callbacks
from virtio.c API that has already classified them as GS.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20220609143727.1151816-6-eespo...@redhat.com>
---
 hw/block/dataplane/virtio-blk.c |  4 
 hw/block/virtio-blk.c   | 27 +++
 2 files changed, 31 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 975f5ca8c4..728c9cd86c 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -89,6 +89,8 @@ bool virtio_blk_data_plane_create(VirtIODevice *vdev, 
VirtIOBlkConf *conf,
 BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
 VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
 
+GLOBAL_STATE_CODE();
+
 *dataplane = NULL;
 
 if (conf->iothread) {
@@ -140,6 +142,8 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 {
 VirtIOBlock *vblk;
 
+GLOBAL_STATE_CODE();
+
 if (!s) {
 return;
 }
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 96bc11d2fe..02b213a140 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -845,11 +845,17 @@ static void virtio_blk_dma_restart_bh(void *opaque)
 aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
+/*
+ * Only called when VM is started or stopped in cpus.c.
+ * No iothread runs in parallel
+ */
 static void virtio_blk_dma_restart_cb(void *opaque, bool running,
   RunState state)
 {
 VirtIOBlock *s = opaque;
 
+GLOBAL_STATE_CODE();
+
 if (!running) {
 return;
 }
@@ -867,8 +873,14 @@ static void virtio_blk_reset(VirtIODevice *vdev)
 AioContext *ctx;
 VirtIOBlockReq *req;
 
+GLOBAL_STATE_CODE();
+
 ctx = blk_get_aio_context(s->blk);
 aio_context_acquire(ctx);
+/*
+ * This drain together with ->stop_ioeventfd() in virtio_pci_reset()
+ * stops all Iothreads.
+ */
 blk_drain(s->blk);
 
 /* We drop queued requests after blk_drain() because blk_drain() itself can
@@ -1037,11 +1049,17 @@ static void virtio_blk_set_status(VirtIODevice *vdev, 
uint8_t status)
 }
 }
 
+/*
+ * VM is stopped while doing migration, so iothread has
+ * no requests to process.
+ */
 static void virtio_blk_save_device(VirtIODevice *vdev, QEMUFile *f)
 {
 VirtIOBlock *s = VIRTIO_BLK(vdev);
 VirtIOBlockReq *req = s->rq;
 
+GLOBAL_STATE_CODE();
+
 while (req) {
 qemu_put_sbyte(f, 1);
 
@@ -1055,11 +1073,17 @@ static void virtio_blk_save_device(VirtIODevice *vdev, 
QEMUFile *f)
 qemu_put_sbyte(f, 0);
 }
 
+/*
+ * VM is stopped while doing migration, so iothread has
+ * no requests to process.
+ */
 static int virtio_blk_load_device(VirtIODevice *vdev, QEMUFile *f,
   int version_id)
 {
 VirtIOBlock *s = VIRTIO_BLK(vdev);
 
+GLOBAL_STATE_CODE();
+
 while (qemu_get_sbyte(f)) {
 unsigned nvqs = s->conf.num_queues;
 unsigned vq_idx = 0;
@@ -1108,6 +1132,7 @@ static const BlockDevOps virtio_block_ops = {
 .resize_cb = virtio_blk_resize,
 };
 
+/* Iothread is not yet created */
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -1116,6 +1141,8 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 Error *err = NULL;
 unsigned i;
 
+GLOBAL_STATE_CODE();
+
 if (!conf->conf.blk) {
 error_setg(errp, "drive property not set");
 return;
-- 
2.38.1




[PATCH 2/8] block-backend: enable_write_cache should be atomic

2022-11-08 Thread Stefan Hajnoczi
From: Emanuele Giuseppe Esposito 

It is read from IO_CODE and written with BQL held,
so setting it as atomic should be enough.

Also remove the aiocontext lock that was sporadically
taken around the set.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20220609143727.1151816-3-eespo...@redhat.com>
---
 block/block-backend.c | 6 +++---
 hw/block/virtio-blk.c | 4 
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index c0c7d56c8d..949418cad4 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -60,7 +60,7 @@ struct BlockBackend {
  * can be used to restore those options in the new BDS on insert) */
 BlockBackendRootState root_state;
 
-bool enable_write_cache;
+bool enable_write_cache; /* Atomic */
 
 /* I/O stats (display with "info blockstats"). */
 BlockAcctStats stats;
@@ -1939,13 +1939,13 @@ bool blk_is_sg(BlockBackend *blk)
 bool blk_enable_write_cache(BlockBackend *blk)
 {
 IO_CODE();
-return blk->enable_write_cache;
+return qatomic_read(>enable_write_cache);
 }
 
 void blk_set_enable_write_cache(BlockBackend *blk, bool wce)
 {
 IO_CODE();
-blk->enable_write_cache = wce;
+qatomic_set(>enable_write_cache, wce);
 }
 
 void blk_activate(BlockBackend *blk, Error **errp)
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cdc6fd5979..96d00103a4 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -961,9 +961,7 @@ static void virtio_blk_set_config(VirtIODevice *vdev, const 
uint8_t *config)
 
 memcpy(, config, s->config_size);
 
-aio_context_acquire(blk_get_aio_context(s->blk));
 blk_set_enable_write_cache(s->blk, blkcfg.wce != 0);
-aio_context_release(blk_get_aio_context(s->blk));
 }
 
 static uint64_t virtio_blk_get_features(VirtIODevice *vdev, uint64_t features,
@@ -1031,11 +1029,9 @@ static void virtio_blk_set_status(VirtIODevice *vdev, 
uint8_t status)
  * s->blk would erroneously be placed in writethrough mode.
  */
 if (!virtio_vdev_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE)) {
-aio_context_acquire(blk_get_aio_context(s->blk));
 blk_set_enable_write_cache(s->blk,
virtio_vdev_has_feature(vdev,
VIRTIO_BLK_F_WCE));
-aio_context_release(blk_get_aio_context(s->blk));
 }
 }
 
-- 
2.38.1




[PATCH 0/8] virtio-blk: remove AioContext lock

2022-11-08 Thread Stefan Hajnoczi
This is a continuation of Emanuele Esposito's work to remove the AioContext
lock in virtio-blk. In the past it was necessary to acquire the AioContext lock
in order to do I/O. Paolo Bonzini and Emanuele have removed the need for the
AioContext in the core block layer code, with a few exceptions like blk_drain()
and blk_set_aio_context().

This patch series annotates virtio-blk functions with
IO_CODE()/GLOBAL_STATE_CODE() so it's clear in which context they are called.
It also removes unnecessary AioContext locks.

The end result is that virtio-blk rarely takes the AioContext lock. Future
patch series will add support for multiple IOThreads so that true multi-queue
can be achieved, but right now virtio-blk still has unprotected shared state
like s->rq so that needs to wait for another day.

Based-on: <20221102182337.252202-1-stefa...@redhat.com>

Emanuele Giuseppe Esposito (6):
  virtio_queue_aio_attach_host_notifier: remove AioContext lock
  block-backend: enable_write_cache should be atomic
  virtio: categorize callbacks in GS
  virtio-blk: mark GLOBAL_STATE_CODE functions
  virtio-blk: mark IO_CODE functions
  virtio-blk: remove unnecessary AioContext lock from function already
safe

Stefan Hajnoczi (2):
  virtio-blk: don't acquire AioContext in virtio_blk_handle_vq()
  virtio-blk: minimize virtio_blk_reset() AioContext lock region

 include/block/aio-wait.h|  4 +-
 block/block-backend.c   |  6 +--
 hw/block/dataplane/virtio-blk.c | 18 +--
 hw/block/virtio-blk.c   | 92 -
 hw/scsi/virtio-scsi-dataplane.c | 10 ++--
 hw/virtio/virtio-bus.c  |  5 ++
 hw/virtio/virtio-pci.c  |  2 +
 hw/virtio/virtio.c  |  8 +++
 util/aio-wait.c |  2 +-
 9 files changed, 106 insertions(+), 41 deletions(-)

-- 
2.38.1




[PATCH 3/8] virtio: categorize callbacks in GS

2022-11-08 Thread Stefan Hajnoczi
From: Emanuele Giuseppe Esposito 

All the callbacks below are always running in the main loop.

The callbacks are the following:
- start/stop_ioeventfd: these are the callbacks where
  blk_set_aio_context(iothread) is done, so they are called in the main
  loop.

- save and load: called during migration, when VM is stopped from the
  main loop.

- reset: before calling this callback, stop_ioeventfd is invoked, so
  it can only run in the main loop.

- set_status: going through all the callers we can see it is called
  from a MemoryRegionOps callback, which always run in a vcpu thread and
  hold the BQL.

- realize: iothread is not even created yet.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Stefan Hajnoczi 
Message-Id: <20220609143727.1151816-5-eespo...@redhat.com>
---
 hw/block/virtio-blk.c  | 2 ++
 hw/virtio/virtio-bus.c | 5 +
 hw/virtio/virtio-pci.c | 2 ++
 hw/virtio/virtio.c | 8 
 4 files changed, 17 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 96d00103a4..96bc11d2fe 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1005,6 +1005,8 @@ static void virtio_blk_set_status(VirtIODevice *vdev, 
uint8_t status)
 {
 VirtIOBlock *s = VIRTIO_BLK(vdev);
 
+GLOBAL_STATE_CODE();
+
 if (!(status & (VIRTIO_CONFIG_S_DRIVER | VIRTIO_CONFIG_S_DRIVER_OK))) {
 assert(!s->dataplane_started);
 }
diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
index 896feb37a1..74cdf4bd27 100644
--- a/hw/virtio/virtio-bus.c
+++ b/hw/virtio/virtio-bus.c
@@ -23,6 +23,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/main-loop.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
@@ -224,6 +225,8 @@ int virtio_bus_start_ioeventfd(VirtioBusState *bus)
 VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
 int r;
 
+GLOBAL_STATE_CODE();
+
 if (!k->ioeventfd_assign || !k->ioeventfd_enabled(proxy)) {
 return -ENOSYS;
 }
@@ -248,6 +251,8 @@ void virtio_bus_stop_ioeventfd(VirtioBusState *bus)
 VirtIODevice *vdev;
 VirtioDeviceClass *vdc;
 
+GLOBAL_STATE_CODE();
+
 if (!bus->ioeventfd_started) {
 return;
 }
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index a1c9dfa7bb..4f9a94f61b 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -313,6 +313,8 @@ static void virtio_ioport_write(void *opaque, uint32_t 
addr, uint32_t val)
 uint16_t vector;
 hwaddr pa;
 
+GLOBAL_STATE_CODE();
+
 switch (addr) {
 case VIRTIO_PCI_GUEST_FEATURES:
 /* Guest does not negotiate properly?  We have to assume nothing. */
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 9683b2e158..468e8f5ad0 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -2422,6 +2422,8 @@ int virtio_set_status(VirtIODevice *vdev, uint8_t val)
 VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);
 trace_virtio_set_status(vdev, val);
 
+GLOBAL_STATE_CODE();
+
 if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
 if (!(vdev->status & VIRTIO_CONFIG_S_FEATURES_OK) &&
 val & VIRTIO_CONFIG_S_FEATURES_OK) {
@@ -2515,6 +2517,8 @@ void virtio_reset(void *opaque)
 VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);
 int i;
 
+GLOBAL_STATE_CODE();
+
 virtio_set_status(vdev, 0);
 if (current_cpu) {
 /* Guest initiated reset */
@@ -3357,6 +3361,8 @@ int virtio_save(VirtIODevice *vdev, QEMUFile *f)
 uint32_t guest_features_lo = (vdev->guest_features & 0x);
 int i;
 
+GLOBAL_STATE_CODE();
+
 if (k->save_config) {
 k->save_config(qbus->parent, f);
 }
@@ -3508,6 +3514,8 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int 
version_id)
 VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
 VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev);
 
+GLOBAL_STATE_CODE();
+
 /*
  * We poison the endianness to ensure it does not get used before
  * subsections have been loaded.
-- 
2.38.1




Re: [PULL 0/3] Memory/SDHCI/ParallelFlash patches for v7.2.0-rc0

2022-11-08 Thread Stefan Hajnoczi
I've dropped the SDHCI CVE fix due to the CI failure.

The rest of the commits are still in the staging tree and I plan to
include them in v7.2.0-rc0.

Stefan



Re: [PULL 0/3] Memory/SDHCI/ParallelFlash patches for v7.2.0-rc0

2022-11-08 Thread Stefan Hajnoczi
On Tue, 8 Nov 2022 at 13:35, Philippe Mathieu-Daudé  wrote:
>
> The following changes since commit ade760a2f63804b7ab1839fbc3e5ddbf30538718:
>
>   Merge tag 'pull-request-2022-11-08' of https://gitlab.com/thuth/qemu into 
> staging (2022-11-08 11:34:06 -0500)
>
> are available in the Git repository at:
>
>   https://github.com/philmd/qemu.git tags/memflash-20221108
>
> for you to fetch changes up to cf9b3efd816518f9f210f50a0fa3e46a00b33c27:
>
>   Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2" 
> (2022-11-08 19:29:25 +0100)
>
> 
> Memory/SDHCI/ParallelFlash patches queue
>
> - Fix wrong end address dump in 'info mtree' (Zhenzhong Duan)
> - Fix in SDHCI for CVE-2022-3872 (myself)

There is a CI failure:

>>> G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh 
>>> MALLOC_PERTURB_=127 QTEST_QEMU_BINARY=./qemu-system-arm 
>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
>>> QTEST_QEMU_IMG=./qemu-img 
>>> /builds/qemu-project/qemu/build/tests/qtest/npcm7xx_sdhci-test --tap -k
― ✀ ―
stderr:
** Message: 19:27:52.411: /tmp/sdhci_ZD2EV1
**
ERROR:../tests/qtest/npcm7xx_sdhci-test.c:101:sdwrite_read: assertion
failed: (!memcmp(rmsg, msg, len))

https://gitlab.com/qemu-project/qemu/-/jobs/3292896670

Stefan



Re: [PATCH v3 08/17] migration/qemu-file: Add qemu_file_get_to_fd()

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/3/22 19:16, Avihai Horon wrote:

Add new function qemu_file_get_to_fd() that allows reading data from
QEMUFile and writing it straight into a given fd.

This will be used later in VFIO migration code.

Signed-off-by: Avihai Horon


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir




Re: [PATCH-for-7.2 1/2] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)

2022-11-08 Thread Alexander Bulekov
On 221108 1225, Alexander Bulekov wrote:
> On 221107 2312, Philippe Mathieu-Daudé wrote:
> > When sdhci_write_block_to_card() is called to transfer data from
> > the FIFO to the SD bus, the data is already present in the buffer
> > and we have to consume it directly.
> > 
> > See the description of the 'Buffer Write Enable' bit from the
> > 'Present State' register (prnsts::SDHC_SPACE_AVAILABLE) in Table
> > 2.14 from the SDHCI spec v2:
> > 
> >   Buffer Write Enable
> > 
> >   This status is used for non-DMA write transfers.
> > 
> >   The Host Controller can implement multiple buffers to transfer
> >   data efficiently. This read only flag indicates if space is
> >   available for write data. If this bit is 1, data can be written
> >   to the buffer. A change of this bit from 1 to 0 occurs when all
> >   the block data is written to the buffer. A change of this bit
> >   from 0 to 1 occurs when top of block data can be written to the
> >   buffer and generates the Buffer Write Ready interrupt.
> > 
> > In our case, we do not want to overwrite the buffer, so we want
> > this bit to be 0, then set it to 1 once the data is written onto
> > the bus.
> > 
> > This is probably a copy/paste error from commit d7dfca0807
> > ("hw/sdhci: introduce standard SD host controller").
> > 
> > Reproducer:
> > https://lore.kernel.org/qemu-devel/caa8xkjxrms0fkr28akvnnpyatm0y0b+5fichpsrhd+mugnu...@mail.gmail.com/
> > 
> > Fixes: CVE-2022-3872
> > Reported-by: RivenDell 
> > Reported-by: Siqi Chen 
> > Reported-by: ningqiang 
> > Signed-off-by: Philippe Mathieu-Daudé 
> 
> Seems like OSS-Fuzz also found this, not sure why it never made it into
> a gitlab issue:
> https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=45986#c4
> 
> Slightly shorter reproducer:
> 
> cat << EOF | ./qemu-system-i386 -display none -machine accel=qtest, -m \
> 512M -nodefaults -device sdhci-pci -device sd-card,drive=mydrive -drive \
> if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic -qtest \
> stdio
> outl 0xcf8 0x80001010
> outl 0xcfc 0xe000
> outl 0xcf8 0x80001001
> outl 0xcfc 0x0600
> write 0xe058 0x1 0x6e
> write 0xe059 0x1 0x5a
> write 0xe028 0x1 0x10
> write 0xe02c 0x1 0x05
> write 0x5a6e 0x1 0x21
> write 0x5a75 0x1 0x20
> write 0xe005 0x1 0x02
> write 0xe00c 0x1 0x01
> write 0xe00e 0x1 0x20
> write 0xe00f 0x1 0x00
> write 0xe00c 0x1 0x00
> write 0xe020 0x1 0x00
> EOF

Hi Philippe,
I ran the fuzzer with this series applied and it found:

cat << EOF | ./qemu-system-i386 -display none -machine accel=qtest, -m \
512M -nodefaults -device sdhci-pci -device sd-card,drive=mydrive -drive \
if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic -qtest \
stdio
outl 0xcf8 0x80001010
outl 0xcfc 0xe000
outl 0xcf8 0x80001004
outw 0xcfc 0x06
write 0xe028 0x1 0x08
write 0xe02c 0x1 0x05
write 0xe005 0x1 0x02
write 0x0 0x1 0x21
write 0x3 0x1 0x20
write 0xe00c 0x1 0x01
write 0xe00e 0x1 0x20
write 0xe00f 0x1 0x00
write 0xe00c 0x1 0x20
write 0xe020 0x1 0x00
EOF

The crash seems very similar, but it looks like instead of
SDHC_TRNS_READ this reproducer sets SDHC_TRNS_MULTI
-Alex



Re: [PATCH v3 06/17] vfio/migration: Fix NULL pointer dereference bug

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/3/22 19:16, Avihai Horon wrote:

As part of its error flow, vfio_vmstate_change() accesses
MigrationState->to_dst_file without any checks. This can cause a NULL
pointer dereference if the error flow is taken and
MigrationState->to_dst_file is not set.

For example, this can happen if VM is started or stopped not during
migration and vfio_vmstate_change() error flow is taken, as
MigrationState->to_dst_file is not set at that time.

Fix it by checking that MigrationState->to_dst_file is set before using
it.

Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM")
Signed-off-by: Avihai Horon
Reviewed-by: Juan Quintela



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir




Re: [PATCH v3 05/17] vfio/migration: Fix wrong enum usage

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/3/22 19:16, Avihai Horon wrote:

vfio_migration_init() initializes VFIOMigration->device_state using enum
of VFIO migration protocol v2. Current implemented protocol is v1 so v1
enum should be used. Fix it.

Fixes: 429c72800654 ("vfio/migration: Fix incorrect initialization value for 
parameters in VFIOMigration")
Signed-off-by: Avihai Horon
Reviewed-by: Zhang Chen


the commit is already in master branch

--
Best regards,
Vladimir




Re: [PATCH v3 04/17] migration: Simplify migration_iteration_run()

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/3/22 19:16, Avihai Horon wrote:

From: Juan Quintela 

Signed-off-by: Juan Quintela 
Signed-off-by: Avihai Horon 
---
  migration/migration.c | 25 +
  1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index ffe868b86f..59cc3c309b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3743,23 +3743,24 @@ static MigIterateState 
migration_iteration_run(MigrationState *s)
  
  trace_migrate_pending(pending_size, s->threshold_size, pend_pre, pend_post);
  
-if (pending_size && pending_size >= s->threshold_size) {

-/* Still a significant amount to transfer */
-if (!in_postcopy && pend_pre <= s->threshold_size &&
-qatomic_read(>start_postcopy)) {
-if (postcopy_start(s)) {
-error_report("%s: postcopy failed to start", __func__);
-}
-return MIG_ITERATE_SKIP;
-}
-/* Just another iteration step */
-qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
-} else {
+
+if (pending_size < s->threshold_size) {


Is corner case "pending_size == s->threshold_size == 0" theoretically possible 
here? In this case prepatch we go to completion. Afterpatch we go to next iteration..


  trace_migration_thread_low_pending(pending_size);
  migration_completion(s);
  return MIG_ITERATE_BREAK;
  }
  
+/* Still a significant amount to transfer */

+if (!in_postcopy && pend_pre <= s->threshold_size &&
+qatomic_read(>start_postcopy)) {
+if (postcopy_start(s)) {
+error_report("%s: postcopy failed to start", __func__);
+}
+return MIG_ITERATE_SKIP;
+}
+
+/* Just another iteration step */
+qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
  return MIG_ITERATE_RESUME;
  }
  


--
Best regards,
Vladimir




Re: [PATCH v3 03/17] migration: Block migration comment or code is wrong

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/8/22 21:36, Vladimir Sementsov-Ogievskiy wrote:

On 11/3/22 19:16, Avihai Horon wrote:

From: Juan Quintela 

And it appears that what is wrong is the code. During bulk stage we
need to make sure that some block is dirty, but no games with
max_size at all.


:) That made me interested in, why we need this one block, so I decided to 
search through the history.

And what I see? Haha, that was my commit 04636dc410b163c "migration/block: fix 
pending() return value" [1], which you actually revert with this patch.

So, at least we should note, that it's a revert of [1].

Still that this will reintroduce the bug fixed by [1].

As I understand the problem is (was) that in block_save_complete() we finalize 
only dirty blocks, but don't finalize the bulk phase if it's not finalized yet. 
So, we can fix block_save_complete() to finalize the bulk phase, instead of 
hacking with pending in [1].

Interesting, why we need this one block, described in the comment you refer to? 
Was it an incomplete workaround for the same problem, described in [1]? If so, 
we can fix block_save_complete() and remove this if() together with the comment 
from block_save_pending().



PS: Don't we want to deprecate block migration? Is it really used in 
production?  block-mirror is a recommended way to migrate block devices.

--
Best regards,
Vladimir




Re: [PATCH v3 03/17] migration: Block migration comment or code is wrong

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/3/22 19:16, Avihai Horon wrote:

From: Juan Quintela 

And it appears that what is wrong is the code. During bulk stage we
need to make sure that some block is dirty, but no games with
max_size at all.


:) That made me interested in, why we need this one block, so I decided to 
search through the history.

And what I see? Haha, that was my commit 04636dc410b163c "migration/block: fix 
pending() return value" [1], which you actually revert with this patch.

So, at least we should note, that it's a revert of [1].

Still that this will reintroduce the bug fixed by [1].

As I understand the problem is (was) that in block_save_complete() we finalize 
only dirty blocks, but don't finalize the bulk phase if it's not finalized yet. 
So, we can fix block_save_complete() to finalize the bulk phase, instead of 
hacking with pending in [1].

Interesting, why we need this one block, described in the comment you refer to? 
Was it an incomplete workaround for the same problem, described in [1]? If so, 
we can fix block_save_complete() and remove this if() together with the comment 
from block_save_pending().



Signed-off-by: Juan Quintela 
Reviewed-by: Stefan Hajnoczi 
---
  migration/block.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index b3d680af75..39ce4003c6 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -879,8 +879,8 @@ static void block_save_pending(void *opaque, uint64_t 
max_size,
  blk_mig_unlock();
  
  /* Report at least one block pending during bulk phase */

-if (pending <= max_size && !block_mig_state.bulk_completed) {
-pending = max_size + BLK_MIG_BLOCK_SIZE;
+if (!pending && !block_mig_state.bulk_completed) {
+pending = BLK_MIG_BLOCK_SIZE;
  }
  
  trace_migration_block_save_pending(pending);


--
Best regards,
Vladimir




[PULL 3/3] Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2"

2022-11-08 Thread Philippe Mathieu-Daudé
From: Daniel Henrique Barboza 

Commit 334c388f25 ("pflash_cfi: Error out if device length
isn't a power of two") aimed to finish the effort started by
commit 06f1521795 ("pflash: Require backend size to match device,
improve errors"), but unfortunately we are not quite there since
various machines are still ready to accept incomplete / oversized
pflash backend images, and now fail, i.e. on Debian bullseye:

 $ qemu-system-x86_64 \
   -drive \
   if=pflash,format=raw,unit=0,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd
 qemu-system-x86_64: Device size must be a power of two.

where OVMF_CODE.fd comes from the ovmf package, which doesn't
pad the firmware images to the flash size:

 $ ls -lh /usr/share/OVMF/
 -rw-r--r-- 1 root root 3.5M Aug 19  2021 OVMF_CODE_4M.fd
 -rw-r--r-- 1 root root 1.9M Aug 19  2021 OVMF_CODE.fd
 -rw-r--r-- 1 root root 128K Aug 19  2021 OVMF_VARS.fd

Since we entered the freeze period to prepare the v7.2.0 release,
the safest is to revert commit 334c388f25707a234c4a0dea05b9df08d.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1294
Signed-off-by: Philippe Mathieu-Daudé 
Message-Id: <20221108175755.95141-1-phi...@linaro.org>
Signed-off-by: Daniel Henrique Barboza 
Message-Id: <20221108172633.860700-1-danielhb...@gmail.com>
---
 hw/block/pflash_cfi01.c | 8 ++--
 hw/block/pflash_cfi02.c | 5 -
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 9c235bf66e..0cbc2fb4cb 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -690,7 +690,7 @@ static const MemoryRegionOps pflash_cfi01_ops = {
 .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
-static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl, Error **errp)
+static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl)
 {
 uint64_t blocks_per_device, sector_len_per_device, device_len;
 int num_devices;
@@ -708,10 +708,6 @@ static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl, 
Error **errp)
 sector_len_per_device = pfl->sector_len / num_devices;
 }
 device_len = sector_len_per_device * blocks_per_device;
-if (!is_power_of_2(device_len)) {
-error_setg(errp, "Device size must be a power of two.");
-return;
-}
 
 /* Hardcoded CFI table */
 /* Standard "QRY" string */
@@ -869,7 +865,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
  */
 pfl->cmd = 0x00;
 pfl->status = 0x80; /* WSM ready */
-pflash_cfi01_fill_cfi_table(pfl, errp);
+pflash_cfi01_fill_cfi_table(pfl);
 }
 
 static void pflash_cfi01_system_reset(DeviceState *dev)
diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index ff2fe154c1..2a99b286b0 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -880,11 +880,6 @@ static void pflash_cfi02_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
-if (!is_power_of_2(pfl->chip_len)) {
-error_setg(errp, "Device size must be a power of two.");
-return;
-}
-
 memory_region_init_rom_device(>orig_mem, OBJECT(pfl),
   _cfi02_ops, pfl, pfl->name,
   pfl->chip_len, errp);
-- 
2.38.1




[PULL 2/3] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)

2022-11-08 Thread Philippe Mathieu-Daudé
When sdhci_write_block_to_card() is called to transfer data from
the FIFO to the SD bus, the data is already present in the buffer
and we have to consume it directly.

See the description of the 'Buffer Write Enable' bit from the
'Present State' register (prnsts::SDHC_SPACE_AVAILABLE) in Table
2.14 from the SDHCI spec v2:

  Buffer Write Enable

  This status is used for non-DMA write transfers.

  The Host Controller can implement multiple buffers to transfer
  data efficiently. This read only flag indicates if space is
  available for write data. If this bit is 1, data can be written
  to the buffer. A change of this bit from 1 to 0 occurs when all
  the block data is written to the buffer. A change of this bit
  from 0 to 1 occurs when top of block data can be written to the
  buffer and generates the Buffer Write Ready interrupt.

In our case, we do not want to overwrite the buffer, so we want
this bit to be 0, then set it to 1 once the data is written onto
the bus.

This is probably a copy/paste error from commit d7dfca0807
("hw/sdhci: introduce standard SD host controller").

OSS-Fuzz Report: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=45986#c4

Reproducers:

  $ cat << EOF | \
 qemu-system-x86_64 -nodefaults -display none -machine accel=qtest \
   -m 512M  -device sdhci-pci -device sd-card,drive=mydrive \
   -drive if=none,index=0,file=null-co://,format=raw,id=mydrive \
   -nographic -qtest stdio
  outl 0xcf8 0x80001010
  outl 0xcfc 0xe000
  outl 0xcf8 0x80001001
  outl 0xcfc 0x0600
  write 0xe058 0x1 0x6e
  write 0xe059 0x1 0x5a
  write 0xe028 0x1 0x10
  write 0xe02c 0x1 0x05
  write 0x5a6e 0x1 0x21
  write 0x5a75 0x1 0x20
  write 0xe005 0x1 0x02
  write 0xe00c 0x1 0x01
  write 0xe00e 0x1 0x20
  write 0xe00f 0x1 0x00
  write 0xe00c 0x1 0x00
  write 0xe020 0x1 0x00
  EOF

or 
https://lore.kernel.org/qemu-devel/caa8xkjxrms0fkr28akvnnpyatm0y0b+5fichpsrhd+mugnu...@mail.gmail.com/

Fixes: CVE-2022-3872
Reported-by: RivenDell 
Reported-by: Siqi Chen 
Reported-by: ningqiang 
Reported-by: ClusterFuzz
Signed-off-by: Philippe Mathieu-Daudé 
Tested-by: Mauro Matteo Cascella 
Message-Id: <20221107221236.47841-2-phi...@linaro.org>
---
 hw/sd/sdhci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/sd/sdhci.c b/hw/sd/sdhci.c
index 306070c872..f230e7475f 100644
--- a/hw/sd/sdhci.c
+++ b/hw/sd/sdhci.c
@@ -954,7 +954,7 @@ static void sdhci_data_transfer(void *opaque)
 sdhci_read_block_from_card(s);
 } else {
 s->prnsts |= SDHC_DOING_WRITE | SDHC_DAT_LINE_ACTIVE |
-SDHC_SPACE_AVAILABLE | SDHC_DATA_INHIBIT;
+   SDHC_DATA_INHIBIT;
 sdhci_write_block_to_card(s);
 }
 }
-- 
2.38.1




[PULL 0/3] Memory/SDHCI/ParallelFlash patches for v7.2.0-rc0

2022-11-08 Thread Philippe Mathieu-Daudé
The following changes since commit ade760a2f63804b7ab1839fbc3e5ddbf30538718:

  Merge tag 'pull-request-2022-11-08' of https://gitlab.com/thuth/qemu into 
staging (2022-11-08 11:34:06 -0500)

are available in the Git repository at:

  https://github.com/philmd/qemu.git tags/memflash-20221108

for you to fetch changes up to cf9b3efd816518f9f210f50a0fa3e46a00b33c27:

  Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2" 
(2022-11-08 19:29:25 +0100)


Memory/SDHCI/ParallelFlash patches queue

- Fix wrong end address dump in 'info mtree' (Zhenzhong Duan)
- Fix in SDHCI for CVE-2022-3872 (myself)
- Revert latest pflash check of underlying block size (Daniel
  Henrique Barboza & myself)



Daniel Henrique Barboza (1):
  Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2"

Philippe Mathieu-Daudé (1):
  hw/sd/sdhci: Do not set Buf Wr Ena before writing block
(CVE-2022-3872)

Zhenzhong Duan (1):
  memory: Fix wrong end address dump

 hw/block/pflash_cfi01.c | 8 ++--
 hw/block/pflash_cfi02.c | 5 -
 hw/sd/sdhci.c   | 2 +-
 softmmu/physmem.c   | 2 +-
 4 files changed, 4 insertions(+), 13 deletions(-)

-- 
2.38.1




[PULL 1/3] memory: Fix wrong end address dump

2022-11-08 Thread Philippe Mathieu-Daudé
From: Zhenzhong Duan 

The end address of memory region section isn't correctly calculated
which leads to overflowed mtree dump:

  Dispatch
Physical sections
  ..
  #70 @2000..00011fff io [ROOT]
  #71 @5000..5fff (noname)
  #72 @5000..00014fff io [ROOT]
  #73 @5658..5658 vmport
  #74 @5659..00015658 io [ROOT]
  #75 @6000..00015fff io [ROOT]

After fix:
  #70 @2000..4fff io [ROOT]
  #71 @5000..5fff (noname)
  #72 @5000..5657 io [ROOT]
  #73 @5658..5658 vmport
  #74 @5659..5fff io [ROOT]
  #75 @6000.. io [ROOT]

Fixes: 5e8fd947e2670 ("memory: Rework "info mtree" to print flat views and 
dispatch trees")
Signed-off-by: Zhenzhong Duan 
Reviewed-by: David Hildenbrand 
Reviewed-by: Peter Xu 
Reviewed-by: Philippe Mathieu-Daudé 
Message-Id: <20220622095912.3430583-1-zhenzhong.d...@intel.com>
Signed-off-by: Philippe Mathieu-Daudé 
---
 softmmu/physmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index d9578ccfd4..1b606a3002 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3712,7 +3712,7 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, 
MemoryRegion *root)
 " %s%s%s%s%s",
 i,
 s->offset_within_address_space,
-s->offset_within_address_space + MR_SIZE(s->mr->size),
+s->offset_within_address_space + MR_SIZE(s->size),
 s->mr->name ? s->mr->name : "(noname)",
 i < ARRAY_SIZE(names) ? names[i] : "",
 s->mr == root ? " [ROOT]" : "",
-- 
2.38.1




Re: [PATCH-for-7.2 0/2] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)

2022-11-08 Thread Stefan Hajnoczi
Applied to the staging tree. Thanks!

Stefan



Re: [PATCH-for-7.2] Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2"

2022-11-08 Thread Stefan Hajnoczi
Applied to staging. Thanks!

Stefan



Re: [PATCH v3 02/17] migration: No save_live_pending() method uses the QEMUFile parameter

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/3/22 19:16, Avihai Horon wrote:

From: Juan Quintela

So remove it everywhere.

Signed-off-by: Juan Quintela


Reviewed-by: Vladimir Sementsov-Ogievskiy 


--
Best regards,
Vladimir




[PATCH-for-7.2] Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2"

2022-11-08 Thread Philippe Mathieu-Daudé
Commit 334c388f25 ("pflash_cfi: Error out if device length
isn't a power of two") aimed to finish the effort started by
commit 06f1521795 ("pflash: Require backend size to match device,
improve errors"), but unfortunately we are not quite there since
various machines are still ready to accept incomplete / oversized
pflash backend images, and now fail, i.e. on Debian bullseye:

 $ qemu-system-x86_64 \
   -drive \
   if=pflash,format=raw,unit=0,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd
 qemu-system-x86_64: Device size must be a power of two.

where OVMF_CODE.fd comes from the ovmf package, which doesn't
pad the firmware images to the flash size:

 $ ls -lh /usr/share/OVMF/
 -rw-r--r-- 1 root root 3.5M Aug 19  2021 OVMF_CODE_4M.fd
 -rw-r--r-- 1 root root 1.9M Aug 19  2021 OVMF_CODE.fd
 -rw-r--r-- 1 root root 128K Aug 19  2021 OVMF_VARS.fd

Since we entered the freeze period to prepare the v7.2.0 release,
the safest is to revert commit 334c388f25707a234c4a0dea05b9df08d.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1294
Signed-off-by: Philippe Mathieu-Daudé 
---
Cc: Sunil V L 
Cc: Daniel Henrique Barboza 
Cc: Markus Armbruster 
Cc: Bernhard Beschow 
---
 hw/block/pflash_cfi01.c | 8 ++--
 hw/block/pflash_cfi02.c | 5 -
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index 9c235bf66e..0cbc2fb4cb 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.c
@@ -690,7 +690,7 @@ static const MemoryRegionOps pflash_cfi01_ops = {
 .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
-static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl, Error **errp)
+static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl)
 {
 uint64_t blocks_per_device, sector_len_per_device, device_len;
 int num_devices;
@@ -708,10 +708,6 @@ static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl, 
Error **errp)
 sector_len_per_device = pfl->sector_len / num_devices;
 }
 device_len = sector_len_per_device * blocks_per_device;
-if (!is_power_of_2(device_len)) {
-error_setg(errp, "Device size must be a power of two.");
-return;
-}
 
 /* Hardcoded CFI table */
 /* Standard "QRY" string */
@@ -869,7 +865,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error 
**errp)
  */
 pfl->cmd = 0x00;
 pfl->status = 0x80; /* WSM ready */
-pflash_cfi01_fill_cfi_table(pfl, errp);
+pflash_cfi01_fill_cfi_table(pfl);
 }
 
 static void pflash_cfi01_system_reset(DeviceState *dev)
diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c
index ff2fe154c1..2a99b286b0 100644
--- a/hw/block/pflash_cfi02.c
+++ b/hw/block/pflash_cfi02.c
@@ -880,11 +880,6 @@ static void pflash_cfi02_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
-if (!is_power_of_2(pfl->chip_len)) {
-error_setg(errp, "Device size must be a power of two.");
-return;
-}
-
 memory_region_init_rom_device(>orig_mem, OBJECT(pfl),
   _cfi02_ops, pfl, pfl->name,
   pfl->chip_len, errp);
-- 
2.38.1




Re: [PATCH v3 01/17] migration: Remove res_compatible parameter

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/3/22 19:16, Avihai Horon wrote:

From: Juan Quintela 

It was only used for RAM, and in that case, it means that this amount
of data was sent for memory. 


Not clear for me, what means "this amount of data was sent for memory"... That 
amount of data was not yet sent, actually.


Just delete the field in all callers.

Signed-off-by: Juan Quintela 
---
  hw/s390x/s390-stattrib.c   |  6 ++
  hw/vfio/migration.c| 10 --
  hw/vfio/trace-events   |  2 +-
  include/migration/register.h   | 20 ++--
  migration/block-dirty-bitmap.c |  7 +++
  migration/block.c  |  7 +++
  migration/migration.c  |  9 -
  migration/ram.c|  8 +++-
  migration/savevm.c | 14 +-
  migration/savevm.h |  4 +---
  migration/trace-events |  2 +-
  11 files changed, 37 insertions(+), 52 deletions(-)



[..]


diff --git a/include/migration/register.h b/include/migration/register.h
index c1dcff0f90..1950fee6a8 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -48,18 +48,18 @@ typedef struct SaveVMHandlers {
  int (*save_setup)(QEMUFile *f, void *opaque);
  void (*save_live_pending)(QEMUFile *f, void *opaque,
uint64_t threshold_size,
-  uint64_t *res_precopy_only,
-  uint64_t *res_compatible,
-  uint64_t *res_postcopy_only);
+  uint64_t *rest_precopy,
+  uint64_t *rest_postcopy);
  /* Note for save_live_pending:
- * - res_precopy_only is for data which must be migrated in precopy phase
- * or in stopped state, in other words - before target vm start
- * - res_compatible is for data which may be migrated in any phase
- * - res_postcopy_only is for data which must be migrated in postcopy phase
- * or in stopped state, in other words - after source vm stop
+ * - res_precopy is for data which must be migrated in precopy
+ * phase or in stopped state, in other words - before target
+ * vm start
+ * - res_postcopy is for data which must be migrated in postcopy
+ * phase or in stopped state, in other words - after source vm
+ * stop
   *
- * Sum of res_postcopy_only, res_compatible and res_postcopy_only is the
- * whole amount of pending data.
+ * Sum of res_precopy and res_postcopy is the whole amount of
+ * pending data.
   */
  
  


[..]


diff --git a/migration/ram.c b/migration/ram.c
index dc1de9ddbc..20167e1102 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3435,9 +3435,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
  }
  
  static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,

- uint64_t *res_precopy_only,
- uint64_t *res_compatible,
- uint64_t *res_postcopy_only)
+ uint64_t *res_precopy, uint64_t *res_postcopy)
  {
  RAMState **temp = opaque;
  RAMState *rs = *temp;
@@ -3457,9 +3455,9 @@ static void ram_save_pending(QEMUFile *f, void *opaque, 
uint64_t max_size,
  
  if (migrate_postcopy_ram()) {

  /* We can do postcopy, and all the data is postcopiable */
-*res_compatible += remaining_size;
+*res_postcopy += remaining_size;


That's seems to be not quite correct.

res_postcopy is defined as "data which must be migrated in postcopy", but 
that's not true here, as RAM can be migrated both in precopy and postcopy.

Still we really can include "compat" into "postcopy" just because in the logic of migration_iteration_run() we don't 
actually distinguish "compat" and "post". The logic only depends on "total" and "pre".

So, if we want to combine "compat" into "post", we should redefine "post" in 
the comment in include/migration/register.h, something like this:

- res_precopy is for data which MUST be migrated in precopy
  phase or in stopped state, in other words - before target
  vm start

- res_postcopy is for all data except for declared in res_precopy.
  res_postcopy data CAN be migrated in postcopy, i.e. after target
  vm start.



  } else {
-*res_precopy_only += remaining_size;
+*res_precopy += remaining_size;
  }
  }
  



--
Best regards,
Vladimir




[PATCH] block: m25p80: fix dummy byte count read from spansion cfg register

2022-11-08 Thread Ramon Aerne
Spansion nor-flash stores the dummy read count in bits in a config register.
This is currently read and used as the byte count which is wrong.

This patch fixes this bit to byte conversion without warning about
unsupported configurations (such as bits % 8 != 0)

Signed-off-by: Ramon Aerne 
---
 hw/block/m25p80.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c
index 02adc87..cf10b11 100644
--- a/hw/block/m25p80.c
+++ b/hw/block/m25p80.c
@@ -981,7 +981,7 @@ static void decode_fast_read_cmd(Flash *s)
 s->needed_bytes += extract32(s->spansion_cr2v,
 SPANSION_DUMMY_CLK_POS,
 SPANSION_DUMMY_CLK_LEN
-);
+) / 8;
 break;
 case MAN_ISSI:
 /*
@@ -1017,7 +1017,7 @@ static void decode_dio_read_cmd(Flash *s)
 s->needed_bytes += extract32(s->spansion_cr2v,
 SPANSION_DUMMY_CLK_POS,
 SPANSION_DUMMY_CLK_LEN
-);
+) / 8;
 break;
 case MAN_NUMONYX:
 s->needed_bytes += numonyx_extract_cfg_num_dummies(s);
@@ -1067,7 +1067,7 @@ static void decode_qio_read_cmd(Flash *s)
 s->needed_bytes += extract32(s->spansion_cr2v,
 SPANSION_DUMMY_CLK_POS,
 SPANSION_DUMMY_CLK_LEN
-);
+) / 8;
 break;
 case MAN_NUMONYX:
 s->needed_bytes += numonyx_extract_cfg_num_dummies(s);
-- 
2.25.1




Re: [PATCH-for-7.2 1/2] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)

2022-11-08 Thread Alexander Bulekov
On 221107 2312, Philippe Mathieu-Daudé wrote:
> When sdhci_write_block_to_card() is called to transfer data from
> the FIFO to the SD bus, the data is already present in the buffer
> and we have to consume it directly.
> 
> See the description of the 'Buffer Write Enable' bit from the
> 'Present State' register (prnsts::SDHC_SPACE_AVAILABLE) in Table
> 2.14 from the SDHCI spec v2:
> 
>   Buffer Write Enable
> 
>   This status is used for non-DMA write transfers.
> 
>   The Host Controller can implement multiple buffers to transfer
>   data efficiently. This read only flag indicates if space is
>   available for write data. If this bit is 1, data can be written
>   to the buffer. A change of this bit from 1 to 0 occurs when all
>   the block data is written to the buffer. A change of this bit
>   from 0 to 1 occurs when top of block data can be written to the
>   buffer and generates the Buffer Write Ready interrupt.
> 
> In our case, we do not want to overwrite the buffer, so we want
> this bit to be 0, then set it to 1 once the data is written onto
> the bus.
> 
> This is probably a copy/paste error from commit d7dfca0807
> ("hw/sdhci: introduce standard SD host controller").
> 
> Reproducer:
> https://lore.kernel.org/qemu-devel/caa8xkjxrms0fkr28akvnnpyatm0y0b+5fichpsrhd+mugnu...@mail.gmail.com/
> 
> Fixes: CVE-2022-3872
> Reported-by: RivenDell 
> Reported-by: Siqi Chen 
> Reported-by: ningqiang 
> Signed-off-by: Philippe Mathieu-Daudé 

Seems like OSS-Fuzz also found this, not sure why it never made it into
a gitlab issue:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=45986#c4

Slightly shorter reproducer:

cat << EOF | ./qemu-system-i386 -display none -machine accel=qtest, -m \
512M -nodefaults -device sdhci-pci -device sd-card,drive=mydrive -drive \
if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic -qtest \
stdio
outl 0xcf8 0x80001010
outl 0xcfc 0xe000
outl 0xcf8 0x80001001
outl 0xcfc 0x0600
write 0xe058 0x1 0x6e
write 0xe059 0x1 0x5a
write 0xe028 0x1 0x10
write 0xe02c 0x1 0x05
write 0x5a6e 0x1 0x21
write 0x5a75 0x1 0x20
write 0xe005 0x1 0x02
write 0xe00c 0x1 0x01
write 0xe00e 0x1 0x20
write 0xe00f 0x1 0x00
write 0xe00c 0x1 0x00
write 0xe020 0x1 0x00
EOF





Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-08 Thread Alex Bennée


"Michael S. Tsirkin"  writes:

> On Tue, Nov 08, 2022 at 11:21:26AM +, Alex Bennée wrote:
>> 
>> "Michael S. Tsirkin"  writes:
>> 
>> > On Tue, Nov 08, 2022 at 10:23:15AM +, Alex Bennée wrote:
>> >> 
>> >> "Michael S. Tsirkin"  writes:
>> >> 
>> >> > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote:
>> >> >> The previous fix to virtio_device_started revealed a problem in its
>> >> >> use by both the core and the device code. The core code should be able
>> >> >> to handle the device "starting" while the VM isn't running to handle
>> >> >> the restoration of migration state. To solve this dual use introduce a
>> >> >> new helper for use by the vhost-user backends who all use it to feed a
>> >> >> should_start variable.
>> >> >> 
>> >> >> We can also pick up a change vhost_user_blk_set_status while we are at
>> >> >> it which follows the same pattern.
>> >> >> 
>> >> >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
>> >> >> virtio_device_started)
>> >> >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio 
>> >> >> device)
>> >> >> Signed-off-by: Alex Bennée 
>> >> >> Cc: "Michael S. Tsirkin" 
>> >> >
>> >> > why is this in this patchset?
>> >> 
>> >> As per my cover letter:
>> >> 
>> >>   Most of these patches have been posted before as single patch RFCs. A
>> >>   couple are already scheduled through other trees so will drop out in
>> >>   due course
>> >> 
>> >> but I keep them in my tree until they are merged so I can continue to
>> >> soak test them (and have a stable base for my other WIP trees).
>> >
>> > That's fine just pls don't double-post them on list, certainly
>> > not as part of a patchset.
>> 
>> Why not? Is this breaking some tooling?
>
> Yes patchset breaks git am if you try to apply part of it.
>
> Reposting creates work for reviewers - why should they have to read the same
> patch twice?  In this case it also made me scratch my head trying to
> figure out what to do about it.
>
> But, if you are careful and maintain an ordered changelog after "---"
> and there it says 
>   changes since rfc:
>   no changes, subject changed 
>
> then this second part is less of a problem

Ahh yes, I should have updated to point out I added the extra Fixes line
as per the review. I guess you added that in your PR? Anyway it's
dropped now your PR has gone in.

-- 
Alex Bennée



Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/8/22 19:19, Vladimir Sementsov-Ogievskiy wrote:

This is a lot better than our "coroutine_fn" sign, which actually do no check (and can't 
do). Don't you plan to swap a "coroutine_fn" noop marker with more meaningful 
IN_COROUTINE(); (or something like this, which just do assert(qemu_in_coroutine())) at start of the 
function? It would be a lot safer.



Moreover, we can introduce two macros:

IN_COROUTINE() and NOT_COROUTINE(), mark functions correspondingly and drop 
coroutine_fn mark. Than the picture would be very deterministic:

IN_COROUTINE - function is called only from coroutine context
NOT_COROUTINE - function is never called from coroutine context
 - function may be called from both coroutine and non-coroutine context

--
Best regards,
Vladimir




Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

[add Stefan]


On 11/8/22 18:09, Emanuele Giuseppe Esposito wrote:



Am 08/11/2022 um 15:48 schrieb Vladimir Sementsov-Ogievskiy:

On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:

These functions end up calling bdrv_common_block_status_above(), a
generated_co_wrapper function.


generated_co_wrapper is not a coroutine_fn. Сonversely it's a function
that do a class coroutine wrapping - start a coroutine and do POLL to
wait for the coroutine to finish.


In addition, they also happen to be always called in coroutine context,
meaning all callers are coroutine_fn.


That's also not a reason for marking them coroutine_fn. "coroutine_fn"
means that the function can be called only from coroutine context.


This means that the g_c_w function will enter the qemu_in_coroutine()
case and eventually suspend (or in other words call
qemu_coroutine_yield()).
Therefore we need to mark such functions coroutine_fn too.


I don't think so. Moreover, this breaks the concept, as your new
coroutine_fn functions will call generated_co_wrapper functions which
are not marked coroutine_fn and never was.


Theoretically not, 


Agree, I was wrong in this point


because marking it coroutine_fn we know that we are
going in the if(qemu_in_coroutine()) branch of the g_c_w, so we could
directly replace it with the actual function. Because it's a pain to do
it with hand, and at some point I/someone should use Alberto static
analyzer to get rid of that, I decided to leave g_c_w there.

As I understand it, it seems that you and Paolo have a different
understanding on what coroutine_fn means and where it should be used.


Looks so...

But we have a documentation in coroutine.h, let's check:

 * Mark a function that executes in coroutine context
 *
 * Functions that execute in coroutine context cannot be called directly from
 * normal functions.  In the future it would be nice to enable compiler or
 * static checker support for catching such errors.  This annotation might make
 * it possible and in the meantime it serves as documentation.

Not very clear, but still it say:

 coroutine_fn = "function that executes in coroutine context"
 "functions that execute in coroutine context"  =  "cannot be called directly from 
normal functions"

So, IMHO that corresponds to my point of view: we shouldn't mark by 
coroutine_fn functions that can be called from both coroutine context and not.

If we want to change the concept, we should start with rewriting this 
documentation.


Honestly I don't understand your point, as you said


"coroutine_fn"
means that the function can be called only from coroutine context.


which is the case for these functions. So could you please clarify?

What I do know is that it's extremely confusing to understand if a
function that is *not* marked as coroutine_fn is actually being called
also from coroutines or not.

Which complicates A LOT whatever change or operation I want to perform
on the BlockDriver callbacks or any other function in the block layer,
because in the current approach for the AioContext lock removal (which
you are not aware of, I understand) we need to distinguish between
functions running in coroutine context and not, and throwing g_c_w
functions everywhere makes my work much harder that it already is.


OK, I understand the problem.

Hmm. Formally marking by "coroutine_fn" a function that theoretically can be 
called from any context doesn't break things. We just say that since that moment we don't 
allow to call this function from non-coroutine context.

OK, I tend to agree that you are on the right way, sorry for my skepticism)

PS: you recently introduced a lot of IO_CODE() / GLOBAL_STATE_CODE() marks, 
which (will be/already) turned into assertions.

This is a lot better than our "coroutine_fn" sign, which actually do no check (and can't 
do). Don't you plan to swap a "coroutine_fn" noop marker with more meaningful 
IN_COROUTINE(); (or something like this, which just do assert(qemu_in_coroutine())) at start of the 
function? It would be a lot safer.




Thank you,
Emanuele





Signed-off-by: Emanuele Giuseppe Esposito 
---
   block/block-copy.c | 15 +--
   1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index bb947afdda..f33ab1d0b6 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -577,8 +577,9 @@ static coroutine_fn int
block_copy_task_entry(AioTask *task)
   return ret;
   }
   -static int block_copy_block_status(BlockCopyState *s, int64_t offset,
-   int64_t bytes, int64_t *pnum)
+static coroutine_fn int block_copy_block_status(BlockCopyState *s,
+    int64_t offset,
+    int64_t bytes,
int64_t *pnum)
   {
   int64_t num;
   BlockDriverState *base;
@@ -613,8 +614,9 @@ static int block_copy_block_status(BlockCopyState
*s, int64_t offset,
    * Check if the cluster starting at offset 

Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine

2022-11-08 Thread Emanuele Giuseppe Esposito



Am 08/11/2022 um 16:52 schrieb Vladimir Sementsov-Ogievskiy:
> On 11/8/22 18:13, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 08/11/2022 um 15:33 schrieb Vladimir Sementsov-Ogievskiy:
>>> On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:
 It seems that bdrv_open_driver() forgot to create a coroutine
 where to call bs->drv->bdrv_co_drain_begin(), a callback
 marked as coroutine_fn.

 Because there is no active I/O at this point, the coroutine
 should end right after entering, so the caller does not need
 to poll until it is finished.
>>>
>>> Hmm. I see your point. But isn't it better to go the generic way and use
>>> a generated coroutine wrapper? Nothing guarantees that
>>> .bdrv_co_drain_begin() handlers will never do any yield point even on
>>> driver open...
>>>
>>> Look for example at bdrv_co_check(). It has a generated wrapper
>>> bdrv_check(), declared in include/block/block-io.h
>>>
>>> So you just need to declare the wrapper, and use it in
>>> bdrv_open_driver(), the code would be clearer too.
>>
>> I think (and this is a repetition from my previous email) it confuses
>> the code. It is clear, but then you don't know if we are in a coroutine
>> context or not.
> 
> Hmm. But same thing is true for all callers of coroutine wrappers.
> 
> I agree that "coroutine wrapper" is a workaround for the design problem.
> Really, the fact that in many places we don't know are we in coroutine
> or not is very uncomfortable.

And the only way to figure if it's a coroutine or not is by adding
assertions and pray that the iotests don't fail *and* cover all cases.

> 
> But still, I'm not sure that's it good to avoid this workaround in one
> place and continue to use it in all other places. I think following
> common design is better. Or rework it deeply by getting rid of the
> wrappers somehow.

Well, one step at the time :) it's already difficult to verify that such
replacement cover and is correct for all cases :)

> 
>>
>> I am very well aware of what you did with your script, and I am working
>> on extending your g_c_w class with g_c_w_simple, where we drop the
>> if(qemu_in_coroutine()) case and leave just the coroutine creation.
>> Therefore, coroutine functions we use only the _co_ function, otherwise
>> we use g_c_w_simple.
>> In this way code is clear as you point out, but there is no confusion.
> 
> Hmm sounds good, I missed it. Why not use g_c_w_simple here than?

Because I came up with it this morning.

Thank you,
Emanuele

> 
>>
>> Thank you,
>> Emanuele
>>>

 Signed-off-by: Emanuele Giuseppe Esposito 
 Reviewed-by: Paolo Bonzini 
 ---
    block.c | 36 +++-
    1 file changed, 31 insertions(+), 5 deletions(-)

 diff --git a/block.c b/block.c
 index 5311b21f8e..d2b2800039 100644
 --- a/block.c
 +++ b/block.c
 @@ -1637,12 +1637,34 @@ out:
    g_free(gen_node_name);
    }
    +typedef struct DrainCo {
 +    BlockDriverState *bs;
 +    int ret;
 +} DrainCo;
 +
 +static void coroutine_fn bdrv_co_drain_begin(void *opaque)
 +{
 +    int i;
 +    DrainCo *co = opaque;
 +    BlockDriverState *bs = co->bs;
 +
 +    for (i = 0; i < bs->quiesce_counter; i++) {
 +    bs->drv->bdrv_co_drain_begin(bs);
 +    }
 +    co->ret = 0;
 +}
 +
    static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
    const char *node_name, QDict *options,
    int open_flags, Error **errp)
    {
    Error *local_err = NULL;
 -    int i, ret;
 +    int ret;
 +    Coroutine *co;
 +    DrainCo dco = {
 +    .bs = bs,
 +    .ret = NOT_DONE,
 +    };
    GLOBAL_STATE_CODE();
      bdrv_assign_node_name(bs, node_name, _err);
 @@ -1690,10 +1712,14 @@ static int bdrv_open_driver(BlockDriverState
 *bs, BlockDriver *drv,
    assert(bdrv_min_mem_align(bs) != 0);
    assert(is_power_of_2(bs->bl.request_alignment));
    -    for (i = 0; i < bs->quiesce_counter; i++) {
 -    if (drv->bdrv_co_drain_begin) {
 -    drv->bdrv_co_drain_begin(bs);
 -    }
 +    if (drv->bdrv_co_drain_begin) {
 +    co = qemu_coroutine_create(bdrv_co_drain_begin, );
 +    qemu_coroutine_enter(co);
 +    /*
 + * There should be no reason for drv->bdrv_co_drain_begin to
 wait at
 + * this point, because the device does not have any active
 I/O.
 + */
 +    assert(dco.ret != NOT_DONE);
    }
      return 0;
>>>
>>
> 




Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/8/22 18:13, Emanuele Giuseppe Esposito wrote:



Am 08/11/2022 um 15:33 schrieb Vladimir Sementsov-Ogievskiy:

On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:

It seems that bdrv_open_driver() forgot to create a coroutine
where to call bs->drv->bdrv_co_drain_begin(), a callback
marked as coroutine_fn.

Because there is no active I/O at this point, the coroutine
should end right after entering, so the caller does not need
to poll until it is finished.


Hmm. I see your point. But isn't it better to go the generic way and use
a generated coroutine wrapper? Nothing guarantees that
.bdrv_co_drain_begin() handlers will never do any yield point even on
driver open...

Look for example at bdrv_co_check(). It has a generated wrapper
bdrv_check(), declared in include/block/block-io.h

So you just need to declare the wrapper, and use it in
bdrv_open_driver(), the code would be clearer too.


I think (and this is a repetition from my previous email) it confuses
the code. It is clear, but then you don't know if we are in a coroutine
context or not.


Hmm. But same thing is true for all callers of coroutine wrappers.

I agree that "coroutine wrapper" is a workaround for the design problem. 
Really, the fact that in many places we don't know are we in coroutine or not is very 
uncomfortable.

But still, I'm not sure that's it good to avoid this workaround in one place 
and continue to use it in all other places. I think following common design is 
better. Or rework it deeply by getting rid of the wrappers somehow.



I am very well aware of what you did with your script, and I am working
on extending your g_c_w class with g_c_w_simple, where we drop the
if(qemu_in_coroutine()) case and leave just the coroutine creation.
Therefore, coroutine functions we use only the _co_ function, otherwise
we use g_c_w_simple.
In this way code is clear as you point out, but there is no confusion.


Hmm sounds good, I missed it. Why not use g_c_w_simple here than?



Thank you,
Emanuele




Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Paolo Bonzini 
---
   block.c | 36 +++-
   1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/block.c b/block.c
index 5311b21f8e..d2b2800039 100644
--- a/block.c
+++ b/block.c
@@ -1637,12 +1637,34 @@ out:
   g_free(gen_node_name);
   }
   +typedef struct DrainCo {
+    BlockDriverState *bs;
+    int ret;
+} DrainCo;
+
+static void coroutine_fn bdrv_co_drain_begin(void *opaque)
+{
+    int i;
+    DrainCo *co = opaque;
+    BlockDriverState *bs = co->bs;
+
+    for (i = 0; i < bs->quiesce_counter; i++) {
+    bs->drv->bdrv_co_drain_begin(bs);
+    }
+    co->ret = 0;
+}
+
   static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
   const char *node_name, QDict *options,
   int open_flags, Error **errp)
   {
   Error *local_err = NULL;
-    int i, ret;
+    int ret;
+    Coroutine *co;
+    DrainCo dco = {
+    .bs = bs,
+    .ret = NOT_DONE,
+    };
   GLOBAL_STATE_CODE();
     bdrv_assign_node_name(bs, node_name, _err);
@@ -1690,10 +1712,14 @@ static int bdrv_open_driver(BlockDriverState
*bs, BlockDriver *drv,
   assert(bdrv_min_mem_align(bs) != 0);
   assert(is_power_of_2(bs->bl.request_alignment));
   -    for (i = 0; i < bs->quiesce_counter; i++) {
-    if (drv->bdrv_co_drain_begin) {
-    drv->bdrv_co_drain_begin(bs);
-    }
+    if (drv->bdrv_co_drain_begin) {
+    co = qemu_coroutine_create(bdrv_co_drain_begin, );
+    qemu_coroutine_enter(co);
+    /*
+ * There should be no reason for drv->bdrv_co_drain_begin to
wait at
+ * this point, because the device does not have any active I/O.
+ */
+    assert(dco.ret != NOT_DONE);
   }
     return 0;






--
Best regards,
Vladimir




Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-08 Thread Michael S. Tsirkin
On Tue, Nov 08, 2022 at 11:21:26AM +, Alex Bennée wrote:
> 
> "Michael S. Tsirkin"  writes:
> 
> > On Tue, Nov 08, 2022 at 10:23:15AM +, Alex Bennée wrote:
> >> 
> >> "Michael S. Tsirkin"  writes:
> >> 
> >> > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote:
> >> >> The previous fix to virtio_device_started revealed a problem in its
> >> >> use by both the core and the device code. The core code should be able
> >> >> to handle the device "starting" while the VM isn't running to handle
> >> >> the restoration of migration state. To solve this dual use introduce a
> >> >> new helper for use by the vhost-user backends who all use it to feed a
> >> >> should_start variable.
> >> >> 
> >> >> We can also pick up a change vhost_user_blk_set_status while we are at
> >> >> it which follows the same pattern.
> >> >> 
> >> >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
> >> >> virtio_device_started)
> >> >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio 
> >> >> device)
> >> >> Signed-off-by: Alex Bennée 
> >> >> Cc: "Michael S. Tsirkin" 
> >> >
> >> > why is this in this patchset?
> >> 
> >> As per my cover letter:
> >> 
> >>   Most of these patches have been posted before as single patch RFCs. A
> >>   couple are already scheduled through other trees so will drop out in
> >>   due course
> >> 
> >> but I keep them in my tree until they are merged so I can continue to
> >> soak test them (and have a stable base for my other WIP trees).
> >
> > That's fine just pls don't double-post them on list, certainly
> > not as part of a patchset.
> 
> Why not? Is this breaking some tooling?

Yes patchset breaks git am if you try to apply part of it.

Reposting creates work for reviewers - why should they have to read the same
patch twice?  In this case it also made me scratch my head trying to
figure out what to do about it.

But, if you are careful and maintain an ordered changelog after "---"
and there it says 
changes since rfc:
no changes, subject changed 

then this second part is less of a problem

> -- 
> Alex Bennée




Re: [PATCH v2 8/9] block: bdrv_create is never called in non-coroutine context

2022-11-08 Thread Kevin Wolf
Am 04.11.2022 um 10:56 hat Emanuele Giuseppe Esposito geschrieben:
> Delete the if case and make sure it won't be called again
> in coroutines.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 
> Reviewed-by: Paolo Bonzini 

In the subject line, it should be "never called in coroutine context"
rather than "non-coroutine context", right?

Kevin




Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine

2022-11-08 Thread Kevin Wolf
Am 08.11.2022 um 15:33 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:
> > It seems that bdrv_open_driver() forgot to create a coroutine
> > where to call bs->drv->bdrv_co_drain_begin(), a callback
> > marked as coroutine_fn.
> > 
> > Because there is no active I/O at this point, the coroutine
> > should end right after entering, so the caller does not need
> > to poll until it is finished.
> 
> Hmm. I see your point. But isn't it better to go the generic way and
> use a generated coroutine wrapper? Nothing guarantees that
> .bdrv_co_drain_begin() handlers will never do any yield point even on
> driver open...
> 
> Look for example at bdrv_co_check(). It has a generated wrapper
> bdrv_check(), declared in include/block/block-io.h
> 
> So you just need to declare the wrapper, and use it in
> bdrv_open_driver(), the code would be clearer too.

Note that if we apply the drain simplification series I sent today up to
at least patch 3 ('block: Revert .bdrv_drained_begin/end to
non-coroutine_fn') [1], then this patch isn't actually needed any more.

Kevin

[1] https://lists.gnu.org/archive/html/qemu-block/2022-11/msg00206.html




Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine

2022-11-08 Thread Emanuele Giuseppe Esposito



Am 08/11/2022 um 15:33 schrieb Vladimir Sementsov-Ogievskiy:
> On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:
>> It seems that bdrv_open_driver() forgot to create a coroutine
>> where to call bs->drv->bdrv_co_drain_begin(), a callback
>> marked as coroutine_fn.
>>
>> Because there is no active I/O at this point, the coroutine
>> should end right after entering, so the caller does not need
>> to poll until it is finished.
> 
> Hmm. I see your point. But isn't it better to go the generic way and use
> a generated coroutine wrapper? Nothing guarantees that
> .bdrv_co_drain_begin() handlers will never do any yield point even on
> driver open...
> 
> Look for example at bdrv_co_check(). It has a generated wrapper
> bdrv_check(), declared in include/block/block-io.h
> 
> So you just need to declare the wrapper, and use it in
> bdrv_open_driver(), the code would be clearer too.

I think (and this is a repetition from my previous email) it confuses
the code. It is clear, but then you don't know if we are in a coroutine
context or not.

I am very well aware of what you did with your script, and I am working
on extending your g_c_w class with g_c_w_simple, where we drop the
if(qemu_in_coroutine()) case and leave just the coroutine creation.
Therefore, coroutine functions we use only the _co_ function, otherwise
we use g_c_w_simple.
In this way code is clear as you point out, but there is no confusion.

Thank you,
Emanuele
> 
>>
>> Signed-off-by: Emanuele Giuseppe Esposito 
>> Reviewed-by: Paolo Bonzini 
>> ---
>>   block.c | 36 +++-
>>   1 file changed, 31 insertions(+), 5 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 5311b21f8e..d2b2800039 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1637,12 +1637,34 @@ out:
>>   g_free(gen_node_name);
>>   }
>>   +typedef struct DrainCo {
>> +    BlockDriverState *bs;
>> +    int ret;
>> +} DrainCo;
>> +
>> +static void coroutine_fn bdrv_co_drain_begin(void *opaque)
>> +{
>> +    int i;
>> +    DrainCo *co = opaque;
>> +    BlockDriverState *bs = co->bs;
>> +
>> +    for (i = 0; i < bs->quiesce_counter; i++) {
>> +    bs->drv->bdrv_co_drain_begin(bs);
>> +    }
>> +    co->ret = 0;
>> +}
>> +
>>   static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
>>   const char *node_name, QDict *options,
>>   int open_flags, Error **errp)
>>   {
>>   Error *local_err = NULL;
>> -    int i, ret;
>> +    int ret;
>> +    Coroutine *co;
>> +    DrainCo dco = {
>> +    .bs = bs,
>> +    .ret = NOT_DONE,
>> +    };
>>   GLOBAL_STATE_CODE();
>>     bdrv_assign_node_name(bs, node_name, _err);
>> @@ -1690,10 +1712,14 @@ static int bdrv_open_driver(BlockDriverState
>> *bs, BlockDriver *drv,
>>   assert(bdrv_min_mem_align(bs) != 0);
>>   assert(is_power_of_2(bs->bl.request_alignment));
>>   -    for (i = 0; i < bs->quiesce_counter; i++) {
>> -    if (drv->bdrv_co_drain_begin) {
>> -    drv->bdrv_co_drain_begin(bs);
>> -    }
>> +    if (drv->bdrv_co_drain_begin) {
>> +    co = qemu_coroutine_create(bdrv_co_drain_begin, );
>> +    qemu_coroutine_enter(co);
>> +    /*
>> + * There should be no reason for drv->bdrv_co_drain_begin to
>> wait at
>> + * this point, because the device does not have any active I/O.
>> + */
>> +    assert(dco.ret != NOT_DONE);
>>   }
>>     return 0;
> 




Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations

2022-11-08 Thread Emanuele Giuseppe Esposito



Am 08/11/2022 um 15:48 schrieb Vladimir Sementsov-Ogievskiy:
> On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:
>> These functions end up calling bdrv_common_block_status_above(), a
>> generated_co_wrapper function.
> 
> generated_co_wrapper is not a coroutine_fn. Сonversely it's a function
> that do a class coroutine wrapping - start a coroutine and do POLL to
> wait for the coroutine to finish.
> 
>> In addition, they also happen to be always called in coroutine context,
>> meaning all callers are coroutine_fn.
> 
> That's also not a reason for marking them coroutine_fn. "coroutine_fn"
> means that the function can be called only from coroutine context.
> 
>> This means that the g_c_w function will enter the qemu_in_coroutine()
>> case and eventually suspend (or in other words call
>> qemu_coroutine_yield()).
>> Therefore we need to mark such functions coroutine_fn too.
> 
> I don't think so. Moreover, this breaks the concept, as your new
> coroutine_fn functions will call generated_co_wrapper functions which
> are not marked coroutine_fn and never was.

Theoretically not, because marking it coroutine_fn we know that we are
going in the if(qemu_in_coroutine()) branch of the g_c_w, so we could
directly replace it with the actual function. Because it's a pain to do
it with hand, and at some point I/someone should use Alberto static
analyzer to get rid of that, I decided to leave g_c_w there.

As I understand it, it seems that you and Paolo have a different
understanding on what coroutine_fn means and where it should be used.
Honestly I don't understand your point, as you said

> "coroutine_fn"
> means that the function can be called only from coroutine context.

which is the case for these functions. So could you please clarify?

What I do know is that it's extremely confusing to understand if a
function that is *not* marked as coroutine_fn is actually being called
also from coroutines or not.

Which complicates A LOT whatever change or operation I want to perform
on the BlockDriver callbacks or any other function in the block layer,
because in the current approach for the AioContext lock removal (which
you are not aware of, I understand) we need to distinguish between
functions running in coroutine context and not, and throwing g_c_w
functions everywhere makes my work much harder that it already is.

Thank you,
Emanuele

> 
>>
>> Signed-off-by: Emanuele Giuseppe Esposito 
>> ---
>>   block/block-copy.c | 15 +--
>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/block-copy.c b/block/block-copy.c
>> index bb947afdda..f33ab1d0b6 100644
>> --- a/block/block-copy.c
>> +++ b/block/block-copy.c
>> @@ -577,8 +577,9 @@ static coroutine_fn int
>> block_copy_task_entry(AioTask *task)
>>   return ret;
>>   }
>>   -static int block_copy_block_status(BlockCopyState *s, int64_t offset,
>> -   int64_t bytes, int64_t *pnum)
>> +static coroutine_fn int block_copy_block_status(BlockCopyState *s,
>> +    int64_t offset,
>> +    int64_t bytes,
>> int64_t *pnum)
>>   {
>>   int64_t num;
>>   BlockDriverState *base;
>> @@ -613,8 +614,9 @@ static int block_copy_block_status(BlockCopyState
>> *s, int64_t offset,
>>    * Check if the cluster starting at offset is allocated or not.
>>    * return via pnum the number of contiguous clusters sharing this
>> allocation.
>>    */
>> -static int block_copy_is_cluster_allocated(BlockCopyState *s, int64_t
>> offset,
>> -   int64_t *pnum)
>> +static int coroutine_fn
>> block_copy_is_cluster_allocated(BlockCopyState *s,
>> +    int64_t offset,
>> +    int64_t *pnum)
>>   {
>>   BlockDriverState *bs = s->source->bs;
>>   int64_t count, total_count = 0;
>> @@ -669,8 +671,9 @@ void block_copy_reset(BlockCopyState *s, int64_t
>> offset, int64_t bytes)
>>    * @return 0 when the cluster at @offset was unallocated,
>>    * 1 otherwise, and -ret on error.
>>    */
>> -int64_t block_copy_reset_unallocated(BlockCopyState *s,
>> - int64_t offset, int64_t *count)
>> +int64_t coroutine_fn block_copy_reset_unallocated(BlockCopyState *s,
>> +  int64_t offset,
>> +  int64_t *count)
>>   {
>>   int ret;
>>   int64_t clusters, bytes;
> 




Re: [PATCH v2 5/9] block: distinguish between bdrv_create running in coroutine and not

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:

Call two different functions depending on whether bdrv_create
is in coroutine or not, following the same pattern as
generated_co_wrapper functions.

This allows to also call the coroutine function directly,
without using CreateCo or relying in bdrv_create().



Can we move to auto-generation of bdrv_create(), like for bdrv_check() and 
friends?


Signed-off-by: Emanuele Giuseppe Esposito 
---
  block.c | 74 -
  1 file changed, 36 insertions(+), 38 deletions(-)

diff --git a/block.c b/block.c
index d2b2800039..0823563e4d 100644
--- a/block.c
+++ b/block.c
@@ -522,66 +522,64 @@ typedef struct CreateCo {
  Error *err;
  } CreateCo;
  
-static void coroutine_fn bdrv_create_co_entry(void *opaque)

+static int coroutine_fn bdrv_co_create(BlockDriver *drv, const char *filename,
+   QemuOpts *opts, Error **errp)
  {
-Error *local_err = NULL;
  int ret;
+GLOBAL_STATE_CODE();
+assert(*errp == NULL);
+
+if (!drv->bdrv_co_create_opts) {
+error_setg(errp, "Driver '%s' does not support image creation",
+   drv->format_name);
+return -ENOTSUP;
+}
+
+ret = drv->bdrv_co_create_opts(drv, filename, opts, errp);
  
+if (ret < 0 && !*errp) {

+error_setg_errno(errp, -ret, "Could not create image");
+}
+
+return ret;
+}
+
+static void coroutine_fn bdrv_create_co_entry(void *opaque)
+{
  CreateCo *cco = opaque;
-assert(cco->drv);
  GLOBAL_STATE_CODE();
+assert(cco->drv);
  
-ret = cco->drv->bdrv_co_create_opts(cco->drv,

-cco->filename, cco->opts, _err);
-error_propagate(>err, local_err);
-cco->ret = ret;
+cco->ret = bdrv_co_create(cco->drv, cco->filename, cco->opts, >err);
  }
  
  int bdrv_create(BlockDriver *drv, const char* filename,

  QemuOpts *opts, Error **errp)
  {
-int ret;
-
  GLOBAL_STATE_CODE();
  
-Coroutine *co;

-CreateCo cco = {
-.drv = drv,
-.filename = g_strdup(filename),
-.opts = opts,
-.ret = NOT_DONE,
-.err = NULL,
-};
-
-if (!drv->bdrv_co_create_opts) {
-error_setg(errp, "Driver '%s' does not support image creation", 
drv->format_name);
-ret = -ENOTSUP;
-goto out;
-}
-
  if (qemu_in_coroutine()) {
  /* Fast-path if already in coroutine context */
-bdrv_create_co_entry();
+return bdrv_co_create(drv, filename, opts, errp);
  } else {
+Coroutine *co;
+CreateCo cco = {
+.drv = drv,
+.filename = g_strdup(filename),
+.opts = opts,
+.ret = NOT_DONE,
+.err = NULL,
+};
+
  co = qemu_coroutine_create(bdrv_create_co_entry, );
  qemu_coroutine_enter(co);
  while (cco.ret == NOT_DONE) {
  aio_poll(qemu_get_aio_context(), true);
  }
+error_propagate(errp, cco.err);
+g_free(cco.filename);
+return cco.ret;
  }
-
-ret = cco.ret;
-if (ret < 0) {
-if (cco.err) {
-error_propagate(errp, cco.err);
-} else {
-error_setg_errno(errp, -ret, "Could not create image");
-}
-}
-
-out:
-g_free(cco.filename);
-return ret;
  }
  
  /**


--
Best regards,
Vladimir




Re: [PATCH v3 4/4] hw/nvme: add polling support

2022-11-08 Thread John Levon
On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote:

> On Nov  3 21:19, Jinhao Fan wrote:
> > On 11/3/2022 8:10 PM, Klaus Jensen wrote:
> > > I agree that the spec is a little unclear on this point. In any case, in
> > > Linux, when the driver has decided that the sq tail must be updated,
> > > it will use this check:
> > > 
> > >(new_idx - event_idx - 1) < (new_idx - old)
> > 
> > When eventidx is already behind, it's like:
> > 
> >  0
> >  1 <- event_idx
> >  2 <- old
> >  3 <- new_idx
> >  4
> >  .
> >  .
> >  .
> > 
> > In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) =
> > 3-2=1, so the host won't update sq tail. Where am I wrong in this example?
> 
> That becomes 1 >= 1, i.e. "true". So this will result in the driver
> doing an mmio doorbell write.

The code is:

static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
 
{   
 
return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);   
 
}   
 

which per the above is "return 1 < 1;", or false. So the above case does *not*
do an mmio write. No?

regards
john



Re: [PATCH v2 3/9] nbd/server.c: add missing coroutine_fn annotations

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:

These functions end up calling bdrv_*() implemented as generated_co_wrapper
functions.


Same here. Sorry that I joined only on v3.

In past we had a lot of "coroutine wrappers", each IO function in block/io.c 
and many in block.c had two variants:

coroutine_fn bdrv_co_foo(...)

and a wrapper

bdrv_foo(...)

And that wrapper is not a coroutine_fn, it's for calling from any context: 
coroutine or not. Now many of these wrappers are auto-generated, you may find 
them in build/block/block-gen.c after successful make.

"generated_co_wrapper" is a sign for code generation script to generate the 
wrapper code.


In addition, they also happen to be always called in coroutine context,
meaning all callers are coroutine_fn.
This means that the g_c_w function will enter the qemu_in_coroutine()
case and eventually suspend (or in other words call qemu_coroutine_yield()).
Therefore we need to mark such functions coroutine_fn too.

Signed-off-by: Emanuele Giuseppe Esposito 
---
  nbd/server.c | 21 -
  1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index ada16089f3..e2eec0cf46 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2141,8 +2141,9 @@ static int nbd_extent_array_add(NBDExtentArray *ea,
  return 0;
  }
  
-static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset,

-  uint64_t bytes, NBDExtentArray *ea)
+static int coroutine_fn blockstatus_to_extents(BlockDriverState *bs,
+   uint64_t offset, uint64_t bytes,
+   NBDExtentArray *ea)
  {
  while (bytes) {
  uint32_t flags;
@@ -2168,8 +2169,9 @@ static int blockstatus_to_extents(BlockDriverState *bs, 
uint64_t offset,
  return 0;
  }
  
-static int blockalloc_to_extents(BlockDriverState *bs, uint64_t offset,

- uint64_t bytes, NBDExtentArray *ea)
+static int coroutine_fn blockalloc_to_extents(BlockDriverState *bs,
+  uint64_t offset, uint64_t bytes,
+  NBDExtentArray *ea)
  {
  while (bytes) {
  int64_t num;
@@ -2220,11 +,12 @@ static int nbd_co_send_extents(NBDClient *client, 
uint64_t handle,
  }
  
  /* Get block status from the exported device and send it to the client */

-static int nbd_co_send_block_status(NBDClient *client, uint64_t handle,
-BlockDriverState *bs, uint64_t offset,
-uint32_t length, bool dont_fragment,
-bool last, uint32_t context_id,
-Error **errp)
+static int
+coroutine_fn nbd_co_send_block_status(NBDClient *client, uint64_t handle,
+  BlockDriverState *bs, uint64_t offset,
+  uint32_t length, bool dont_fragment,
+  bool last, uint32_t context_id,
+  Error **errp)
  {
  int ret;
  unsigned int nb_extents = dont_fragment ? 1 : 
NBD_MAX_BLOCK_STATUS_EXTENTS;


--
Best regards,
Vladimir




Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:

These functions end up calling bdrv_common_block_status_above(), a
generated_co_wrapper function.


generated_co_wrapper is not a coroutine_fn. Сonversely it's a function that do 
a class coroutine wrapping - start a coroutine and do POLL to wait for the 
coroutine to finish.


In addition, they also happen to be always called in coroutine context,
meaning all callers are coroutine_fn.


That's also not a reason for marking them coroutine_fn. "coroutine_fn" means 
that the function can be called only from coroutine context.


This means that the g_c_w function will enter the qemu_in_coroutine()
case and eventually suspend (or in other words call qemu_coroutine_yield()).
Therefore we need to mark such functions coroutine_fn too.


I don't think so. Moreover, this breaks the concept, as your new coroutine_fn 
functions will call generated_co_wrapper functions which are not marked 
coroutine_fn and never was.



Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/block-copy.c | 15 +--
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index bb947afdda..f33ab1d0b6 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -577,8 +577,9 @@ static coroutine_fn int block_copy_task_entry(AioTask *task)
  return ret;
  }
  
-static int block_copy_block_status(BlockCopyState *s, int64_t offset,

-   int64_t bytes, int64_t *pnum)
+static coroutine_fn int block_copy_block_status(BlockCopyState *s,
+int64_t offset,
+int64_t bytes, int64_t *pnum)
  {
  int64_t num;
  BlockDriverState *base;
@@ -613,8 +614,9 @@ static int block_copy_block_status(BlockCopyState *s, 
int64_t offset,
   * Check if the cluster starting at offset is allocated or not.
   * return via pnum the number of contiguous clusters sharing this allocation.
   */
-static int block_copy_is_cluster_allocated(BlockCopyState *s, int64_t offset,
-   int64_t *pnum)
+static int coroutine_fn block_copy_is_cluster_allocated(BlockCopyState *s,
+int64_t offset,
+int64_t *pnum)
  {
  BlockDriverState *bs = s->source->bs;
  int64_t count, total_count = 0;
@@ -669,8 +671,9 @@ void block_copy_reset(BlockCopyState *s, int64_t offset, 
int64_t bytes)
   * @return 0 when the cluster at @offset was unallocated,
   * 1 otherwise, and -ret on error.
   */
-int64_t block_copy_reset_unallocated(BlockCopyState *s,
- int64_t offset, int64_t *count)
+int64_t coroutine_fn block_copy_reset_unallocated(BlockCopyState *s,
+  int64_t offset,
+  int64_t *count)
  {
  int ret;
  int64_t clusters, bytes;


--
Best regards,
Vladimir




[PATCH for-7.2] block/blkio: Set BlockDriver::has_variable_length to false

2022-11-08 Thread Alberto Faria
Setting it to true can cause the device size to be queried from libblkio
in otherwise fast paths, degrading performance. Set it to false and
require users to refresh the device size explicitly instead.

Fixes: 4c8f4fda0504 ("block/blkio: Tolerate device size changes")
Suggested-by: Kevin Wolf 
Signed-off-by: Alberto Faria 
---
 block/blkio.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blkio.c b/block/blkio.c
index 620fab28a7..5eae3adfaf 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -993,7 +993,6 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
 { \
 .format_name = name, \
 .protocol_name   = name, \
-.has_variable_length = true, \
 .instance_size   = sizeof(BDRVBlkioState), \
 .bdrv_file_open  = blkio_file_open, \
 .bdrv_close  = blkio_close, \
-- 
2.38.1




Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote:

It seems that bdrv_open_driver() forgot to create a coroutine
where to call bs->drv->bdrv_co_drain_begin(), a callback
marked as coroutine_fn.

Because there is no active I/O at this point, the coroutine
should end right after entering, so the caller does not need
to poll until it is finished.


Hmm. I see your point. But isn't it better to go the generic way and use a 
generated coroutine wrapper? Nothing guarantees that .bdrv_co_drain_begin() 
handlers will never do any yield point even on driver open...

Look for example at bdrv_co_check(). It has a generated wrapper bdrv_check(), 
declared in include/block/block-io.h

So you just need to declare the wrapper, and use it in bdrv_open_driver(), the 
code would be clearer too.



Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Paolo Bonzini 
---
  block.c | 36 +++-
  1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/block.c b/block.c
index 5311b21f8e..d2b2800039 100644
--- a/block.c
+++ b/block.c
@@ -1637,12 +1637,34 @@ out:
  g_free(gen_node_name);
  }
  
+typedef struct DrainCo {

+BlockDriverState *bs;
+int ret;
+} DrainCo;
+
+static void coroutine_fn bdrv_co_drain_begin(void *opaque)
+{
+int i;
+DrainCo *co = opaque;
+BlockDriverState *bs = co->bs;
+
+for (i = 0; i < bs->quiesce_counter; i++) {
+bs->drv->bdrv_co_drain_begin(bs);
+}
+co->ret = 0;
+}
+
  static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
  const char *node_name, QDict *options,
  int open_flags, Error **errp)
  {
  Error *local_err = NULL;
-int i, ret;
+int ret;
+Coroutine *co;
+DrainCo dco = {
+.bs = bs,
+.ret = NOT_DONE,
+};
  GLOBAL_STATE_CODE();
  
  bdrv_assign_node_name(bs, node_name, _err);

@@ -1690,10 +1712,14 @@ static int bdrv_open_driver(BlockDriverState *bs, 
BlockDriver *drv,
  assert(bdrv_min_mem_align(bs) != 0);
  assert(is_power_of_2(bs->bl.request_alignment));
  
-for (i = 0; i < bs->quiesce_counter; i++) {

-if (drv->bdrv_co_drain_begin) {
-drv->bdrv_co_drain_begin(bs);
-}
+if (drv->bdrv_co_drain_begin) {
+co = qemu_coroutine_create(bdrv_co_drain_begin, );
+qemu_coroutine_enter(co);
+/*
+ * There should be no reason for drv->bdrv_co_drain_begin to wait at
+ * this point, because the device does not have any active I/O.
+ */
+assert(dco.ret != NOT_DONE);
  }
  
  return 0;


--
Best regards,
Vladimir




[PATCH] qapi/block-core: Fix BlockdevOptionsNvmeIoUring @path description

2022-11-08 Thread Alberto Faria
The nvme-io_uring BlockDriver's path option must point at the character
device of an NVMe namespace, not at an image file.

Fixes: fd66dbd424f5 ("blkio: add libblkio block driver")
Suggested-by: Stefano Garzarella 
Signed-off-by: Alberto Faria 
---
 qapi/block-core.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 6d904004f8..95ac4fa634 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3704,7 +3704,7 @@
 #
 # Driver specific block device options for the nvme-io_uring backend.
 #
-# @path: path to the image file
+# @path: path to the NVMe namespace's character device (e.g. /dev/ng0n1).
 #
 # Since: 7.2
 ##
-- 
2.38.1




Re: [PATCH v2 0/3] block: Start/end drain on correct AioContext

2022-11-08 Thread Kevin Wolf
Am 08.11.2022 um 15:13 hat Kevin Wolf geschrieben:
> Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben:
> > Hi,
> > 
> > v1 cover letter:
> > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html
> > 
> > bdrv_replace_child_noperm() drains the child via
> > bdrv_parent_drained_{begin,end}_single().  When it removes a child, the
> > bdrv_parent_drained_end_single() at its end will be called on an empty
> > child, making the BDRV_POLL_WHILE() in it poll the main AioContext
> > (because c->bs is NULL).
> > 
> > That’s wrong, though, because it’s supposed to operate on the parent.
> > bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in
> > the parents’ AioContext, which may be anything, not necessarily the main
> > context.  Therefore, we must poll the parent’s context.
> > 
> > Patch 3 does this for both bdrv_parent_drained_{begin,end}_single().
> > Patch 1 ensures that we can legally call
> > bdrv_child_get_parent_aio_context() from those I/O context functions,
> > and patch 2 fixes blk_do_set_aio_context() to not cause an assertion
> > failure if it beginning a drain can end up in blk_get_aio_context()
> > before blk->ctx has been updated.
> 
> Hmm, I may have unintentionally made this series obsolete with the drain
> series I sent today. The poll instances that you're fixing simply don't
> exist any more after it.
> 
> Can you check if the bug is gone with my series? I would hope so, but if
> not, we would probably need to apply a fix in a different place.

Actually, on second thoughts, we'd probably still apply your patches as
a fix for 7.2 and then have my patches which would get rid of the
polling only in block-next. Patch 1 and 2 of this series would stay in
the tree, and that seems to make sense to me, too. So obsolete was not
the right word.

But it would still be interesting to see if my series fixes the bug,
too, because otherwise it might introduce a regression later.

By the way, is the bug hard to test in a test case? If this series had
one, I could just have run it myself against my tree.

Kevin




Re: [PATCH v3 09/30] nbd/server: Clean up abuse of BlockExportOptionsNbd member @arg

2022-11-08 Thread Vladimir Sementsov-Ogievskiy

On 11/4/22 19:06, Markus Armbruster wrote:

block-export-add argument @name defaults to the value of argument
@node-name.

nbd_export_create() implements this by copying @node_name to @name.
It leaves @has_node_name false, violating the "has_node_name ==
!!node_name" invariant.  Unclean.  Falls apart when we elide
@has_node_name (next commit): then QAPI frees the same value twice,
once for @node_name and once @name.  iotest 307 duly explodes.

Goes back to commit c62d24e906 "blockdev-nbd: Boxed argument type for
nbd-server-add" (v5.0.0).  Got moved from qmp_nbd_server_add() to
nbd_export_create() (commit 56ee86261e), then copied back (commit
b6076afcab).  Commit 8675cbd68b "nbd: Utilize QAPI_CLONE for type
conversion" (v5.2.0) cleaned up the copy in qmp_nbd_server_add()
noting

 Second, our assignment to arg->name is fishy: the generated QAPI code
 for qapi_free_NbdServerAddOptions does not visit arg->name if
 arg->has_name is false, but if it DID visit it, we would have
 introduced a double-free situation when arg is finally freed.

Exactly.  However, the copy in nbd_export_create() remained dirty.

Clean it up.  Since the value stored in member @name is not actually
used outside this function, use a local variable instead of modifying
the QAPI object.

Signed-off-by: Markus Armbruster
Cc: Eric Blake
Cc: Vladimir Sementsov-Ogievskiy
Cc:qemu-block@nongnu.org



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir




Re: [PATCH v2 0/3] block: Start/end drain on correct AioContext

2022-11-08 Thread Kevin Wolf
Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben:
> Hi,
> 
> v1 cover letter:
> https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html
> 
> bdrv_replace_child_noperm() drains the child via
> bdrv_parent_drained_{begin,end}_single().  When it removes a child, the
> bdrv_parent_drained_end_single() at its end will be called on an empty
> child, making the BDRV_POLL_WHILE() in it poll the main AioContext
> (because c->bs is NULL).
> 
> That’s wrong, though, because it’s supposed to operate on the parent.
> bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in
> the parents’ AioContext, which may be anything, not necessarily the main
> context.  Therefore, we must poll the parent’s context.
> 
> Patch 3 does this for both bdrv_parent_drained_{begin,end}_single().
> Patch 1 ensures that we can legally call
> bdrv_child_get_parent_aio_context() from those I/O context functions,
> and patch 2 fixes blk_do_set_aio_context() to not cause an assertion
> failure if it beginning a drain can end up in blk_get_aio_context()
> before blk->ctx has been updated.

Hmm, I may have unintentionally made this series obsolete with the drain
series I sent today. The poll instances that you're fixing simply don't
exist any more after it.

Can you check if the bug is gone with my series? I would hope so, but if
not, we would probably need to apply a fix in a different place.

Kevin




Re: [PATCH v3 4/4] hw/nvme: add polling support

2022-11-08 Thread Klaus Jensen
On Nov  8 12:39, John Levon wrote:
> On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote:
> 
> > On Nov  3 21:19, Jinhao Fan wrote:
> > > On 11/3/2022 8:10 PM, Klaus Jensen wrote:
> > > > I agree that the spec is a little unclear on this point. In any case, in
> > > > Linux, when the driver has decided that the sq tail must be updated,
> > > > it will use this check:
> > > > 
> > > >(new_idx - event_idx - 1) < (new_idx - old)
> > > 
> > > When eventidx is already behind, it's like:
> > > 
> > >  0
> > >  1 <- event_idx
> > >  2 <- old
> > >  3 <- new_idx
> > >  4
> > >  .
> > >  .
> > >  .
> > > 
> > > In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) =
> > > 3-2=1, so the host won't update sq tail. Where am I wrong in this example?
> > 
> > That becomes 1 >= 1, i.e. "true". So this will result in the driver
> > doing an mmio doorbell write.
> 
> The code is:
> 
> static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)  
>
> { 
>
> return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old); 
>
> } 
>
> 
> which per the above is "return 1 < 1;", or false. So the above case does *not*
> do an mmio write. No?
> 

Whelp.

Looks like I'm in the wrong here, apologies!


signature.asc
Description: PGP signature


[PATCH 05/13] block: Inline bdrv_drain_invoke()

2022-11-08 Thread Kevin Wolf
bdrv_drain_invoke() has now two entirely separate cases that share no
code any more and are selected depending on a bool parameter. Each case
has only one caller. Just inline the function.

Signed-off-by: Kevin Wolf 
---
 block/io.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/block/io.c b/block/io.c
index 41e6121c31..c520183fb7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -241,21 +241,6 @@ typedef struct {
 bool ignore_bds_parents;
 } BdrvCoDrainData;
 
-/* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */
-static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
-{
-if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) ||
-(!begin && !bs->drv->bdrv_drain_end)) {
-return;
-}
-
-if (begin) {
-bs->drv->bdrv_drain_begin(bs);
-} else {
-bs->drv->bdrv_drain_end(bs);
-}
-}
-
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
 bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
  BdrvChild *ignore_parent, bool ignore_bds_parents)
@@ -389,7 +374,9 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
 }
 
 bdrv_parent_drained_begin(bs, parent, ignore_bds_parents);
-bdrv_drain_invoke(bs, true);
+if (bs->drv && bs->drv->bdrv_drain_begin) {
+bs->drv->bdrv_drain_begin(bs);
+}
 }
 
 static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
@@ -460,7 +447,9 @@ static void bdrv_do_drained_end(BlockDriverState *bs, bool 
recursive,
 assert(bs->quiesce_counter > 0);
 
 /* Re-enable things in child-to-parent order */
-bdrv_drain_invoke(bs, false);
+if (bs->drv && bs->drv->bdrv_drain_end) {
+bs->drv->bdrv_drain_end(bs);
+}
 bdrv_parent_drained_end(bs, parent, ignore_bds_parents);
 
 old_quiesce_counter = qatomic_fetch_dec(>quiesce_counter);
-- 
2.38.1




[PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm()

2022-11-08 Thread Kevin Wolf
In order to make sure that bdrv_replace_child_noperm() doesn't have to
poll any more, get rid of the bdrv_parent_drained_begin_single() call.

This is possible now because we can require that the child is already
drained when the function is called (it better be, having in-flight
requests while modifying the graph isn't going to end well!) and we
don't call the parent drain callbacks more than once.

The additional drain calls needed in callers cause the test case to run
its code in the drain handler too early (bdrv_attach_child() drains
now), so modify it to only enable the code after the test setup has
completed.

Signed-off-by: Kevin Wolf 
---
 include/block/block-io.h |  8 
 block.c  | 72 +---
 block/io.c   |  2 +-
 tests/unit/test-bdrv-drain.c | 10 +
 4 files changed, 70 insertions(+), 22 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index 5b54ed4672..ddce8550a9 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -290,6 +290,14 @@ bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector 
*qiov, int64_t pos);
  */
 void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll);
 
+/**
+ * bdrv_parent_drained_poll_single:
+ *
+ * Returns true if there is any pending activity to cease before @c can be
+ * called quiesced, false otherwise.
+ */
+bool bdrv_parent_drained_poll_single(BdrvChild *c);
+
 /**
  * bdrv_parent_drained_end_single:
  *
diff --git a/block.c b/block.c
index 5f5f79cd16..12039e9b8a 100644
--- a/block.c
+++ b/block.c
@@ -2399,6 +2399,20 @@ static void bdrv_replace_child_abort(void *opaque)
 
 GLOBAL_STATE_CODE();
 /* old_bs reference is transparently moved from @s to @s->child */
+if (!s->child->bs) {
+/*
+ * The parents were undrained when removing old_bs from the child. New
+ * requests can't have been made, though, because the child was empty.
+ *
+ * TODO Make bdrv_replace_child_noperm() transactionable to avoid
+ * undraining the parent in the first place. Once this is done, having
+ * new_bs drained when calling bdrv_replace_child_tran() is not a
+ * requirement any more.
+ */
+bdrv_parent_drained_begin_single(s->child, false);
+assert(!bdrv_parent_drained_poll_single(s->child));
+}
+assert(s->child->parent_quiesce_counter);
 bdrv_replace_child_noperm(s->child, s->old_bs);
 bdrv_unref(new_bs);
 }
@@ -2414,12 +2428,20 @@ static TransactionActionDrv bdrv_replace_child_drv = {
  *
  * Note: real unref of old_bs is done only on commit.
  *
+ * Both child and new_bs (if non-NULL) must be drained. new_bs must be kept
+ * drained until the transaction is completed (this automatically implies that
+ * child remains drained, too).
+ *
  * The function doesn't update permissions, caller is responsible for this.
  */
 static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs,
 Transaction *tran)
 {
 BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1);
+
+assert(child->parent_quiesce_counter);
+assert(!new_bs || new_bs->quiesce_counter);
+
 *s = (BdrvReplaceChildState) {
 .child = child,
 .old_bs = child->bs,
@@ -2818,6 +2840,12 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 int new_bs_quiesce_counter;
 
 assert(!child->frozen);
+/*
+ * When removing the child, it's the callers responsibility to make sure
+ * that no requests are in flight any more. Usually the parent is drained,
+ * but not through child->parent_quiesce_counter.
+ */
+assert(!new_bs || child->parent_quiesce_counter);
 assert(old_bs != new_bs);
 GLOBAL_STATE_CODE();
 
@@ -2825,16 +2853,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
 }
 
-new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
-
-/*
- * If the new child node is drained but the old one was not, flush
- * all outstanding requests to the old child node.
- */
-if (new_bs_quiesce_counter && !child->parent_quiesce_counter) {
-bdrv_parent_drained_begin_single(child, true);
-}
-
 if (old_bs) {
 if (child->klass->detach) {
 child->klass->detach(child);
@@ -2849,14 +2867,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 assert_bdrv_graph_writable(new_bs);
 QLIST_INSERT_HEAD(_bs->parents, child, next_parent);
 
-/*
- * Polling in bdrv_parent_drained_begin_single() may have led to the 
new
- * node's quiesce_counter having been decreased.  Not a problem, we 
just
- * need to recognize this here and then invoke drained_end 
appropriately
- * more often.
- */
-assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
-
 if 

[PATCH 09/13] block: Remove subtree drains

2022-11-08 Thread Kevin Wolf
Subtree drains are not used any more. Remove them.

After this, BdrvChildClass.attach/detach() don't poll any more.

Signed-off-by: Kevin Wolf 
---
 include/block/block-io.h |  18 +--
 include/block/block_int-common.h |   1 -
 include/block/block_int-io.h |  12 --
 block.c  |  20 +--
 block/io.c   | 121 +++---
 tests/unit/test-bdrv-drain.c | 261 ++-
 6 files changed, 44 insertions(+), 389 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index 97e9ae8bee..c35cb1e53f 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -303,8 +303,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c);
 /**
  * bdrv_drain_poll:
  *
- * Poll for pending requests in @bs, its parents (except for @ignore_parent),
- * and if @recursive is true its children as well (used for subtree drain).
+ * Poll for pending requests in @bs and its parents (except for 
@ignore_parent).
  *
  * If @ignore_bds_parents is true, parents that are BlockDriverStates must
  * ignore the drain request because they will be drained separately (used for
@@ -312,8 +311,8 @@ void bdrv_parent_drained_end_single(BdrvChild *c);
  *
  * This is part of bdrv_drained_begin.
  */
-bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
- BdrvChild *ignore_parent, bool ignore_bds_parents);
+bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
+ bool ignore_bds_parents);
 
 /**
  * bdrv_drained_begin:
@@ -334,12 +333,6 @@ void bdrv_drained_begin(BlockDriverState *bs);
 void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
BdrvChild *parent, bool ignore_bds_parents);
 
-/**
- * Like bdrv_drained_begin, but recursively begins a quiesced section for
- * exclusive access to all child nodes as well.
- */
-void bdrv_subtree_drained_begin(BlockDriverState *bs);
-
 /**
  * bdrv_drained_end:
  *
@@ -353,9 +346,4 @@ void bdrv_subtree_drained_begin(BlockDriverState *bs);
  */
 void bdrv_drained_end(BlockDriverState *bs);
 
-/**
- * End a quiescent section started by bdrv_subtree_drained_begin().
- */
-void bdrv_subtree_drained_end(BlockDriverState *bs);
-
 #endif /* BLOCK_IO_H */
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 6504db4fd9..65ee5fcbec 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -1184,7 +1184,6 @@ struct BlockDriverState {
 
 /* Accessed with atomic ops.  */
 int quiesce_counter;
-int recursive_quiesce_counter;
 
 unsigned int write_gen;   /* Current data generation */
 
diff --git a/include/block/block_int-io.h b/include/block/block_int-io.h
index 4b0b3e17ef..8bc061ebb8 100644
--- a/include/block/block_int-io.h
+++ b/include/block/block_int-io.h
@@ -179,16 +179,4 @@ void bdrv_bsc_invalidate_range(BlockDriverState *bs,
  */
 void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
 
-
-/*
- * "I/O or GS" API functions. These functions can run without
- * the BQL, but only in one specific iothread/main loop.
- *
- * See include/block/block-io.h for more information about
- * the "I/O or GS" API.
- */
-
-void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent);
-void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState 
*old_parent);
-
 #endif /* BLOCK_INT_IO_H */
diff --git a/block.c b/block.c
index 43b893dd6c..9d082631d9 100644
--- a/block.c
+++ b/block.c
@@ -1224,7 +1224,7 @@ static void bdrv_child_cb_drained_begin(BdrvChild *child)
 static bool bdrv_child_cb_drained_poll(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
-return bdrv_drain_poll(bs, false, NULL, false);
+return bdrv_drain_poll(bs, NULL, false);
 }
 
 static void bdrv_child_cb_drained_end(BdrvChild *child)
@@ -1474,8 +1474,6 @@ static void bdrv_child_cb_attach(BdrvChild *child)
 assert(!bs->file);
 bs->file = child;
 }
-
-bdrv_apply_subtree_drain(child, bs);
 }
 
 static void bdrv_child_cb_detach(BdrvChild *child)
@@ -1486,8 +1484,6 @@ static void bdrv_child_cb_detach(BdrvChild *child)
 bdrv_backing_detach(child);
 }
 
-bdrv_unapply_subtree_drain(child, bs);
-
 assert_bdrv_graph_writable(bs);
 QLIST_REMOVE(child, next);
 if (child == bs->backing) {
@@ -2843,9 +2839,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 }
 
 if (old_bs) {
-/* Detach first so that the recursive drain sections coming from @child
- * are already gone and we only end the drain sections that came from
- * elsewhere. */
 if (child->klass->detach) {
 child->klass->detach(child);
 }
@@ -2860,17 +2853,14 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 QLIST_INSERT_HEAD(_bs->parents, child, next_parent);
 
 /*
- * Detaching the old node may have led to the 

[PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate()

2022-11-08 Thread Kevin Wolf
Instead of using a subtree drain from the top node (which also drains
child nodes of base that we're not even interested in), use a normal
drain for base, which automatically drains all of the parents, too.

Signed-off-by: Kevin Wolf 
---
 block.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 5828b970e4..2f6b25875f 100644
--- a/block.c
+++ b/block.c
@@ -5581,7 +5581,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, 
BlockDriverState *base,
 GLOBAL_STATE_CODE();
 
 bdrv_ref(top);
-bdrv_subtree_drained_begin(top);
+bdrv_drained_begin(base);
 
 if (!top->drv || !base->drv) {
 goto exit;
@@ -5654,7 +5654,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, 
BlockDriverState *base,
 
 ret = 0;
 exit:
-bdrv_subtree_drained_end(top);
+bdrv_drained_end(base);
 bdrv_unref(top);
 return ret;
 }
-- 
2.38.1




[PATCH 10/13] block: Call drain callbacks only once

2022-11-08 Thread Kevin Wolf
We only need to call both the BlockDriver's callback and the parent
callbacks when going from undrained to drained or vice versa. A second
drain section doesn't make a difference for the driver or the parent,
they weren't supposed to send new requests before and after the second
drain.

One thing that gets in the way is the 'ignore_bds_parents' parameter in
bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true
for the first drain, bs->quiesce_counter will be non-zero, but the
parent callbacks still haven't been called, so a second drain where it
is false would still have to call them.

Instead of keeping track of this, let's just get rid of the parameter.
It was introduced in commit 6cd5c9d7b2d as an optimisation so that
during bdrv_drain_all(), we wouldn't recursively drain all parents up to
the root for each node, resulting in quadratic complexity. As it happens,
calling the callbacks only once solves the same problem, so as of this
patch, we'll still have O(n) complexity and ignore_bds_parents is not
needed any more.

This patch only ignores the 'ignore_bds_parents' parameter. It will be
removed in a separate patch.

Signed-off-by: Kevin Wolf 
---
 block.c  | 13 ++---
 block/io.c   | 24 +---
 tests/unit/test-bdrv-drain.c | 16 ++--
 3 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/block.c b/block.c
index 9d082631d9..8878586f6e 100644
--- a/block.c
+++ b/block.c
@@ -2816,7 +2816,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 {
 BlockDriverState *old_bs = child->bs;
 int new_bs_quiesce_counter;
-int drain_saldo;
 
 assert(!child->frozen);
 assert(old_bs != new_bs);
@@ -2827,15 +2826,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 }
 
 new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
-drain_saldo = new_bs_quiesce_counter - child->parent_quiesce_counter;
 
 /*
  * If the new child node is drained but the old one was not, flush
  * all outstanding requests to the old child node.
  */
-while (drain_saldo > 0 && child->klass->drained_begin) {
+if (new_bs_quiesce_counter && !child->parent_quiesce_counter) {
 bdrv_parent_drained_begin_single(child, true);
-drain_saldo--;
 }
 
 if (old_bs) {
@@ -2859,7 +2856,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
  * more often.
  */
 assert(new_bs->quiesce_counter <= new_bs_quiesce_counter);
-drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter;
 
 if (child->klass->attach) {
 child->klass->attach(child);
@@ -2869,10 +2865,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
 /*
  * If the old child node was drained but the new one is not, allow
  * requests to come in only after the new node has been attached.
+ *
+ * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single()
+ * polls, which could have changed the value.
  */
-while (drain_saldo < 0 && child->klass->drained_end) {
+new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0);
+if (!new_bs_quiesce_counter && child->parent_quiesce_counter) {
 bdrv_parent_drained_end_single(child);
-drain_saldo++;
 }
 }
 
diff --git a/block/io.c b/block/io.c
index 870a25d7a5..87c7a92f15 100644
--- a/block/io.c
+++ b/block/io.c
@@ -62,7 +62,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c)
 {
 IO_OR_GS_CODE();
 
-assert(c->parent_quiesce_counter > 0);
+assert(c->parent_quiesce_counter == 1);
 c->parent_quiesce_counter--;
 if (c->klass->drained_end) {
 c->klass->drained_end(c);
@@ -109,6 +109,7 @@ static bool bdrv_parent_drained_poll(BlockDriverState *bs, 
BdrvChild *ignore,
 void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
 {
 IO_OR_GS_CODE();
+assert(c->parent_quiesce_counter == 0);
 c->parent_quiesce_counter++;
 if (c->klass->drained_begin) {
 c->klass->drained_begin(c);
@@ -352,16 +353,16 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
BdrvChild *parent, bool ignore_bds_parents)
 {
 IO_OR_GS_CODE();
-assert(!qemu_in_coroutine());
 
 /* Stop things in parent-to-child order */
 if (qatomic_fetch_inc(>quiesce_counter) == 0) {
 aio_disable_external(bdrv_get_aio_context(bs));
-}
 
-bdrv_parent_drained_begin(bs, parent, ignore_bds_parents);
-if (bs->drv && bs->drv->bdrv_drain_begin) {
-bs->drv->bdrv_drain_begin(bs);
+/* TODO Remove ignore_bds_parents, we don't consider it any more */
+bdrv_parent_drained_begin(bs, parent, false);
+if (bs->drv && bs->drv->bdrv_drain_begin) {
+bs->drv->bdrv_drain_begin(bs);
+}
 }
 }
 
@@ -412,13 +413,14 @@ static void bdrv_do_drained_end(BlockDriverState *bs, 
BdrvChild *parent,
 

[PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions

2022-11-08 Thread Kevin Wolf
ignore_bds_parents is now ignored, so we can just remove it.

Signed-off-by: Kevin Wolf 
---
 include/block/block-io.h | 10 ++
 block.c  |  4 +--
 block/io.c   | 78 +++-
 3 files changed, 32 insertions(+), 60 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index c35cb1e53f..5b54ed4672 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -305,14 +305,9 @@ void bdrv_parent_drained_end_single(BdrvChild *c);
  *
  * Poll for pending requests in @bs and its parents (except for 
@ignore_parent).
  *
- * If @ignore_bds_parents is true, parents that are BlockDriverStates must
- * ignore the drain request because they will be drained separately (used for
- * drain_all).
- *
  * This is part of bdrv_drained_begin.
  */
-bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
- bool ignore_bds_parents);
+bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent);
 
 /**
  * bdrv_drained_begin:
@@ -330,8 +325,7 @@ void bdrv_drained_begin(BlockDriverState *bs);
  * Quiesces a BDS like bdrv_drained_begin(), but does not wait for already
  * running requests to complete.
  */
-void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
-   BdrvChild *parent, bool ignore_bds_parents);
+void bdrv_do_drained_begin_quiesce(BlockDriverState *bs, BdrvChild *parent);
 
 /**
  * bdrv_drained_end:
diff --git a/block.c b/block.c
index 8878586f6e..5f5f79cd16 100644
--- a/block.c
+++ b/block.c
@@ -1218,13 +1218,13 @@ static char *bdrv_child_get_parent_desc(BdrvChild *c)
 static void bdrv_child_cb_drained_begin(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
-bdrv_do_drained_begin_quiesce(bs, NULL, false);
+bdrv_do_drained_begin_quiesce(bs, NULL);
 }
 
 static bool bdrv_child_cb_drained_poll(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
-return bdrv_drain_poll(bs, NULL, false);
+return bdrv_drain_poll(bs, NULL);
 }
 
 static void bdrv_child_cb_drained_end(BdrvChild *child)
diff --git a/block/io.c b/block/io.c
index 87c7a92f15..4a83359a8f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -45,13 +45,12 @@ static void bdrv_parent_cb_resize(BlockDriverState *bs);
 static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 int64_t offset, int64_t bytes, BdrvRequestFlags flags);
 
-static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore,
-  bool ignore_bds_parents)
+static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore)
 {
 BdrvChild *c, *next;
 
 QLIST_FOREACH_SAFE(c, >parents, next_parent, next) {
-if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) {
+if (c == ignore) {
 continue;
 }
 bdrv_parent_drained_begin_single(c, false);
@@ -69,13 +68,12 @@ void bdrv_parent_drained_end_single(BdrvChild *c)
 }
 }
 
-static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore,
-bool ignore_bds_parents)
+static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore)
 {
 BdrvChild *c;
 
 QLIST_FOREACH(c, >parents, next_parent) {
-if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) {
+if (c == ignore) {
 continue;
 }
 bdrv_parent_drained_end_single(c);
@@ -90,14 +88,13 @@ static bool bdrv_parent_drained_poll_single(BdrvChild *c)
 return false;
 }
 
-static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore,
- bool ignore_bds_parents)
+static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore)
 {
 BdrvChild *c, *next;
 bool busy = false;
 
 QLIST_FOREACH_SAFE(c, >parents, next_parent, next) {
-if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) {
+if (c == ignore) {
 continue;
 }
 busy |= bdrv_parent_drained_poll_single(c);
@@ -238,16 +235,14 @@ typedef struct {
 bool begin;
 bool poll;
 BdrvChild *parent;
-bool ignore_bds_parents;
 } BdrvCoDrainData;
 
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
-bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent,
- bool ignore_bds_parents)
+bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent)
 {
 IO_OR_GS_CODE();
 
-if (bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents)) {
+if (bdrv_parent_drained_poll(bs, ignore_parent)) {
 return true;
 }
 
@@ -258,16 +253,9 @@ bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild 
*ignore_parent,
 return false;
 }
 
-static bool bdrv_drain_poll_top_level(BlockDriverState *bs,
-  BdrvChild *ignore_parent)
-{
-return 

[PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()

2022-11-08 Thread Kevin Wolf
We want to change .bdrv_co_drained_begin/end() back to be non-coroutine
callbacks, so in preparation, avoid yielding in their implementation.

This does almost the same as the existing logic in bdrv_drain_invoke(),
by creating and entering coroutines internally. However, since the test
case is by far the heaviest user of coroutine code in drain callbacks,
it is preferable to have the complexity in the test case rather than the
drain core, which is already complicated enough without this.

The behaviour for bdrv_drain_begin() is unchanged because we increase
bs->in_flight and this is still polled. However, bdrv_drain_end()
doesn't wait for the spawned coroutine to complete any more. This is
fine, we don't rely on bdrv_drain_end() restarting all operations
immediately before the next aio_poll().

Signed-off-by: Kevin Wolf 
---
 tests/unit/test-bdrv-drain.c | 64 ++--
 1 file changed, 46 insertions(+), 18 deletions(-)

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 09dc4a4891..24f34e24ad 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -38,12 +38,22 @@ typedef struct BDRVTestState {
 bool sleep_in_drain_begin;
 } BDRVTestState;
 
+static void coroutine_fn sleep_in_drain_begin(void *opaque)
+{
+BlockDriverState *bs = opaque;
+
+qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 10);
+bdrv_dec_in_flight(bs);
+}
+
 static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs)
 {
 BDRVTestState *s = bs->opaque;
 s->drain_count++;
 if (s->sleep_in_drain_begin) {
-qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 10);
+Coroutine *co = qemu_coroutine_create(sleep_in_drain_begin, bs);
+bdrv_inc_in_flight(bs);
+aio_co_enter(bdrv_get_aio_context(bs), co);
 }
 }
 
@@ -1916,6 +1926,21 @@ static int coroutine_fn 
bdrv_replace_test_co_preadv(BlockDriverState *bs,
 return 0;
 }
 
+static void coroutine_fn bdrv_replace_test_drain_co(void *opaque)
+{
+BlockDriverState *bs = opaque;
+BDRVReplaceTestState *s = bs->opaque;
+
+/* Keep waking io_co up until it is done */
+while (s->io_co) {
+aio_co_wake(s->io_co);
+s->io_co = NULL;
+qemu_coroutine_yield();
+}
+s->drain_co = NULL;
+bdrv_dec_in_flight(bs);
+}
+
 /**
  * If .drain_count is 0, wake up .io_co if there is one; and set
  * .was_drained.
@@ -1926,20 +1951,27 @@ static void coroutine_fn 
bdrv_replace_test_co_drain_begin(BlockDriverState *bs)
 BDRVReplaceTestState *s = bs->opaque;
 
 if (!s->drain_count) {
-/* Keep waking io_co up until it is done */
-s->drain_co = qemu_coroutine_self();
-while (s->io_co) {
-aio_co_wake(s->io_co);
-s->io_co = NULL;
-qemu_coroutine_yield();
-}
-s->drain_co = NULL;
-
+s->drain_co = qemu_coroutine_create(bdrv_replace_test_drain_co, bs);
+bdrv_inc_in_flight(bs);
+aio_co_enter(bdrv_get_aio_context(bs), s->drain_co);
 s->was_drained = true;
 }
 s->drain_count++;
 }
 
+static void coroutine_fn bdrv_replace_test_read_entry(void *opaque)
+{
+BlockDriverState *bs = opaque;
+char data;
+QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, , 1);
+int ret;
+
+/* Queue a read request post-drain */
+ret = bdrv_replace_test_co_preadv(bs, 0, 1, , 0);
+g_assert(ret >= 0);
+bdrv_dec_in_flight(bs);
+}
+
 /**
  * Reduce .drain_count, set .was_undrained once it reaches 0.
  * If .drain_count reaches 0 and the node has a backing file, issue a
@@ -1951,17 +1983,13 @@ static void coroutine_fn 
bdrv_replace_test_co_drain_end(BlockDriverState *bs)
 
 g_assert(s->drain_count > 0);
 if (!--s->drain_count) {
-int ret;
-
 s->was_undrained = true;
 
 if (bs->backing) {
-char data;
-QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, , 1);
-
-/* Queue a read request post-drain */
-ret = bdrv_replace_test_co_preadv(bs, 0, 1, , 0);
-g_assert(ret >= 0);
+Coroutine *co = qemu_coroutine_create(bdrv_replace_test_read_entry,
+  bs);
+bdrv_inc_in_flight(bs);
+aio_co_enter(bdrv_get_aio_context(bs), co);
 }
 }
 }
-- 
2.38.1




[PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single()

2022-11-08 Thread Kevin Wolf
All callers of bdrv_parent_drained_begin_single() pass poll=false now,
so we don't need the parameter any more.

Signed-off-by: Kevin Wolf 
---
 include/block/block-io.h | 5 ++---
 block.c  | 4 ++--
 block/io.c   | 7 ++-
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index ddce8550a9..35669f0e62 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -285,10 +285,9 @@ bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector 
*qiov, int64_t pos);
 /**
  * bdrv_parent_drained_begin_single:
  *
- * Begin a quiesced section for the parent of @c. If @poll is true, wait for
- * any pending activity to cease.
+ * Begin a quiesced section for the parent of @c.
  */
-void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll);
+void bdrv_parent_drained_begin_single(BdrvChild *c);
 
 /**
  * bdrv_parent_drained_poll_single:
diff --git a/block.c b/block.c
index 12039e9b8a..c200f7afa0 100644
--- a/block.c
+++ b/block.c
@@ -2409,7 +2409,7 @@ static void bdrv_replace_child_abort(void *opaque)
  * new_bs drained when calling bdrv_replace_child_tran() is not a
  * requirement any more.
  */
-bdrv_parent_drained_begin_single(s->child, false);
+bdrv_parent_drained_begin_single(s->child);
 assert(!bdrv_parent_drained_poll_single(s->child));
 }
 assert(s->child->parent_quiesce_counter);
@@ -3016,7 +3016,7 @@ static BdrvChild 
*bdrv_attach_child_common(BlockDriverState *child_bs,
  * bdrv_replace_child_noperm() will undrain it if the child node is not
  * drained. The child was only just created, so polling is not necessary.
  */
-bdrv_parent_drained_begin_single(new_child, false);
+bdrv_parent_drained_begin_single(new_child);
 bdrv_replace_child_noperm(new_child, child_bs);
 
 BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1);
diff --git a/block/io.c b/block/io.c
index d0f641926f..9bcb19e5ee 100644
--- a/block/io.c
+++ b/block/io.c
@@ -53,7 +53,7 @@ static void bdrv_parent_drained_begin(BlockDriverState *bs, 
BdrvChild *ignore)
 if (c == ignore) {
 continue;
 }
-bdrv_parent_drained_begin_single(c, false);
+bdrv_parent_drained_begin_single(c);
 }
 }
 
@@ -103,7 +103,7 @@ static bool bdrv_parent_drained_poll(BlockDriverState *bs, 
BdrvChild *ignore)
 return busy;
 }
 
-void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll)
+void bdrv_parent_drained_begin_single(BdrvChild *c)
 {
 IO_OR_GS_CODE();
 assert(c->parent_quiesce_counter == 0);
@@ -111,9 +111,6 @@ void bdrv_parent_drained_begin_single(BdrvChild *c, bool 
poll)
 if (c->klass->drained_begin) {
 c->klass->drained_begin(c);
 }
-if (poll) {
-BDRV_POLL_WHILE(c->bs, bdrv_parent_drained_poll_single(c));
-}
 }
 
 static void bdrv_merge_limits(BlockLimits *dst, const BlockLimits *src)
-- 
2.38.1




[PATCH 08/13] stream: Replace subtree drain with a single node drain

2022-11-08 Thread Kevin Wolf
The subtree drain was introduced in commit b1e1af394d9 as a way to avoid
graph changes between finding the base node and changing the block graph
as necessary on completion of the image streaming job.

The block graph could change between these two points because
bdrv_set_backing_hd() first drains the parent node, which involved
polling and can do anything.

Subtree draining was an imperfect way to make this less likely (because
with it, fewer callbacks are called during this window). Everyone agreed
that it's not really the right solution, and it was only committed as a
stopgap solution.

This replaces the subtree drain with a solution that simply drains the
parent node before we try to find the base node, and then call a version
of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
parent node is already drained.

This way, any graph changes caused by draining happen before we start
looking at the graph and things stay consistent between finding the base
node and changing the graph.

Signed-off-by: Kevin Wolf 
---
 include/block/block-global-state.h |  3 +++
 block.c| 17 ++---
 block/stream.c | 20 ++--
 3 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index bb42ed9559..7923415d4e 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -82,6 +82,9 @@ int bdrv_open_file_child(const char *filename,
 BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp);
 int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
 Error **errp);
+int bdrv_set_backing_hd_drained(BlockDriverState *bs,
+BlockDriverState *backing_hd,
+Error **errp);
 int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
const char *bdref_key, Error **errp);
 BlockDriverState *bdrv_open(const char *filename, const char *reference,
diff --git a/block.c b/block.c
index 2f6b25875f..43b893dd6c 100644
--- a/block.c
+++ b/block.c
@@ -3395,14 +3395,15 @@ static int bdrv_set_backing_noperm(BlockDriverState *bs,
 return bdrv_set_file_or_backing_noperm(bs, backing_hd, true, tran, errp);
 }
 
-int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
-Error **errp)
+int bdrv_set_backing_hd_drained(BlockDriverState *bs,
+BlockDriverState *backing_hd,
+Error **errp)
 {
 int ret;
 Transaction *tran = tran_new();
 
 GLOBAL_STATE_CODE();
-bdrv_drained_begin(bs);
+assert(bs->quiesce_counter > 0);
 
 ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
 if (ret < 0) {
@@ -3412,7 +3413,17 @@ int bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd,
 ret = bdrv_refresh_perms(bs, errp);
 out:
 tran_finalize(tran, ret);
+return ret;
+}
 
+int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
+Error **errp)
+{
+int ret;
+GLOBAL_STATE_CODE();
+
+bdrv_drained_begin(bs);
+ret = bdrv_set_backing_hd_drained(bs, backing_hd, errp);
 bdrv_drained_end(bs);
 
 return ret;
diff --git a/block/stream.c b/block/stream.c
index 694709bd25..81dcf5a417 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -64,13 +64,16 @@ static int stream_prepare(Job *job)
 bdrv_cor_filter_drop(s->cor_filter_bs);
 s->cor_filter_bs = NULL;
 
-bdrv_subtree_drained_begin(s->above_base);
+/*
+ * bdrv_set_backing_hd() requires that unfiltered_bs is drained. Drain
+ * already here and use bdrv_set_backing_hd_drained() instead because
+ * the polling during drained_begin() might change the graph, and if we do
+ * this only later, we may end up working with the wrong base node (or it
+ * might even have gone away by the time we want to use it).
+ */
+bdrv_drained_begin(unfiltered_bs);
 
 base = bdrv_filter_or_cow_bs(s->above_base);
-if (base) {
-bdrv_ref(base);
-}
-
 unfiltered_base = bdrv_skip_filters(base);
 
 if (bdrv_cow_child(unfiltered_bs)) {
@@ -82,7 +85,7 @@ static int stream_prepare(Job *job)
 }
 }
 
-bdrv_set_backing_hd(unfiltered_bs, base, _err);
+bdrv_set_backing_hd_drained(unfiltered_bs, base, _err);
 ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, 
false);
 if (local_err) {
 error_report_err(local_err);
@@ -92,10 +95,7 @@ static int stream_prepare(Job *job)
 }
 
 out:
-if (base) {
-bdrv_unref(base);
-}
-bdrv_subtree_drained_end(s->above_base);
+bdrv_drained_end(unfiltered_bs);
 return ret;
 }
 
-- 
2.38.1




[PATCH 00/13] block: Simplify drain

2022-11-08 Thread Kevin Wolf
I'm aware that exactly nobody has been looking forward to a series with
this title, but it has to be. The way drain works means that we need to
poll in bdrv_replace_child_noperm() and that makes things rather messy
with Emanuele's multiqueue work because you must not poll while you hold
the graph lock.

The other reason why it has to be is that drain is way too complex and
there are too many different cases. Some simplification like this will
hopefully make it considerably more maintainable. The diffstat probably
tells something, too.

There are roughly speaking three parts in this series:

1. Make BlockDriver.bdrv_drained_begin/end() non-coroutine_fn again,
   which allows us to not poll on bdrv_drained_end() any more.

2. Remove subtree drains. They are a considerable complication in the
   whole drain machinery (in particular, they require polling in the
   BdrvChildClass.attach/detach() callbacks that are called during
   bdrv_replace_child_noperm()) and none of their users actually has a
   good reason to use them.

3. Finally get rid of polling in bdrv_replace_child_noperm() by
   requiring that the child is already drained by the caller and calling
   callbacks only once and not again for every nested drain section.

If necessary, a prefix of this series can be merged that covers only the
first or the first two parts and it would still make sense.

Kevin Wolf (13):
  qed: Don't yield in bdrv_qed_co_drain_begin()
  test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
  block: Revert .bdrv_drained_begin/end to non-coroutine_fn
  block: Remove drained_end_counter
  block: Inline bdrv_drain_invoke()
  block: Drain invidual nodes during reopen
  block: Don't use subtree drains in bdrv_drop_intermediate()
  stream: Replace subtree drain with a single node drain
  block: Remove subtree drains
  block: Call drain callbacks only once
  block: Remove ignore_bds_parents parameter from drain functions
  block: Don't poll in bdrv_replace_child_noperm()
  block: Remove poll parameter from bdrv_parent_drained_begin_single()

 include/block/block-global-state.h |   3 +
 include/block/block-io.h   |  52 +---
 include/block/block_int-common.h   |  17 +-
 include/block/block_int-io.h   |  12 -
 block.c| 132 ++-
 block/block-backend.c  |   4 +-
 block/io.c | 281 --
 block/qed.c|  24 +-
 block/replication.c|   6 -
 block/stream.c |  20 +-
 block/throttle.c   |   6 +-
 blockdev.c |  13 -
 blockjob.c |   2 +-
 tests/unit/test-bdrv-drain.c   | 369 +++--
 14 files changed, 270 insertions(+), 671 deletions(-)

-- 
2.38.1




[PATCH 04/13] block: Remove drained_end_counter

2022-11-08 Thread Kevin Wolf
drained_end_counter is unused now, nobody changes its value any more. It
can be removed.

In cases where we had two almost identical functions that only differed
in whether the caller passes drained_end_counter, or whether they would
poll for a local drained_end_counter to reach 0, these become a single
function.

Signed-off-by: Kevin Wolf 
---
 include/block/block-io.h | 15 -
 include/block/block_int-common.h |  6 +-
 block.c  |  5 +-
 block/block-backend.c|  4 +-
 block/io.c   | 97 
 blockjob.c   |  2 +-
 6 files changed, 30 insertions(+), 99 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index 770ddeb7c8..97e9ae8bee 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -235,21 +235,6 @@ int coroutine_fn bdrv_co_copy_range(BdrvChild *src, 
int64_t src_offset,
 int64_t bytes, BdrvRequestFlags read_flags,
 BdrvRequestFlags write_flags);
 
-/**
- * bdrv_drained_end_no_poll:
- *
- * Same as bdrv_drained_end(), but do not poll for the subgraph to
- * actually become unquiesced.  Therefore, no graph changes will occur
- * with this function.
- *
- * *drained_end_counter is incremented for every background operation
- * that is scheduled, and will be decremented for every operation once
- * it settles.  The caller must poll until it reaches 0.  The counter
- * should be accessed using atomic operations only.
- */
-void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter);
-
-
 /*
  * "I/O or GS" API functions. These functions can run without
  * the BQL, but only in one specific iothread/main loop.
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 0956acbb60..6504db4fd9 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -939,15 +939,11 @@ struct BdrvChildClass {
  * These functions must not change the graph (and therefore also must not
  * call aio_poll(), which could change the graph indirectly).
  *
- * If drained_end() schedules background operations, it must atomically
- * increment *drained_end_counter for each such operation and atomically
- * decrement it once the operation has settled.
- *
  * Note that this can be nested. If drained_begin() was called twice, new
  * I/O is allowed only after drained_end() was called twice, too.
  */
 void (*drained_begin)(BdrvChild *child);
-void (*drained_end)(BdrvChild *child, int *drained_end_counter);
+void (*drained_end)(BdrvChild *child);
 
 /*
  * Returns whether the parent has pending requests for the child. This
diff --git a/block.c b/block.c
index fed8077993..7a24bd4c36 100644
--- a/block.c
+++ b/block.c
@@ -1227,11 +1227,10 @@ static bool bdrv_child_cb_drained_poll(BdrvChild *child)
 return bdrv_drain_poll(bs, false, NULL, false);
 }
 
-static void bdrv_child_cb_drained_end(BdrvChild *child,
-  int *drained_end_counter)
+static void bdrv_child_cb_drained_end(BdrvChild *child)
 {
 BlockDriverState *bs = child->opaque;
-bdrv_drained_end_no_poll(bs, drained_end_counter);
+bdrv_drained_end(bs);
 }
 
 static int bdrv_child_cb_inactivate(BdrvChild *child)
diff --git a/block/block-backend.c b/block/block-backend.c
index c0c7d56c8d..ecdfeb49bb 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -129,7 +129,7 @@ static void blk_root_inherit_options(BdrvChildRole role, 
bool parent_is_format,
 }
 static void blk_root_drained_begin(BdrvChild *child);
 static bool blk_root_drained_poll(BdrvChild *child);
-static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter);
+static void blk_root_drained_end(BdrvChild *child);
 
 static void blk_root_change_media(BdrvChild *child, bool load);
 static void blk_root_resize(BdrvChild *child);
@@ -2549,7 +2549,7 @@ static bool blk_root_drained_poll(BdrvChild *child)
 return busy || !!blk->in_flight;
 }
 
-static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
+static void blk_root_drained_end(BdrvChild *child)
 {
 BlockBackend *blk = child->opaque;
 assert(blk->quiesce_counter);
diff --git a/block/io.c b/block/io.c
index 183b407f5b..41e6121c31 100644
--- a/block/io.c
+++ b/block/io.c
@@ -58,27 +58,19 @@ static void bdrv_parent_drained_begin(BlockDriverState *bs, 
BdrvChild *ignore,
 }
 }
 
-static void bdrv_parent_drained_end_single_no_poll(BdrvChild *c,
-   int *drained_end_counter)
+void bdrv_parent_drained_end_single(BdrvChild *c)
 {
+IO_OR_GS_CODE();
+
 assert(c->parent_quiesce_counter > 0);
 c->parent_quiesce_counter--;
 if (c->klass->drained_end) {
-c->klass->drained_end(c, drained_end_counter);
+c->klass->drained_end(c);
 }
 }
 
-void 

[PATCH 06/13] block: Drain invidual nodes during reopen

2022-11-08 Thread Kevin Wolf
bdrv_reopen() and friends use subtree drains as a lazy way of covering
all the nodes they touch. Turns out that this lazy way is a lot more
complicated than just draining the nodes individually, even not
accounting for the additional complexity in the drain mechanism itself.

Simplify the code by switching to draining the individual nodes that are
already managed in the BlockReopenQueue anyway.

Signed-off-by: Kevin Wolf 
---
 block.c | 11 ---
 block/replication.c |  6 --
 blockdev.c  | 13 -
 3 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/block.c b/block.c
index 7a24bd4c36..5828b970e4 100644
--- a/block.c
+++ b/block.c
@@ -4142,7 +4142,7 @@ static bool bdrv_recurse_has_child(BlockDriverState *bs,
  * returns a pointer to bs_queue, which is either the newly allocated
  * bs_queue, or the existing bs_queue being used.
  *
- * bs must be drained between bdrv_reopen_queue() and bdrv_reopen_multiple().
+ * bs is drained here and undrained by bdrv_reopen_queue_free().
  */
 static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
  BlockDriverState *bs,
@@ -4162,12 +4162,10 @@ static BlockReopenQueue 
*bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
 int flags;
 QemuOpts *opts;
 
-/* Make sure that the caller remembered to use a drained section. This is
- * important to avoid graph changes between the recursive queuing here and
- * bdrv_reopen_multiple(). */
-assert(bs->quiesce_counter > 0);
 GLOBAL_STATE_CODE();
 
+bdrv_drained_begin(bs);
+
 if (bs_queue == NULL) {
 bs_queue = g_new0(BlockReopenQueue, 1);
 QTAILQ_INIT(bs_queue);
@@ -4317,6 +4315,7 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
 if (bs_queue) {
 BlockReopenQueueEntry *bs_entry, *next;
 QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
+bdrv_drained_end(bs_entry->state.bs);
 qobject_unref(bs_entry->state.explicit_options);
 qobject_unref(bs_entry->state.options);
 g_free(bs_entry);
@@ -4464,7 +4463,6 @@ int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool 
keep_old_opts,
 
 GLOBAL_STATE_CODE();
 
-bdrv_subtree_drained_begin(bs);
 if (ctx != qemu_get_aio_context()) {
 aio_context_release(ctx);
 }
@@ -4475,7 +4473,6 @@ int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool 
keep_old_opts,
 if (ctx != qemu_get_aio_context()) {
 aio_context_acquire(ctx);
 }
-bdrv_subtree_drained_end(bs);
 
 return ret;
 }
diff --git a/block/replication.c b/block/replication.c
index f1eed25e43..c62f48a874 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -374,9 +374,6 @@ static void reopen_backing_file(BlockDriverState *bs, bool 
writable,
 s->orig_secondary_read_only = bdrv_is_read_only(secondary_disk->bs);
 }
 
-bdrv_subtree_drained_begin(hidden_disk->bs);
-bdrv_subtree_drained_begin(secondary_disk->bs);
-
 if (s->orig_hidden_read_only) {
 QDict *opts = qdict_new();
 qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
@@ -401,9 +398,6 @@ static void reopen_backing_file(BlockDriverState *bs, bool 
writable,
 aio_context_acquire(ctx);
 }
 }
-
-bdrv_subtree_drained_end(hidden_disk->bs);
-bdrv_subtree_drained_end(secondary_disk->bs);
 }
 
 static void backup_job_cleanup(BlockDriverState *bs)
diff --git a/blockdev.c b/blockdev.c
index 3f1dec6242..8ffb3d9537 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3547,8 +3547,6 @@ fail:
 void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
 {
 BlockReopenQueue *queue = NULL;
-GSList *drained = NULL;
-GSList *p;
 
 /* Add each one of the BDS that we want to reopen to the queue */
 for (; reopen_list != NULL; reopen_list = reopen_list->next) {
@@ -3585,9 +3583,7 @@ void qmp_blockdev_reopen(BlockdevOptionsList 
*reopen_list, Error **errp)
 ctx = bdrv_get_aio_context(bs);
 aio_context_acquire(ctx);
 
-bdrv_subtree_drained_begin(bs);
 queue = bdrv_reopen_queue(queue, bs, qdict, false);
-drained = g_slist_prepend(drained, bs);
 
 aio_context_release(ctx);
 }
@@ -3598,15 +3594,6 @@ void qmp_blockdev_reopen(BlockdevOptionsList 
*reopen_list, Error **errp)
 
 fail:
 bdrv_reopen_queue_free(queue);
-for (p = drained; p; p = p->next) {
-BlockDriverState *bs = p->data;
-AioContext *ctx = bdrv_get_aio_context(bs);
-
-aio_context_acquire(ctx);
-bdrv_subtree_drained_end(bs);
-aio_context_release(ctx);
-}
-g_slist_free(drained);
 }
 
 void qmp_blockdev_del(const char *node_name, Error **errp)
-- 
2.38.1




[PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()

2022-11-08 Thread Kevin Wolf
We want to change .bdrv_co_drained_begin() back to be a non-coroutine
callback, so in preparation, avoid yielding in its implementation.

Because we increase bs->in_flight and bdrv_drained_begin() polls, the
behaviour is unchanged.

Signed-off-by: Kevin Wolf 
---
 block/qed.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 2f36ad342c..013f826c44 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -282,9 +282,8 @@ static void coroutine_fn 
qed_unplug_allocating_write_reqs(BDRVQEDState *s)
 qemu_co_mutex_unlock(>table_lock);
 }
 
-static void coroutine_fn qed_need_check_timer_entry(void *opaque)
+static void coroutine_fn qed_need_check_timer(BDRVQEDState *s)
 {
-BDRVQEDState *s = opaque;
 int ret;
 
 trace_qed_need_check_timer_cb(s);
@@ -310,9 +309,20 @@ static void coroutine_fn qed_need_check_timer_entry(void 
*opaque)
 (void) ret;
 }
 
+static void coroutine_fn qed_need_check_timer_entry(void *opaque)
+{
+BDRVQEDState *s = opaque;
+
+qed_need_check_timer(opaque);
+bdrv_dec_in_flight(s->bs);
+}
+
 static void qed_need_check_timer_cb(void *opaque)
 {
+BDRVQEDState *s = opaque;
 Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque);
+
+bdrv_inc_in_flight(s->bs);
 qemu_coroutine_enter(co);
 }
 
@@ -363,8 +373,12 @@ static void coroutine_fn 
bdrv_qed_co_drain_begin(BlockDriverState *bs)
  * header is flushed.
  */
 if (s->need_check_timer && timer_pending(s->need_check_timer)) {
+Coroutine *co;
+
 qed_cancel_need_check_timer(s);
-qed_need_check_timer_entry(s);
+co = qemu_coroutine_create(qed_need_check_timer_entry, s);
+bdrv_inc_in_flight(bs);
+aio_co_enter(bdrv_get_aio_context(bs), co);
 }
 }
 
-- 
2.38.1




[PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn

2022-11-08 Thread Kevin Wolf
Polling during bdrv_drained_end() can be problematic (and in the future,
we may get cases for bdrv_drained_begin() where polling is forbidden,
and we don't care about already in-flight requests, but just want to
prevent new requests from arriving).

The .bdrv_drained_begin/end callbacks running in a coroutine is the only
reason why we have to do this polling, so make them non-coroutine
callbacks again. None of the callers actually yield any more.

This means that bdrv_drained_end() effectively doesn't poll any more,
even if AIO_WAIT_WHILE() loops are still there (their condition is false
from the beginning). This is generally not a problem, but in
test-bdrv-drain, some additional explicit aio_poll() calls need to be
added because the test case wants to verify the final state after BHs
have executed.

Signed-off-by: Kevin Wolf 
---
 include/block/block_int-common.h | 10 ---
 block.c  |  4 +--
 block/io.c   | 49 +---
 block/qed.c  |  4 +--
 block/throttle.c |  6 ++--
 tests/unit/test-bdrv-drain.c | 18 ++--
 6 files changed, 30 insertions(+), 61 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 5a2cc077a0..0956acbb60 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -735,17 +735,19 @@ struct BlockDriver {
 void (*bdrv_io_unplug)(BlockDriverState *bs);
 
 /**
- * bdrv_co_drain_begin is called if implemented in the beginning of a
+ * bdrv_drain_begin is called if implemented in the beginning of a
  * drain operation to drain and stop any internal sources of requests in
  * the driver.
- * bdrv_co_drain_end is called if implemented at the end of the drain.
+ * bdrv_drain_end is called if implemented at the end of the drain.
  *
  * They should be used by the driver to e.g. manage scheduled I/O
  * requests, or toggle an internal state. After the end of the drain new
  * requests will continue normally.
+ *
+ * Implementations of both functions must not call aio_poll().
  */
-void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs);
-void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs);
+void (*bdrv_drain_begin)(BlockDriverState *bs);
+void (*bdrv_drain_end)(BlockDriverState *bs);
 
 bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
 bool coroutine_fn (*bdrv_co_can_store_new_dirty_bitmap)(
diff --git a/block.c b/block.c
index 3bd594eb2a..fed8077993 100644
--- a/block.c
+++ b/block.c
@@ -1705,8 +1705,8 @@ static int bdrv_open_driver(BlockDriverState *bs, 
BlockDriver *drv,
 assert(is_power_of_2(bs->bl.request_alignment));
 
 for (i = 0; i < bs->quiesce_counter; i++) {
-if (drv->bdrv_co_drain_begin) {
-drv->bdrv_co_drain_begin(bs);
+if (drv->bdrv_drain_begin) {
+drv->bdrv_drain_begin(bs);
 }
 }
 
diff --git a/block/io.c b/block/io.c
index 34b30e304e..183b407f5b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -250,55 +250,20 @@ typedef struct {
 int *drained_end_counter;
 } BdrvCoDrainData;
 
-static void coroutine_fn bdrv_drain_invoke_entry(void *opaque)
-{
-BdrvCoDrainData *data = opaque;
-BlockDriverState *bs = data->bs;
-
-if (data->begin) {
-bs->drv->bdrv_co_drain_begin(bs);
-} else {
-bs->drv->bdrv_co_drain_end(bs);
-}
-
-/* Set data->done and decrement drained_end_counter before bdrv_wakeup() */
-qatomic_mb_set(>done, true);
-if (!data->begin) {
-qatomic_dec(data->drained_end_counter);
-}
-bdrv_dec_in_flight(bs);
-
-g_free(data);
-}
-
-/* Recursively call BlockDriver.bdrv_co_drain_begin/end callbacks */
+/* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */
 static void bdrv_drain_invoke(BlockDriverState *bs, bool begin,
   int *drained_end_counter)
 {
-BdrvCoDrainData *data;
-
-if (!bs->drv || (begin && !bs->drv->bdrv_co_drain_begin) ||
-(!begin && !bs->drv->bdrv_co_drain_end)) {
+if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) ||
+(!begin && !bs->drv->bdrv_drain_end)) {
 return;
 }
 
-data = g_new(BdrvCoDrainData, 1);
-*data = (BdrvCoDrainData) {
-.bs = bs,
-.done = false,
-.begin = begin,
-.drained_end_counter = drained_end_counter,
-};
-
-if (!begin) {
-qatomic_inc(drained_end_counter);
+if (begin) {
+bs->drv->bdrv_drain_begin(bs);
+} else {
+bs->drv->bdrv_drain_end(bs);
 }
-
-/* Make sure the driver callback completes during the polling phase for
- * drain_begin. */
-bdrv_inc_in_flight(bs);
-data->co = qemu_coroutine_create(bdrv_drain_invoke_entry, data);
-aio_co_schedule(bdrv_get_aio_context(bs), data->co);
 }
 
 /* Returns true if BDRV_POLL_WHILE() should go 

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-08 Thread Alex Bennée


"Michael S. Tsirkin"  writes:

> On Tue, Nov 08, 2022 at 10:23:15AM +, Alex Bennée wrote:
>> 
>> "Michael S. Tsirkin"  writes:
>> 
>> > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote:
>> >> The previous fix to virtio_device_started revealed a problem in its
>> >> use by both the core and the device code. The core code should be able
>> >> to handle the device "starting" while the VM isn't running to handle
>> >> the restoration of migration state. To solve this dual use introduce a
>> >> new helper for use by the vhost-user backends who all use it to feed a
>> >> should_start variable.
>> >> 
>> >> We can also pick up a change vhost_user_blk_set_status while we are at
>> >> it which follows the same pattern.
>> >> 
>> >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
>> >> virtio_device_started)
>> >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
>> >> Signed-off-by: Alex Bennée 
>> >> Cc: "Michael S. Tsirkin" 
>> >
>> > why is this in this patchset?
>> 
>> As per my cover letter:
>> 
>>   Most of these patches have been posted before as single patch RFCs. A
>>   couple are already scheduled through other trees so will drop out in
>>   due course
>> 
>> but I keep them in my tree until they are merged so I can continue to
>> soak test them (and have a stable base for my other WIP trees).
>
> That's fine just pls don't double-post them on list, certainly
> not as part of a patchset.

Why not? Is this breaking some tooling?

-- 
Alex Bennée



Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-08 Thread Michael S. Tsirkin
On Tue, Nov 08, 2022 at 10:23:15AM +, Alex Bennée wrote:
> 
> "Michael S. Tsirkin"  writes:
> 
> > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote:
> >> The previous fix to virtio_device_started revealed a problem in its
> >> use by both the core and the device code. The core code should be able
> >> to handle the device "starting" while the VM isn't running to handle
> >> the restoration of migration state. To solve this dual use introduce a
> >> new helper for use by the vhost-user backends who all use it to feed a
> >> should_start variable.
> >> 
> >> We can also pick up a change vhost_user_blk_set_status while we are at
> >> it which follows the same pattern.
> >> 
> >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to 
> >> virtio_device_started)
> >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
> >> Signed-off-by: Alex Bennée 
> >> Cc: "Michael S. Tsirkin" 
> >
> > why is this in this patchset?
> 
> As per my cover letter:
> 
>   Most of these patches have been posted before as single patch RFCs. A
>   couple are already scheduled through other trees so will drop out in
>   due course
> 
> but I keep them in my tree until they are merged so I can continue to
> soak test them (and have a stable base for my other WIP trees).

That's fine just pls don't double-post them on list, certainly
not as part of a patchset.

> -- 
> Alex Bennée




Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-08 Thread Alex Bennée


"Michael S. Tsirkin"  writes:

> On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote:
>> The previous fix to virtio_device_started revealed a problem in its
>> use by both the core and the device code. The core code should be able
>> to handle the device "starting" while the VM isn't running to handle
>> the restoration of migration state. To solve this dual use introduce a
>> new helper for use by the vhost-user backends who all use it to feed a
>> should_start variable.
>> 
>> We can also pick up a change vhost_user_blk_set_status while we are at
>> it which follows the same pattern.
>> 
>> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
>> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
>> Signed-off-by: Alex Bennée 
>> Cc: "Michael S. Tsirkin" 
>
> why is this in this patchset?

As per my cover letter:

  Most of these patches have been posted before as single patch RFCs. A
  couple are already scheduled through other trees so will drop out in
  due course

but I keep them in my tree until they are merged so I can continue to
soak test them (and have a stable base for my other WIP trees).

-- 
Alex Bennée



Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-08 Thread Michael S. Tsirkin
On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote:
> The previous fix to virtio_device_started revealed a problem in its
> use by both the core and the device code. The core code should be able
> to handle the device "starting" while the VM isn't running to handle
> the restoration of migration state. To solve this dual use introduce a
> new helper for use by the vhost-user backends who all use it to feed a
> should_start variable.
> 
> We can also pick up a change vhost_user_blk_set_status while we are at
> it which follows the same pattern.
> 
> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
> Signed-off-by: Alex Bennée 
> Cc: "Michael S. Tsirkin" 

is this the same as the RFC?

> ---
>  include/hw/virtio/virtio.h   | 18 ++
>  hw/block/vhost-user-blk.c|  6 +-
>  hw/virtio/vhost-user-fs.c|  2 +-
>  hw/virtio/vhost-user-gpio.c  |  2 +-
>  hw/virtio/vhost-user-i2c.c   |  2 +-
>  hw/virtio/vhost-user-rng.c   |  2 +-
>  hw/virtio/vhost-user-vsock.c |  2 +-
>  hw/virtio/vhost-vsock.c  |  2 +-
>  8 files changed, 25 insertions(+), 11 deletions(-)
> 
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index f41b4a7e64..3191c618f3 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -389,6 +389,24 @@ static inline bool virtio_device_started(VirtIODevice 
> *vdev, uint8_t status)
>  return vdev->started;
>  }
>  
> +return status & VIRTIO_CONFIG_S_DRIVER_OK;
> +}
> +
> +/**
> + * virtio_device_should_start() - check if device startable
> + * @vdev - the VirtIO device
> + * @status - the devices status bits
> + *
> + * This is similar to virtio_device_started() but also encapsulates a
> + * check on the VM status which would prevent a device starting
> + * anyway.
> + */
> +static inline bool virtio_device_should_start(VirtIODevice *vdev, uint8_t 
> status)
> +{
> +if (vdev->use_started) {
> +return vdev->started;
> +}
> +
>  if (!vdev->vm_running) {
>  return false;
>  }
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 13bf5cc47a..8feaf12e4e 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -222,14 +222,10 @@ static void vhost_user_blk_stop(VirtIODevice *vdev)
>  static void vhost_user_blk_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserBlk *s = VHOST_USER_BLK(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +bool should_start = virtio_device_should_start(vdev, status);
>  Error *local_err = NULL;
>  int ret;
>  
> -if (!vdev->vm_running) {
> -should_start = false;
> -}
> -
>  if (!s->connected) {
>  return;
>  }
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index ad0f91c607..1c40f42045 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -123,7 +123,7 @@ static void vuf_stop(VirtIODevice *vdev)
>  static void vuf_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserFS *fs = VHOST_USER_FS(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +bool should_start = virtio_device_should_start(vdev, status);
>  
>  if (vhost_dev_is_started(>vhost_dev) == should_start) {
>  return;
> diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c
> index 8b40fe450c..677d1c7730 100644
> --- a/hw/virtio/vhost-user-gpio.c
> +++ b/hw/virtio/vhost-user-gpio.c
> @@ -152,7 +152,7 @@ static void vu_gpio_stop(VirtIODevice *vdev)
>  static void vu_gpio_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserGPIO *gpio = VHOST_USER_GPIO(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +bool should_start = virtio_device_should_start(vdev, status);
>  
>  trace_virtio_gpio_set_status(status);
>  
> diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> index bc58b6c0d1..864eba695e 100644
> --- a/hw/virtio/vhost-user-i2c.c
> +++ b/hw/virtio/vhost-user-i2c.c
> @@ -93,7 +93,7 @@ static void vu_i2c_stop(VirtIODevice *vdev)
>  static void vu_i2c_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserI2C *i2c = VHOST_USER_I2C(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +bool should_start = virtio_device_should_start(vdev, status);
>  
>  if (vhost_dev_is_started(>vhost_dev) == should_start) {
>  return;
> diff --git a/hw/virtio/vhost-user-rng.c b/hw/virtio/vhost-user-rng.c
> index bc1f36c5ac..8b47287875 100644
> --- a/hw/virtio/vhost-user-rng.c
> +++ b/hw/virtio/vhost-user-rng.c
> @@ -90,7 +90,7 @@ static void vu_rng_stop(VirtIODevice *vdev)
>  static void vu_rng_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserRNG *rng = VHOST_USER_RNG(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +  

Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-08 Thread Michael S. Tsirkin
On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote:
> The previous fix to virtio_device_started revealed a problem in its
> use by both the core and the device code. The core code should be able
> to handle the device "starting" while the VM isn't running to handle
> the restoration of migration state. To solve this dual use introduce a
> new helper for use by the vhost-user backends who all use it to feed a
> should_start variable.
> 
> We can also pick up a change vhost_user_blk_set_status while we are at
> it which follows the same pattern.
> 
> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
> Signed-off-by: Alex Bennée 
> Cc: "Michael S. Tsirkin" 

why is this in this patchset?

> ---
>  include/hw/virtio/virtio.h   | 18 ++
>  hw/block/vhost-user-blk.c|  6 +-
>  hw/virtio/vhost-user-fs.c|  2 +-
>  hw/virtio/vhost-user-gpio.c  |  2 +-
>  hw/virtio/vhost-user-i2c.c   |  2 +-
>  hw/virtio/vhost-user-rng.c   |  2 +-
>  hw/virtio/vhost-user-vsock.c |  2 +-
>  hw/virtio/vhost-vsock.c  |  2 +-
>  8 files changed, 25 insertions(+), 11 deletions(-)
> 
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index f41b4a7e64..3191c618f3 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -389,6 +389,24 @@ static inline bool virtio_device_started(VirtIODevice 
> *vdev, uint8_t status)
>  return vdev->started;
>  }
>  
> +return status & VIRTIO_CONFIG_S_DRIVER_OK;
> +}
> +
> +/**
> + * virtio_device_should_start() - check if device startable
> + * @vdev - the VirtIO device
> + * @status - the devices status bits
> + *
> + * This is similar to virtio_device_started() but also encapsulates a
> + * check on the VM status which would prevent a device starting
> + * anyway.
> + */
> +static inline bool virtio_device_should_start(VirtIODevice *vdev, uint8_t 
> status)
> +{
> +if (vdev->use_started) {
> +return vdev->started;
> +}
> +
>  if (!vdev->vm_running) {
>  return false;
>  }
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 13bf5cc47a..8feaf12e4e 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -222,14 +222,10 @@ static void vhost_user_blk_stop(VirtIODevice *vdev)
>  static void vhost_user_blk_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserBlk *s = VHOST_USER_BLK(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +bool should_start = virtio_device_should_start(vdev, status);
>  Error *local_err = NULL;
>  int ret;
>  
> -if (!vdev->vm_running) {
> -should_start = false;
> -}
> -
>  if (!s->connected) {
>  return;
>  }
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index ad0f91c607..1c40f42045 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -123,7 +123,7 @@ static void vuf_stop(VirtIODevice *vdev)
>  static void vuf_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserFS *fs = VHOST_USER_FS(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +bool should_start = virtio_device_should_start(vdev, status);
>  
>  if (vhost_dev_is_started(>vhost_dev) == should_start) {
>  return;
> diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c
> index 8b40fe450c..677d1c7730 100644
> --- a/hw/virtio/vhost-user-gpio.c
> +++ b/hw/virtio/vhost-user-gpio.c
> @@ -152,7 +152,7 @@ static void vu_gpio_stop(VirtIODevice *vdev)
>  static void vu_gpio_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserGPIO *gpio = VHOST_USER_GPIO(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +bool should_start = virtio_device_should_start(vdev, status);
>  
>  trace_virtio_gpio_set_status(status);
>  
> diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
> index bc58b6c0d1..864eba695e 100644
> --- a/hw/virtio/vhost-user-i2c.c
> +++ b/hw/virtio/vhost-user-i2c.c
> @@ -93,7 +93,7 @@ static void vu_i2c_stop(VirtIODevice *vdev)
>  static void vu_i2c_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserI2C *i2c = VHOST_USER_I2C(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> +bool should_start = virtio_device_should_start(vdev, status);
>  
>  if (vhost_dev_is_started(>vhost_dev) == should_start) {
>  return;
> diff --git a/hw/virtio/vhost-user-rng.c b/hw/virtio/vhost-user-rng.c
> index bc1f36c5ac..8b47287875 100644
> --- a/hw/virtio/vhost-user-rng.c
> +++ b/hw/virtio/vhost-user-rng.c
> @@ -90,7 +90,7 @@ static void vu_rng_stop(VirtIODevice *vdev)
>  static void vu_rng_set_status(VirtIODevice *vdev, uint8_t status)
>  {
>  VHostUserRNG *rng = VHOST_USER_RNG(vdev);
> -bool should_start = virtio_device_started(vdev, status);
> + 

[PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start

2022-11-08 Thread Alex Bennée
The previous fix to virtio_device_started revealed a problem in its
use by both the core and the device code. The core code should be able
to handle the device "starting" while the VM isn't running to handle
the restoration of migration state. To solve this dual use introduce a
new helper for use by the vhost-user backends who all use it to feed a
should_start variable.

We can also pick up a change vhost_user_blk_set_status while we are at
it which follows the same pattern.

Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started)
Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device)
Signed-off-by: Alex Bennée 
Cc: "Michael S. Tsirkin" 
---
 include/hw/virtio/virtio.h   | 18 ++
 hw/block/vhost-user-blk.c|  6 +-
 hw/virtio/vhost-user-fs.c|  2 +-
 hw/virtio/vhost-user-gpio.c  |  2 +-
 hw/virtio/vhost-user-i2c.c   |  2 +-
 hw/virtio/vhost-user-rng.c   |  2 +-
 hw/virtio/vhost-user-vsock.c |  2 +-
 hw/virtio/vhost-vsock.c  |  2 +-
 8 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index f41b4a7e64..3191c618f3 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -389,6 +389,24 @@ static inline bool virtio_device_started(VirtIODevice 
*vdev, uint8_t status)
 return vdev->started;
 }
 
+return status & VIRTIO_CONFIG_S_DRIVER_OK;
+}
+
+/**
+ * virtio_device_should_start() - check if device startable
+ * @vdev - the VirtIO device
+ * @status - the devices status bits
+ *
+ * This is similar to virtio_device_started() but also encapsulates a
+ * check on the VM status which would prevent a device starting
+ * anyway.
+ */
+static inline bool virtio_device_should_start(VirtIODevice *vdev, uint8_t 
status)
+{
+if (vdev->use_started) {
+return vdev->started;
+}
+
 if (!vdev->vm_running) {
 return false;
 }
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 13bf5cc47a..8feaf12e4e 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -222,14 +222,10 @@ static void vhost_user_blk_stop(VirtIODevice *vdev)
 static void vhost_user_blk_set_status(VirtIODevice *vdev, uint8_t status)
 {
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
-bool should_start = virtio_device_started(vdev, status);
+bool should_start = virtio_device_should_start(vdev, status);
 Error *local_err = NULL;
 int ret;
 
-if (!vdev->vm_running) {
-should_start = false;
-}
-
 if (!s->connected) {
 return;
 }
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index ad0f91c607..1c40f42045 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -123,7 +123,7 @@ static void vuf_stop(VirtIODevice *vdev)
 static void vuf_set_status(VirtIODevice *vdev, uint8_t status)
 {
 VHostUserFS *fs = VHOST_USER_FS(vdev);
-bool should_start = virtio_device_started(vdev, status);
+bool should_start = virtio_device_should_start(vdev, status);
 
 if (vhost_dev_is_started(>vhost_dev) == should_start) {
 return;
diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c
index 8b40fe450c..677d1c7730 100644
--- a/hw/virtio/vhost-user-gpio.c
+++ b/hw/virtio/vhost-user-gpio.c
@@ -152,7 +152,7 @@ static void vu_gpio_stop(VirtIODevice *vdev)
 static void vu_gpio_set_status(VirtIODevice *vdev, uint8_t status)
 {
 VHostUserGPIO *gpio = VHOST_USER_GPIO(vdev);
-bool should_start = virtio_device_started(vdev, status);
+bool should_start = virtio_device_should_start(vdev, status);
 
 trace_virtio_gpio_set_status(status);
 
diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c
index bc58b6c0d1..864eba695e 100644
--- a/hw/virtio/vhost-user-i2c.c
+++ b/hw/virtio/vhost-user-i2c.c
@@ -93,7 +93,7 @@ static void vu_i2c_stop(VirtIODevice *vdev)
 static void vu_i2c_set_status(VirtIODevice *vdev, uint8_t status)
 {
 VHostUserI2C *i2c = VHOST_USER_I2C(vdev);
-bool should_start = virtio_device_started(vdev, status);
+bool should_start = virtio_device_should_start(vdev, status);
 
 if (vhost_dev_is_started(>vhost_dev) == should_start) {
 return;
diff --git a/hw/virtio/vhost-user-rng.c b/hw/virtio/vhost-user-rng.c
index bc1f36c5ac..8b47287875 100644
--- a/hw/virtio/vhost-user-rng.c
+++ b/hw/virtio/vhost-user-rng.c
@@ -90,7 +90,7 @@ static void vu_rng_stop(VirtIODevice *vdev)
 static void vu_rng_set_status(VirtIODevice *vdev, uint8_t status)
 {
 VHostUserRNG *rng = VHOST_USER_RNG(vdev);
-bool should_start = virtio_device_started(vdev, status);
+bool should_start = virtio_device_should_start(vdev, status);
 
 if (vhost_dev_is_started(>vhost_dev) == should_start) {
 return;
diff --git a/hw/virtio/vhost-user-vsock.c b/hw/virtio/vhost-user-vsock.c
index 7b67e29d83..9431b9792c 100644
--- a/hw/virtio/vhost-user-vsock.c
+++ b/hw/virtio/vhost-user-vsock.c
@@ -55,7 +55,7 @@ 

Re: [PATCH-for-7.2 1/2] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)

2022-11-08 Thread Mauro Matteo Cascella
On Mon, Nov 7, 2022 at 11:12 PM Philippe Mathieu-Daudé
 wrote:
>
> When sdhci_write_block_to_card() is called to transfer data from
> the FIFO to the SD bus, the data is already present in the buffer
> and we have to consume it directly.
>
> See the description of the 'Buffer Write Enable' bit from the
> 'Present State' register (prnsts::SDHC_SPACE_AVAILABLE) in Table
> 2.14 from the SDHCI spec v2:
>
>   Buffer Write Enable
>
>   This status is used for non-DMA write transfers.
>
>   The Host Controller can implement multiple buffers to transfer
>   data efficiently. This read only flag indicates if space is
>   available for write data. If this bit is 1, data can be written
>   to the buffer. A change of this bit from 1 to 0 occurs when all
>   the block data is written to the buffer. A change of this bit
>   from 0 to 1 occurs when top of block data can be written to the
>   buffer and generates the Buffer Write Ready interrupt.
>
> In our case, we do not want to overwrite the buffer, so we want
> this bit to be 0, then set it to 1 once the data is written onto
> the bus.
>
> This is probably a copy/paste error from commit d7dfca0807
> ("hw/sdhci: introduce standard SD host controller").
>
> Reproducer:
> https://lore.kernel.org/qemu-devel/caa8xkjxrms0fkr28akvnnpyatm0y0b+5fichpsrhd+mugnu...@mail.gmail.com/
>
> Fixes: CVE-2022-3872
> Reported-by: RivenDell 
> Reported-by: Siqi Chen 
> Reported-by: ningqiang 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/sd/sdhci.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/sd/sdhci.c b/hw/sd/sdhci.c
> index 306070c872..f230e7475f 100644
> --- a/hw/sd/sdhci.c
> +++ b/hw/sd/sdhci.c
> @@ -954,7 +954,7 @@ static void sdhci_data_transfer(void *opaque)
>  sdhci_read_block_from_card(s);
>  } else {
>  s->prnsts |= SDHC_DOING_WRITE | SDHC_DAT_LINE_ACTIVE |
> -SDHC_SPACE_AVAILABLE | SDHC_DATA_INHIBIT;
> +   SDHC_DATA_INHIBIT;
>  sdhci_write_block_to_card(s);
>  }
>  }
> --
> 2.38.1
>

Tested-by: Mauro Matteo Cascella 

Thank you,

--
Mauro Matteo Cascella
Red Hat Product Security
PGP-Key ID: BB3410B0