Re: [PULL 0/3] Memory/SDHCI/ParallelFlash patches for v7.2.0-rc0
On 8/11/22 21:57, Stefan Hajnoczi wrote: I've dropped the SDHCI CVE fix due to the CI failure. The rest of the commits are still in the staging tree and I plan to include them in v7.2.0-rc0. Thank you Stefan, sorry for not catching that failure sooner.
Re: [PATCH v3 4/4] hw/nvme: add polling support
at 10:11 PM, Klaus Jensen wrote: > On Nov 8 12:39, John Levon wrote: >> On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote: >> >>> On Nov 3 21:19, Jinhao Fan wrote: On 11/3/2022 8:10 PM, Klaus Jensen wrote: > I agree that the spec is a little unclear on this point. In any case, in > Linux, when the driver has decided that the sq tail must be updated, > it will use this check: > > (new_idx - event_idx - 1) < (new_idx - old) When eventidx is already behind, it's like: 0 1 <- event_idx 2 <- old 3 <- new_idx 4 . . . In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) = 3-2=1, so the host won't update sq tail. Where am I wrong in this example? >>> >>> That becomes 1 >= 1, i.e. "true". So this will result in the driver >>> doing an mmio doorbell write. >> >> The code is: >> >> static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old) >> >> { >> >>return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old); >> >> } >> >> >> which per the above is "return 1 < 1;", or false. So the above case does >> *not* >> do an mmio write. No? > > Whelp. > > Looks like I'm in the wrong here, apologies! So disabling eventidx updates during polling has the potential of reducing doorbell writes. But as Klaus observed removing this function does not cause performance difference. So I guess only one command is processed during each polling iteration.
[PATCH 8/8] virtio-blk: minimize virtio_blk_reset() AioContext lock region
blk_drain() needs the lock because it calls AIO_WAIT_WHILE(). The s->rq loop doesn't need the lock because dataplane has been stopped when virtio_blk_reset() is called. Signed-off-by: Stefan Hajnoczi --- hw/block/virtio-blk.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 771d87cfbe..0b411b3065 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -898,6 +898,7 @@ static void virtio_blk_reset(VirtIODevice *vdev) ctx = blk_get_aio_context(s->blk); aio_context_acquire(ctx); blk_drain(s->blk); +aio_context_release(ctx); /* We drop queued requests after blk_drain() because blk_drain() itself can * produce them. */ @@ -908,8 +909,6 @@ static void virtio_blk_reset(VirtIODevice *vdev) virtio_blk_free_request(req); } -aio_context_release(ctx); - assert(!s->dataplane_started); blk_set_enable_write_cache(s->blk, s->original_wce); } -- 2.38.1
[PATCH 7/8] virtio-blk: don't acquire AioContext in virtio_blk_handle_vq()
There is no need to acquire AioContext in virtio_blk_handle_vq() because no APIs used in the function require it and nothing else in the virtio-blk code requires mutual exclusion anymore. Signed-off-by: Stefan Hajnoczi --- hw/block/virtio-blk.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index faea045178..771d87cfbe 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -784,7 +784,6 @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq) IO_CODE(); -aio_context_acquire(blk_get_aio_context(s->blk)); blk_io_plug(s->blk); do { @@ -810,7 +809,6 @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq) } blk_io_unplug(s->blk); -aio_context_release(blk_get_aio_context(s->blk)); } static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq) -- 2.38.1
[PATCH 1/8] virtio_queue_aio_attach_host_notifier: remove AioContext lock
From: Emanuele Giuseppe Esposito virtio_queue_aio_attach_host_notifier() and virtio_queue_aio_attach_host_notifier_nopoll() run always in the main loop, so there is no need to protect them with AioContext lock. On the other side, virtio_queue_aio_detach_host_notifier() runs in a bh in the iothread context, but it is always scheduled (thus serialized) by the main loop. Therefore removing the AioContext lock is safe. In order to remove the AioContext lock it is necessary to switch aio_wait_bh_oneshot() to AIO_WAIT_WHILE_UNLOCKED(). virtio-blk and virtio-scsi are the only users of aio_wait_bh_oneshot() so it is possible to make this change. For now bdrv_set_aio_context() still needs the AioContext lock. Signed-off-by: Emanuele Giuseppe Esposito Signed-off-by: Stefan Hajnoczi Message-Id: <20220609143727.1151816-2-eespo...@redhat.com> --- include/block/aio-wait.h| 4 ++-- hw/block/dataplane/virtio-blk.c | 10 ++ hw/block/virtio-blk.c | 2 ++ hw/scsi/virtio-scsi-dataplane.c | 10 -- util/aio-wait.c | 2 +- 5 files changed, 15 insertions(+), 13 deletions(-) diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h index dd9a7f6461..fce6bfee3a 100644 --- a/include/block/aio-wait.h +++ b/include/block/aio-wait.h @@ -131,8 +131,8 @@ void aio_wait_kick(void); * * Run a BH in @ctx and wait for it to complete. * - * Must be called from the main loop thread with @ctx acquired exactly once. - * Note that main loop event processing may occur. + * Must be called from the main loop thread. @ctx must not be acquired by the + * caller. Note that main loop event processing may occur. */ void aio_wait_bh_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque); diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c index b28d81737e..975f5ca8c4 100644 --- a/hw/block/dataplane/virtio-blk.c +++ b/hw/block/dataplane/virtio-blk.c @@ -167,6 +167,8 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev) Error *local_err = NULL; int r; +GLOBAL_STATE_CODE(); + if (vblk->dataplane_started || s->starting) { return 0; } @@ -245,13 +247,11 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev) } /* Get this show started by hooking up our callbacks */ -aio_context_acquire(s->ctx); for (i = 0; i < nvqs; i++) { VirtQueue *vq = virtio_get_queue(s->vdev, i); virtio_queue_aio_attach_host_notifier(vq, s->ctx); } -aio_context_release(s->ctx); return 0; fail_aio_context: @@ -301,6 +301,8 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev) unsigned i; unsigned nvqs = s->conf->num_queues; +GLOBAL_STATE_CODE(); + if (!vblk->dataplane_started || s->stopping) { return; } @@ -314,9 +316,10 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev) s->stopping = true; trace_virtio_blk_data_plane_stop(s); -aio_context_acquire(s->ctx); aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s); +aio_context_acquire(s->ctx); + /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */ blk_drain(s->conf->conf.blk); @@ -325,7 +328,6 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev) * BlockBackend in the iothread, that's ok */ blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context(), NULL); - aio_context_release(s->ctx); /* diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 1762517878..cdc6fd5979 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -100,6 +100,8 @@ static void virtio_blk_rw_complete(void *opaque, int ret) VirtIOBlock *s = next->dev; VirtIODevice *vdev = VIRTIO_DEVICE(s); +IO_CODE(); + aio_context_acquire(blk_get_aio_context(s->conf.conf.blk)); while (next) { VirtIOBlockReq *req = next; diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c index 20bb91766e..f6f55d4511 100644 --- a/hw/scsi/virtio-scsi-dataplane.c +++ b/hw/scsi/virtio-scsi-dataplane.c @@ -91,6 +91,8 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev) VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev); VirtIOSCSI *s = VIRTIO_SCSI(vdev); +GLOBAL_STATE_CODE(); + if (s->dataplane_started || s->dataplane_starting || s->dataplane_fenced) { @@ -138,20 +140,18 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev) /* * These fields are visible to the IOThread so we rely on implicit barriers - * in aio_context_acquire() on the write side and aio_notify_accept() on - * the read side. + * in virtio_queue_aio_attach_host_notifier() on the write side and + * aio_notify_accept() on the read side. */ s->dataplane_starting = false; s->dataplane_started = true; -aio_context_acquire(s->ctx); virtio_queue_aio_attach_host_notifier(vs->ctrl_vq, s->ctx);
[PATCH 5/8] virtio-blk: mark IO_CODE functions
From: Emanuele Giuseppe Esposito Just as done in the block API, mark functions in virtio-blk that are called also from iothread(s). We know such functions are IO because many are blk_* callbacks, running always in the device iothread, and remaining are propagated from the leaf IO functions (if a function calls a IO_CODE function, itself is categorized as IO_CODE too). Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi Signed-off-by: Stefan Hajnoczi Message-Id: <20220609143727.1151816-7-eespo...@redhat.com> --- hw/block/dataplane/virtio-blk.c | 4 +++ hw/block/virtio-blk.c | 45 - 2 files changed, 43 insertions(+), 6 deletions(-) diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c index 728c9cd86c..3593ac0e7b 100644 --- a/hw/block/dataplane/virtio-blk.c +++ b/hw/block/dataplane/virtio-blk.c @@ -63,6 +63,8 @@ static void notify_guest_bh(void *opaque) unsigned long bitmap[BITS_TO_LONGS(nvqs)]; unsigned j; +IO_CODE(); + memcpy(bitmap, s->batch_notify_vqs, sizeof(bitmap)); memset(s->batch_notify_vqs, 0, sizeof(bitmap)); @@ -288,6 +290,8 @@ static void virtio_blk_data_plane_stop_bh(void *opaque) VirtIOBlockDataPlane *s = opaque; unsigned i; +IO_CODE(); + for (i = 0; i < s->conf->num_queues; i++) { VirtQueue *vq = virtio_get_queue(s->vdev, i); diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 02b213a140..f8fcf25292 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -39,6 +39,8 @@ static void virtio_blk_init_request(VirtIOBlock *s, VirtQueue *vq, VirtIOBlockReq *req) { +IO_CODE(); + req->dev = s; req->vq = vq; req->qiov.size = 0; @@ -57,6 +59,8 @@ static void virtio_blk_req_complete(VirtIOBlockReq *req, unsigned char status) VirtIOBlock *s = req->dev; VirtIODevice *vdev = VIRTIO_DEVICE(s); +IO_CODE(); + trace_virtio_blk_req_complete(vdev, req, status); stb_p(>in->status, status); @@ -76,6 +80,8 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error, VirtIOBlock *s = req->dev; BlockErrorAction action = blk_get_error_action(s->blk, is_read, error); +IO_CODE(); + if (action == BLOCK_ERROR_ACTION_STOP) { /* Break the link as the next request is going to be parsed from the * ring again. Otherwise we may end up doing a double completion! */ @@ -143,7 +149,9 @@ static void virtio_blk_flush_complete(void *opaque, int ret) VirtIOBlockReq *req = opaque; VirtIOBlock *s = req->dev; +IO_CODE(); aio_context_acquire(blk_get_aio_context(s->conf.conf.blk)); + if (ret) { if (virtio_blk_handle_rw_error(req, -ret, 0, true)) { goto out; @@ -165,7 +173,9 @@ static void virtio_blk_discard_write_zeroes_complete(void *opaque, int ret) bool is_write_zeroes = (virtio_ldl_p(VIRTIO_DEVICE(s), >out.type) & ~VIRTIO_BLK_T_BARRIER) == VIRTIO_BLK_T_WRITE_ZEROES; +IO_CODE(); aio_context_acquire(blk_get_aio_context(s->conf.conf.blk)); + if (ret) { if (virtio_blk_handle_rw_error(req, -ret, false, is_write_zeroes)) { goto out; @@ -198,6 +208,8 @@ static void virtio_blk_ioctl_complete(void *opaque, int status) struct virtio_scsi_inhdr *scsi; struct sg_io_hdr *hdr; +IO_CODE(); + scsi = (void *)req->elem.in_sg[req->elem.in_num - 2].iov_base; if (status) { @@ -239,6 +251,8 @@ static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock *s, VirtQueue *vq) { VirtIOBlockReq *req = virtqueue_pop(vq, sizeof(VirtIOBlockReq)); +IO_CODE(); + if (req) { virtio_blk_init_request(s, vq, req); } @@ -259,6 +273,8 @@ static int virtio_blk_handle_scsi_req(VirtIOBlockReq *req) BlockAIOCB *acb; #endif +IO_CODE(); + /* * We require at least one output segment each for the virtio_blk_outhdr * and the SCSI command block. @@ -357,6 +373,7 @@ fail: static void virtio_blk_handle_scsi(VirtIOBlockReq *req) { int status; +IO_CODE(); status = virtio_blk_handle_scsi_req(req); if (status != -EINPROGRESS) { @@ -374,6 +391,8 @@ static inline void submit_requests(VirtIOBlock *s, MultiReqBuffer *mrb, bool is_write = mrb->is_write; BdrvRequestFlags flags = 0; +IO_CODE(); + if (num_reqs > 1) { int i; struct iovec *tmp_iov = qiov->iov; @@ -423,6 +442,8 @@ static int multireq_compare(const void *a, const void *b) const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a, *req2 = *(VirtIOBlockReq **)b; +IO_CODE(); + /* * Note that we can't simply subtract sector_num1 from sector_num2 * here as that could overflow the return value. @@ -442,6 +463,8 @@ static void virtio_blk_submit_multireq(VirtIOBlock *s, MultiReqBuffer *mrb) uint32_t
[PATCH 6/8] virtio-blk: remove unnecessary AioContext lock from function already safe
From: Emanuele Giuseppe Esposito AioContext lock was introduced in b9e413dd375 and in this instance it is used to protect these 3 functions: - virtio_blk_handle_rw_error - virtio_blk_req_complete - block_acct_done Now that all three of the above functions are protected with their own locks, we can get rid of the AioContext lock. Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi Signed-off-by: Stefan Hajnoczi Message-Id: <20220609143727.1151816-9-eespo...@redhat.com> --- hw/block/virtio-blk.c | 19 ++- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index f8fcf25292..faea045178 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -108,7 +108,6 @@ static void virtio_blk_rw_complete(void *opaque, int ret) IO_CODE(); -aio_context_acquire(blk_get_aio_context(s->conf.conf.blk)); while (next) { VirtIOBlockReq *req = next; next = req->mr_next; @@ -141,7 +140,6 @@ static void virtio_blk_rw_complete(void *opaque, int ret) block_acct_done(blk_get_stats(s->blk), >acct); virtio_blk_free_request(req); } -aio_context_release(blk_get_aio_context(s->conf.conf.blk)); } static void virtio_blk_flush_complete(void *opaque, int ret) @@ -150,20 +148,16 @@ static void virtio_blk_flush_complete(void *opaque, int ret) VirtIOBlock *s = req->dev; IO_CODE(); -aio_context_acquire(blk_get_aio_context(s->conf.conf.blk)); if (ret) { if (virtio_blk_handle_rw_error(req, -ret, 0, true)) { -goto out; +return; } } virtio_blk_req_complete(req, VIRTIO_BLK_S_OK); block_acct_done(blk_get_stats(s->blk), >acct); virtio_blk_free_request(req); - -out: -aio_context_release(blk_get_aio_context(s->conf.conf.blk)); } static void virtio_blk_discard_write_zeroes_complete(void *opaque, int ret) @@ -174,11 +168,10 @@ static void virtio_blk_discard_write_zeroes_complete(void *opaque, int ret) ~VIRTIO_BLK_T_BARRIER) == VIRTIO_BLK_T_WRITE_ZEROES; IO_CODE(); -aio_context_acquire(blk_get_aio_context(s->conf.conf.blk)); if (ret) { if (virtio_blk_handle_rw_error(req, -ret, false, is_write_zeroes)) { -goto out; +return; } } @@ -187,9 +180,6 @@ static void virtio_blk_discard_write_zeroes_complete(void *opaque, int ret) block_acct_done(blk_get_stats(s->blk), >acct); } virtio_blk_free_request(req); - -out: -aio_context_release(blk_get_aio_context(s->conf.conf.blk)); } #ifdef __linux__ @@ -238,10 +228,8 @@ static void virtio_blk_ioctl_complete(void *opaque, int status) virtio_stl_p(vdev, >data_len, hdr->dxfer_len); out: -aio_context_acquire(blk_get_aio_context(s->conf.conf.blk)); virtio_blk_req_complete(req, status); virtio_blk_free_request(req); -aio_context_release(blk_get_aio_context(s->conf.conf.blk)); g_free(ioctl_req); } @@ -852,7 +840,6 @@ static void virtio_blk_dma_restart_bh(void *opaque) s->rq = NULL; -aio_context_acquire(blk_get_aio_context(s->conf.conf.blk)); while (req) { VirtIOBlockReq *next = req->next; if (virtio_blk_handle_request(req, )) { @@ -876,8 +863,6 @@ static void virtio_blk_dma_restart_bh(void *opaque) /* Paired with inc in virtio_blk_dma_restart_cb() */ blk_dec_in_flight(s->conf.conf.blk); - -aio_context_release(blk_get_aio_context(s->conf.conf.blk)); } /* -- 2.38.1
[PATCH 4/8] virtio-blk: mark GLOBAL_STATE_CODE functions
From: Emanuele Giuseppe Esposito Just as done in the block API, mark functions in virtio-blk that are always called in the main loop with BQL held. We know such functions are GS because they all are callbacks from virtio.c API that has already classified them as GS. Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi Signed-off-by: Stefan Hajnoczi Message-Id: <20220609143727.1151816-6-eespo...@redhat.com> --- hw/block/dataplane/virtio-blk.c | 4 hw/block/virtio-blk.c | 27 +++ 2 files changed, 31 insertions(+) diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c index 975f5ca8c4..728c9cd86c 100644 --- a/hw/block/dataplane/virtio-blk.c +++ b/hw/block/dataplane/virtio-blk.c @@ -89,6 +89,8 @@ bool virtio_blk_data_plane_create(VirtIODevice *vdev, VirtIOBlkConf *conf, BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev))); VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus); +GLOBAL_STATE_CODE(); + *dataplane = NULL; if (conf->iothread) { @@ -140,6 +142,8 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s) { VirtIOBlock *vblk; +GLOBAL_STATE_CODE(); + if (!s) { return; } diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 96bc11d2fe..02b213a140 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -845,11 +845,17 @@ static void virtio_blk_dma_restart_bh(void *opaque) aio_context_release(blk_get_aio_context(s->conf.conf.blk)); } +/* + * Only called when VM is started or stopped in cpus.c. + * No iothread runs in parallel + */ static void virtio_blk_dma_restart_cb(void *opaque, bool running, RunState state) { VirtIOBlock *s = opaque; +GLOBAL_STATE_CODE(); + if (!running) { return; } @@ -867,8 +873,14 @@ static void virtio_blk_reset(VirtIODevice *vdev) AioContext *ctx; VirtIOBlockReq *req; +GLOBAL_STATE_CODE(); + ctx = blk_get_aio_context(s->blk); aio_context_acquire(ctx); +/* + * This drain together with ->stop_ioeventfd() in virtio_pci_reset() + * stops all Iothreads. + */ blk_drain(s->blk); /* We drop queued requests after blk_drain() because blk_drain() itself can @@ -1037,11 +1049,17 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status) } } +/* + * VM is stopped while doing migration, so iothread has + * no requests to process. + */ static void virtio_blk_save_device(VirtIODevice *vdev, QEMUFile *f) { VirtIOBlock *s = VIRTIO_BLK(vdev); VirtIOBlockReq *req = s->rq; +GLOBAL_STATE_CODE(); + while (req) { qemu_put_sbyte(f, 1); @@ -1055,11 +1073,17 @@ static void virtio_blk_save_device(VirtIODevice *vdev, QEMUFile *f) qemu_put_sbyte(f, 0); } +/* + * VM is stopped while doing migration, so iothread has + * no requests to process. + */ static int virtio_blk_load_device(VirtIODevice *vdev, QEMUFile *f, int version_id) { VirtIOBlock *s = VIRTIO_BLK(vdev); +GLOBAL_STATE_CODE(); + while (qemu_get_sbyte(f)) { unsigned nvqs = s->conf.num_queues; unsigned vq_idx = 0; @@ -1108,6 +1132,7 @@ static const BlockDevOps virtio_block_ops = { .resize_cb = virtio_blk_resize, }; +/* Iothread is not yet created */ static void virtio_blk_device_realize(DeviceState *dev, Error **errp) { VirtIODevice *vdev = VIRTIO_DEVICE(dev); @@ -1116,6 +1141,8 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp) Error *err = NULL; unsigned i; +GLOBAL_STATE_CODE(); + if (!conf->conf.blk) { error_setg(errp, "drive property not set"); return; -- 2.38.1
[PATCH 2/8] block-backend: enable_write_cache should be atomic
From: Emanuele Giuseppe Esposito It is read from IO_CODE and written with BQL held, so setting it as atomic should be enough. Also remove the aiocontext lock that was sporadically taken around the set. Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi Signed-off-by: Stefan Hajnoczi Message-Id: <20220609143727.1151816-3-eespo...@redhat.com> --- block/block-backend.c | 6 +++--- hw/block/virtio-blk.c | 4 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index c0c7d56c8d..949418cad4 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -60,7 +60,7 @@ struct BlockBackend { * can be used to restore those options in the new BDS on insert) */ BlockBackendRootState root_state; -bool enable_write_cache; +bool enable_write_cache; /* Atomic */ /* I/O stats (display with "info blockstats"). */ BlockAcctStats stats; @@ -1939,13 +1939,13 @@ bool blk_is_sg(BlockBackend *blk) bool blk_enable_write_cache(BlockBackend *blk) { IO_CODE(); -return blk->enable_write_cache; +return qatomic_read(>enable_write_cache); } void blk_set_enable_write_cache(BlockBackend *blk, bool wce) { IO_CODE(); -blk->enable_write_cache = wce; +qatomic_set(>enable_write_cache, wce); } void blk_activate(BlockBackend *blk, Error **errp) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index cdc6fd5979..96d00103a4 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -961,9 +961,7 @@ static void virtio_blk_set_config(VirtIODevice *vdev, const uint8_t *config) memcpy(, config, s->config_size); -aio_context_acquire(blk_get_aio_context(s->blk)); blk_set_enable_write_cache(s->blk, blkcfg.wce != 0); -aio_context_release(blk_get_aio_context(s->blk)); } static uint64_t virtio_blk_get_features(VirtIODevice *vdev, uint64_t features, @@ -1031,11 +1029,9 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status) * s->blk would erroneously be placed in writethrough mode. */ if (!virtio_vdev_has_feature(vdev, VIRTIO_BLK_F_CONFIG_WCE)) { -aio_context_acquire(blk_get_aio_context(s->blk)); blk_set_enable_write_cache(s->blk, virtio_vdev_has_feature(vdev, VIRTIO_BLK_F_WCE)); -aio_context_release(blk_get_aio_context(s->blk)); } } -- 2.38.1
[PATCH 0/8] virtio-blk: remove AioContext lock
This is a continuation of Emanuele Esposito's work to remove the AioContext lock in virtio-blk. In the past it was necessary to acquire the AioContext lock in order to do I/O. Paolo Bonzini and Emanuele have removed the need for the AioContext in the core block layer code, with a few exceptions like blk_drain() and blk_set_aio_context(). This patch series annotates virtio-blk functions with IO_CODE()/GLOBAL_STATE_CODE() so it's clear in which context they are called. It also removes unnecessary AioContext locks. The end result is that virtio-blk rarely takes the AioContext lock. Future patch series will add support for multiple IOThreads so that true multi-queue can be achieved, but right now virtio-blk still has unprotected shared state like s->rq so that needs to wait for another day. Based-on: <20221102182337.252202-1-stefa...@redhat.com> Emanuele Giuseppe Esposito (6): virtio_queue_aio_attach_host_notifier: remove AioContext lock block-backend: enable_write_cache should be atomic virtio: categorize callbacks in GS virtio-blk: mark GLOBAL_STATE_CODE functions virtio-blk: mark IO_CODE functions virtio-blk: remove unnecessary AioContext lock from function already safe Stefan Hajnoczi (2): virtio-blk: don't acquire AioContext in virtio_blk_handle_vq() virtio-blk: minimize virtio_blk_reset() AioContext lock region include/block/aio-wait.h| 4 +- block/block-backend.c | 6 +-- hw/block/dataplane/virtio-blk.c | 18 +-- hw/block/virtio-blk.c | 92 - hw/scsi/virtio-scsi-dataplane.c | 10 ++-- hw/virtio/virtio-bus.c | 5 ++ hw/virtio/virtio-pci.c | 2 + hw/virtio/virtio.c | 8 +++ util/aio-wait.c | 2 +- 9 files changed, 106 insertions(+), 41 deletions(-) -- 2.38.1
[PATCH 3/8] virtio: categorize callbacks in GS
From: Emanuele Giuseppe Esposito All the callbacks below are always running in the main loop. The callbacks are the following: - start/stop_ioeventfd: these are the callbacks where blk_set_aio_context(iothread) is done, so they are called in the main loop. - save and load: called during migration, when VM is stopped from the main loop. - reset: before calling this callback, stop_ioeventfd is invoked, so it can only run in the main loop. - set_status: going through all the callers we can see it is called from a MemoryRegionOps callback, which always run in a vcpu thread and hold the BQL. - realize: iothread is not even created yet. Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Stefan Hajnoczi Acked-by: Michael S. Tsirkin Signed-off-by: Stefan Hajnoczi Message-Id: <20220609143727.1151816-5-eespo...@redhat.com> --- hw/block/virtio-blk.c | 2 ++ hw/virtio/virtio-bus.c | 5 + hw/virtio/virtio-pci.c | 2 ++ hw/virtio/virtio.c | 8 4 files changed, 17 insertions(+) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 96d00103a4..96bc11d2fe 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -1005,6 +1005,8 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status) { VirtIOBlock *s = VIRTIO_BLK(vdev); +GLOBAL_STATE_CODE(); + if (!(status & (VIRTIO_CONFIG_S_DRIVER | VIRTIO_CONFIG_S_DRIVER_OK))) { assert(!s->dataplane_started); } diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c index 896feb37a1..74cdf4bd27 100644 --- a/hw/virtio/virtio-bus.c +++ b/hw/virtio/virtio-bus.c @@ -23,6 +23,7 @@ */ #include "qemu/osdep.h" +#include "qemu/main-loop.h" #include "qemu/error-report.h" #include "qemu/module.h" #include "qapi/error.h" @@ -224,6 +225,8 @@ int virtio_bus_start_ioeventfd(VirtioBusState *bus) VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev); int r; +GLOBAL_STATE_CODE(); + if (!k->ioeventfd_assign || !k->ioeventfd_enabled(proxy)) { return -ENOSYS; } @@ -248,6 +251,8 @@ void virtio_bus_stop_ioeventfd(VirtioBusState *bus) VirtIODevice *vdev; VirtioDeviceClass *vdc; +GLOBAL_STATE_CODE(); + if (!bus->ioeventfd_started) { return; } diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index a1c9dfa7bb..4f9a94f61b 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -313,6 +313,8 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) uint16_t vector; hwaddr pa; +GLOBAL_STATE_CODE(); + switch (addr) { case VIRTIO_PCI_GUEST_FEATURES: /* Guest does not negotiate properly? We have to assume nothing. */ diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 9683b2e158..468e8f5ad0 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -2422,6 +2422,8 @@ int virtio_set_status(VirtIODevice *vdev, uint8_t val) VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev); trace_virtio_set_status(vdev, val); +GLOBAL_STATE_CODE(); + if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) { if (!(vdev->status & VIRTIO_CONFIG_S_FEATURES_OK) && val & VIRTIO_CONFIG_S_FEATURES_OK) { @@ -2515,6 +2517,8 @@ void virtio_reset(void *opaque) VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev); int i; +GLOBAL_STATE_CODE(); + virtio_set_status(vdev, 0); if (current_cpu) { /* Guest initiated reset */ @@ -3357,6 +3361,8 @@ int virtio_save(VirtIODevice *vdev, QEMUFile *f) uint32_t guest_features_lo = (vdev->guest_features & 0x); int i; +GLOBAL_STATE_CODE(); + if (k->save_config) { k->save_config(qbus->parent, f); } @@ -3508,6 +3514,8 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id) VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus); VirtioDeviceClass *vdc = VIRTIO_DEVICE_GET_CLASS(vdev); +GLOBAL_STATE_CODE(); + /* * We poison the endianness to ensure it does not get used before * subsections have been loaded. -- 2.38.1
Re: [PULL 0/3] Memory/SDHCI/ParallelFlash patches for v7.2.0-rc0
I've dropped the SDHCI CVE fix due to the CI failure. The rest of the commits are still in the staging tree and I plan to include them in v7.2.0-rc0. Stefan
Re: [PULL 0/3] Memory/SDHCI/ParallelFlash patches for v7.2.0-rc0
On Tue, 8 Nov 2022 at 13:35, Philippe Mathieu-Daudé wrote: > > The following changes since commit ade760a2f63804b7ab1839fbc3e5ddbf30538718: > > Merge tag 'pull-request-2022-11-08' of https://gitlab.com/thuth/qemu into > staging (2022-11-08 11:34:06 -0500) > > are available in the Git repository at: > > https://github.com/philmd/qemu.git tags/memflash-20221108 > > for you to fetch changes up to cf9b3efd816518f9f210f50a0fa3e46a00b33c27: > > Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2" > (2022-11-08 19:29:25 +0100) > > > Memory/SDHCI/ParallelFlash patches queue > > - Fix wrong end address dump in 'info mtree' (Zhenzhong Duan) > - Fix in SDHCI for CVE-2022-3872 (myself) There is a CI failure: >>> G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh >>> MALLOC_PERTURB_=127 QTEST_QEMU_BINARY=./qemu-system-arm >>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon >>> QTEST_QEMU_IMG=./qemu-img >>> /builds/qemu-project/qemu/build/tests/qtest/npcm7xx_sdhci-test --tap -k ― ✀ ― stderr: ** Message: 19:27:52.411: /tmp/sdhci_ZD2EV1 ** ERROR:../tests/qtest/npcm7xx_sdhci-test.c:101:sdwrite_read: assertion failed: (!memcmp(rmsg, msg, len)) https://gitlab.com/qemu-project/qemu/-/jobs/3292896670 Stefan
Re: [PATCH v3 08/17] migration/qemu-file: Add qemu_file_get_to_fd()
On 11/3/22 19:16, Avihai Horon wrote: Add new function qemu_file_get_to_fd() that allows reading data from QEMUFile and writing it straight into a given fd. This will be used later in VFIO migration code. Signed-off-by: Avihai Horon Reviewed-by: Vladimir Sementsov-Ogievskiy -- Best regards, Vladimir
Re: [PATCH-for-7.2 1/2] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)
On 221108 1225, Alexander Bulekov wrote: > On 221107 2312, Philippe Mathieu-Daudé wrote: > > When sdhci_write_block_to_card() is called to transfer data from > > the FIFO to the SD bus, the data is already present in the buffer > > and we have to consume it directly. > > > > See the description of the 'Buffer Write Enable' bit from the > > 'Present State' register (prnsts::SDHC_SPACE_AVAILABLE) in Table > > 2.14 from the SDHCI spec v2: > > > > Buffer Write Enable > > > > This status is used for non-DMA write transfers. > > > > The Host Controller can implement multiple buffers to transfer > > data efficiently. This read only flag indicates if space is > > available for write data. If this bit is 1, data can be written > > to the buffer. A change of this bit from 1 to 0 occurs when all > > the block data is written to the buffer. A change of this bit > > from 0 to 1 occurs when top of block data can be written to the > > buffer and generates the Buffer Write Ready interrupt. > > > > In our case, we do not want to overwrite the buffer, so we want > > this bit to be 0, then set it to 1 once the data is written onto > > the bus. > > > > This is probably a copy/paste error from commit d7dfca0807 > > ("hw/sdhci: introduce standard SD host controller"). > > > > Reproducer: > > https://lore.kernel.org/qemu-devel/caa8xkjxrms0fkr28akvnnpyatm0y0b+5fichpsrhd+mugnu...@mail.gmail.com/ > > > > Fixes: CVE-2022-3872 > > Reported-by: RivenDell > > Reported-by: Siqi Chen > > Reported-by: ningqiang > > Signed-off-by: Philippe Mathieu-Daudé > > Seems like OSS-Fuzz also found this, not sure why it never made it into > a gitlab issue: > https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=45986#c4 > > Slightly shorter reproducer: > > cat << EOF | ./qemu-system-i386 -display none -machine accel=qtest, -m \ > 512M -nodefaults -device sdhci-pci -device sd-card,drive=mydrive -drive \ > if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic -qtest \ > stdio > outl 0xcf8 0x80001010 > outl 0xcfc 0xe000 > outl 0xcf8 0x80001001 > outl 0xcfc 0x0600 > write 0xe058 0x1 0x6e > write 0xe059 0x1 0x5a > write 0xe028 0x1 0x10 > write 0xe02c 0x1 0x05 > write 0x5a6e 0x1 0x21 > write 0x5a75 0x1 0x20 > write 0xe005 0x1 0x02 > write 0xe00c 0x1 0x01 > write 0xe00e 0x1 0x20 > write 0xe00f 0x1 0x00 > write 0xe00c 0x1 0x00 > write 0xe020 0x1 0x00 > EOF Hi Philippe, I ran the fuzzer with this series applied and it found: cat << EOF | ./qemu-system-i386 -display none -machine accel=qtest, -m \ 512M -nodefaults -device sdhci-pci -device sd-card,drive=mydrive -drive \ if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic -qtest \ stdio outl 0xcf8 0x80001010 outl 0xcfc 0xe000 outl 0xcf8 0x80001004 outw 0xcfc 0x06 write 0xe028 0x1 0x08 write 0xe02c 0x1 0x05 write 0xe005 0x1 0x02 write 0x0 0x1 0x21 write 0x3 0x1 0x20 write 0xe00c 0x1 0x01 write 0xe00e 0x1 0x20 write 0xe00f 0x1 0x00 write 0xe00c 0x1 0x20 write 0xe020 0x1 0x00 EOF The crash seems very similar, but it looks like instead of SDHC_TRNS_READ this reproducer sets SDHC_TRNS_MULTI -Alex
Re: [PATCH v3 06/17] vfio/migration: Fix NULL pointer dereference bug
On 11/3/22 19:16, Avihai Horon wrote: As part of its error flow, vfio_vmstate_change() accesses MigrationState->to_dst_file without any checks. This can cause a NULL pointer dereference if the error flow is taken and MigrationState->to_dst_file is not set. For example, this can happen if VM is started or stopped not during migration and vfio_vmstate_change() error flow is taken, as MigrationState->to_dst_file is not set at that time. Fix it by checking that MigrationState->to_dst_file is set before using it. Fixes: 02a7e71b1e5b ("vfio: Add VM state change handler to know state of VM") Signed-off-by: Avihai Horon Reviewed-by: Juan Quintela Reviewed-by: Vladimir Sementsov-Ogievskiy -- Best regards, Vladimir
Re: [PATCH v3 05/17] vfio/migration: Fix wrong enum usage
On 11/3/22 19:16, Avihai Horon wrote: vfio_migration_init() initializes VFIOMigration->device_state using enum of VFIO migration protocol v2. Current implemented protocol is v1 so v1 enum should be used. Fix it. Fixes: 429c72800654 ("vfio/migration: Fix incorrect initialization value for parameters in VFIOMigration") Signed-off-by: Avihai Horon Reviewed-by: Zhang Chen the commit is already in master branch -- Best regards, Vladimir
Re: [PATCH v3 04/17] migration: Simplify migration_iteration_run()
On 11/3/22 19:16, Avihai Horon wrote: From: Juan Quintela Signed-off-by: Juan Quintela Signed-off-by: Avihai Horon --- migration/migration.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index ffe868b86f..59cc3c309b 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3743,23 +3743,24 @@ static MigIterateState migration_iteration_run(MigrationState *s) trace_migrate_pending(pending_size, s->threshold_size, pend_pre, pend_post); -if (pending_size && pending_size >= s->threshold_size) { -/* Still a significant amount to transfer */ -if (!in_postcopy && pend_pre <= s->threshold_size && -qatomic_read(>start_postcopy)) { -if (postcopy_start(s)) { -error_report("%s: postcopy failed to start", __func__); -} -return MIG_ITERATE_SKIP; -} -/* Just another iteration step */ -qemu_savevm_state_iterate(s->to_dst_file, in_postcopy); -} else { + +if (pending_size < s->threshold_size) { Is corner case "pending_size == s->threshold_size == 0" theoretically possible here? In this case prepatch we go to completion. Afterpatch we go to next iteration.. trace_migration_thread_low_pending(pending_size); migration_completion(s); return MIG_ITERATE_BREAK; } +/* Still a significant amount to transfer */ +if (!in_postcopy && pend_pre <= s->threshold_size && +qatomic_read(>start_postcopy)) { +if (postcopy_start(s)) { +error_report("%s: postcopy failed to start", __func__); +} +return MIG_ITERATE_SKIP; +} + +/* Just another iteration step */ +qemu_savevm_state_iterate(s->to_dst_file, in_postcopy); return MIG_ITERATE_RESUME; } -- Best regards, Vladimir
Re: [PATCH v3 03/17] migration: Block migration comment or code is wrong
On 11/8/22 21:36, Vladimir Sementsov-Ogievskiy wrote: On 11/3/22 19:16, Avihai Horon wrote: From: Juan Quintela And it appears that what is wrong is the code. During bulk stage we need to make sure that some block is dirty, but no games with max_size at all. :) That made me interested in, why we need this one block, so I decided to search through the history. And what I see? Haha, that was my commit 04636dc410b163c "migration/block: fix pending() return value" [1], which you actually revert with this patch. So, at least we should note, that it's a revert of [1]. Still that this will reintroduce the bug fixed by [1]. As I understand the problem is (was) that in block_save_complete() we finalize only dirty blocks, but don't finalize the bulk phase if it's not finalized yet. So, we can fix block_save_complete() to finalize the bulk phase, instead of hacking with pending in [1]. Interesting, why we need this one block, described in the comment you refer to? Was it an incomplete workaround for the same problem, described in [1]? If so, we can fix block_save_complete() and remove this if() together with the comment from block_save_pending(). PS: Don't we want to deprecate block migration? Is it really used in production? block-mirror is a recommended way to migrate block devices. -- Best regards, Vladimir
Re: [PATCH v3 03/17] migration: Block migration comment or code is wrong
On 11/3/22 19:16, Avihai Horon wrote: From: Juan Quintela And it appears that what is wrong is the code. During bulk stage we need to make sure that some block is dirty, but no games with max_size at all. :) That made me interested in, why we need this one block, so I decided to search through the history. And what I see? Haha, that was my commit 04636dc410b163c "migration/block: fix pending() return value" [1], which you actually revert with this patch. So, at least we should note, that it's a revert of [1]. Still that this will reintroduce the bug fixed by [1]. As I understand the problem is (was) that in block_save_complete() we finalize only dirty blocks, but don't finalize the bulk phase if it's not finalized yet. So, we can fix block_save_complete() to finalize the bulk phase, instead of hacking with pending in [1]. Interesting, why we need this one block, described in the comment you refer to? Was it an incomplete workaround for the same problem, described in [1]? If so, we can fix block_save_complete() and remove this if() together with the comment from block_save_pending(). Signed-off-by: Juan Quintela Reviewed-by: Stefan Hajnoczi --- migration/block.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/migration/block.c b/migration/block.c index b3d680af75..39ce4003c6 100644 --- a/migration/block.c +++ b/migration/block.c @@ -879,8 +879,8 @@ static void block_save_pending(void *opaque, uint64_t max_size, blk_mig_unlock(); /* Report at least one block pending during bulk phase */ -if (pending <= max_size && !block_mig_state.bulk_completed) { -pending = max_size + BLK_MIG_BLOCK_SIZE; +if (!pending && !block_mig_state.bulk_completed) { +pending = BLK_MIG_BLOCK_SIZE; } trace_migration_block_save_pending(pending); -- Best regards, Vladimir
[PULL 3/3] Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2"
From: Daniel Henrique Barboza Commit 334c388f25 ("pflash_cfi: Error out if device length isn't a power of two") aimed to finish the effort started by commit 06f1521795 ("pflash: Require backend size to match device, improve errors"), but unfortunately we are not quite there since various machines are still ready to accept incomplete / oversized pflash backend images, and now fail, i.e. on Debian bullseye: $ qemu-system-x86_64 \ -drive \ if=pflash,format=raw,unit=0,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd qemu-system-x86_64: Device size must be a power of two. where OVMF_CODE.fd comes from the ovmf package, which doesn't pad the firmware images to the flash size: $ ls -lh /usr/share/OVMF/ -rw-r--r-- 1 root root 3.5M Aug 19 2021 OVMF_CODE_4M.fd -rw-r--r-- 1 root root 1.9M Aug 19 2021 OVMF_CODE.fd -rw-r--r-- 1 root root 128K Aug 19 2021 OVMF_VARS.fd Since we entered the freeze period to prepare the v7.2.0 release, the safest is to revert commit 334c388f25707a234c4a0dea05b9df08d. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1294 Signed-off-by: Philippe Mathieu-Daudé Message-Id: <20221108175755.95141-1-phi...@linaro.org> Signed-off-by: Daniel Henrique Barboza Message-Id: <20221108172633.860700-1-danielhb...@gmail.com> --- hw/block/pflash_cfi01.c | 8 ++-- hw/block/pflash_cfi02.c | 5 - 2 files changed, 2 insertions(+), 11 deletions(-) diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c index 9c235bf66e..0cbc2fb4cb 100644 --- a/hw/block/pflash_cfi01.c +++ b/hw/block/pflash_cfi01.c @@ -690,7 +690,7 @@ static const MemoryRegionOps pflash_cfi01_ops = { .endianness = DEVICE_NATIVE_ENDIAN, }; -static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl, Error **errp) +static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl) { uint64_t blocks_per_device, sector_len_per_device, device_len; int num_devices; @@ -708,10 +708,6 @@ static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl, Error **errp) sector_len_per_device = pfl->sector_len / num_devices; } device_len = sector_len_per_device * blocks_per_device; -if (!is_power_of_2(device_len)) { -error_setg(errp, "Device size must be a power of two."); -return; -} /* Hardcoded CFI table */ /* Standard "QRY" string */ @@ -869,7 +865,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error **errp) */ pfl->cmd = 0x00; pfl->status = 0x80; /* WSM ready */ -pflash_cfi01_fill_cfi_table(pfl, errp); +pflash_cfi01_fill_cfi_table(pfl); } static void pflash_cfi01_system_reset(DeviceState *dev) diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c index ff2fe154c1..2a99b286b0 100644 --- a/hw/block/pflash_cfi02.c +++ b/hw/block/pflash_cfi02.c @@ -880,11 +880,6 @@ static void pflash_cfi02_realize(DeviceState *dev, Error **errp) return; } -if (!is_power_of_2(pfl->chip_len)) { -error_setg(errp, "Device size must be a power of two."); -return; -} - memory_region_init_rom_device(>orig_mem, OBJECT(pfl), _cfi02_ops, pfl, pfl->name, pfl->chip_len, errp); -- 2.38.1
[PULL 2/3] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)
When sdhci_write_block_to_card() is called to transfer data from the FIFO to the SD bus, the data is already present in the buffer and we have to consume it directly. See the description of the 'Buffer Write Enable' bit from the 'Present State' register (prnsts::SDHC_SPACE_AVAILABLE) in Table 2.14 from the SDHCI spec v2: Buffer Write Enable This status is used for non-DMA write transfers. The Host Controller can implement multiple buffers to transfer data efficiently. This read only flag indicates if space is available for write data. If this bit is 1, data can be written to the buffer. A change of this bit from 1 to 0 occurs when all the block data is written to the buffer. A change of this bit from 0 to 1 occurs when top of block data can be written to the buffer and generates the Buffer Write Ready interrupt. In our case, we do not want to overwrite the buffer, so we want this bit to be 0, then set it to 1 once the data is written onto the bus. This is probably a copy/paste error from commit d7dfca0807 ("hw/sdhci: introduce standard SD host controller"). OSS-Fuzz Report: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=45986#c4 Reproducers: $ cat << EOF | \ qemu-system-x86_64 -nodefaults -display none -machine accel=qtest \ -m 512M -device sdhci-pci -device sd-card,drive=mydrive \ -drive if=none,index=0,file=null-co://,format=raw,id=mydrive \ -nographic -qtest stdio outl 0xcf8 0x80001010 outl 0xcfc 0xe000 outl 0xcf8 0x80001001 outl 0xcfc 0x0600 write 0xe058 0x1 0x6e write 0xe059 0x1 0x5a write 0xe028 0x1 0x10 write 0xe02c 0x1 0x05 write 0x5a6e 0x1 0x21 write 0x5a75 0x1 0x20 write 0xe005 0x1 0x02 write 0xe00c 0x1 0x01 write 0xe00e 0x1 0x20 write 0xe00f 0x1 0x00 write 0xe00c 0x1 0x00 write 0xe020 0x1 0x00 EOF or https://lore.kernel.org/qemu-devel/caa8xkjxrms0fkr28akvnnpyatm0y0b+5fichpsrhd+mugnu...@mail.gmail.com/ Fixes: CVE-2022-3872 Reported-by: RivenDell Reported-by: Siqi Chen Reported-by: ningqiang Reported-by: ClusterFuzz Signed-off-by: Philippe Mathieu-Daudé Tested-by: Mauro Matteo Cascella Message-Id: <20221107221236.47841-2-phi...@linaro.org> --- hw/sd/sdhci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/sd/sdhci.c b/hw/sd/sdhci.c index 306070c872..f230e7475f 100644 --- a/hw/sd/sdhci.c +++ b/hw/sd/sdhci.c @@ -954,7 +954,7 @@ static void sdhci_data_transfer(void *opaque) sdhci_read_block_from_card(s); } else { s->prnsts |= SDHC_DOING_WRITE | SDHC_DAT_LINE_ACTIVE | -SDHC_SPACE_AVAILABLE | SDHC_DATA_INHIBIT; + SDHC_DATA_INHIBIT; sdhci_write_block_to_card(s); } } -- 2.38.1
[PULL 0/3] Memory/SDHCI/ParallelFlash patches for v7.2.0-rc0
The following changes since commit ade760a2f63804b7ab1839fbc3e5ddbf30538718: Merge tag 'pull-request-2022-11-08' of https://gitlab.com/thuth/qemu into staging (2022-11-08 11:34:06 -0500) are available in the Git repository at: https://github.com/philmd/qemu.git tags/memflash-20221108 for you to fetch changes up to cf9b3efd816518f9f210f50a0fa3e46a00b33c27: Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2" (2022-11-08 19:29:25 +0100) Memory/SDHCI/ParallelFlash patches queue - Fix wrong end address dump in 'info mtree' (Zhenzhong Duan) - Fix in SDHCI for CVE-2022-3872 (myself) - Revert latest pflash check of underlying block size (Daniel Henrique Barboza & myself) Daniel Henrique Barboza (1): Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2" Philippe Mathieu-Daudé (1): hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872) Zhenzhong Duan (1): memory: Fix wrong end address dump hw/block/pflash_cfi01.c | 8 ++-- hw/block/pflash_cfi02.c | 5 - hw/sd/sdhci.c | 2 +- softmmu/physmem.c | 2 +- 4 files changed, 4 insertions(+), 13 deletions(-) -- 2.38.1
[PULL 1/3] memory: Fix wrong end address dump
From: Zhenzhong Duan The end address of memory region section isn't correctly calculated which leads to overflowed mtree dump: Dispatch Physical sections .. #70 @2000..00011fff io [ROOT] #71 @5000..5fff (noname) #72 @5000..00014fff io [ROOT] #73 @5658..5658 vmport #74 @5659..00015658 io [ROOT] #75 @6000..00015fff io [ROOT] After fix: #70 @2000..4fff io [ROOT] #71 @5000..5fff (noname) #72 @5000..5657 io [ROOT] #73 @5658..5658 vmport #74 @5659..5fff io [ROOT] #75 @6000.. io [ROOT] Fixes: 5e8fd947e2670 ("memory: Rework "info mtree" to print flat views and dispatch trees") Signed-off-by: Zhenzhong Duan Reviewed-by: David Hildenbrand Reviewed-by: Peter Xu Reviewed-by: Philippe Mathieu-Daudé Message-Id: <20220622095912.3430583-1-zhenzhong.d...@intel.com> Signed-off-by: Philippe Mathieu-Daudé --- softmmu/physmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/softmmu/physmem.c b/softmmu/physmem.c index d9578ccfd4..1b606a3002 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -3712,7 +3712,7 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root) " %s%s%s%s%s", i, s->offset_within_address_space, -s->offset_within_address_space + MR_SIZE(s->mr->size), +s->offset_within_address_space + MR_SIZE(s->size), s->mr->name ? s->mr->name : "(noname)", i < ARRAY_SIZE(names) ? names[i] : "", s->mr == root ? " [ROOT]" : "", -- 2.38.1
Re: [PATCH-for-7.2 0/2] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)
Applied to the staging tree. Thanks! Stefan
Re: [PATCH-for-7.2] Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2"
Applied to staging. Thanks! Stefan
Re: [PATCH v3 02/17] migration: No save_live_pending() method uses the QEMUFile parameter
On 11/3/22 19:16, Avihai Horon wrote: From: Juan Quintela So remove it everywhere. Signed-off-by: Juan Quintela Reviewed-by: Vladimir Sementsov-Ogievskiy -- Best regards, Vladimir
[PATCH-for-7.2] Revert "hw/block/pflash_cfi: Error out if dev length isn't power of 2"
Commit 334c388f25 ("pflash_cfi: Error out if device length isn't a power of two") aimed to finish the effort started by commit 06f1521795 ("pflash: Require backend size to match device, improve errors"), but unfortunately we are not quite there since various machines are still ready to accept incomplete / oversized pflash backend images, and now fail, i.e. on Debian bullseye: $ qemu-system-x86_64 \ -drive \ if=pflash,format=raw,unit=0,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd qemu-system-x86_64: Device size must be a power of two. where OVMF_CODE.fd comes from the ovmf package, which doesn't pad the firmware images to the flash size: $ ls -lh /usr/share/OVMF/ -rw-r--r-- 1 root root 3.5M Aug 19 2021 OVMF_CODE_4M.fd -rw-r--r-- 1 root root 1.9M Aug 19 2021 OVMF_CODE.fd -rw-r--r-- 1 root root 128K Aug 19 2021 OVMF_VARS.fd Since we entered the freeze period to prepare the v7.2.0 release, the safest is to revert commit 334c388f25707a234c4a0dea05b9df08d. Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1294 Signed-off-by: Philippe Mathieu-Daudé --- Cc: Sunil V L Cc: Daniel Henrique Barboza Cc: Markus Armbruster Cc: Bernhard Beschow --- hw/block/pflash_cfi01.c | 8 ++-- hw/block/pflash_cfi02.c | 5 - 2 files changed, 2 insertions(+), 11 deletions(-) diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c index 9c235bf66e..0cbc2fb4cb 100644 --- a/hw/block/pflash_cfi01.c +++ b/hw/block/pflash_cfi01.c @@ -690,7 +690,7 @@ static const MemoryRegionOps pflash_cfi01_ops = { .endianness = DEVICE_NATIVE_ENDIAN, }; -static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl, Error **errp) +static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl) { uint64_t blocks_per_device, sector_len_per_device, device_len; int num_devices; @@ -708,10 +708,6 @@ static void pflash_cfi01_fill_cfi_table(PFlashCFI01 *pfl, Error **errp) sector_len_per_device = pfl->sector_len / num_devices; } device_len = sector_len_per_device * blocks_per_device; -if (!is_power_of_2(device_len)) { -error_setg(errp, "Device size must be a power of two."); -return; -} /* Hardcoded CFI table */ /* Standard "QRY" string */ @@ -869,7 +865,7 @@ static void pflash_cfi01_realize(DeviceState *dev, Error **errp) */ pfl->cmd = 0x00; pfl->status = 0x80; /* WSM ready */ -pflash_cfi01_fill_cfi_table(pfl, errp); +pflash_cfi01_fill_cfi_table(pfl); } static void pflash_cfi01_system_reset(DeviceState *dev) diff --git a/hw/block/pflash_cfi02.c b/hw/block/pflash_cfi02.c index ff2fe154c1..2a99b286b0 100644 --- a/hw/block/pflash_cfi02.c +++ b/hw/block/pflash_cfi02.c @@ -880,11 +880,6 @@ static void pflash_cfi02_realize(DeviceState *dev, Error **errp) return; } -if (!is_power_of_2(pfl->chip_len)) { -error_setg(errp, "Device size must be a power of two."); -return; -} - memory_region_init_rom_device(>orig_mem, OBJECT(pfl), _cfi02_ops, pfl, pfl->name, pfl->chip_len, errp); -- 2.38.1
Re: [PATCH v3 01/17] migration: Remove res_compatible parameter
On 11/3/22 19:16, Avihai Horon wrote: From: Juan Quintela It was only used for RAM, and in that case, it means that this amount of data was sent for memory. Not clear for me, what means "this amount of data was sent for memory"... That amount of data was not yet sent, actually. Just delete the field in all callers. Signed-off-by: Juan Quintela --- hw/s390x/s390-stattrib.c | 6 ++ hw/vfio/migration.c| 10 -- hw/vfio/trace-events | 2 +- include/migration/register.h | 20 ++-- migration/block-dirty-bitmap.c | 7 +++ migration/block.c | 7 +++ migration/migration.c | 9 - migration/ram.c| 8 +++- migration/savevm.c | 14 +- migration/savevm.h | 4 +--- migration/trace-events | 2 +- 11 files changed, 37 insertions(+), 52 deletions(-) [..] diff --git a/include/migration/register.h b/include/migration/register.h index c1dcff0f90..1950fee6a8 100644 --- a/include/migration/register.h +++ b/include/migration/register.h @@ -48,18 +48,18 @@ typedef struct SaveVMHandlers { int (*save_setup)(QEMUFile *f, void *opaque); void (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t threshold_size, - uint64_t *res_precopy_only, - uint64_t *res_compatible, - uint64_t *res_postcopy_only); + uint64_t *rest_precopy, + uint64_t *rest_postcopy); /* Note for save_live_pending: - * - res_precopy_only is for data which must be migrated in precopy phase - * or in stopped state, in other words - before target vm start - * - res_compatible is for data which may be migrated in any phase - * - res_postcopy_only is for data which must be migrated in postcopy phase - * or in stopped state, in other words - after source vm stop + * - res_precopy is for data which must be migrated in precopy + * phase or in stopped state, in other words - before target + * vm start + * - res_postcopy is for data which must be migrated in postcopy + * phase or in stopped state, in other words - after source vm + * stop * - * Sum of res_postcopy_only, res_compatible and res_postcopy_only is the - * whole amount of pending data. + * Sum of res_precopy and res_postcopy is the whole amount of + * pending data. */ [..] diff --git a/migration/ram.c b/migration/ram.c index dc1de9ddbc..20167e1102 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -3435,9 +3435,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque) } static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size, - uint64_t *res_precopy_only, - uint64_t *res_compatible, - uint64_t *res_postcopy_only) + uint64_t *res_precopy, uint64_t *res_postcopy) { RAMState **temp = opaque; RAMState *rs = *temp; @@ -3457,9 +3455,9 @@ static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size, if (migrate_postcopy_ram()) { /* We can do postcopy, and all the data is postcopiable */ -*res_compatible += remaining_size; +*res_postcopy += remaining_size; That's seems to be not quite correct. res_postcopy is defined as "data which must be migrated in postcopy", but that's not true here, as RAM can be migrated both in precopy and postcopy. Still we really can include "compat" into "postcopy" just because in the logic of migration_iteration_run() we don't actually distinguish "compat" and "post". The logic only depends on "total" and "pre". So, if we want to combine "compat" into "post", we should redefine "post" in the comment in include/migration/register.h, something like this: - res_precopy is for data which MUST be migrated in precopy phase or in stopped state, in other words - before target vm start - res_postcopy is for all data except for declared in res_precopy. res_postcopy data CAN be migrated in postcopy, i.e. after target vm start. } else { -*res_precopy_only += remaining_size; +*res_precopy += remaining_size; } } -- Best regards, Vladimir
[PATCH] block: m25p80: fix dummy byte count read from spansion cfg register
Spansion nor-flash stores the dummy read count in bits in a config register. This is currently read and used as the byte count which is wrong. This patch fixes this bit to byte conversion without warning about unsupported configurations (such as bits % 8 != 0) Signed-off-by: Ramon Aerne --- hw/block/m25p80.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/hw/block/m25p80.c b/hw/block/m25p80.c index 02adc87..cf10b11 100644 --- a/hw/block/m25p80.c +++ b/hw/block/m25p80.c @@ -981,7 +981,7 @@ static void decode_fast_read_cmd(Flash *s) s->needed_bytes += extract32(s->spansion_cr2v, SPANSION_DUMMY_CLK_POS, SPANSION_DUMMY_CLK_LEN -); +) / 8; break; case MAN_ISSI: /* @@ -1017,7 +1017,7 @@ static void decode_dio_read_cmd(Flash *s) s->needed_bytes += extract32(s->spansion_cr2v, SPANSION_DUMMY_CLK_POS, SPANSION_DUMMY_CLK_LEN -); +) / 8; break; case MAN_NUMONYX: s->needed_bytes += numonyx_extract_cfg_num_dummies(s); @@ -1067,7 +1067,7 @@ static void decode_qio_read_cmd(Flash *s) s->needed_bytes += extract32(s->spansion_cr2v, SPANSION_DUMMY_CLK_POS, SPANSION_DUMMY_CLK_LEN -); +) / 8; break; case MAN_NUMONYX: s->needed_bytes += numonyx_extract_cfg_num_dummies(s); -- 2.25.1
Re: [PATCH-for-7.2 1/2] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)
On 221107 2312, Philippe Mathieu-Daudé wrote: > When sdhci_write_block_to_card() is called to transfer data from > the FIFO to the SD bus, the data is already present in the buffer > and we have to consume it directly. > > See the description of the 'Buffer Write Enable' bit from the > 'Present State' register (prnsts::SDHC_SPACE_AVAILABLE) in Table > 2.14 from the SDHCI spec v2: > > Buffer Write Enable > > This status is used for non-DMA write transfers. > > The Host Controller can implement multiple buffers to transfer > data efficiently. This read only flag indicates if space is > available for write data. If this bit is 1, data can be written > to the buffer. A change of this bit from 1 to 0 occurs when all > the block data is written to the buffer. A change of this bit > from 0 to 1 occurs when top of block data can be written to the > buffer and generates the Buffer Write Ready interrupt. > > In our case, we do not want to overwrite the buffer, so we want > this bit to be 0, then set it to 1 once the data is written onto > the bus. > > This is probably a copy/paste error from commit d7dfca0807 > ("hw/sdhci: introduce standard SD host controller"). > > Reproducer: > https://lore.kernel.org/qemu-devel/caa8xkjxrms0fkr28akvnnpyatm0y0b+5fichpsrhd+mugnu...@mail.gmail.com/ > > Fixes: CVE-2022-3872 > Reported-by: RivenDell > Reported-by: Siqi Chen > Reported-by: ningqiang > Signed-off-by: Philippe Mathieu-Daudé Seems like OSS-Fuzz also found this, not sure why it never made it into a gitlab issue: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=45986#c4 Slightly shorter reproducer: cat << EOF | ./qemu-system-i386 -display none -machine accel=qtest, -m \ 512M -nodefaults -device sdhci-pci -device sd-card,drive=mydrive -drive \ if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic -qtest \ stdio outl 0xcf8 0x80001010 outl 0xcfc 0xe000 outl 0xcf8 0x80001001 outl 0xcfc 0x0600 write 0xe058 0x1 0x6e write 0xe059 0x1 0x5a write 0xe028 0x1 0x10 write 0xe02c 0x1 0x05 write 0x5a6e 0x1 0x21 write 0x5a75 0x1 0x20 write 0xe005 0x1 0x02 write 0xe00c 0x1 0x01 write 0xe00e 0x1 0x20 write 0xe00f 0x1 0x00 write 0xe00c 0x1 0x00 write 0xe020 0x1 0x00 EOF
Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start
"Michael S. Tsirkin" writes: > On Tue, Nov 08, 2022 at 11:21:26AM +, Alex Bennée wrote: >> >> "Michael S. Tsirkin" writes: >> >> > On Tue, Nov 08, 2022 at 10:23:15AM +, Alex Bennée wrote: >> >> >> >> "Michael S. Tsirkin" writes: >> >> >> >> > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote: >> >> >> The previous fix to virtio_device_started revealed a problem in its >> >> >> use by both the core and the device code. The core code should be able >> >> >> to handle the device "starting" while the VM isn't running to handle >> >> >> the restoration of migration state. To solve this dual use introduce a >> >> >> new helper for use by the vhost-user backends who all use it to feed a >> >> >> should_start variable. >> >> >> >> >> >> We can also pick up a change vhost_user_blk_set_status while we are at >> >> >> it which follows the same pattern. >> >> >> >> >> >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to >> >> >> virtio_device_started) >> >> >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio >> >> >> device) >> >> >> Signed-off-by: Alex Bennée >> >> >> Cc: "Michael S. Tsirkin" >> >> > >> >> > why is this in this patchset? >> >> >> >> As per my cover letter: >> >> >> >> Most of these patches have been posted before as single patch RFCs. A >> >> couple are already scheduled through other trees so will drop out in >> >> due course >> >> >> >> but I keep them in my tree until they are merged so I can continue to >> >> soak test them (and have a stable base for my other WIP trees). >> > >> > That's fine just pls don't double-post them on list, certainly >> > not as part of a patchset. >> >> Why not? Is this breaking some tooling? > > Yes patchset breaks git am if you try to apply part of it. > > Reposting creates work for reviewers - why should they have to read the same > patch twice? In this case it also made me scratch my head trying to > figure out what to do about it. > > But, if you are careful and maintain an ordered changelog after "---" > and there it says > changes since rfc: > no changes, subject changed > > then this second part is less of a problem Ahh yes, I should have updated to point out I added the extra Fixes line as per the review. I guess you added that in your PR? Anyway it's dropped now your PR has gone in. -- Alex Bennée
Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations
On 11/8/22 19:19, Vladimir Sementsov-Ogievskiy wrote: This is a lot better than our "coroutine_fn" sign, which actually do no check (and can't do). Don't you plan to swap a "coroutine_fn" noop marker with more meaningful IN_COROUTINE(); (or something like this, which just do assert(qemu_in_coroutine())) at start of the function? It would be a lot safer. Moreover, we can introduce two macros: IN_COROUTINE() and NOT_COROUTINE(), mark functions correspondingly and drop coroutine_fn mark. Than the picture would be very deterministic: IN_COROUTINE - function is called only from coroutine context NOT_COROUTINE - function is never called from coroutine context - function may be called from both coroutine and non-coroutine context -- Best regards, Vladimir
Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations
[add Stefan] On 11/8/22 18:09, Emanuele Giuseppe Esposito wrote: Am 08/11/2022 um 15:48 schrieb Vladimir Sementsov-Ogievskiy: On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: These functions end up calling bdrv_common_block_status_above(), a generated_co_wrapper function. generated_co_wrapper is not a coroutine_fn. Сonversely it's a function that do a class coroutine wrapping - start a coroutine and do POLL to wait for the coroutine to finish. In addition, they also happen to be always called in coroutine context, meaning all callers are coroutine_fn. That's also not a reason for marking them coroutine_fn. "coroutine_fn" means that the function can be called only from coroutine context. This means that the g_c_w function will enter the qemu_in_coroutine() case and eventually suspend (or in other words call qemu_coroutine_yield()). Therefore we need to mark such functions coroutine_fn too. I don't think so. Moreover, this breaks the concept, as your new coroutine_fn functions will call generated_co_wrapper functions which are not marked coroutine_fn and never was. Theoretically not, Agree, I was wrong in this point because marking it coroutine_fn we know that we are going in the if(qemu_in_coroutine()) branch of the g_c_w, so we could directly replace it with the actual function. Because it's a pain to do it with hand, and at some point I/someone should use Alberto static analyzer to get rid of that, I decided to leave g_c_w there. As I understand it, it seems that you and Paolo have a different understanding on what coroutine_fn means and where it should be used. Looks so... But we have a documentation in coroutine.h, let's check: * Mark a function that executes in coroutine context * * Functions that execute in coroutine context cannot be called directly from * normal functions. In the future it would be nice to enable compiler or * static checker support for catching such errors. This annotation might make * it possible and in the meantime it serves as documentation. Not very clear, but still it say: coroutine_fn = "function that executes in coroutine context" "functions that execute in coroutine context" = "cannot be called directly from normal functions" So, IMHO that corresponds to my point of view: we shouldn't mark by coroutine_fn functions that can be called from both coroutine context and not. If we want to change the concept, we should start with rewriting this documentation. Honestly I don't understand your point, as you said "coroutine_fn" means that the function can be called only from coroutine context. which is the case for these functions. So could you please clarify? What I do know is that it's extremely confusing to understand if a function that is *not* marked as coroutine_fn is actually being called also from coroutines or not. Which complicates A LOT whatever change or operation I want to perform on the BlockDriver callbacks or any other function in the block layer, because in the current approach for the AioContext lock removal (which you are not aware of, I understand) we need to distinguish between functions running in coroutine context and not, and throwing g_c_w functions everywhere makes my work much harder that it already is. OK, I understand the problem. Hmm. Formally marking by "coroutine_fn" a function that theoretically can be called from any context doesn't break things. We just say that since that moment we don't allow to call this function from non-coroutine context. OK, I tend to agree that you are on the right way, sorry for my skepticism) PS: you recently introduced a lot of IO_CODE() / GLOBAL_STATE_CODE() marks, which (will be/already) turned into assertions. This is a lot better than our "coroutine_fn" sign, which actually do no check (and can't do). Don't you plan to swap a "coroutine_fn" noop marker with more meaningful IN_COROUTINE(); (or something like this, which just do assert(qemu_in_coroutine())) at start of the function? It would be a lot safer. Thank you, Emanuele Signed-off-by: Emanuele Giuseppe Esposito --- block/block-copy.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/block/block-copy.c b/block/block-copy.c index bb947afdda..f33ab1d0b6 100644 --- a/block/block-copy.c +++ b/block/block-copy.c @@ -577,8 +577,9 @@ static coroutine_fn int block_copy_task_entry(AioTask *task) return ret; } -static int block_copy_block_status(BlockCopyState *s, int64_t offset, - int64_t bytes, int64_t *pnum) +static coroutine_fn int block_copy_block_status(BlockCopyState *s, + int64_t offset, + int64_t bytes, int64_t *pnum) { int64_t num; BlockDriverState *base; @@ -613,8 +614,9 @@ static int block_copy_block_status(BlockCopyState *s, int64_t offset, * Check if the cluster starting at offset
Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine
Am 08/11/2022 um 16:52 schrieb Vladimir Sementsov-Ogievskiy: > On 11/8/22 18:13, Emanuele Giuseppe Esposito wrote: >> >> >> Am 08/11/2022 um 15:33 schrieb Vladimir Sementsov-Ogievskiy: >>> On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: It seems that bdrv_open_driver() forgot to create a coroutine where to call bs->drv->bdrv_co_drain_begin(), a callback marked as coroutine_fn. Because there is no active I/O at this point, the coroutine should end right after entering, so the caller does not need to poll until it is finished. >>> >>> Hmm. I see your point. But isn't it better to go the generic way and use >>> a generated coroutine wrapper? Nothing guarantees that >>> .bdrv_co_drain_begin() handlers will never do any yield point even on >>> driver open... >>> >>> Look for example at bdrv_co_check(). It has a generated wrapper >>> bdrv_check(), declared in include/block/block-io.h >>> >>> So you just need to declare the wrapper, and use it in >>> bdrv_open_driver(), the code would be clearer too. >> >> I think (and this is a repetition from my previous email) it confuses >> the code. It is clear, but then you don't know if we are in a coroutine >> context or not. > > Hmm. But same thing is true for all callers of coroutine wrappers. > > I agree that "coroutine wrapper" is a workaround for the design problem. > Really, the fact that in many places we don't know are we in coroutine > or not is very uncomfortable. And the only way to figure if it's a coroutine or not is by adding assertions and pray that the iotests don't fail *and* cover all cases. > > But still, I'm not sure that's it good to avoid this workaround in one > place and continue to use it in all other places. I think following > common design is better. Or rework it deeply by getting rid of the > wrappers somehow. Well, one step at the time :) it's already difficult to verify that such replacement cover and is correct for all cases :) > >> >> I am very well aware of what you did with your script, and I am working >> on extending your g_c_w class with g_c_w_simple, where we drop the >> if(qemu_in_coroutine()) case and leave just the coroutine creation. >> Therefore, coroutine functions we use only the _co_ function, otherwise >> we use g_c_w_simple. >> In this way code is clear as you point out, but there is no confusion. > > Hmm sounds good, I missed it. Why not use g_c_w_simple here than? Because I came up with it this morning. Thank you, Emanuele > >> >> Thank you, >> Emanuele >>> Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Paolo Bonzini --- block.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/block.c b/block.c index 5311b21f8e..d2b2800039 100644 --- a/block.c +++ b/block.c @@ -1637,12 +1637,34 @@ out: g_free(gen_node_name); } +typedef struct DrainCo { + BlockDriverState *bs; + int ret; +} DrainCo; + +static void coroutine_fn bdrv_co_drain_begin(void *opaque) +{ + int i; + DrainCo *co = opaque; + BlockDriverState *bs = co->bs; + + for (i = 0; i < bs->quiesce_counter; i++) { + bs->drv->bdrv_co_drain_begin(bs); + } + co->ret = 0; +} + static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, const char *node_name, QDict *options, int open_flags, Error **errp) { Error *local_err = NULL; - int i, ret; + int ret; + Coroutine *co; + DrainCo dco = { + .bs = bs, + .ret = NOT_DONE, + }; GLOBAL_STATE_CODE(); bdrv_assign_node_name(bs, node_name, _err); @@ -1690,10 +1712,14 @@ static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, assert(bdrv_min_mem_align(bs) != 0); assert(is_power_of_2(bs->bl.request_alignment)); - for (i = 0; i < bs->quiesce_counter; i++) { - if (drv->bdrv_co_drain_begin) { - drv->bdrv_co_drain_begin(bs); - } + if (drv->bdrv_co_drain_begin) { + co = qemu_coroutine_create(bdrv_co_drain_begin, ); + qemu_coroutine_enter(co); + /* + * There should be no reason for drv->bdrv_co_drain_begin to wait at + * this point, because the device does not have any active I/O. + */ + assert(dco.ret != NOT_DONE); } return 0; >>> >> >
Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine
On 11/8/22 18:13, Emanuele Giuseppe Esposito wrote: Am 08/11/2022 um 15:33 schrieb Vladimir Sementsov-Ogievskiy: On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: It seems that bdrv_open_driver() forgot to create a coroutine where to call bs->drv->bdrv_co_drain_begin(), a callback marked as coroutine_fn. Because there is no active I/O at this point, the coroutine should end right after entering, so the caller does not need to poll until it is finished. Hmm. I see your point. But isn't it better to go the generic way and use a generated coroutine wrapper? Nothing guarantees that .bdrv_co_drain_begin() handlers will never do any yield point even on driver open... Look for example at bdrv_co_check(). It has a generated wrapper bdrv_check(), declared in include/block/block-io.h So you just need to declare the wrapper, and use it in bdrv_open_driver(), the code would be clearer too. I think (and this is a repetition from my previous email) it confuses the code. It is clear, but then you don't know if we are in a coroutine context or not. Hmm. But same thing is true for all callers of coroutine wrappers. I agree that "coroutine wrapper" is a workaround for the design problem. Really, the fact that in many places we don't know are we in coroutine or not is very uncomfortable. But still, I'm not sure that's it good to avoid this workaround in one place and continue to use it in all other places. I think following common design is better. Or rework it deeply by getting rid of the wrappers somehow. I am very well aware of what you did with your script, and I am working on extending your g_c_w class with g_c_w_simple, where we drop the if(qemu_in_coroutine()) case and leave just the coroutine creation. Therefore, coroutine functions we use only the _co_ function, otherwise we use g_c_w_simple. In this way code is clear as you point out, but there is no confusion. Hmm sounds good, I missed it. Why not use g_c_w_simple here than? Thank you, Emanuele Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Paolo Bonzini --- block.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/block.c b/block.c index 5311b21f8e..d2b2800039 100644 --- a/block.c +++ b/block.c @@ -1637,12 +1637,34 @@ out: g_free(gen_node_name); } +typedef struct DrainCo { + BlockDriverState *bs; + int ret; +} DrainCo; + +static void coroutine_fn bdrv_co_drain_begin(void *opaque) +{ + int i; + DrainCo *co = opaque; + BlockDriverState *bs = co->bs; + + for (i = 0; i < bs->quiesce_counter; i++) { + bs->drv->bdrv_co_drain_begin(bs); + } + co->ret = 0; +} + static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, const char *node_name, QDict *options, int open_flags, Error **errp) { Error *local_err = NULL; - int i, ret; + int ret; + Coroutine *co; + DrainCo dco = { + .bs = bs, + .ret = NOT_DONE, + }; GLOBAL_STATE_CODE(); bdrv_assign_node_name(bs, node_name, _err); @@ -1690,10 +1712,14 @@ static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, assert(bdrv_min_mem_align(bs) != 0); assert(is_power_of_2(bs->bl.request_alignment)); - for (i = 0; i < bs->quiesce_counter; i++) { - if (drv->bdrv_co_drain_begin) { - drv->bdrv_co_drain_begin(bs); - } + if (drv->bdrv_co_drain_begin) { + co = qemu_coroutine_create(bdrv_co_drain_begin, ); + qemu_coroutine_enter(co); + /* + * There should be no reason for drv->bdrv_co_drain_begin to wait at + * this point, because the device does not have any active I/O. + */ + assert(dco.ret != NOT_DONE); } return 0; -- Best regards, Vladimir
Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start
On Tue, Nov 08, 2022 at 11:21:26AM +, Alex Bennée wrote: > > "Michael S. Tsirkin" writes: > > > On Tue, Nov 08, 2022 at 10:23:15AM +, Alex Bennée wrote: > >> > >> "Michael S. Tsirkin" writes: > >> > >> > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote: > >> >> The previous fix to virtio_device_started revealed a problem in its > >> >> use by both the core and the device code. The core code should be able > >> >> to handle the device "starting" while the VM isn't running to handle > >> >> the restoration of migration state. To solve this dual use introduce a > >> >> new helper for use by the vhost-user backends who all use it to feed a > >> >> should_start variable. > >> >> > >> >> We can also pick up a change vhost_user_blk_set_status while we are at > >> >> it which follows the same pattern. > >> >> > >> >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to > >> >> virtio_device_started) > >> >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio > >> >> device) > >> >> Signed-off-by: Alex Bennée > >> >> Cc: "Michael S. Tsirkin" > >> > > >> > why is this in this patchset? > >> > >> As per my cover letter: > >> > >> Most of these patches have been posted before as single patch RFCs. A > >> couple are already scheduled through other trees so will drop out in > >> due course > >> > >> but I keep them in my tree until they are merged so I can continue to > >> soak test them (and have a stable base for my other WIP trees). > > > > That's fine just pls don't double-post them on list, certainly > > not as part of a patchset. > > Why not? Is this breaking some tooling? Yes patchset breaks git am if you try to apply part of it. Reposting creates work for reviewers - why should they have to read the same patch twice? In this case it also made me scratch my head trying to figure out what to do about it. But, if you are careful and maintain an ordered changelog after "---" and there it says changes since rfc: no changes, subject changed then this second part is less of a problem > -- > Alex Bennée
Re: [PATCH v2 8/9] block: bdrv_create is never called in non-coroutine context
Am 04.11.2022 um 10:56 hat Emanuele Giuseppe Esposito geschrieben: > Delete the if case and make sure it won't be called again > in coroutines. > > Signed-off-by: Emanuele Giuseppe Esposito > Reviewed-by: Paolo Bonzini In the subject line, it should be "never called in coroutine context" rather than "non-coroutine context", right? Kevin
Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine
Am 08.11.2022 um 15:33 hat Vladimir Sementsov-Ogievskiy geschrieben: > On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: > > It seems that bdrv_open_driver() forgot to create a coroutine > > where to call bs->drv->bdrv_co_drain_begin(), a callback > > marked as coroutine_fn. > > > > Because there is no active I/O at this point, the coroutine > > should end right after entering, so the caller does not need > > to poll until it is finished. > > Hmm. I see your point. But isn't it better to go the generic way and > use a generated coroutine wrapper? Nothing guarantees that > .bdrv_co_drain_begin() handlers will never do any yield point even on > driver open... > > Look for example at bdrv_co_check(). It has a generated wrapper > bdrv_check(), declared in include/block/block-io.h > > So you just need to declare the wrapper, and use it in > bdrv_open_driver(), the code would be clearer too. Note that if we apply the drain simplification series I sent today up to at least patch 3 ('block: Revert .bdrv_drained_begin/end to non-coroutine_fn') [1], then this patch isn't actually needed any more. Kevin [1] https://lists.gnu.org/archive/html/qemu-block/2022-11/msg00206.html
Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine
Am 08/11/2022 um 15:33 schrieb Vladimir Sementsov-Ogievskiy: > On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: >> It seems that bdrv_open_driver() forgot to create a coroutine >> where to call bs->drv->bdrv_co_drain_begin(), a callback >> marked as coroutine_fn. >> >> Because there is no active I/O at this point, the coroutine >> should end right after entering, so the caller does not need >> to poll until it is finished. > > Hmm. I see your point. But isn't it better to go the generic way and use > a generated coroutine wrapper? Nothing guarantees that > .bdrv_co_drain_begin() handlers will never do any yield point even on > driver open... > > Look for example at bdrv_co_check(). It has a generated wrapper > bdrv_check(), declared in include/block/block-io.h > > So you just need to declare the wrapper, and use it in > bdrv_open_driver(), the code would be clearer too. I think (and this is a repetition from my previous email) it confuses the code. It is clear, but then you don't know if we are in a coroutine context or not. I am very well aware of what you did with your script, and I am working on extending your g_c_w class with g_c_w_simple, where we drop the if(qemu_in_coroutine()) case and leave just the coroutine creation. Therefore, coroutine functions we use only the _co_ function, otherwise we use g_c_w_simple. In this way code is clear as you point out, but there is no confusion. Thank you, Emanuele > >> >> Signed-off-by: Emanuele Giuseppe Esposito >> Reviewed-by: Paolo Bonzini >> --- >> block.c | 36 +++- >> 1 file changed, 31 insertions(+), 5 deletions(-) >> >> diff --git a/block.c b/block.c >> index 5311b21f8e..d2b2800039 100644 >> --- a/block.c >> +++ b/block.c >> @@ -1637,12 +1637,34 @@ out: >> g_free(gen_node_name); >> } >> +typedef struct DrainCo { >> + BlockDriverState *bs; >> + int ret; >> +} DrainCo; >> + >> +static void coroutine_fn bdrv_co_drain_begin(void *opaque) >> +{ >> + int i; >> + DrainCo *co = opaque; >> + BlockDriverState *bs = co->bs; >> + >> + for (i = 0; i < bs->quiesce_counter; i++) { >> + bs->drv->bdrv_co_drain_begin(bs); >> + } >> + co->ret = 0; >> +} >> + >> static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, >> const char *node_name, QDict *options, >> int open_flags, Error **errp) >> { >> Error *local_err = NULL; >> - int i, ret; >> + int ret; >> + Coroutine *co; >> + DrainCo dco = { >> + .bs = bs, >> + .ret = NOT_DONE, >> + }; >> GLOBAL_STATE_CODE(); >> bdrv_assign_node_name(bs, node_name, _err); >> @@ -1690,10 +1712,14 @@ static int bdrv_open_driver(BlockDriverState >> *bs, BlockDriver *drv, >> assert(bdrv_min_mem_align(bs) != 0); >> assert(is_power_of_2(bs->bl.request_alignment)); >> - for (i = 0; i < bs->quiesce_counter; i++) { >> - if (drv->bdrv_co_drain_begin) { >> - drv->bdrv_co_drain_begin(bs); >> - } >> + if (drv->bdrv_co_drain_begin) { >> + co = qemu_coroutine_create(bdrv_co_drain_begin, ); >> + qemu_coroutine_enter(co); >> + /* >> + * There should be no reason for drv->bdrv_co_drain_begin to >> wait at >> + * this point, because the device does not have any active I/O. >> + */ >> + assert(dco.ret != NOT_DONE); >> } >> return 0; >
Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations
Am 08/11/2022 um 15:48 schrieb Vladimir Sementsov-Ogievskiy: > On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: >> These functions end up calling bdrv_common_block_status_above(), a >> generated_co_wrapper function. > > generated_co_wrapper is not a coroutine_fn. Сonversely it's a function > that do a class coroutine wrapping - start a coroutine and do POLL to > wait for the coroutine to finish. > >> In addition, they also happen to be always called in coroutine context, >> meaning all callers are coroutine_fn. > > That's also not a reason for marking them coroutine_fn. "coroutine_fn" > means that the function can be called only from coroutine context. > >> This means that the g_c_w function will enter the qemu_in_coroutine() >> case and eventually suspend (or in other words call >> qemu_coroutine_yield()). >> Therefore we need to mark such functions coroutine_fn too. > > I don't think so. Moreover, this breaks the concept, as your new > coroutine_fn functions will call generated_co_wrapper functions which > are not marked coroutine_fn and never was. Theoretically not, because marking it coroutine_fn we know that we are going in the if(qemu_in_coroutine()) branch of the g_c_w, so we could directly replace it with the actual function. Because it's a pain to do it with hand, and at some point I/someone should use Alberto static analyzer to get rid of that, I decided to leave g_c_w there. As I understand it, it seems that you and Paolo have a different understanding on what coroutine_fn means and where it should be used. Honestly I don't understand your point, as you said > "coroutine_fn" > means that the function can be called only from coroutine context. which is the case for these functions. So could you please clarify? What I do know is that it's extremely confusing to understand if a function that is *not* marked as coroutine_fn is actually being called also from coroutines or not. Which complicates A LOT whatever change or operation I want to perform on the BlockDriver callbacks or any other function in the block layer, because in the current approach for the AioContext lock removal (which you are not aware of, I understand) we need to distinguish between functions running in coroutine context and not, and throwing g_c_w functions everywhere makes my work much harder that it already is. Thank you, Emanuele > >> >> Signed-off-by: Emanuele Giuseppe Esposito >> --- >> block/block-copy.c | 15 +-- >> 1 file changed, 9 insertions(+), 6 deletions(-) >> >> diff --git a/block/block-copy.c b/block/block-copy.c >> index bb947afdda..f33ab1d0b6 100644 >> --- a/block/block-copy.c >> +++ b/block/block-copy.c >> @@ -577,8 +577,9 @@ static coroutine_fn int >> block_copy_task_entry(AioTask *task) >> return ret; >> } >> -static int block_copy_block_status(BlockCopyState *s, int64_t offset, >> - int64_t bytes, int64_t *pnum) >> +static coroutine_fn int block_copy_block_status(BlockCopyState *s, >> + int64_t offset, >> + int64_t bytes, >> int64_t *pnum) >> { >> int64_t num; >> BlockDriverState *base; >> @@ -613,8 +614,9 @@ static int block_copy_block_status(BlockCopyState >> *s, int64_t offset, >> * Check if the cluster starting at offset is allocated or not. >> * return via pnum the number of contiguous clusters sharing this >> allocation. >> */ >> -static int block_copy_is_cluster_allocated(BlockCopyState *s, int64_t >> offset, >> - int64_t *pnum) >> +static int coroutine_fn >> block_copy_is_cluster_allocated(BlockCopyState *s, >> + int64_t offset, >> + int64_t *pnum) >> { >> BlockDriverState *bs = s->source->bs; >> int64_t count, total_count = 0; >> @@ -669,8 +671,9 @@ void block_copy_reset(BlockCopyState *s, int64_t >> offset, int64_t bytes) >> * @return 0 when the cluster at @offset was unallocated, >> * 1 otherwise, and -ret on error. >> */ >> -int64_t block_copy_reset_unallocated(BlockCopyState *s, >> - int64_t offset, int64_t *count) >> +int64_t coroutine_fn block_copy_reset_unallocated(BlockCopyState *s, >> + int64_t offset, >> + int64_t *count) >> { >> int ret; >> int64_t clusters, bytes; >
Re: [PATCH v2 5/9] block: distinguish between bdrv_create running in coroutine and not
On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: Call two different functions depending on whether bdrv_create is in coroutine or not, following the same pattern as generated_co_wrapper functions. This allows to also call the coroutine function directly, without using CreateCo or relying in bdrv_create(). Can we move to auto-generation of bdrv_create(), like for bdrv_check() and friends? Signed-off-by: Emanuele Giuseppe Esposito --- block.c | 74 - 1 file changed, 36 insertions(+), 38 deletions(-) diff --git a/block.c b/block.c index d2b2800039..0823563e4d 100644 --- a/block.c +++ b/block.c @@ -522,66 +522,64 @@ typedef struct CreateCo { Error *err; } CreateCo; -static void coroutine_fn bdrv_create_co_entry(void *opaque) +static int coroutine_fn bdrv_co_create(BlockDriver *drv, const char *filename, + QemuOpts *opts, Error **errp) { -Error *local_err = NULL; int ret; +GLOBAL_STATE_CODE(); +assert(*errp == NULL); + +if (!drv->bdrv_co_create_opts) { +error_setg(errp, "Driver '%s' does not support image creation", + drv->format_name); +return -ENOTSUP; +} + +ret = drv->bdrv_co_create_opts(drv, filename, opts, errp); +if (ret < 0 && !*errp) { +error_setg_errno(errp, -ret, "Could not create image"); +} + +return ret; +} + +static void coroutine_fn bdrv_create_co_entry(void *opaque) +{ CreateCo *cco = opaque; -assert(cco->drv); GLOBAL_STATE_CODE(); +assert(cco->drv); -ret = cco->drv->bdrv_co_create_opts(cco->drv, -cco->filename, cco->opts, _err); -error_propagate(>err, local_err); -cco->ret = ret; +cco->ret = bdrv_co_create(cco->drv, cco->filename, cco->opts, >err); } int bdrv_create(BlockDriver *drv, const char* filename, QemuOpts *opts, Error **errp) { -int ret; - GLOBAL_STATE_CODE(); -Coroutine *co; -CreateCo cco = { -.drv = drv, -.filename = g_strdup(filename), -.opts = opts, -.ret = NOT_DONE, -.err = NULL, -}; - -if (!drv->bdrv_co_create_opts) { -error_setg(errp, "Driver '%s' does not support image creation", drv->format_name); -ret = -ENOTSUP; -goto out; -} - if (qemu_in_coroutine()) { /* Fast-path if already in coroutine context */ -bdrv_create_co_entry(); +return bdrv_co_create(drv, filename, opts, errp); } else { +Coroutine *co; +CreateCo cco = { +.drv = drv, +.filename = g_strdup(filename), +.opts = opts, +.ret = NOT_DONE, +.err = NULL, +}; + co = qemu_coroutine_create(bdrv_create_co_entry, ); qemu_coroutine_enter(co); while (cco.ret == NOT_DONE) { aio_poll(qemu_get_aio_context(), true); } +error_propagate(errp, cco.err); +g_free(cco.filename); +return cco.ret; } - -ret = cco.ret; -if (ret < 0) { -if (cco.err) { -error_propagate(errp, cco.err); -} else { -error_setg_errno(errp, -ret, "Could not create image"); -} -} - -out: -g_free(cco.filename); -return ret; } /** -- Best regards, Vladimir
Re: [PATCH v3 4/4] hw/nvme: add polling support
On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote: > On Nov 3 21:19, Jinhao Fan wrote: > > On 11/3/2022 8:10 PM, Klaus Jensen wrote: > > > I agree that the spec is a little unclear on this point. In any case, in > > > Linux, when the driver has decided that the sq tail must be updated, > > > it will use this check: > > > > > >(new_idx - event_idx - 1) < (new_idx - old) > > > > When eventidx is already behind, it's like: > > > > 0 > > 1 <- event_idx > > 2 <- old > > 3 <- new_idx > > 4 > > . > > . > > . > > > > In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) = > > 3-2=1, so the host won't update sq tail. Where am I wrong in this example? > > That becomes 1 >= 1, i.e. "true". So this will result in the driver > doing an mmio doorbell write. The code is: static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old) { return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old); } which per the above is "return 1 < 1;", or false. So the above case does *not* do an mmio write. No? regards john
Re: [PATCH v2 3/9] nbd/server.c: add missing coroutine_fn annotations
On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: These functions end up calling bdrv_*() implemented as generated_co_wrapper functions. Same here. Sorry that I joined only on v3. In past we had a lot of "coroutine wrappers", each IO function in block/io.c and many in block.c had two variants: coroutine_fn bdrv_co_foo(...) and a wrapper bdrv_foo(...) And that wrapper is not a coroutine_fn, it's for calling from any context: coroutine or not. Now many of these wrappers are auto-generated, you may find them in build/block/block-gen.c after successful make. "generated_co_wrapper" is a sign for code generation script to generate the wrapper code. In addition, they also happen to be always called in coroutine context, meaning all callers are coroutine_fn. This means that the g_c_w function will enter the qemu_in_coroutine() case and eventually suspend (or in other words call qemu_coroutine_yield()). Therefore we need to mark such functions coroutine_fn too. Signed-off-by: Emanuele Giuseppe Esposito --- nbd/server.c | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/nbd/server.c b/nbd/server.c index ada16089f3..e2eec0cf46 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -2141,8 +2141,9 @@ static int nbd_extent_array_add(NBDExtentArray *ea, return 0; } -static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset, - uint64_t bytes, NBDExtentArray *ea) +static int coroutine_fn blockstatus_to_extents(BlockDriverState *bs, + uint64_t offset, uint64_t bytes, + NBDExtentArray *ea) { while (bytes) { uint32_t flags; @@ -2168,8 +2169,9 @@ static int blockstatus_to_extents(BlockDriverState *bs, uint64_t offset, return 0; } -static int blockalloc_to_extents(BlockDriverState *bs, uint64_t offset, - uint64_t bytes, NBDExtentArray *ea) +static int coroutine_fn blockalloc_to_extents(BlockDriverState *bs, + uint64_t offset, uint64_t bytes, + NBDExtentArray *ea) { while (bytes) { int64_t num; @@ -2220,11 +,12 @@ static int nbd_co_send_extents(NBDClient *client, uint64_t handle, } /* Get block status from the exported device and send it to the client */ -static int nbd_co_send_block_status(NBDClient *client, uint64_t handle, -BlockDriverState *bs, uint64_t offset, -uint32_t length, bool dont_fragment, -bool last, uint32_t context_id, -Error **errp) +static int +coroutine_fn nbd_co_send_block_status(NBDClient *client, uint64_t handle, + BlockDriverState *bs, uint64_t offset, + uint32_t length, bool dont_fragment, + bool last, uint32_t context_id, + Error **errp) { int ret; unsigned int nb_extents = dont_fragment ? 1 : NBD_MAX_BLOCK_STATUS_EXTENTS; -- Best regards, Vladimir
Re: [PATCH v2 2/9] block-copy: add missing coroutine_fn annotations
On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: These functions end up calling bdrv_common_block_status_above(), a generated_co_wrapper function. generated_co_wrapper is not a coroutine_fn. Сonversely it's a function that do a class coroutine wrapping - start a coroutine and do POLL to wait for the coroutine to finish. In addition, they also happen to be always called in coroutine context, meaning all callers are coroutine_fn. That's also not a reason for marking them coroutine_fn. "coroutine_fn" means that the function can be called only from coroutine context. This means that the g_c_w function will enter the qemu_in_coroutine() case and eventually suspend (or in other words call qemu_coroutine_yield()). Therefore we need to mark such functions coroutine_fn too. I don't think so. Moreover, this breaks the concept, as your new coroutine_fn functions will call generated_co_wrapper functions which are not marked coroutine_fn and never was. Signed-off-by: Emanuele Giuseppe Esposito --- block/block-copy.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/block/block-copy.c b/block/block-copy.c index bb947afdda..f33ab1d0b6 100644 --- a/block/block-copy.c +++ b/block/block-copy.c @@ -577,8 +577,9 @@ static coroutine_fn int block_copy_task_entry(AioTask *task) return ret; } -static int block_copy_block_status(BlockCopyState *s, int64_t offset, - int64_t bytes, int64_t *pnum) +static coroutine_fn int block_copy_block_status(BlockCopyState *s, +int64_t offset, +int64_t bytes, int64_t *pnum) { int64_t num; BlockDriverState *base; @@ -613,8 +614,9 @@ static int block_copy_block_status(BlockCopyState *s, int64_t offset, * Check if the cluster starting at offset is allocated or not. * return via pnum the number of contiguous clusters sharing this allocation. */ -static int block_copy_is_cluster_allocated(BlockCopyState *s, int64_t offset, - int64_t *pnum) +static int coroutine_fn block_copy_is_cluster_allocated(BlockCopyState *s, +int64_t offset, +int64_t *pnum) { BlockDriverState *bs = s->source->bs; int64_t count, total_count = 0; @@ -669,8 +671,9 @@ void block_copy_reset(BlockCopyState *s, int64_t offset, int64_t bytes) * @return 0 when the cluster at @offset was unallocated, * 1 otherwise, and -ret on error. */ -int64_t block_copy_reset_unallocated(BlockCopyState *s, - int64_t offset, int64_t *count) +int64_t coroutine_fn block_copy_reset_unallocated(BlockCopyState *s, + int64_t offset, + int64_t *count) { int ret; int64_t clusters, bytes; -- Best regards, Vladimir
[PATCH for-7.2] block/blkio: Set BlockDriver::has_variable_length to false
Setting it to true can cause the device size to be queried from libblkio in otherwise fast paths, degrading performance. Set it to false and require users to refresh the device size explicitly instead. Fixes: 4c8f4fda0504 ("block/blkio: Tolerate device size changes") Suggested-by: Kevin Wolf Signed-off-by: Alberto Faria --- block/blkio.c | 1 - 1 file changed, 1 deletion(-) diff --git a/block/blkio.c b/block/blkio.c index 620fab28a7..5eae3adfaf 100644 --- a/block/blkio.c +++ b/block/blkio.c @@ -993,7 +993,6 @@ static void blkio_refresh_limits(BlockDriverState *bs, Error **errp) { \ .format_name = name, \ .protocol_name = name, \ -.has_variable_length = true, \ .instance_size = sizeof(BDRVBlkioState), \ .bdrv_file_open = blkio_file_open, \ .bdrv_close = blkio_close, \ -- 2.38.1
Re: [PATCH v2 1/9] block: call bdrv_co_drain_begin in a coroutine
On 11/4/22 12:56, Emanuele Giuseppe Esposito wrote: It seems that bdrv_open_driver() forgot to create a coroutine where to call bs->drv->bdrv_co_drain_begin(), a callback marked as coroutine_fn. Because there is no active I/O at this point, the coroutine should end right after entering, so the caller does not need to poll until it is finished. Hmm. I see your point. But isn't it better to go the generic way and use a generated coroutine wrapper? Nothing guarantees that .bdrv_co_drain_begin() handlers will never do any yield point even on driver open... Look for example at bdrv_co_check(). It has a generated wrapper bdrv_check(), declared in include/block/block-io.h So you just need to declare the wrapper, and use it in bdrv_open_driver(), the code would be clearer too. Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Paolo Bonzini --- block.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/block.c b/block.c index 5311b21f8e..d2b2800039 100644 --- a/block.c +++ b/block.c @@ -1637,12 +1637,34 @@ out: g_free(gen_node_name); } +typedef struct DrainCo { +BlockDriverState *bs; +int ret; +} DrainCo; + +static void coroutine_fn bdrv_co_drain_begin(void *opaque) +{ +int i; +DrainCo *co = opaque; +BlockDriverState *bs = co->bs; + +for (i = 0; i < bs->quiesce_counter; i++) { +bs->drv->bdrv_co_drain_begin(bs); +} +co->ret = 0; +} + static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, const char *node_name, QDict *options, int open_flags, Error **errp) { Error *local_err = NULL; -int i, ret; +int ret; +Coroutine *co; +DrainCo dco = { +.bs = bs, +.ret = NOT_DONE, +}; GLOBAL_STATE_CODE(); bdrv_assign_node_name(bs, node_name, _err); @@ -1690,10 +1712,14 @@ static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, assert(bdrv_min_mem_align(bs) != 0); assert(is_power_of_2(bs->bl.request_alignment)); -for (i = 0; i < bs->quiesce_counter; i++) { -if (drv->bdrv_co_drain_begin) { -drv->bdrv_co_drain_begin(bs); -} +if (drv->bdrv_co_drain_begin) { +co = qemu_coroutine_create(bdrv_co_drain_begin, ); +qemu_coroutine_enter(co); +/* + * There should be no reason for drv->bdrv_co_drain_begin to wait at + * this point, because the device does not have any active I/O. + */ +assert(dco.ret != NOT_DONE); } return 0; -- Best regards, Vladimir
[PATCH] qapi/block-core: Fix BlockdevOptionsNvmeIoUring @path description
The nvme-io_uring BlockDriver's path option must point at the character device of an NVMe namespace, not at an image file. Fixes: fd66dbd424f5 ("blkio: add libblkio block driver") Suggested-by: Stefano Garzarella Signed-off-by: Alberto Faria --- qapi/block-core.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 6d904004f8..95ac4fa634 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -3704,7 +3704,7 @@ # # Driver specific block device options for the nvme-io_uring backend. # -# @path: path to the image file +# @path: path to the NVMe namespace's character device (e.g. /dev/ng0n1). # # Since: 7.2 ## -- 2.38.1
Re: [PATCH v2 0/3] block: Start/end drain on correct AioContext
Am 08.11.2022 um 15:13 hat Kevin Wolf geschrieben: > Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben: > > Hi, > > > > v1 cover letter: > > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html > > > > bdrv_replace_child_noperm() drains the child via > > bdrv_parent_drained_{begin,end}_single(). When it removes a child, the > > bdrv_parent_drained_end_single() at its end will be called on an empty > > child, making the BDRV_POLL_WHILE() in it poll the main AioContext > > (because c->bs is NULL). > > > > That’s wrong, though, because it’s supposed to operate on the parent. > > bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in > > the parents’ AioContext, which may be anything, not necessarily the main > > context. Therefore, we must poll the parent’s context. > > > > Patch 3 does this for both bdrv_parent_drained_{begin,end}_single(). > > Patch 1 ensures that we can legally call > > bdrv_child_get_parent_aio_context() from those I/O context functions, > > and patch 2 fixes blk_do_set_aio_context() to not cause an assertion > > failure if it beginning a drain can end up in blk_get_aio_context() > > before blk->ctx has been updated. > > Hmm, I may have unintentionally made this series obsolete with the drain > series I sent today. The poll instances that you're fixing simply don't > exist any more after it. > > Can you check if the bug is gone with my series? I would hope so, but if > not, we would probably need to apply a fix in a different place. Actually, on second thoughts, we'd probably still apply your patches as a fix for 7.2 and then have my patches which would get rid of the polling only in block-next. Patch 1 and 2 of this series would stay in the tree, and that seems to make sense to me, too. So obsolete was not the right word. But it would still be interesting to see if my series fixes the bug, too, because otherwise it might introduce a regression later. By the way, is the bug hard to test in a test case? If this series had one, I could just have run it myself against my tree. Kevin
Re: [PATCH v3 09/30] nbd/server: Clean up abuse of BlockExportOptionsNbd member @arg
On 11/4/22 19:06, Markus Armbruster wrote: block-export-add argument @name defaults to the value of argument @node-name. nbd_export_create() implements this by copying @node_name to @name. It leaves @has_node_name false, violating the "has_node_name == !!node_name" invariant. Unclean. Falls apart when we elide @has_node_name (next commit): then QAPI frees the same value twice, once for @node_name and once @name. iotest 307 duly explodes. Goes back to commit c62d24e906 "blockdev-nbd: Boxed argument type for nbd-server-add" (v5.0.0). Got moved from qmp_nbd_server_add() to nbd_export_create() (commit 56ee86261e), then copied back (commit b6076afcab). Commit 8675cbd68b "nbd: Utilize QAPI_CLONE for type conversion" (v5.2.0) cleaned up the copy in qmp_nbd_server_add() noting Second, our assignment to arg->name is fishy: the generated QAPI code for qapi_free_NbdServerAddOptions does not visit arg->name if arg->has_name is false, but if it DID visit it, we would have introduced a double-free situation when arg is finally freed. Exactly. However, the copy in nbd_export_create() remained dirty. Clean it up. Since the value stored in member @name is not actually used outside this function, use a local variable instead of modifying the QAPI object. Signed-off-by: Markus Armbruster Cc: Eric Blake Cc: Vladimir Sementsov-Ogievskiy Cc:qemu-block@nongnu.org Reviewed-by: Vladimir Sementsov-Ogievskiy -- Best regards, Vladimir
Re: [PATCH v2 0/3] block: Start/end drain on correct AioContext
Am 07.11.2022 um 16:13 hat Hanna Reitz geschrieben: > Hi, > > v1 cover letter: > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00389.html > > bdrv_replace_child_noperm() drains the child via > bdrv_parent_drained_{begin,end}_single(). When it removes a child, the > bdrv_parent_drained_end_single() at its end will be called on an empty > child, making the BDRV_POLL_WHILE() in it poll the main AioContext > (because c->bs is NULL). > > That’s wrong, though, because it’s supposed to operate on the parent. > bdrv_parent_drained_end_single_no_poll() will have scheduled any BHs in > the parents’ AioContext, which may be anything, not necessarily the main > context. Therefore, we must poll the parent’s context. > > Patch 3 does this for both bdrv_parent_drained_{begin,end}_single(). > Patch 1 ensures that we can legally call > bdrv_child_get_parent_aio_context() from those I/O context functions, > and patch 2 fixes blk_do_set_aio_context() to not cause an assertion > failure if it beginning a drain can end up in blk_get_aio_context() > before blk->ctx has been updated. Hmm, I may have unintentionally made this series obsolete with the drain series I sent today. The poll instances that you're fixing simply don't exist any more after it. Can you check if the bug is gone with my series? I would hope so, but if not, we would probably need to apply a fix in a different place. Kevin
Re: [PATCH v3 4/4] hw/nvme: add polling support
On Nov 8 12:39, John Levon wrote: > On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote: > > > On Nov 3 21:19, Jinhao Fan wrote: > > > On 11/3/2022 8:10 PM, Klaus Jensen wrote: > > > > I agree that the spec is a little unclear on this point. In any case, in > > > > Linux, when the driver has decided that the sq tail must be updated, > > > > it will use this check: > > > > > > > >(new_idx - event_idx - 1) < (new_idx - old) > > > > > > When eventidx is already behind, it's like: > > > > > > 0 > > > 1 <- event_idx > > > 2 <- old > > > 3 <- new_idx > > > 4 > > > . > > > . > > > . > > > > > > In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) = > > > 3-2=1, so the host won't update sq tail. Where am I wrong in this example? > > > > That becomes 1 >= 1, i.e. "true". So this will result in the driver > > doing an mmio doorbell write. > > The code is: > > static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old) > > { > > return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old); > > } > > > which per the above is "return 1 < 1;", or false. So the above case does *not* > do an mmio write. No? > Whelp. Looks like I'm in the wrong here, apologies! signature.asc Description: PGP signature
[PATCH 05/13] block: Inline bdrv_drain_invoke()
bdrv_drain_invoke() has now two entirely separate cases that share no code any more and are selected depending on a bool parameter. Each case has only one caller. Just inline the function. Signed-off-by: Kevin Wolf --- block/io.c | 23 ++- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/block/io.c b/block/io.c index 41e6121c31..c520183fb7 100644 --- a/block/io.c +++ b/block/io.c @@ -241,21 +241,6 @@ typedef struct { bool ignore_bds_parents; } BdrvCoDrainData; -/* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */ -static void bdrv_drain_invoke(BlockDriverState *bs, bool begin) -{ -if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) || -(!begin && !bs->drv->bdrv_drain_end)) { -return; -} - -if (begin) { -bs->drv->bdrv_drain_begin(bs); -} else { -bs->drv->bdrv_drain_end(bs); -} -} - /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */ bool bdrv_drain_poll(BlockDriverState *bs, bool recursive, BdrvChild *ignore_parent, bool ignore_bds_parents) @@ -389,7 +374,9 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs, } bdrv_parent_drained_begin(bs, parent, ignore_bds_parents); -bdrv_drain_invoke(bs, true); +if (bs->drv && bs->drv->bdrv_drain_begin) { +bs->drv->bdrv_drain_begin(bs); +} } static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive, @@ -460,7 +447,9 @@ static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive, assert(bs->quiesce_counter > 0); /* Re-enable things in child-to-parent order */ -bdrv_drain_invoke(bs, false); +if (bs->drv && bs->drv->bdrv_drain_end) { +bs->drv->bdrv_drain_end(bs); +} bdrv_parent_drained_end(bs, parent, ignore_bds_parents); old_quiesce_counter = qatomic_fetch_dec(>quiesce_counter); -- 2.38.1
[PATCH 12/13] block: Don't poll in bdrv_replace_child_noperm()
In order to make sure that bdrv_replace_child_noperm() doesn't have to poll any more, get rid of the bdrv_parent_drained_begin_single() call. This is possible now because we can require that the child is already drained when the function is called (it better be, having in-flight requests while modifying the graph isn't going to end well!) and we don't call the parent drain callbacks more than once. The additional drain calls needed in callers cause the test case to run its code in the drain handler too early (bdrv_attach_child() drains now), so modify it to only enable the code after the test setup has completed. Signed-off-by: Kevin Wolf --- include/block/block-io.h | 8 block.c | 72 +--- block/io.c | 2 +- tests/unit/test-bdrv-drain.c | 10 + 4 files changed, 70 insertions(+), 22 deletions(-) diff --git a/include/block/block-io.h b/include/block/block-io.h index 5b54ed4672..ddce8550a9 100644 --- a/include/block/block-io.h +++ b/include/block/block-io.h @@ -290,6 +290,14 @@ bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos); */ void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll); +/** + * bdrv_parent_drained_poll_single: + * + * Returns true if there is any pending activity to cease before @c can be + * called quiesced, false otherwise. + */ +bool bdrv_parent_drained_poll_single(BdrvChild *c); + /** * bdrv_parent_drained_end_single: * diff --git a/block.c b/block.c index 5f5f79cd16..12039e9b8a 100644 --- a/block.c +++ b/block.c @@ -2399,6 +2399,20 @@ static void bdrv_replace_child_abort(void *opaque) GLOBAL_STATE_CODE(); /* old_bs reference is transparently moved from @s to @s->child */ +if (!s->child->bs) { +/* + * The parents were undrained when removing old_bs from the child. New + * requests can't have been made, though, because the child was empty. + * + * TODO Make bdrv_replace_child_noperm() transactionable to avoid + * undraining the parent in the first place. Once this is done, having + * new_bs drained when calling bdrv_replace_child_tran() is not a + * requirement any more. + */ +bdrv_parent_drained_begin_single(s->child, false); +assert(!bdrv_parent_drained_poll_single(s->child)); +} +assert(s->child->parent_quiesce_counter); bdrv_replace_child_noperm(s->child, s->old_bs); bdrv_unref(new_bs); } @@ -2414,12 +2428,20 @@ static TransactionActionDrv bdrv_replace_child_drv = { * * Note: real unref of old_bs is done only on commit. * + * Both child and new_bs (if non-NULL) must be drained. new_bs must be kept + * drained until the transaction is completed (this automatically implies that + * child remains drained, too). + * * The function doesn't update permissions, caller is responsible for this. */ static void bdrv_replace_child_tran(BdrvChild *child, BlockDriverState *new_bs, Transaction *tran) { BdrvReplaceChildState *s = g_new(BdrvReplaceChildState, 1); + +assert(child->parent_quiesce_counter); +assert(!new_bs || new_bs->quiesce_counter); + *s = (BdrvReplaceChildState) { .child = child, .old_bs = child->bs, @@ -2818,6 +2840,12 @@ static void bdrv_replace_child_noperm(BdrvChild *child, int new_bs_quiesce_counter; assert(!child->frozen); +/* + * When removing the child, it's the callers responsibility to make sure + * that no requests are in flight any more. Usually the parent is drained, + * but not through child->parent_quiesce_counter. + */ +assert(!new_bs || child->parent_quiesce_counter); assert(old_bs != new_bs); GLOBAL_STATE_CODE(); @@ -2825,16 +2853,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child, assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs)); } -new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0); - -/* - * If the new child node is drained but the old one was not, flush - * all outstanding requests to the old child node. - */ -if (new_bs_quiesce_counter && !child->parent_quiesce_counter) { -bdrv_parent_drained_begin_single(child, true); -} - if (old_bs) { if (child->klass->detach) { child->klass->detach(child); @@ -2849,14 +2867,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child, assert_bdrv_graph_writable(new_bs); QLIST_INSERT_HEAD(_bs->parents, child, next_parent); -/* - * Polling in bdrv_parent_drained_begin_single() may have led to the new - * node's quiesce_counter having been decreased. Not a problem, we just - * need to recognize this here and then invoke drained_end appropriately - * more often. - */ -assert(new_bs->quiesce_counter <= new_bs_quiesce_counter); - if
[PATCH 09/13] block: Remove subtree drains
Subtree drains are not used any more. Remove them. After this, BdrvChildClass.attach/detach() don't poll any more. Signed-off-by: Kevin Wolf --- include/block/block-io.h | 18 +-- include/block/block_int-common.h | 1 - include/block/block_int-io.h | 12 -- block.c | 20 +-- block/io.c | 121 +++--- tests/unit/test-bdrv-drain.c | 261 ++- 6 files changed, 44 insertions(+), 389 deletions(-) diff --git a/include/block/block-io.h b/include/block/block-io.h index 97e9ae8bee..c35cb1e53f 100644 --- a/include/block/block-io.h +++ b/include/block/block-io.h @@ -303,8 +303,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c); /** * bdrv_drain_poll: * - * Poll for pending requests in @bs, its parents (except for @ignore_parent), - * and if @recursive is true its children as well (used for subtree drain). + * Poll for pending requests in @bs and its parents (except for @ignore_parent). * * If @ignore_bds_parents is true, parents that are BlockDriverStates must * ignore the drain request because they will be drained separately (used for @@ -312,8 +311,8 @@ void bdrv_parent_drained_end_single(BdrvChild *c); * * This is part of bdrv_drained_begin. */ -bool bdrv_drain_poll(BlockDriverState *bs, bool recursive, - BdrvChild *ignore_parent, bool ignore_bds_parents); +bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent, + bool ignore_bds_parents); /** * bdrv_drained_begin: @@ -334,12 +333,6 @@ void bdrv_drained_begin(BlockDriverState *bs); void bdrv_do_drained_begin_quiesce(BlockDriverState *bs, BdrvChild *parent, bool ignore_bds_parents); -/** - * Like bdrv_drained_begin, but recursively begins a quiesced section for - * exclusive access to all child nodes as well. - */ -void bdrv_subtree_drained_begin(BlockDriverState *bs); - /** * bdrv_drained_end: * @@ -353,9 +346,4 @@ void bdrv_subtree_drained_begin(BlockDriverState *bs); */ void bdrv_drained_end(BlockDriverState *bs); -/** - * End a quiescent section started by bdrv_subtree_drained_begin(). - */ -void bdrv_subtree_drained_end(BlockDriverState *bs); - #endif /* BLOCK_IO_H */ diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h index 6504db4fd9..65ee5fcbec 100644 --- a/include/block/block_int-common.h +++ b/include/block/block_int-common.h @@ -1184,7 +1184,6 @@ struct BlockDriverState { /* Accessed with atomic ops. */ int quiesce_counter; -int recursive_quiesce_counter; unsigned int write_gen; /* Current data generation */ diff --git a/include/block/block_int-io.h b/include/block/block_int-io.h index 4b0b3e17ef..8bc061ebb8 100644 --- a/include/block/block_int-io.h +++ b/include/block/block_int-io.h @@ -179,16 +179,4 @@ void bdrv_bsc_invalidate_range(BlockDriverState *bs, */ void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes); - -/* - * "I/O or GS" API functions. These functions can run without - * the BQL, but only in one specific iothread/main loop. - * - * See include/block/block-io.h for more information about - * the "I/O or GS" API. - */ - -void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent); -void bdrv_unapply_subtree_drain(BdrvChild *child, BlockDriverState *old_parent); - #endif /* BLOCK_INT_IO_H */ diff --git a/block.c b/block.c index 43b893dd6c..9d082631d9 100644 --- a/block.c +++ b/block.c @@ -1224,7 +1224,7 @@ static void bdrv_child_cb_drained_begin(BdrvChild *child) static bool bdrv_child_cb_drained_poll(BdrvChild *child) { BlockDriverState *bs = child->opaque; -return bdrv_drain_poll(bs, false, NULL, false); +return bdrv_drain_poll(bs, NULL, false); } static void bdrv_child_cb_drained_end(BdrvChild *child) @@ -1474,8 +1474,6 @@ static void bdrv_child_cb_attach(BdrvChild *child) assert(!bs->file); bs->file = child; } - -bdrv_apply_subtree_drain(child, bs); } static void bdrv_child_cb_detach(BdrvChild *child) @@ -1486,8 +1484,6 @@ static void bdrv_child_cb_detach(BdrvChild *child) bdrv_backing_detach(child); } -bdrv_unapply_subtree_drain(child, bs); - assert_bdrv_graph_writable(bs); QLIST_REMOVE(child, next); if (child == bs->backing) { @@ -2843,9 +2839,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child, } if (old_bs) { -/* Detach first so that the recursive drain sections coming from @child - * are already gone and we only end the drain sections that came from - * elsewhere. */ if (child->klass->detach) { child->klass->detach(child); } @@ -2860,17 +2853,14 @@ static void bdrv_replace_child_noperm(BdrvChild *child, QLIST_INSERT_HEAD(_bs->parents, child, next_parent); /* - * Detaching the old node may have led to the
[PATCH 07/13] block: Don't use subtree drains in bdrv_drop_intermediate()
Instead of using a subtree drain from the top node (which also drains child nodes of base that we're not even interested in), use a normal drain for base, which automatically drains all of the parents, too. Signed-off-by: Kevin Wolf --- block.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/block.c b/block.c index 5828b970e4..2f6b25875f 100644 --- a/block.c +++ b/block.c @@ -5581,7 +5581,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base, GLOBAL_STATE_CODE(); bdrv_ref(top); -bdrv_subtree_drained_begin(top); +bdrv_drained_begin(base); if (!top->drv || !base->drv) { goto exit; @@ -5654,7 +5654,7 @@ int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base, ret = 0; exit: -bdrv_subtree_drained_end(top); +bdrv_drained_end(base); bdrv_unref(top); return ret; } -- 2.38.1
[PATCH 10/13] block: Call drain callbacks only once
We only need to call both the BlockDriver's callback and the parent callbacks when going from undrained to drained or vice versa. A second drain section doesn't make a difference for the driver or the parent, they weren't supposed to send new requests before and after the second drain. One thing that gets in the way is the 'ignore_bds_parents' parameter in bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): If it is true for the first drain, bs->quiesce_counter will be non-zero, but the parent callbacks still haven't been called, so a second drain where it is false would still have to call them. Instead of keeping track of this, let's just get rid of the parameter. It was introduced in commit 6cd5c9d7b2d as an optimisation so that during bdrv_drain_all(), we wouldn't recursively drain all parents up to the root for each node, resulting in quadratic complexity. As it happens, calling the callbacks only once solves the same problem, so as of this patch, we'll still have O(n) complexity and ignore_bds_parents is not needed any more. This patch only ignores the 'ignore_bds_parents' parameter. It will be removed in a separate patch. Signed-off-by: Kevin Wolf --- block.c | 13 ++--- block/io.c | 24 +--- tests/unit/test-bdrv-drain.c | 16 ++-- 3 files changed, 29 insertions(+), 24 deletions(-) diff --git a/block.c b/block.c index 9d082631d9..8878586f6e 100644 --- a/block.c +++ b/block.c @@ -2816,7 +2816,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child, { BlockDriverState *old_bs = child->bs; int new_bs_quiesce_counter; -int drain_saldo; assert(!child->frozen); assert(old_bs != new_bs); @@ -2827,15 +2826,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child, } new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0); -drain_saldo = new_bs_quiesce_counter - child->parent_quiesce_counter; /* * If the new child node is drained but the old one was not, flush * all outstanding requests to the old child node. */ -while (drain_saldo > 0 && child->klass->drained_begin) { +if (new_bs_quiesce_counter && !child->parent_quiesce_counter) { bdrv_parent_drained_begin_single(child, true); -drain_saldo--; } if (old_bs) { @@ -2859,7 +2856,6 @@ static void bdrv_replace_child_noperm(BdrvChild *child, * more often. */ assert(new_bs->quiesce_counter <= new_bs_quiesce_counter); -drain_saldo += new_bs->quiesce_counter - new_bs_quiesce_counter; if (child->klass->attach) { child->klass->attach(child); @@ -2869,10 +2865,13 @@ static void bdrv_replace_child_noperm(BdrvChild *child, /* * If the old child node was drained but the new one is not, allow * requests to come in only after the new node has been attached. + * + * Update new_bs_quiesce_counter because bdrv_parent_drained_begin_single() + * polls, which could have changed the value. */ -while (drain_saldo < 0 && child->klass->drained_end) { +new_bs_quiesce_counter = (new_bs ? new_bs->quiesce_counter : 0); +if (!new_bs_quiesce_counter && child->parent_quiesce_counter) { bdrv_parent_drained_end_single(child); -drain_saldo++; } } diff --git a/block/io.c b/block/io.c index 870a25d7a5..87c7a92f15 100644 --- a/block/io.c +++ b/block/io.c @@ -62,7 +62,7 @@ void bdrv_parent_drained_end_single(BdrvChild *c) { IO_OR_GS_CODE(); -assert(c->parent_quiesce_counter > 0); +assert(c->parent_quiesce_counter == 1); c->parent_quiesce_counter--; if (c->klass->drained_end) { c->klass->drained_end(c); @@ -109,6 +109,7 @@ static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore, void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll) { IO_OR_GS_CODE(); +assert(c->parent_quiesce_counter == 0); c->parent_quiesce_counter++; if (c->klass->drained_begin) { c->klass->drained_begin(c); @@ -352,16 +353,16 @@ void bdrv_do_drained_begin_quiesce(BlockDriverState *bs, BdrvChild *parent, bool ignore_bds_parents) { IO_OR_GS_CODE(); -assert(!qemu_in_coroutine()); /* Stop things in parent-to-child order */ if (qatomic_fetch_inc(>quiesce_counter) == 0) { aio_disable_external(bdrv_get_aio_context(bs)); -} -bdrv_parent_drained_begin(bs, parent, ignore_bds_parents); -if (bs->drv && bs->drv->bdrv_drain_begin) { -bs->drv->bdrv_drain_begin(bs); +/* TODO Remove ignore_bds_parents, we don't consider it any more */ +bdrv_parent_drained_begin(bs, parent, false); +if (bs->drv && bs->drv->bdrv_drain_begin) { +bs->drv->bdrv_drain_begin(bs); +} } } @@ -412,13 +413,14 @@ static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent,
[PATCH 11/13] block: Remove ignore_bds_parents parameter from drain functions
ignore_bds_parents is now ignored, so we can just remove it. Signed-off-by: Kevin Wolf --- include/block/block-io.h | 10 ++ block.c | 4 +-- block/io.c | 78 +++- 3 files changed, 32 insertions(+), 60 deletions(-) diff --git a/include/block/block-io.h b/include/block/block-io.h index c35cb1e53f..5b54ed4672 100644 --- a/include/block/block-io.h +++ b/include/block/block-io.h @@ -305,14 +305,9 @@ void bdrv_parent_drained_end_single(BdrvChild *c); * * Poll for pending requests in @bs and its parents (except for @ignore_parent). * - * If @ignore_bds_parents is true, parents that are BlockDriverStates must - * ignore the drain request because they will be drained separately (used for - * drain_all). - * * This is part of bdrv_drained_begin. */ -bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent, - bool ignore_bds_parents); +bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent); /** * bdrv_drained_begin: @@ -330,8 +325,7 @@ void bdrv_drained_begin(BlockDriverState *bs); * Quiesces a BDS like bdrv_drained_begin(), but does not wait for already * running requests to complete. */ -void bdrv_do_drained_begin_quiesce(BlockDriverState *bs, - BdrvChild *parent, bool ignore_bds_parents); +void bdrv_do_drained_begin_quiesce(BlockDriverState *bs, BdrvChild *parent); /** * bdrv_drained_end: diff --git a/block.c b/block.c index 8878586f6e..5f5f79cd16 100644 --- a/block.c +++ b/block.c @@ -1218,13 +1218,13 @@ static char *bdrv_child_get_parent_desc(BdrvChild *c) static void bdrv_child_cb_drained_begin(BdrvChild *child) { BlockDriverState *bs = child->opaque; -bdrv_do_drained_begin_quiesce(bs, NULL, false); +bdrv_do_drained_begin_quiesce(bs, NULL); } static bool bdrv_child_cb_drained_poll(BdrvChild *child) { BlockDriverState *bs = child->opaque; -return bdrv_drain_poll(bs, NULL, false); +return bdrv_drain_poll(bs, NULL); } static void bdrv_child_cb_drained_end(BdrvChild *child) diff --git a/block/io.c b/block/io.c index 87c7a92f15..4a83359a8f 100644 --- a/block/io.c +++ b/block/io.c @@ -45,13 +45,12 @@ static void bdrv_parent_cb_resize(BlockDriverState *bs); static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, int64_t offset, int64_t bytes, BdrvRequestFlags flags); -static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore, - bool ignore_bds_parents) +static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore) { BdrvChild *c, *next; QLIST_FOREACH_SAFE(c, >parents, next_parent, next) { -if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) { +if (c == ignore) { continue; } bdrv_parent_drained_begin_single(c, false); @@ -69,13 +68,12 @@ void bdrv_parent_drained_end_single(BdrvChild *c) } } -static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore, -bool ignore_bds_parents) +static void bdrv_parent_drained_end(BlockDriverState *bs, BdrvChild *ignore) { BdrvChild *c; QLIST_FOREACH(c, >parents, next_parent) { -if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) { +if (c == ignore) { continue; } bdrv_parent_drained_end_single(c); @@ -90,14 +88,13 @@ static bool bdrv_parent_drained_poll_single(BdrvChild *c) return false; } -static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore, - bool ignore_bds_parents) +static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore) { BdrvChild *c, *next; bool busy = false; QLIST_FOREACH_SAFE(c, >parents, next_parent, next) { -if (c == ignore || (ignore_bds_parents && c->klass->parent_is_bds)) { +if (c == ignore) { continue; } busy |= bdrv_parent_drained_poll_single(c); @@ -238,16 +235,14 @@ typedef struct { bool begin; bool poll; BdrvChild *parent; -bool ignore_bds_parents; } BdrvCoDrainData; /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */ -bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent, - bool ignore_bds_parents) +bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent) { IO_OR_GS_CODE(); -if (bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents)) { +if (bdrv_parent_drained_poll(bs, ignore_parent)) { return true; } @@ -258,16 +253,9 @@ bool bdrv_drain_poll(BlockDriverState *bs, BdrvChild *ignore_parent, return false; } -static bool bdrv_drain_poll_top_level(BlockDriverState *bs, - BdrvChild *ignore_parent) -{ -return
[PATCH 02/13] test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end()
We want to change .bdrv_co_drained_begin/end() back to be non-coroutine callbacks, so in preparation, avoid yielding in their implementation. This does almost the same as the existing logic in bdrv_drain_invoke(), by creating and entering coroutines internally. However, since the test case is by far the heaviest user of coroutine code in drain callbacks, it is preferable to have the complexity in the test case rather than the drain core, which is already complicated enough without this. The behaviour for bdrv_drain_begin() is unchanged because we increase bs->in_flight and this is still polled. However, bdrv_drain_end() doesn't wait for the spawned coroutine to complete any more. This is fine, we don't rely on bdrv_drain_end() restarting all operations immediately before the next aio_poll(). Signed-off-by: Kevin Wolf --- tests/unit/test-bdrv-drain.c | 64 ++-- 1 file changed, 46 insertions(+), 18 deletions(-) diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c index 09dc4a4891..24f34e24ad 100644 --- a/tests/unit/test-bdrv-drain.c +++ b/tests/unit/test-bdrv-drain.c @@ -38,12 +38,22 @@ typedef struct BDRVTestState { bool sleep_in_drain_begin; } BDRVTestState; +static void coroutine_fn sleep_in_drain_begin(void *opaque) +{ +BlockDriverState *bs = opaque; + +qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 10); +bdrv_dec_in_flight(bs); +} + static void coroutine_fn bdrv_test_co_drain_begin(BlockDriverState *bs) { BDRVTestState *s = bs->opaque; s->drain_count++; if (s->sleep_in_drain_begin) { -qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 10); +Coroutine *co = qemu_coroutine_create(sleep_in_drain_begin, bs); +bdrv_inc_in_flight(bs); +aio_co_enter(bdrv_get_aio_context(bs), co); } } @@ -1916,6 +1926,21 @@ static int coroutine_fn bdrv_replace_test_co_preadv(BlockDriverState *bs, return 0; } +static void coroutine_fn bdrv_replace_test_drain_co(void *opaque) +{ +BlockDriverState *bs = opaque; +BDRVReplaceTestState *s = bs->opaque; + +/* Keep waking io_co up until it is done */ +while (s->io_co) { +aio_co_wake(s->io_co); +s->io_co = NULL; +qemu_coroutine_yield(); +} +s->drain_co = NULL; +bdrv_dec_in_flight(bs); +} + /** * If .drain_count is 0, wake up .io_co if there is one; and set * .was_drained. @@ -1926,20 +1951,27 @@ static void coroutine_fn bdrv_replace_test_co_drain_begin(BlockDriverState *bs) BDRVReplaceTestState *s = bs->opaque; if (!s->drain_count) { -/* Keep waking io_co up until it is done */ -s->drain_co = qemu_coroutine_self(); -while (s->io_co) { -aio_co_wake(s->io_co); -s->io_co = NULL; -qemu_coroutine_yield(); -} -s->drain_co = NULL; - +s->drain_co = qemu_coroutine_create(bdrv_replace_test_drain_co, bs); +bdrv_inc_in_flight(bs); +aio_co_enter(bdrv_get_aio_context(bs), s->drain_co); s->was_drained = true; } s->drain_count++; } +static void coroutine_fn bdrv_replace_test_read_entry(void *opaque) +{ +BlockDriverState *bs = opaque; +char data; +QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, , 1); +int ret; + +/* Queue a read request post-drain */ +ret = bdrv_replace_test_co_preadv(bs, 0, 1, , 0); +g_assert(ret >= 0); +bdrv_dec_in_flight(bs); +} + /** * Reduce .drain_count, set .was_undrained once it reaches 0. * If .drain_count reaches 0 and the node has a backing file, issue a @@ -1951,17 +1983,13 @@ static void coroutine_fn bdrv_replace_test_co_drain_end(BlockDriverState *bs) g_assert(s->drain_count > 0); if (!--s->drain_count) { -int ret; - s->was_undrained = true; if (bs->backing) { -char data; -QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, , 1); - -/* Queue a read request post-drain */ -ret = bdrv_replace_test_co_preadv(bs, 0, 1, , 0); -g_assert(ret >= 0); +Coroutine *co = qemu_coroutine_create(bdrv_replace_test_read_entry, + bs); +bdrv_inc_in_flight(bs); +aio_co_enter(bdrv_get_aio_context(bs), co); } } } -- 2.38.1
[PATCH 13/13] block: Remove poll parameter from bdrv_parent_drained_begin_single()
All callers of bdrv_parent_drained_begin_single() pass poll=false now, so we don't need the parameter any more. Signed-off-by: Kevin Wolf --- include/block/block-io.h | 5 ++--- block.c | 4 ++-- block/io.c | 7 ++- 3 files changed, 6 insertions(+), 10 deletions(-) diff --git a/include/block/block-io.h b/include/block/block-io.h index ddce8550a9..35669f0e62 100644 --- a/include/block/block-io.h +++ b/include/block/block-io.h @@ -285,10 +285,9 @@ bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos); /** * bdrv_parent_drained_begin_single: * - * Begin a quiesced section for the parent of @c. If @poll is true, wait for - * any pending activity to cease. + * Begin a quiesced section for the parent of @c. */ -void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll); +void bdrv_parent_drained_begin_single(BdrvChild *c); /** * bdrv_parent_drained_poll_single: diff --git a/block.c b/block.c index 12039e9b8a..c200f7afa0 100644 --- a/block.c +++ b/block.c @@ -2409,7 +2409,7 @@ static void bdrv_replace_child_abort(void *opaque) * new_bs drained when calling bdrv_replace_child_tran() is not a * requirement any more. */ -bdrv_parent_drained_begin_single(s->child, false); +bdrv_parent_drained_begin_single(s->child); assert(!bdrv_parent_drained_poll_single(s->child)); } assert(s->child->parent_quiesce_counter); @@ -3016,7 +3016,7 @@ static BdrvChild *bdrv_attach_child_common(BlockDriverState *child_bs, * bdrv_replace_child_noperm() will undrain it if the child node is not * drained. The child was only just created, so polling is not necessary. */ -bdrv_parent_drained_begin_single(new_child, false); +bdrv_parent_drained_begin_single(new_child); bdrv_replace_child_noperm(new_child, child_bs); BdrvAttachChildCommonState *s = g_new(BdrvAttachChildCommonState, 1); diff --git a/block/io.c b/block/io.c index d0f641926f..9bcb19e5ee 100644 --- a/block/io.c +++ b/block/io.c @@ -53,7 +53,7 @@ static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore) if (c == ignore) { continue; } -bdrv_parent_drained_begin_single(c, false); +bdrv_parent_drained_begin_single(c); } } @@ -103,7 +103,7 @@ static bool bdrv_parent_drained_poll(BlockDriverState *bs, BdrvChild *ignore) return busy; } -void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll) +void bdrv_parent_drained_begin_single(BdrvChild *c) { IO_OR_GS_CODE(); assert(c->parent_quiesce_counter == 0); @@ -111,9 +111,6 @@ void bdrv_parent_drained_begin_single(BdrvChild *c, bool poll) if (c->klass->drained_begin) { c->klass->drained_begin(c); } -if (poll) { -BDRV_POLL_WHILE(c->bs, bdrv_parent_drained_poll_single(c)); -} } static void bdrv_merge_limits(BlockLimits *dst, const BlockLimits *src) -- 2.38.1
[PATCH 08/13] stream: Replace subtree drain with a single node drain
The subtree drain was introduced in commit b1e1af394d9 as a way to avoid graph changes between finding the base node and changing the block graph as necessary on completion of the image streaming job. The block graph could change between these two points because bdrv_set_backing_hd() first drains the parent node, which involved polling and can do anything. Subtree draining was an imperfect way to make this less likely (because with it, fewer callbacks are called during this window). Everyone agreed that it's not really the right solution, and it was only committed as a stopgap solution. This replaces the subtree drain with a solution that simply drains the parent node before we try to find the base node, and then call a version of bdrv_set_backing_hd() that doesn't drain, but just asserts that the parent node is already drained. This way, any graph changes caused by draining happen before we start looking at the graph and things stay consistent between finding the base node and changing the graph. Signed-off-by: Kevin Wolf --- include/block/block-global-state.h | 3 +++ block.c| 17 ++--- block/stream.c | 20 ++-- 3 files changed, 27 insertions(+), 13 deletions(-) diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h index bb42ed9559..7923415d4e 100644 --- a/include/block/block-global-state.h +++ b/include/block/block-global-state.h @@ -82,6 +82,9 @@ int bdrv_open_file_child(const char *filename, BlockDriverState *bdrv_open_blockdev_ref(BlockdevRef *ref, Error **errp); int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd, Error **errp); +int bdrv_set_backing_hd_drained(BlockDriverState *bs, +BlockDriverState *backing_hd, +Error **errp); int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options, const char *bdref_key, Error **errp); BlockDriverState *bdrv_open(const char *filename, const char *reference, diff --git a/block.c b/block.c index 2f6b25875f..43b893dd6c 100644 --- a/block.c +++ b/block.c @@ -3395,14 +3395,15 @@ static int bdrv_set_backing_noperm(BlockDriverState *bs, return bdrv_set_file_or_backing_noperm(bs, backing_hd, true, tran, errp); } -int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd, -Error **errp) +int bdrv_set_backing_hd_drained(BlockDriverState *bs, +BlockDriverState *backing_hd, +Error **errp) { int ret; Transaction *tran = tran_new(); GLOBAL_STATE_CODE(); -bdrv_drained_begin(bs); +assert(bs->quiesce_counter > 0); ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp); if (ret < 0) { @@ -3412,7 +3413,17 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd, ret = bdrv_refresh_perms(bs, errp); out: tran_finalize(tran, ret); +return ret; +} +int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd, +Error **errp) +{ +int ret; +GLOBAL_STATE_CODE(); + +bdrv_drained_begin(bs); +ret = bdrv_set_backing_hd_drained(bs, backing_hd, errp); bdrv_drained_end(bs); return ret; diff --git a/block/stream.c b/block/stream.c index 694709bd25..81dcf5a417 100644 --- a/block/stream.c +++ b/block/stream.c @@ -64,13 +64,16 @@ static int stream_prepare(Job *job) bdrv_cor_filter_drop(s->cor_filter_bs); s->cor_filter_bs = NULL; -bdrv_subtree_drained_begin(s->above_base); +/* + * bdrv_set_backing_hd() requires that unfiltered_bs is drained. Drain + * already here and use bdrv_set_backing_hd_drained() instead because + * the polling during drained_begin() might change the graph, and if we do + * this only later, we may end up working with the wrong base node (or it + * might even have gone away by the time we want to use it). + */ +bdrv_drained_begin(unfiltered_bs); base = bdrv_filter_or_cow_bs(s->above_base); -if (base) { -bdrv_ref(base); -} - unfiltered_base = bdrv_skip_filters(base); if (bdrv_cow_child(unfiltered_bs)) { @@ -82,7 +85,7 @@ static int stream_prepare(Job *job) } } -bdrv_set_backing_hd(unfiltered_bs, base, _err); +bdrv_set_backing_hd_drained(unfiltered_bs, base, _err); ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false); if (local_err) { error_report_err(local_err); @@ -92,10 +95,7 @@ static int stream_prepare(Job *job) } out: -if (base) { -bdrv_unref(base); -} -bdrv_subtree_drained_end(s->above_base); +bdrv_drained_end(unfiltered_bs); return ret; } -- 2.38.1
[PATCH 00/13] block: Simplify drain
I'm aware that exactly nobody has been looking forward to a series with this title, but it has to be. The way drain works means that we need to poll in bdrv_replace_child_noperm() and that makes things rather messy with Emanuele's multiqueue work because you must not poll while you hold the graph lock. The other reason why it has to be is that drain is way too complex and there are too many different cases. Some simplification like this will hopefully make it considerably more maintainable. The diffstat probably tells something, too. There are roughly speaking three parts in this series: 1. Make BlockDriver.bdrv_drained_begin/end() non-coroutine_fn again, which allows us to not poll on bdrv_drained_end() any more. 2. Remove subtree drains. They are a considerable complication in the whole drain machinery (in particular, they require polling in the BdrvChildClass.attach/detach() callbacks that are called during bdrv_replace_child_noperm()) and none of their users actually has a good reason to use them. 3. Finally get rid of polling in bdrv_replace_child_noperm() by requiring that the child is already drained by the caller and calling callbacks only once and not again for every nested drain section. If necessary, a prefix of this series can be merged that covers only the first or the first two parts and it would still make sense. Kevin Wolf (13): qed: Don't yield in bdrv_qed_co_drain_begin() test-bdrv-drain: Don't yield in .bdrv_co_drained_begin/end() block: Revert .bdrv_drained_begin/end to non-coroutine_fn block: Remove drained_end_counter block: Inline bdrv_drain_invoke() block: Drain invidual nodes during reopen block: Don't use subtree drains in bdrv_drop_intermediate() stream: Replace subtree drain with a single node drain block: Remove subtree drains block: Call drain callbacks only once block: Remove ignore_bds_parents parameter from drain functions block: Don't poll in bdrv_replace_child_noperm() block: Remove poll parameter from bdrv_parent_drained_begin_single() include/block/block-global-state.h | 3 + include/block/block-io.h | 52 +--- include/block/block_int-common.h | 17 +- include/block/block_int-io.h | 12 - block.c| 132 ++- block/block-backend.c | 4 +- block/io.c | 281 -- block/qed.c| 24 +- block/replication.c| 6 - block/stream.c | 20 +- block/throttle.c | 6 +- blockdev.c | 13 - blockjob.c | 2 +- tests/unit/test-bdrv-drain.c | 369 +++-- 14 files changed, 270 insertions(+), 671 deletions(-) -- 2.38.1
[PATCH 04/13] block: Remove drained_end_counter
drained_end_counter is unused now, nobody changes its value any more. It can be removed. In cases where we had two almost identical functions that only differed in whether the caller passes drained_end_counter, or whether they would poll for a local drained_end_counter to reach 0, these become a single function. Signed-off-by: Kevin Wolf --- include/block/block-io.h | 15 - include/block/block_int-common.h | 6 +- block.c | 5 +- block/block-backend.c| 4 +- block/io.c | 97 blockjob.c | 2 +- 6 files changed, 30 insertions(+), 99 deletions(-) diff --git a/include/block/block-io.h b/include/block/block-io.h index 770ddeb7c8..97e9ae8bee 100644 --- a/include/block/block-io.h +++ b/include/block/block-io.h @@ -235,21 +235,6 @@ int coroutine_fn bdrv_co_copy_range(BdrvChild *src, int64_t src_offset, int64_t bytes, BdrvRequestFlags read_flags, BdrvRequestFlags write_flags); -/** - * bdrv_drained_end_no_poll: - * - * Same as bdrv_drained_end(), but do not poll for the subgraph to - * actually become unquiesced. Therefore, no graph changes will occur - * with this function. - * - * *drained_end_counter is incremented for every background operation - * that is scheduled, and will be decremented for every operation once - * it settles. The caller must poll until it reaches 0. The counter - * should be accessed using atomic operations only. - */ -void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter); - - /* * "I/O or GS" API functions. These functions can run without * the BQL, but only in one specific iothread/main loop. diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h index 0956acbb60..6504db4fd9 100644 --- a/include/block/block_int-common.h +++ b/include/block/block_int-common.h @@ -939,15 +939,11 @@ struct BdrvChildClass { * These functions must not change the graph (and therefore also must not * call aio_poll(), which could change the graph indirectly). * - * If drained_end() schedules background operations, it must atomically - * increment *drained_end_counter for each such operation and atomically - * decrement it once the operation has settled. - * * Note that this can be nested. If drained_begin() was called twice, new * I/O is allowed only after drained_end() was called twice, too. */ void (*drained_begin)(BdrvChild *child); -void (*drained_end)(BdrvChild *child, int *drained_end_counter); +void (*drained_end)(BdrvChild *child); /* * Returns whether the parent has pending requests for the child. This diff --git a/block.c b/block.c index fed8077993..7a24bd4c36 100644 --- a/block.c +++ b/block.c @@ -1227,11 +1227,10 @@ static bool bdrv_child_cb_drained_poll(BdrvChild *child) return bdrv_drain_poll(bs, false, NULL, false); } -static void bdrv_child_cb_drained_end(BdrvChild *child, - int *drained_end_counter) +static void bdrv_child_cb_drained_end(BdrvChild *child) { BlockDriverState *bs = child->opaque; -bdrv_drained_end_no_poll(bs, drained_end_counter); +bdrv_drained_end(bs); } static int bdrv_child_cb_inactivate(BdrvChild *child) diff --git a/block/block-backend.c b/block/block-backend.c index c0c7d56c8d..ecdfeb49bb 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -129,7 +129,7 @@ static void blk_root_inherit_options(BdrvChildRole role, bool parent_is_format, } static void blk_root_drained_begin(BdrvChild *child); static bool blk_root_drained_poll(BdrvChild *child); -static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter); +static void blk_root_drained_end(BdrvChild *child); static void blk_root_change_media(BdrvChild *child, bool load); static void blk_root_resize(BdrvChild *child); @@ -2549,7 +2549,7 @@ static bool blk_root_drained_poll(BdrvChild *child) return busy || !!blk->in_flight; } -static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter) +static void blk_root_drained_end(BdrvChild *child) { BlockBackend *blk = child->opaque; assert(blk->quiesce_counter); diff --git a/block/io.c b/block/io.c index 183b407f5b..41e6121c31 100644 --- a/block/io.c +++ b/block/io.c @@ -58,27 +58,19 @@ static void bdrv_parent_drained_begin(BlockDriverState *bs, BdrvChild *ignore, } } -static void bdrv_parent_drained_end_single_no_poll(BdrvChild *c, - int *drained_end_counter) +void bdrv_parent_drained_end_single(BdrvChild *c) { +IO_OR_GS_CODE(); + assert(c->parent_quiesce_counter > 0); c->parent_quiesce_counter--; if (c->klass->drained_end) { -c->klass->drained_end(c, drained_end_counter); +c->klass->drained_end(c); } } -void
[PATCH 06/13] block: Drain invidual nodes during reopen
bdrv_reopen() and friends use subtree drains as a lazy way of covering all the nodes they touch. Turns out that this lazy way is a lot more complicated than just draining the nodes individually, even not accounting for the additional complexity in the drain mechanism itself. Simplify the code by switching to draining the individual nodes that are already managed in the BlockReopenQueue anyway. Signed-off-by: Kevin Wolf --- block.c | 11 --- block/replication.c | 6 -- blockdev.c | 13 - 3 files changed, 4 insertions(+), 26 deletions(-) diff --git a/block.c b/block.c index 7a24bd4c36..5828b970e4 100644 --- a/block.c +++ b/block.c @@ -4142,7 +4142,7 @@ static bool bdrv_recurse_has_child(BlockDriverState *bs, * returns a pointer to bs_queue, which is either the newly allocated * bs_queue, or the existing bs_queue being used. * - * bs must be drained between bdrv_reopen_queue() and bdrv_reopen_multiple(). + * bs is drained here and undrained by bdrv_reopen_queue_free(). */ static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, BlockDriverState *bs, @@ -4162,12 +4162,10 @@ static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue, int flags; QemuOpts *opts; -/* Make sure that the caller remembered to use a drained section. This is - * important to avoid graph changes between the recursive queuing here and - * bdrv_reopen_multiple(). */ -assert(bs->quiesce_counter > 0); GLOBAL_STATE_CODE(); +bdrv_drained_begin(bs); + if (bs_queue == NULL) { bs_queue = g_new0(BlockReopenQueue, 1); QTAILQ_INIT(bs_queue); @@ -4317,6 +4315,7 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue) if (bs_queue) { BlockReopenQueueEntry *bs_entry, *next; QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) { +bdrv_drained_end(bs_entry->state.bs); qobject_unref(bs_entry->state.explicit_options); qobject_unref(bs_entry->state.options); g_free(bs_entry); @@ -4464,7 +4463,6 @@ int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts, GLOBAL_STATE_CODE(); -bdrv_subtree_drained_begin(bs); if (ctx != qemu_get_aio_context()) { aio_context_release(ctx); } @@ -4475,7 +4473,6 @@ int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts, if (ctx != qemu_get_aio_context()) { aio_context_acquire(ctx); } -bdrv_subtree_drained_end(bs); return ret; } diff --git a/block/replication.c b/block/replication.c index f1eed25e43..c62f48a874 100644 --- a/block/replication.c +++ b/block/replication.c @@ -374,9 +374,6 @@ static void reopen_backing_file(BlockDriverState *bs, bool writable, s->orig_secondary_read_only = bdrv_is_read_only(secondary_disk->bs); } -bdrv_subtree_drained_begin(hidden_disk->bs); -bdrv_subtree_drained_begin(secondary_disk->bs); - if (s->orig_hidden_read_only) { QDict *opts = qdict_new(); qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable); @@ -401,9 +398,6 @@ static void reopen_backing_file(BlockDriverState *bs, bool writable, aio_context_acquire(ctx); } } - -bdrv_subtree_drained_end(hidden_disk->bs); -bdrv_subtree_drained_end(secondary_disk->bs); } static void backup_job_cleanup(BlockDriverState *bs) diff --git a/blockdev.c b/blockdev.c index 3f1dec6242..8ffb3d9537 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3547,8 +3547,6 @@ fail: void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp) { BlockReopenQueue *queue = NULL; -GSList *drained = NULL; -GSList *p; /* Add each one of the BDS that we want to reopen to the queue */ for (; reopen_list != NULL; reopen_list = reopen_list->next) { @@ -3585,9 +3583,7 @@ void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp) ctx = bdrv_get_aio_context(bs); aio_context_acquire(ctx); -bdrv_subtree_drained_begin(bs); queue = bdrv_reopen_queue(queue, bs, qdict, false); -drained = g_slist_prepend(drained, bs); aio_context_release(ctx); } @@ -3598,15 +3594,6 @@ void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp) fail: bdrv_reopen_queue_free(queue); -for (p = drained; p; p = p->next) { -BlockDriverState *bs = p->data; -AioContext *ctx = bdrv_get_aio_context(bs); - -aio_context_acquire(ctx); -bdrv_subtree_drained_end(bs); -aio_context_release(ctx); -} -g_slist_free(drained); } void qmp_blockdev_del(const char *node_name, Error **errp) -- 2.38.1
[PATCH 01/13] qed: Don't yield in bdrv_qed_co_drain_begin()
We want to change .bdrv_co_drained_begin() back to be a non-coroutine callback, so in preparation, avoid yielding in its implementation. Because we increase bs->in_flight and bdrv_drained_begin() polls, the behaviour is unchanged. Signed-off-by: Kevin Wolf --- block/qed.c | 20 +--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/block/qed.c b/block/qed.c index 2f36ad342c..013f826c44 100644 --- a/block/qed.c +++ b/block/qed.c @@ -282,9 +282,8 @@ static void coroutine_fn qed_unplug_allocating_write_reqs(BDRVQEDState *s) qemu_co_mutex_unlock(>table_lock); } -static void coroutine_fn qed_need_check_timer_entry(void *opaque) +static void coroutine_fn qed_need_check_timer(BDRVQEDState *s) { -BDRVQEDState *s = opaque; int ret; trace_qed_need_check_timer_cb(s); @@ -310,9 +309,20 @@ static void coroutine_fn qed_need_check_timer_entry(void *opaque) (void) ret; } +static void coroutine_fn qed_need_check_timer_entry(void *opaque) +{ +BDRVQEDState *s = opaque; + +qed_need_check_timer(opaque); +bdrv_dec_in_flight(s->bs); +} + static void qed_need_check_timer_cb(void *opaque) { +BDRVQEDState *s = opaque; Coroutine *co = qemu_coroutine_create(qed_need_check_timer_entry, opaque); + +bdrv_inc_in_flight(s->bs); qemu_coroutine_enter(co); } @@ -363,8 +373,12 @@ static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs) * header is flushed. */ if (s->need_check_timer && timer_pending(s->need_check_timer)) { +Coroutine *co; + qed_cancel_need_check_timer(s); -qed_need_check_timer_entry(s); +co = qemu_coroutine_create(qed_need_check_timer_entry, s); +bdrv_inc_in_flight(bs); +aio_co_enter(bdrv_get_aio_context(bs), co); } } -- 2.38.1
[PATCH 03/13] block: Revert .bdrv_drained_begin/end to non-coroutine_fn
Polling during bdrv_drained_end() can be problematic (and in the future, we may get cases for bdrv_drained_begin() where polling is forbidden, and we don't care about already in-flight requests, but just want to prevent new requests from arriving). The .bdrv_drained_begin/end callbacks running in a coroutine is the only reason why we have to do this polling, so make them non-coroutine callbacks again. None of the callers actually yield any more. This means that bdrv_drained_end() effectively doesn't poll any more, even if AIO_WAIT_WHILE() loops are still there (their condition is false from the beginning). This is generally not a problem, but in test-bdrv-drain, some additional explicit aio_poll() calls need to be added because the test case wants to verify the final state after BHs have executed. Signed-off-by: Kevin Wolf --- include/block/block_int-common.h | 10 --- block.c | 4 +-- block/io.c | 49 +--- block/qed.c | 4 +-- block/throttle.c | 6 ++-- tests/unit/test-bdrv-drain.c | 18 ++-- 6 files changed, 30 insertions(+), 61 deletions(-) diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h index 5a2cc077a0..0956acbb60 100644 --- a/include/block/block_int-common.h +++ b/include/block/block_int-common.h @@ -735,17 +735,19 @@ struct BlockDriver { void (*bdrv_io_unplug)(BlockDriverState *bs); /** - * bdrv_co_drain_begin is called if implemented in the beginning of a + * bdrv_drain_begin is called if implemented in the beginning of a * drain operation to drain and stop any internal sources of requests in * the driver. - * bdrv_co_drain_end is called if implemented at the end of the drain. + * bdrv_drain_end is called if implemented at the end of the drain. * * They should be used by the driver to e.g. manage scheduled I/O * requests, or toggle an internal state. After the end of the drain new * requests will continue normally. + * + * Implementations of both functions must not call aio_poll(). */ -void coroutine_fn (*bdrv_co_drain_begin)(BlockDriverState *bs); -void coroutine_fn (*bdrv_co_drain_end)(BlockDriverState *bs); +void (*bdrv_drain_begin)(BlockDriverState *bs); +void (*bdrv_drain_end)(BlockDriverState *bs); bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs); bool coroutine_fn (*bdrv_co_can_store_new_dirty_bitmap)( diff --git a/block.c b/block.c index 3bd594eb2a..fed8077993 100644 --- a/block.c +++ b/block.c @@ -1705,8 +1705,8 @@ static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv, assert(is_power_of_2(bs->bl.request_alignment)); for (i = 0; i < bs->quiesce_counter; i++) { -if (drv->bdrv_co_drain_begin) { -drv->bdrv_co_drain_begin(bs); +if (drv->bdrv_drain_begin) { +drv->bdrv_drain_begin(bs); } } diff --git a/block/io.c b/block/io.c index 34b30e304e..183b407f5b 100644 --- a/block/io.c +++ b/block/io.c @@ -250,55 +250,20 @@ typedef struct { int *drained_end_counter; } BdrvCoDrainData; -static void coroutine_fn bdrv_drain_invoke_entry(void *opaque) -{ -BdrvCoDrainData *data = opaque; -BlockDriverState *bs = data->bs; - -if (data->begin) { -bs->drv->bdrv_co_drain_begin(bs); -} else { -bs->drv->bdrv_co_drain_end(bs); -} - -/* Set data->done and decrement drained_end_counter before bdrv_wakeup() */ -qatomic_mb_set(>done, true); -if (!data->begin) { -qatomic_dec(data->drained_end_counter); -} -bdrv_dec_in_flight(bs); - -g_free(data); -} - -/* Recursively call BlockDriver.bdrv_co_drain_begin/end callbacks */ +/* Recursively call BlockDriver.bdrv_drain_begin/end callbacks */ static void bdrv_drain_invoke(BlockDriverState *bs, bool begin, int *drained_end_counter) { -BdrvCoDrainData *data; - -if (!bs->drv || (begin && !bs->drv->bdrv_co_drain_begin) || -(!begin && !bs->drv->bdrv_co_drain_end)) { +if (!bs->drv || (begin && !bs->drv->bdrv_drain_begin) || +(!begin && !bs->drv->bdrv_drain_end)) { return; } -data = g_new(BdrvCoDrainData, 1); -*data = (BdrvCoDrainData) { -.bs = bs, -.done = false, -.begin = begin, -.drained_end_counter = drained_end_counter, -}; - -if (!begin) { -qatomic_inc(drained_end_counter); +if (begin) { +bs->drv->bdrv_drain_begin(bs); +} else { +bs->drv->bdrv_drain_end(bs); } - -/* Make sure the driver callback completes during the polling phase for - * drain_begin. */ -bdrv_inc_in_flight(bs); -data->co = qemu_coroutine_create(bdrv_drain_invoke_entry, data); -aio_co_schedule(bdrv_get_aio_context(bs), data->co); } /* Returns true if BDRV_POLL_WHILE() should go
Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start
"Michael S. Tsirkin" writes: > On Tue, Nov 08, 2022 at 10:23:15AM +, Alex Bennée wrote: >> >> "Michael S. Tsirkin" writes: >> >> > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote: >> >> The previous fix to virtio_device_started revealed a problem in its >> >> use by both the core and the device code. The core code should be able >> >> to handle the device "starting" while the VM isn't running to handle >> >> the restoration of migration state. To solve this dual use introduce a >> >> new helper for use by the vhost-user backends who all use it to feed a >> >> should_start variable. >> >> >> >> We can also pick up a change vhost_user_blk_set_status while we are at >> >> it which follows the same pattern. >> >> >> >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to >> >> virtio_device_started) >> >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device) >> >> Signed-off-by: Alex Bennée >> >> Cc: "Michael S. Tsirkin" >> > >> > why is this in this patchset? >> >> As per my cover letter: >> >> Most of these patches have been posted before as single patch RFCs. A >> couple are already scheduled through other trees so will drop out in >> due course >> >> but I keep them in my tree until they are merged so I can continue to >> soak test them (and have a stable base for my other WIP trees). > > That's fine just pls don't double-post them on list, certainly > not as part of a patchset. Why not? Is this breaking some tooling? -- Alex Bennée
Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start
On Tue, Nov 08, 2022 at 10:23:15AM +, Alex Bennée wrote: > > "Michael S. Tsirkin" writes: > > > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote: > >> The previous fix to virtio_device_started revealed a problem in its > >> use by both the core and the device code. The core code should be able > >> to handle the device "starting" while the VM isn't running to handle > >> the restoration of migration state. To solve this dual use introduce a > >> new helper for use by the vhost-user backends who all use it to feed a > >> should_start variable. > >> > >> We can also pick up a change vhost_user_blk_set_status while we are at > >> it which follows the same pattern. > >> > >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to > >> virtio_device_started) > >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device) > >> Signed-off-by: Alex Bennée > >> Cc: "Michael S. Tsirkin" > > > > why is this in this patchset? > > As per my cover letter: > > Most of these patches have been posted before as single patch RFCs. A > couple are already scheduled through other trees so will drop out in > due course > > but I keep them in my tree until they are merged so I can continue to > soak test them (and have a stable base for my other WIP trees). That's fine just pls don't double-post them on list, certainly not as part of a patchset. > -- > Alex Bennée
Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start
"Michael S. Tsirkin" writes: > On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote: >> The previous fix to virtio_device_started revealed a problem in its >> use by both the core and the device code. The core code should be able >> to handle the device "starting" while the VM isn't running to handle >> the restoration of migration state. To solve this dual use introduce a >> new helper for use by the vhost-user backends who all use it to feed a >> should_start variable. >> >> We can also pick up a change vhost_user_blk_set_status while we are at >> it which follows the same pattern. >> >> Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started) >> Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device) >> Signed-off-by: Alex Bennée >> Cc: "Michael S. Tsirkin" > > why is this in this patchset? As per my cover letter: Most of these patches have been posted before as single patch RFCs. A couple are already scheduled through other trees so will drop out in due course but I keep them in my tree until they are merged so I can continue to soak test them (and have a stable base for my other WIP trees). -- Alex Bennée
Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start
On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote: > The previous fix to virtio_device_started revealed a problem in its > use by both the core and the device code. The core code should be able > to handle the device "starting" while the VM isn't running to handle > the restoration of migration state. To solve this dual use introduce a > new helper for use by the vhost-user backends who all use it to feed a > should_start variable. > > We can also pick up a change vhost_user_blk_set_status while we are at > it which follows the same pattern. > > Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started) > Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device) > Signed-off-by: Alex Bennée > Cc: "Michael S. Tsirkin" is this the same as the RFC? > --- > include/hw/virtio/virtio.h | 18 ++ > hw/block/vhost-user-blk.c| 6 +- > hw/virtio/vhost-user-fs.c| 2 +- > hw/virtio/vhost-user-gpio.c | 2 +- > hw/virtio/vhost-user-i2c.c | 2 +- > hw/virtio/vhost-user-rng.c | 2 +- > hw/virtio/vhost-user-vsock.c | 2 +- > hw/virtio/vhost-vsock.c | 2 +- > 8 files changed, 25 insertions(+), 11 deletions(-) > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h > index f41b4a7e64..3191c618f3 100644 > --- a/include/hw/virtio/virtio.h > +++ b/include/hw/virtio/virtio.h > @@ -389,6 +389,24 @@ static inline bool virtio_device_started(VirtIODevice > *vdev, uint8_t status) > return vdev->started; > } > > +return status & VIRTIO_CONFIG_S_DRIVER_OK; > +} > + > +/** > + * virtio_device_should_start() - check if device startable > + * @vdev - the VirtIO device > + * @status - the devices status bits > + * > + * This is similar to virtio_device_started() but also encapsulates a > + * check on the VM status which would prevent a device starting > + * anyway. > + */ > +static inline bool virtio_device_should_start(VirtIODevice *vdev, uint8_t > status) > +{ > +if (vdev->use_started) { > +return vdev->started; > +} > + > if (!vdev->vm_running) { > return false; > } > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c > index 13bf5cc47a..8feaf12e4e 100644 > --- a/hw/block/vhost-user-blk.c > +++ b/hw/block/vhost-user-blk.c > @@ -222,14 +222,10 @@ static void vhost_user_blk_stop(VirtIODevice *vdev) > static void vhost_user_blk_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserBlk *s = VHOST_USER_BLK(vdev); > -bool should_start = virtio_device_started(vdev, status); > +bool should_start = virtio_device_should_start(vdev, status); > Error *local_err = NULL; > int ret; > > -if (!vdev->vm_running) { > -should_start = false; > -} > - > if (!s->connected) { > return; > } > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c > index ad0f91c607..1c40f42045 100644 > --- a/hw/virtio/vhost-user-fs.c > +++ b/hw/virtio/vhost-user-fs.c > @@ -123,7 +123,7 @@ static void vuf_stop(VirtIODevice *vdev) > static void vuf_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserFS *fs = VHOST_USER_FS(vdev); > -bool should_start = virtio_device_started(vdev, status); > +bool should_start = virtio_device_should_start(vdev, status); > > if (vhost_dev_is_started(>vhost_dev) == should_start) { > return; > diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c > index 8b40fe450c..677d1c7730 100644 > --- a/hw/virtio/vhost-user-gpio.c > +++ b/hw/virtio/vhost-user-gpio.c > @@ -152,7 +152,7 @@ static void vu_gpio_stop(VirtIODevice *vdev) > static void vu_gpio_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserGPIO *gpio = VHOST_USER_GPIO(vdev); > -bool should_start = virtio_device_started(vdev, status); > +bool should_start = virtio_device_should_start(vdev, status); > > trace_virtio_gpio_set_status(status); > > diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c > index bc58b6c0d1..864eba695e 100644 > --- a/hw/virtio/vhost-user-i2c.c > +++ b/hw/virtio/vhost-user-i2c.c > @@ -93,7 +93,7 @@ static void vu_i2c_stop(VirtIODevice *vdev) > static void vu_i2c_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserI2C *i2c = VHOST_USER_I2C(vdev); > -bool should_start = virtio_device_started(vdev, status); > +bool should_start = virtio_device_should_start(vdev, status); > > if (vhost_dev_is_started(>vhost_dev) == should_start) { > return; > diff --git a/hw/virtio/vhost-user-rng.c b/hw/virtio/vhost-user-rng.c > index bc1f36c5ac..8b47287875 100644 > --- a/hw/virtio/vhost-user-rng.c > +++ b/hw/virtio/vhost-user-rng.c > @@ -90,7 +90,7 @@ static void vu_rng_stop(VirtIODevice *vdev) > static void vu_rng_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserRNG *rng = VHOST_USER_RNG(vdev); > -bool should_start = virtio_device_started(vdev, status); > +
Re: [PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start
On Tue, Nov 08, 2022 at 09:23:04AM +, Alex Bennée wrote: > The previous fix to virtio_device_started revealed a problem in its > use by both the core and the device code. The core code should be able > to handle the device "starting" while the VM isn't running to handle > the restoration of migration state. To solve this dual use introduce a > new helper for use by the vhost-user backends who all use it to feed a > should_start variable. > > We can also pick up a change vhost_user_blk_set_status while we are at > it which follows the same pattern. > > Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started) > Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device) > Signed-off-by: Alex Bennée > Cc: "Michael S. Tsirkin" why is this in this patchset? > --- > include/hw/virtio/virtio.h | 18 ++ > hw/block/vhost-user-blk.c| 6 +- > hw/virtio/vhost-user-fs.c| 2 +- > hw/virtio/vhost-user-gpio.c | 2 +- > hw/virtio/vhost-user-i2c.c | 2 +- > hw/virtio/vhost-user-rng.c | 2 +- > hw/virtio/vhost-user-vsock.c | 2 +- > hw/virtio/vhost-vsock.c | 2 +- > 8 files changed, 25 insertions(+), 11 deletions(-) > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h > index f41b4a7e64..3191c618f3 100644 > --- a/include/hw/virtio/virtio.h > +++ b/include/hw/virtio/virtio.h > @@ -389,6 +389,24 @@ static inline bool virtio_device_started(VirtIODevice > *vdev, uint8_t status) > return vdev->started; > } > > +return status & VIRTIO_CONFIG_S_DRIVER_OK; > +} > + > +/** > + * virtio_device_should_start() - check if device startable > + * @vdev - the VirtIO device > + * @status - the devices status bits > + * > + * This is similar to virtio_device_started() but also encapsulates a > + * check on the VM status which would prevent a device starting > + * anyway. > + */ > +static inline bool virtio_device_should_start(VirtIODevice *vdev, uint8_t > status) > +{ > +if (vdev->use_started) { > +return vdev->started; > +} > + > if (!vdev->vm_running) { > return false; > } > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c > index 13bf5cc47a..8feaf12e4e 100644 > --- a/hw/block/vhost-user-blk.c > +++ b/hw/block/vhost-user-blk.c > @@ -222,14 +222,10 @@ static void vhost_user_blk_stop(VirtIODevice *vdev) > static void vhost_user_blk_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserBlk *s = VHOST_USER_BLK(vdev); > -bool should_start = virtio_device_started(vdev, status); > +bool should_start = virtio_device_should_start(vdev, status); > Error *local_err = NULL; > int ret; > > -if (!vdev->vm_running) { > -should_start = false; > -} > - > if (!s->connected) { > return; > } > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c > index ad0f91c607..1c40f42045 100644 > --- a/hw/virtio/vhost-user-fs.c > +++ b/hw/virtio/vhost-user-fs.c > @@ -123,7 +123,7 @@ static void vuf_stop(VirtIODevice *vdev) > static void vuf_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserFS *fs = VHOST_USER_FS(vdev); > -bool should_start = virtio_device_started(vdev, status); > +bool should_start = virtio_device_should_start(vdev, status); > > if (vhost_dev_is_started(>vhost_dev) == should_start) { > return; > diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c > index 8b40fe450c..677d1c7730 100644 > --- a/hw/virtio/vhost-user-gpio.c > +++ b/hw/virtio/vhost-user-gpio.c > @@ -152,7 +152,7 @@ static void vu_gpio_stop(VirtIODevice *vdev) > static void vu_gpio_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserGPIO *gpio = VHOST_USER_GPIO(vdev); > -bool should_start = virtio_device_started(vdev, status); > +bool should_start = virtio_device_should_start(vdev, status); > > trace_virtio_gpio_set_status(status); > > diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c > index bc58b6c0d1..864eba695e 100644 > --- a/hw/virtio/vhost-user-i2c.c > +++ b/hw/virtio/vhost-user-i2c.c > @@ -93,7 +93,7 @@ static void vu_i2c_stop(VirtIODevice *vdev) > static void vu_i2c_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserI2C *i2c = VHOST_USER_I2C(vdev); > -bool should_start = virtio_device_started(vdev, status); > +bool should_start = virtio_device_should_start(vdev, status); > > if (vhost_dev_is_started(>vhost_dev) == should_start) { > return; > diff --git a/hw/virtio/vhost-user-rng.c b/hw/virtio/vhost-user-rng.c > index bc1f36c5ac..8b47287875 100644 > --- a/hw/virtio/vhost-user-rng.c > +++ b/hw/virtio/vhost-user-rng.c > @@ -90,7 +90,7 @@ static void vu_rng_stop(VirtIODevice *vdev) > static void vu_rng_set_status(VirtIODevice *vdev, uint8_t status) > { > VHostUserRNG *rng = VHOST_USER_RNG(vdev); > -bool should_start = virtio_device_started(vdev, status); > +
[PATCH v1 5/9] hw/virtio: introduce virtio_device_should_start
The previous fix to virtio_device_started revealed a problem in its use by both the core and the device code. The core code should be able to handle the device "starting" while the VM isn't running to handle the restoration of migration state. To solve this dual use introduce a new helper for use by the vhost-user backends who all use it to feed a should_start variable. We can also pick up a change vhost_user_blk_set_status while we are at it which follows the same pattern. Fixes: 9f6bcfd99f (hw/virtio: move vm_running check to virtio_device_started) Fixes: 27ba7b027f (hw/virtio: add boilerplate for vhost-user-gpio device) Signed-off-by: Alex Bennée Cc: "Michael S. Tsirkin" --- include/hw/virtio/virtio.h | 18 ++ hw/block/vhost-user-blk.c| 6 +- hw/virtio/vhost-user-fs.c| 2 +- hw/virtio/vhost-user-gpio.c | 2 +- hw/virtio/vhost-user-i2c.c | 2 +- hw/virtio/vhost-user-rng.c | 2 +- hw/virtio/vhost-user-vsock.c | 2 +- hw/virtio/vhost-vsock.c | 2 +- 8 files changed, 25 insertions(+), 11 deletions(-) diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index f41b4a7e64..3191c618f3 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -389,6 +389,24 @@ static inline bool virtio_device_started(VirtIODevice *vdev, uint8_t status) return vdev->started; } +return status & VIRTIO_CONFIG_S_DRIVER_OK; +} + +/** + * virtio_device_should_start() - check if device startable + * @vdev - the VirtIO device + * @status - the devices status bits + * + * This is similar to virtio_device_started() but also encapsulates a + * check on the VM status which would prevent a device starting + * anyway. + */ +static inline bool virtio_device_should_start(VirtIODevice *vdev, uint8_t status) +{ +if (vdev->use_started) { +return vdev->started; +} + if (!vdev->vm_running) { return false; } diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index 13bf5cc47a..8feaf12e4e 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -222,14 +222,10 @@ static void vhost_user_blk_stop(VirtIODevice *vdev) static void vhost_user_blk_set_status(VirtIODevice *vdev, uint8_t status) { VHostUserBlk *s = VHOST_USER_BLK(vdev); -bool should_start = virtio_device_started(vdev, status); +bool should_start = virtio_device_should_start(vdev, status); Error *local_err = NULL; int ret; -if (!vdev->vm_running) { -should_start = false; -} - if (!s->connected) { return; } diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c index ad0f91c607..1c40f42045 100644 --- a/hw/virtio/vhost-user-fs.c +++ b/hw/virtio/vhost-user-fs.c @@ -123,7 +123,7 @@ static void vuf_stop(VirtIODevice *vdev) static void vuf_set_status(VirtIODevice *vdev, uint8_t status) { VHostUserFS *fs = VHOST_USER_FS(vdev); -bool should_start = virtio_device_started(vdev, status); +bool should_start = virtio_device_should_start(vdev, status); if (vhost_dev_is_started(>vhost_dev) == should_start) { return; diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c index 8b40fe450c..677d1c7730 100644 --- a/hw/virtio/vhost-user-gpio.c +++ b/hw/virtio/vhost-user-gpio.c @@ -152,7 +152,7 @@ static void vu_gpio_stop(VirtIODevice *vdev) static void vu_gpio_set_status(VirtIODevice *vdev, uint8_t status) { VHostUserGPIO *gpio = VHOST_USER_GPIO(vdev); -bool should_start = virtio_device_started(vdev, status); +bool should_start = virtio_device_should_start(vdev, status); trace_virtio_gpio_set_status(status); diff --git a/hw/virtio/vhost-user-i2c.c b/hw/virtio/vhost-user-i2c.c index bc58b6c0d1..864eba695e 100644 --- a/hw/virtio/vhost-user-i2c.c +++ b/hw/virtio/vhost-user-i2c.c @@ -93,7 +93,7 @@ static void vu_i2c_stop(VirtIODevice *vdev) static void vu_i2c_set_status(VirtIODevice *vdev, uint8_t status) { VHostUserI2C *i2c = VHOST_USER_I2C(vdev); -bool should_start = virtio_device_started(vdev, status); +bool should_start = virtio_device_should_start(vdev, status); if (vhost_dev_is_started(>vhost_dev) == should_start) { return; diff --git a/hw/virtio/vhost-user-rng.c b/hw/virtio/vhost-user-rng.c index bc1f36c5ac..8b47287875 100644 --- a/hw/virtio/vhost-user-rng.c +++ b/hw/virtio/vhost-user-rng.c @@ -90,7 +90,7 @@ static void vu_rng_stop(VirtIODevice *vdev) static void vu_rng_set_status(VirtIODevice *vdev, uint8_t status) { VHostUserRNG *rng = VHOST_USER_RNG(vdev); -bool should_start = virtio_device_started(vdev, status); +bool should_start = virtio_device_should_start(vdev, status); if (vhost_dev_is_started(>vhost_dev) == should_start) { return; diff --git a/hw/virtio/vhost-user-vsock.c b/hw/virtio/vhost-user-vsock.c index 7b67e29d83..9431b9792c 100644 --- a/hw/virtio/vhost-user-vsock.c +++ b/hw/virtio/vhost-user-vsock.c @@ -55,7 +55,7 @@
Re: [PATCH-for-7.2 1/2] hw/sd/sdhci: Do not set Buf Wr Ena before writing block (CVE-2022-3872)
On Mon, Nov 7, 2022 at 11:12 PM Philippe Mathieu-Daudé wrote: > > When sdhci_write_block_to_card() is called to transfer data from > the FIFO to the SD bus, the data is already present in the buffer > and we have to consume it directly. > > See the description of the 'Buffer Write Enable' bit from the > 'Present State' register (prnsts::SDHC_SPACE_AVAILABLE) in Table > 2.14 from the SDHCI spec v2: > > Buffer Write Enable > > This status is used for non-DMA write transfers. > > The Host Controller can implement multiple buffers to transfer > data efficiently. This read only flag indicates if space is > available for write data. If this bit is 1, data can be written > to the buffer. A change of this bit from 1 to 0 occurs when all > the block data is written to the buffer. A change of this bit > from 0 to 1 occurs when top of block data can be written to the > buffer and generates the Buffer Write Ready interrupt. > > In our case, we do not want to overwrite the buffer, so we want > this bit to be 0, then set it to 1 once the data is written onto > the bus. > > This is probably a copy/paste error from commit d7dfca0807 > ("hw/sdhci: introduce standard SD host controller"). > > Reproducer: > https://lore.kernel.org/qemu-devel/caa8xkjxrms0fkr28akvnnpyatm0y0b+5fichpsrhd+mugnu...@mail.gmail.com/ > > Fixes: CVE-2022-3872 > Reported-by: RivenDell > Reported-by: Siqi Chen > Reported-by: ningqiang > Signed-off-by: Philippe Mathieu-Daudé > --- > hw/sd/sdhci.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/hw/sd/sdhci.c b/hw/sd/sdhci.c > index 306070c872..f230e7475f 100644 > --- a/hw/sd/sdhci.c > +++ b/hw/sd/sdhci.c > @@ -954,7 +954,7 @@ static void sdhci_data_transfer(void *opaque) > sdhci_read_block_from_card(s); > } else { > s->prnsts |= SDHC_DOING_WRITE | SDHC_DAT_LINE_ACTIVE | > -SDHC_SPACE_AVAILABLE | SDHC_DATA_INHIBIT; > + SDHC_DATA_INHIBIT; > sdhci_write_block_to_card(s); > } > } > -- > 2.38.1 > Tested-by: Mauro Matteo Cascella Thank you, -- Mauro Matteo Cascella Red Hat Product Security PGP-Key ID: BB3410B0