Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support
On Mon, Mar 25, 2024 at 5:52 PM Jonah Palmer wrote: > > > > On 3/22/24 7:18 AM, Eugenio Perez Martin wrote: > > On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer > > wrote: > >> > >> The goal of these patches is to add support to a variety of virtio and > >> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature > >> indicates that all buffers are used by the device in the same order in > >> which they were made available by the driver. > >> > >> These patches attempt to implement a generalized, non-device-specific > >> solution to support this feature. > >> > >> The core feature behind this solution is a buffer mechanism in the form > >> of GLib's GHashTable. The decision behind using a hash table was to > >> leverage their ability for quick lookup, insertion, and removal > >> operations. Given that our keys are simply numbers of an ordered > >> sequence, a hash table seemed like the best choice for a buffer > >> mechanism. > >> > >> - > >> > >> The strategy behind this implementation is as follows: > >> > >> We know that buffers that are popped from the available ring and enqueued > >> for further processing will always done in the same order in which they > >> were made available by the driver. Given this, we can note their order > >> by assigning the resulting VirtQueueElement a key. This key is a number > >> in a sequence that represents the order in which they were popped from > >> the available ring, relative to the other VirtQueueElements. > >> > >> For example, given 3 "elements" that were popped from the available > >> ring, we assign a key value to them which represents their order (elem0 > >> is popped first, then elem1, then lastly elem2): > >> > >> elem2 -- elem1 -- elem0 ---> Enqueue for processing > >> (key: 2)(key: 1)(key: 0) > >> > >> Then these elements are enqueued for further processing by the host. > >> > >> While most devices will return these completed elements in the same > >> order in which they were enqueued, some devices may not (e.g. > >> virtio-blk). To guarantee that these elements are put on the used ring > >> in the same order in which they were enqueued, we can use a buffering > >> mechanism that keeps track of the next expected sequence number of an > >> element. > >> > >> In other words, if the completed element does not have a key value that > >> matches the next expected sequence number, then we know this element is > >> not in-order and we must stash it away in a hash table until an order > >> can be made. The element's key value is used as the key for placing it > >> in the hash table. > >> > >> If the completed element has a key value that matches the next expected > >> sequence number, then we know this element is in-order and we can push > >> it on the used ring. Then we increment the next expected sequence number > >> and check if the hash table contains an element at this key location. > >> > >> If so, we retrieve this element, push it to the used ring, delete the > >> key-value pair from the hash table, increment the next expected sequence > >> number, and check the hash table again for an element at this new key > >> location. This process is repeated until we're unable to find an element > >> in the hash table to continue the order. > >> > >> So, for example, say the 3 elements we enqueued were completed in the > >> following order: elem1, elem2, elem0. The next expected sequence number > >> is 0: > >> > >> exp-seq-num = 0: > >> > >> elem1 --> elem1.key == exp-seq-num ? --> No, stash it > >> (key: 1) | > >> | > >> v > >> > >> |key: 1 - elem1| > >> > >> - > >> exp-seq-num = 0: > >> > >> elem2 --> elem2.key == exp-seq-num ? --> No, stash it > >> (key: 2) | > >> | > >> v > >> > >> |key: 1 - elem1| > >> |--| > >> |key: 2 - elem2| > >> > >> - > >> exp-seq-num = 0: > >> > >> elem0 --> elem0.key == exp-seq-num ? --> Yes, push to used ring > >> (key: 0) > >> > >> exp-seq-num = 1: > >> > >> lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring, > >> remove elem from table > >>
Re: [PATCH v3 0/4] fix two edge cases related to stream block jobs
On Fri, Mar 22, 2024 at 10:50:05AM +0100, Fiona Ebner wrote: > Changes in v3: > * Also deal with edge case in bdrv_next_cleanup(). Haven't run > into an actual issue there, but at least the caller in > migration/block.c uses bdrv_nb_sectors() which, while not a > coroutine wrapper itself (it's written manually), may call > bdrv_refresh_total_sectors(), which is a generated coroutine > wrapper, so AFAIU, the block graph can change during that call. > And even without that, it's just better to be more consistent > with bdrv_next(). > > Changes in v2: > * Ran into another issue while writing the IO test Stefan wanted > to have (good call :)), so include a fix for that and add the > test. I didn't notice during manual testing, because I hadn't > used a scripted QMP 'quit', so there was no race. > > Fiona Ebner (3): > block-backend: fix edge case in bdrv_next() where BDS associated to BB > changes > block-backend: fix edge case in bdrv_next_cleanup() where BDS > associated to BB changes > iotests: add test for stream job with an unaligned prefetch read > > Stefan Reiter (1): > block/io: accept NULL qiov in bdrv_pad_request > > block/block-backend.c | 18 ++-- > block/io.c| 31 --- > .../tests/stream-unaligned-prefetch | 86 +++ > .../tests/stream-unaligned-prefetch.out | 5 ++ > 4 files changed, 117 insertions(+), 23 deletions(-) > create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch > create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out Looks good to me. I will wait until Thursday before merging in case Hanna, Vladimir, or Kevin have comments. Thanks! Stefan signature.asc Description: PGP signature
Re: [PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read
On Fri, Mar 22, 2024 at 10:50:09AM +0100, Fiona Ebner wrote: > Previously, bdrv_pad_request() could not deal with a NULL qiov when > a read needed to be aligned. During prefetch, a stream job will pass a > NULL qiov. Add a test case to cover this scenario. > > By accident, also covers a previous race during shutdown, where block > graph changes during iteration in bdrv_flush_all() could lead to > unreferencing the wrong block driver state and an assertion failure > later. > > Signed-off-by: Fiona Ebner > --- > > No changes in v3. > New in v2. > > .../tests/stream-unaligned-prefetch | 86 +++ > .../tests/stream-unaligned-prefetch.out | 5 ++ > 2 files changed, 91 insertions(+) > create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch > create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes
On Fri, Mar 22, 2024 at 10:50:08AM +0100, Fiona Ebner wrote: > Same rationale as for commit "block-backend: fix edge case in > bdrv_next() where BDS associated to BB changes". The block graph might > change between the bdrv_next() call and the bdrv_next_cleanup() call, > so it could be that the associated BDS is not the same that was > referenced previously anymore. Instead, rely on bdrv_next() to set > it->bs to the BDS it referenced and unreference that one in any case. > > Signed-off-by: Fiona Ebner > --- > > New in v3. > > block/block-backend.c | 11 --- > 1 file changed, 4 insertions(+), 7 deletions(-) Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
On Fri, Mar 22, 2024 at 10:50:07AM +0100, Fiona Ebner wrote: > The old_bs variable in bdrv_next() is currently determined by looking > at the old block backend. However, if the block graph changes before > the next bdrv_next() call, it might be that the associated BDS is not > the same that was referenced previously. In that case, the wrong BDS > is unreferenced, leading to an assertion failure later: > > > bdrv_unref: Assertion `bs->refcnt > 0' failed. > > In particular, this can happen in the context of bdrv_flush_all(), > when polling for bdrv_co_flush() in the generated co-wrapper leads to > a graph change (for example with a stream block job [0]). > > A racy reproducer: > > > #!/bin/bash > > rm -f /tmp/backing.qcow2 > > rm -f /tmp/top.qcow2 > > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M > > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > > ./qemu-system-x86_64 --qmp stdio \ > > --blockdev > > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > > < > {"execute": "qmp_capabilities"} > > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > > "node0" } } > > {"execute": "quit"} > > EOF > > [0]: > > > #0 bdrv_replace_child_tran (child=..., new_bs=..., tran=...) > > #1 bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., > > errp=...) > > #2 bdrv_replace_node_common (from=..., to=..., auto_skip=..., > > detach_subchain=..., errp=...) > > #3 bdrv_drop_filter (bs=..., errp=...) > > #4 bdrv_cor_filter_drop (cor_filter_bs=...) > > #5 stream_prepare (job=...) > > #6 job_prepare_locked (job=...) > > #7 job_txn_apply_locked (fn=..., job=...) > > #8 job_do_finalize_locked (job=...) > > #9 job_exit (opaque=...) > > #10 aio_bh_poll (ctx=...) > > #11 aio_poll (ctx=..., blocking=...) > > #12 bdrv_poll_co (s=...) > > #13 bdrv_flush (bs=...) > > #14 bdrv_flush_all () > > #15 do_vm_stop (state=..., send_stop=...) > > #16 vm_shutdown () > > Signed-off-by: Fiona Ebner > --- > > No changes in v3. > New in v2. > > block/block-backend.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request
On Fri, Mar 22, 2024 at 10:50:06AM +0100, Fiona Ebner wrote: > From: Stefan Reiter > > Some operations, e.g. block-stream, perform reads while discarding the > results (only copy-on-read matters). In this case, they will pass NULL > as the target QEMUIOVector, which will however trip bdrv_pad_request, > since it wants to extend its passed vector. In particular, this is the > case for the blk_co_preadv() call in stream_populate(). > > If there is no qiov, no operation can be done with it, but the bytes > and offset still need to be updated, so the subsequent aligned read > will actually be aligned and not run into an assertion failure. > > In particular, this can happen when the request alignment of the top > node is larger than the allocated part of the bottom node, in which > case padding becomes necessary. For example: > > > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768 > > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > > ./qemu-system-x86_64 --qmp stdio \ > > --blockdev > > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > > < > {"execute": "qmp_capabilities"} > > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": > > "node0", "node-name": "node1" } } > > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > > "node1" } } > > EOF > > Originally-by: Stefan Reiter > Signed-off-by: Thomas Lamprecht > [FE: do update bytes and offset in any case > add reproducer to commit message] > Signed-off-by: Fiona Ebner > --- > > No changes in v3. > No changes in v2. > > block/io.c | 31 +++ > 1 file changed, 19 insertions(+), 12 deletions(-) Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [RFC 4/8] virtio: Implement in-order handling for virtio devices
On Mon, Mar 25, 2024 at 6:35 PM Jonah Palmer wrote: > > > > On 3/22/24 6:46 AM, Eugenio Perez Martin wrote: > > On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer > > wrote: > >> > >> Implements in-order handling for most virtio devices using the > >> VIRTIO_F_IN_ORDER transport feature, specifically those who call > >> virtqueue_push to push their used elements onto the used ring. > >> > >> The logic behind this implementation is as follows: > >> > >> 1.) virtqueue_pop always enqueues VirtQueueElements in-order. > >> > >> virtqueue_pop always retrieves one or more buffer descriptors in-order > >> from the available ring and converts them into a VirtQueueElement. This > >> means that the order in which VirtQueueElements are enqueued are > >> in-order by default. > >> > >> By virtue, as VirtQueueElements are created, we can assign a sequential > >> key value to them. This preserves the order of buffers that have been > >> made available to the device by the driver. > >> > >> As VirtQueueElements are assigned a key value, the current sequence > >> number is incremented. > >> > >> 2.) Requests can be completed out-of-order. > >> > >> While most devices complete requests in the same order that they were > >> enqueued by default, some devices don't (e.g. virtio-blk). The goal of > >> this out-of-order handling is to reduce the impact of devices that > >> process elements in-order by default while also guaranteeing compliance > >> with the VIRTIO_F_IN_ORDER feature. > >> > >> Below is the logic behind handling completed requests (which may or may > >> not be in-order). > >> > >> 3.) Does the incoming used VirtQueueElement preserve the correct order? > >> > >> In other words, is the sequence number (key) assigned to the > >> VirtQueueElement the expected number that would preserve the original > >> order? > >> > >> 3a.) > >> If it does... immediately push the used element onto the used ring. > >> Then increment the next expected sequence number and check to see if > >> any previous out-of-order VirtQueueElements stored on the hash table > >> has a key that matches this next expected sequence number. > >> > >> For each VirtQueueElement found on the hash table with a matching key: > >> push the element on the used ring, remove the key-value pair from the > >> hash table, and then increment the next expected sequence number. Repeat > >> this process until we're unable to find an element with a matching key. > >> > >> Note that if the device uses batching (e.g. virtio-net), then we skip > >> the virtqueue_flush call and let the device call it themselves. > >> > >> 3b.) > >> If it does not... stash the VirtQueueElement, along with relevant data, > >> as a InOrderVQElement on the hash table. The key used is the order_key > >> that was assigned when the VirtQueueElement was created. > >> > >> Signed-off-by: Jonah Palmer > >> --- > >> hw/virtio/virtio.c | 70 -- > >> include/hw/virtio/virtio.h | 8 + > >> 2 files changed, 76 insertions(+), 2 deletions(-) > >> > >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c > >> index 40124545d6..40e4377f1e 100644 > >> --- a/hw/virtio/virtio.c > >> +++ b/hw/virtio/virtio.c > >> @@ -992,12 +992,56 @@ void virtqueue_flush(VirtQueue *vq, unsigned int > >> count) > >> } > >> } > >> > >> +void virtqueue_order_element(VirtQueue *vq, const VirtQueueElement *elem, > >> + unsigned int len, unsigned int idx, > >> + unsigned int count) > >> +{ > >> +InOrderVQElement *in_order_elem; > >> + > >> +if (elem->order_key == vq->current_order_idx) { > >> +/* Element is in-order, push to used ring */ > >> +virtqueue_fill(vq, elem, len, idx); > >> + > >> +/* Batching? Don't flush */ > >> +if (count) { > >> +virtqueue_flush(vq, count); > > > > The "count" parameter is the number of heads used, but here you're > > only using one head (elem). Same with the other virtqueue_flush in the > > function. > > > > True. This acts more as a flag than an actual count since, unless we're > batching (which in the current setup, the device would explicitly call > virtqueue_flush separately), this value will be either 0 or 1. > If possible, I think it is better to keep consistent with the current API: fill+flush for batching, push for a single shot. > > Also, this function sometimes replaces virtqueue_fill and other > > replaces virtqueue_fill + virtqueue_flush (both examples in patch > > 6/8). I have the impression the series would be simpler if > > virtqueue_order_element is a static function just handling the > > virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER) path of > > virtqueue_fill, so the caller does not need to know if the in_order > > feature is on or off. > > > > Originally I wanted this function to replace virtqueue_fill + > virtqueue_flush but after looking at virtio_net_receive_rcu and > vhost_svq_flush, where multiple
Re: [RFC 1/8] virtio: Define InOrderVQElement
On Mon, Mar 25, 2024 at 6:08 PM Jonah Palmer wrote: > > > > On 3/22/24 5:45 AM, Eugenio Perez Martin wrote: > > On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer > > wrote: > >> > >> Define the InOrderVQElement structure for the VIRTIO_F_IN_ORDER > >> transport feature implementation. > >> > >> The InOrderVQElement structure is used to encapsulate out-of-order > >> VirtQueueElement data that was processed by the host. This data > >> includes: > >> - The processed VirtQueueElement (elem) > >> - Length of data (len) > >> - VirtQueueElement array index (idx) > >> - Number of processed VirtQueueElements (count) > >> > >> InOrderVQElements will be stored in a buffering mechanism until an > >> order can be achieved. > >> > >> Signed-off-by: Jonah Palmer > >> --- > >> include/hw/virtio/virtio.h | 7 +++ > >> 1 file changed, 7 insertions(+) > >> > >> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h > >> index b3c74a1bca..c8aa435a5e 100644 > >> --- a/include/hw/virtio/virtio.h > >> +++ b/include/hw/virtio/virtio.h > >> @@ -77,6 +77,13 @@ typedef struct VirtQueueElement > >> struct iovec *out_sg; > >> } VirtQueueElement; > >> > >> +typedef struct InOrderVQElement { > >> +const VirtQueueElement *elem; > > > > Some subsystems allocate space for extra elements after > > VirtQueueElement, like VirtIOBlockReq. You can request virtqueue_pop > > to allocate this extra space by its second argument. Would it work for > > this? > > > > I don't see why not. Although this may not be necessary due to me > missing a key aspect mentioned in your comment below. > > >> +unsigned int len; > >> +unsigned int idx; > >> +unsigned int count; > > > > Now I don't get why these fields cannot be obtained from elem->(len, > > index, ndescs) ? > > > > Interesting. I didn't realize that these values are equivalent to a > VirtQueueElement's len, index, and ndescs fields. > > Is this always true? Else I would've expected, for example, > virtqueue_push to not need the 'unsigned int len' parameter if this > information is already included via. the VirtQueueElement being passed in. > The code uses "len" to store the written length values of each used descriptor between virtqueue_fill and virtqueue_flush. But not all devices use these separately, only the ones that batches: virtio-net and SVQ. A smarter / less simpler implementation of virtqueue_push could certainly avoid storing elem->len. But the performance gain is probably tiny, and the code complexity grows. > >> +} InOrderVQElement; > >> + > >> #define VIRTIO_QUEUE_MAX 1024 > >> > >> #define VIRTIO_NO_VECTOR 0x > >> -- > >> 2.39.3 > >> > > >
Re: [RFC 4/8] virtio: Implement in-order handling for virtio devices
On 3/22/24 6:46 AM, Eugenio Perez Martin wrote: On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer wrote: Implements in-order handling for most virtio devices using the VIRTIO_F_IN_ORDER transport feature, specifically those who call virtqueue_push to push their used elements onto the used ring. The logic behind this implementation is as follows: 1.) virtqueue_pop always enqueues VirtQueueElements in-order. virtqueue_pop always retrieves one or more buffer descriptors in-order from the available ring and converts them into a VirtQueueElement. This means that the order in which VirtQueueElements are enqueued are in-order by default. By virtue, as VirtQueueElements are created, we can assign a sequential key value to them. This preserves the order of buffers that have been made available to the device by the driver. As VirtQueueElements are assigned a key value, the current sequence number is incremented. 2.) Requests can be completed out-of-order. While most devices complete requests in the same order that they were enqueued by default, some devices don't (e.g. virtio-blk). The goal of this out-of-order handling is to reduce the impact of devices that process elements in-order by default while also guaranteeing compliance with the VIRTIO_F_IN_ORDER feature. Below is the logic behind handling completed requests (which may or may not be in-order). 3.) Does the incoming used VirtQueueElement preserve the correct order? In other words, is the sequence number (key) assigned to the VirtQueueElement the expected number that would preserve the original order? 3a.) If it does... immediately push the used element onto the used ring. Then increment the next expected sequence number and check to see if any previous out-of-order VirtQueueElements stored on the hash table has a key that matches this next expected sequence number. For each VirtQueueElement found on the hash table with a matching key: push the element on the used ring, remove the key-value pair from the hash table, and then increment the next expected sequence number. Repeat this process until we're unable to find an element with a matching key. Note that if the device uses batching (e.g. virtio-net), then we skip the virtqueue_flush call and let the device call it themselves. 3b.) If it does not... stash the VirtQueueElement, along with relevant data, as a InOrderVQElement on the hash table. The key used is the order_key that was assigned when the VirtQueueElement was created. Signed-off-by: Jonah Palmer --- hw/virtio/virtio.c | 70 -- include/hw/virtio/virtio.h | 8 + 2 files changed, 76 insertions(+), 2 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 40124545d6..40e4377f1e 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -992,12 +992,56 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count) } } +void virtqueue_order_element(VirtQueue *vq, const VirtQueueElement *elem, + unsigned int len, unsigned int idx, + unsigned int count) +{ +InOrderVQElement *in_order_elem; + +if (elem->order_key == vq->current_order_idx) { +/* Element is in-order, push to used ring */ +virtqueue_fill(vq, elem, len, idx); + +/* Batching? Don't flush */ +if (count) { +virtqueue_flush(vq, count); The "count" parameter is the number of heads used, but here you're only using one head (elem). Same with the other virtqueue_flush in the function. True. This acts more as a flag than an actual count since, unless we're batching (which in the current setup, the device would explicitly call virtqueue_flush separately), this value will be either 0 or 1. Also, this function sometimes replaces virtqueue_fill and other replaces virtqueue_fill + virtqueue_flush (both examples in patch 6/8). I have the impression the series would be simpler if virtqueue_order_element is a static function just handling the virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER) path of virtqueue_fill, so the caller does not need to know if the in_order feature is on or off. Originally I wanted this function to replace virtqueue_fill + virtqueue_flush but after looking at virtio_net_receive_rcu and vhost_svq_flush, where multiple virtqueue_fill's can be called before a single virtqueue_flush, I added this 'if (count)' conditional to handle both cases. I did consider virtqueue_order_element just handling the virtqueue_fill path but then I wasn't sure how to handle calling virtqueue_flush when retrieving out-of-order data from the hash table. For example, devices that call virtqueue_push would call virtqueue_fill and then virtqueue_flush afterwards. In the scenario where, say, elem1 was found out of order and put into the hash table, and then elem0 comes along. For elem0 we'd call virtqueue_fill and then we should call virtqueue_flush to keep the order going. Then we'd
Re: [RFC 1/8] virtio: Define InOrderVQElement
On 3/22/24 5:45 AM, Eugenio Perez Martin wrote: On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer wrote: Define the InOrderVQElement structure for the VIRTIO_F_IN_ORDER transport feature implementation. The InOrderVQElement structure is used to encapsulate out-of-order VirtQueueElement data that was processed by the host. This data includes: - The processed VirtQueueElement (elem) - Length of data (len) - VirtQueueElement array index (idx) - Number of processed VirtQueueElements (count) InOrderVQElements will be stored in a buffering mechanism until an order can be achieved. Signed-off-by: Jonah Palmer --- include/hw/virtio/virtio.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index b3c74a1bca..c8aa435a5e 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -77,6 +77,13 @@ typedef struct VirtQueueElement struct iovec *out_sg; } VirtQueueElement; +typedef struct InOrderVQElement { +const VirtQueueElement *elem; Some subsystems allocate space for extra elements after VirtQueueElement, like VirtIOBlockReq. You can request virtqueue_pop to allocate this extra space by its second argument. Would it work for this? I don't see why not. Although this may not be necessary due to me missing a key aspect mentioned in your comment below. +unsigned int len; +unsigned int idx; +unsigned int count; Now I don't get why these fields cannot be obtained from elem->(len, index, ndescs) ? Interesting. I didn't realize that these values are equivalent to a VirtQueueElement's len, index, and ndescs fields. Is this always true? Else I would've expected, for example, virtqueue_push to not need the 'unsigned int len' parameter if this information is already included via. the VirtQueueElement being passed in. +} InOrderVQElement; + #define VIRTIO_QUEUE_MAX 1024 #define VIRTIO_NO_VECTOR 0x -- 2.39.3
Re: how do the iotests pick a machine model to run on ?
On 19/01/2024 17.18, Peter Maydell wrote: On Fri, 19 Jan 2024 at 15:26, Peter Maydell wrote: (Also, we should probably put an entry for sh4 in machine_map, because the default board type (shix) is about to be deprecated, and the r2d board type is thus a better choice.) The good news is if we add r2d to the machine_map, then only 3 iotests fail: 191 -- not sure exactly what's going on. QEMU complains "machine type does not support if=ide,bus=0,unit=1". Side note: the test harness seems to throw away the stderr from QEMU with this error message, leaving the test failure log rather uninformative. I had to run everything under strace to get hold of it. 203 -- this wants a machine type that can be migrated; sh4 CPUs don't support migration, so the test fails because the 'migrate' command returns the error {"error": {"class": "GenericError", "desc": "State blocked by non-migratable device 'cpu'"}} 267 -- similarly, wants a machine that supports snapshots, so fails when the loadvm/savevm get the error Error: State blocked by non-migratable device 'cpu' How should a test indicate "I need a machine type that supports migration" ? We could maybe add a flag to the machine_map to indicate whether the machine is capable of migration or not. In the latter case, we could skip all tests that are in the "migration" group ? Thomas
Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support
On 3/22/24 7:18 AM, Eugenio Perez Martin wrote: On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer wrote: The goal of these patches is to add support to a variety of virtio and vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature indicates that all buffers are used by the device in the same order in which they were made available by the driver. These patches attempt to implement a generalized, non-device-specific solution to support this feature. The core feature behind this solution is a buffer mechanism in the form of GLib's GHashTable. The decision behind using a hash table was to leverage their ability for quick lookup, insertion, and removal operations. Given that our keys are simply numbers of an ordered sequence, a hash table seemed like the best choice for a buffer mechanism. - The strategy behind this implementation is as follows: We know that buffers that are popped from the available ring and enqueued for further processing will always done in the same order in which they were made available by the driver. Given this, we can note their order by assigning the resulting VirtQueueElement a key. This key is a number in a sequence that represents the order in which they were popped from the available ring, relative to the other VirtQueueElements. For example, given 3 "elements" that were popped from the available ring, we assign a key value to them which represents their order (elem0 is popped first, then elem1, then lastly elem2): elem2 -- elem1 -- elem0 ---> Enqueue for processing (key: 2)(key: 1)(key: 0) Then these elements are enqueued for further processing by the host. While most devices will return these completed elements in the same order in which they were enqueued, some devices may not (e.g. virtio-blk). To guarantee that these elements are put on the used ring in the same order in which they were enqueued, we can use a buffering mechanism that keeps track of the next expected sequence number of an element. In other words, if the completed element does not have a key value that matches the next expected sequence number, then we know this element is not in-order and we must stash it away in a hash table until an order can be made. The element's key value is used as the key for placing it in the hash table. If the completed element has a key value that matches the next expected sequence number, then we know this element is in-order and we can push it on the used ring. Then we increment the next expected sequence number and check if the hash table contains an element at this key location. If so, we retrieve this element, push it to the used ring, delete the key-value pair from the hash table, increment the next expected sequence number, and check the hash table again for an element at this new key location. This process is repeated until we're unable to find an element in the hash table to continue the order. So, for example, say the 3 elements we enqueued were completed in the following order: elem1, elem2, elem0. The next expected sequence number is 0: exp-seq-num = 0: elem1 --> elem1.key == exp-seq-num ? --> No, stash it (key: 1) | | v |key: 1 - elem1| - exp-seq-num = 0: elem2 --> elem2.key == exp-seq-num ? --> No, stash it (key: 2) | | v |key: 1 - elem1| |--| |key: 2 - elem2| - exp-seq-num = 0: elem0 --> elem0.key == exp-seq-num ? --> Yes, push to used ring (key: 0) exp-seq-num = 1: lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring, remove elem from table | v |key: 2 - elem2| exp-seq-num = 2: lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring, remove elem from table |
Re: [PATCH] qapi/block-core: improve Qcow2OverlapCheckFlags documentation
On 25.03.24 16:04, Markus Armbruster wrote: Vladimir Sementsov-Ogievskiy writes: Most of fields have no description at all. Let's fix that. Still, no reason to place here more detailed descriptions of what these structures are, as we have public Qcow2 format specification. Signed-off-by: Vladimir Sementsov-Ogievskiy --- qapi/block-core.json | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 1874f880a8..b9fb994a66 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -3403,14 +3403,31 @@ # @Qcow2OverlapCheckFlags: # # Structure of flags for each metadata structure. Setting a field to -# 'true' makes qemu guard that structure against unintended -# overwriting. The default value is chosen according to the template -# given. +# 'true' makes qemu guard that Qcow2 format structure against Mind if I use the occasion to correct the spelling of QEMU? No problem, thanks for fixing -- Best regards, Vladimir
Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup
I've seen (and agree with) Stefan's reply that a more thorough audit is needed, but am still providing a preliminary review based on what I see here. On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang via wrote: > If g_main_loop_run()/aio_poll() is called in the coroutine context, > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup > may be disordered. coroutine context should not be doing anything that can block; you may have uncovered a bigger bug that we need to address. > > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means > some listened events is completed. Therefore, the completion callback > function is dispatched. > > If this callback function needs to invoke aio_co_enter(), it will only > wake up the coroutine (because we are already in coroutine context), > which may cause that the data on this listening event_fd/socket_fd > is not read/cleared. When the next poll () exits, it will be woken up again > and inserted into the wakeup queue again. > > For example, if TLS is enabled in NBD, the server will call g_main_loop_run() > in the coroutine, and repeatedly wake up the io_read event on a socket. > The call stack is as follows: > > aio_co_enter() > aio_co_wake() > qio_channel_restart_read() > aio_dispatch_handler() > aio_dispatch_handlers() > aio_dispatch() > aio_ctx_dispatch() > g_main_context_dispatch() > g_main_loop_run() > nbd_negotiate_handle_starttls() > nbd_negotiate_options() > nbd_negotiate() > nbd_co_client_start() > coroutine_trampoline() > > Signed-off-by: zhuyangyang > --- > util/async.c | 13 - > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/util/async.c b/util/async.c > index 0467890052..25fc1e6083 100644 > --- a/util/async.c > +++ b/util/async.c > @@ -705,7 +705,18 @@ void aio_co_enter(AioContext *ctx, Coroutine *co) > if (qemu_in_coroutine()) { > Coroutine *self = qemu_coroutine_self(); > assert(self != co); > -QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next); > +/* > + * If the Coroutine *co is already in the co_queue_wakeup, this > + * repeated insertion will causes the loss of other queue element s/causes/cause/ > + * or infinite loop. > + * For examplex: s/examplex/example/ > + * Head->a->b->c->NULL, after insert_tail(head, b) => > Head->a->b->NULL > + * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b... s/b>->/b->/ > + */ > +if (!co->co_queue_next.sqe_next && > +self->co_queue_wakeup.sqh_last != >co_queue_next.sqe_next) { > +QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next); > +} > } else { > qemu_aio_coroutine_enter(ctx, co); > } Intuitively, attacking the symptoms (avoiding bogus list insertion when it is already on the list) makes sense; but I want to make sure we attack the root cause. -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org
Re: [PATCH] qapi/block-core: improve Qcow2OverlapCheckFlags documentation
Squashing in diff --git a/qapi/pragma.json b/qapi/pragma.json index 99e4052ab3..9e28de1721 100644 --- a/qapi/pragma.json +++ b/qapi/pragma.json @@ -72,7 +72,6 @@ 'QCryptoAkCipherKeyType', 'QCryptodevBackendServiceType', 'QKeyCode', -'Qcow2OverlapCheckFlags', 'RbdAuthMode', 'RbdImageEncryptionFormat', 'String',
Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup
On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote: > If g_main_loop_run()/aio_poll() is called in the coroutine context, > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup > may be disordered. aio_poll() must not be called from coroutine context: bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking); ^^^ Coroutines are not supposed to block. Instead, they should yield. > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means > some listened events is completed. Therefore, the completion callback > function is dispatched. > > If this callback function needs to invoke aio_co_enter(), it will only > wake up the coroutine (because we are already in coroutine context), > which may cause that the data on this listening event_fd/socket_fd > is not read/cleared. When the next poll () exits, it will be woken up again > and inserted into the wakeup queue again. > > For example, if TLS is enabled in NBD, the server will call g_main_loop_run() > in the coroutine, and repeatedly wake up the io_read event on a socket. > The call stack is as follows: > > aio_co_enter() > aio_co_wake() > qio_channel_restart_read() > aio_dispatch_handler() > aio_dispatch_handlers() > aio_dispatch() > aio_ctx_dispatch() > g_main_context_dispatch() > g_main_loop_run() > nbd_negotiate_handle_starttls() This code does not look like it was designed to run in coroutine context. Two options: 1. Don't run it in coroutine context (e.g. use a BH instead). This avoids blocking the coroutine but calling g_main_loop_run() is still ugly, in my opinion. 2. Get rid of data.loop and use coroutine APIs instead: while (!data.complete) { qemu_coroutine_yield(); } and update nbd_tls_handshake() to call aio_co_wake(data->co) instead of g_main_loop_quit(data->loop). This requires auditing the code to check whether the event loop might invoke something that interferes with nbd_negotiate_handle_starttls(). Typically this means monitor commands or fd activity that could change the state of this connection while it is yielded. This is where the real work is but hopefully it will not be that hard to figure out. > nbd_negotiate_options() > nbd_negotiate() > nbd_co_client_start() > coroutine_trampoline() > > Signed-off-by: zhuyangyang > --- > util/async.c | 13 - > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/util/async.c b/util/async.c > index 0467890052..25fc1e6083 100644 > --- a/util/async.c > +++ b/util/async.c > @@ -705,7 +705,18 @@ void aio_co_enter(AioContext *ctx, Coroutine *co) > if (qemu_in_coroutine()) { > Coroutine *self = qemu_coroutine_self(); > assert(self != co); > -QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next); > +/* > + * If the Coroutine *co is already in the co_queue_wakeup, this > + * repeated insertion will causes the loss of other queue element > + * or infinite loop. > + * For examplex: > + * Head->a->b->c->NULL, after insert_tail(head, b) => > Head->a->b->NULL > + * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b... > + */ > +if (!co->co_queue_next.sqe_next && > +self->co_queue_wakeup.sqh_last != >co_queue_next.sqe_next) { > +QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next); > +} > } else { > qemu_aio_coroutine_enter(ctx, co); > } > -- > 2.33.0 > signature.asc Description: PGP signature
[PATCH] tests/qemu-iotests: Test 157 and 227 require virtio-blk
Tests 157 and 227 use the virtio-blk device, so we have to mark these tests accordingly to be skipped if this devices is not available (e.g. when running the tests with qemu-system-avr only). Signed-off-by: Thomas Huth --- tests/qemu-iotests/157 | 2 ++ tests/qemu-iotests/227 | 2 ++ 2 files changed, 4 insertions(+) diff --git a/tests/qemu-iotests/157 b/tests/qemu-iotests/157 index 0dc9ba68d2..aa2ebbfb4b 100755 --- a/tests/qemu-iotests/157 +++ b/tests/qemu-iotests/157 @@ -40,6 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15 _supported_fmt generic _supported_proto file +_require_devices virtio-blk + do_run_qemu() { ( diff --git a/tests/qemu-iotests/227 b/tests/qemu-iotests/227 index 7e45a47ac6..eddaad679e 100755 --- a/tests/qemu-iotests/227 +++ b/tests/qemu-iotests/227 @@ -40,6 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15 _supported_fmt generic _supported_proto file +_require_devices virtio-blk + do_run_qemu() { echo Testing: "$@" -- 2.44.0
[PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup
If g_main_loop_run()/aio_poll() is called in the coroutine context, the pending coroutine may be woken up repeatedly, and the co_queue_wakeup may be disordered. When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means some listened events is completed. Therefore, the completion callback function is dispatched. If this callback function needs to invoke aio_co_enter(), it will only wake up the coroutine (because we are already in coroutine context), which may cause that the data on this listening event_fd/socket_fd is not read/cleared. When the next poll () exits, it will be woken up again and inserted into the wakeup queue again. For example, if TLS is enabled in NBD, the server will call g_main_loop_run() in the coroutine, and repeatedly wake up the io_read event on a socket. The call stack is as follows: aio_co_enter() aio_co_wake() qio_channel_restart_read() aio_dispatch_handler() aio_dispatch_handlers() aio_dispatch() aio_ctx_dispatch() g_main_context_dispatch() g_main_loop_run() nbd_negotiate_handle_starttls() nbd_negotiate_options() nbd_negotiate() nbd_co_client_start() coroutine_trampoline() Signed-off-by: zhuyangyang --- util/async.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/util/async.c b/util/async.c index 0467890052..25fc1e6083 100644 --- a/util/async.c +++ b/util/async.c @@ -705,7 +705,18 @@ void aio_co_enter(AioContext *ctx, Coroutine *co) if (qemu_in_coroutine()) { Coroutine *self = qemu_coroutine_self(); assert(self != co); -QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next); +/* + * If the Coroutine *co is already in the co_queue_wakeup, this + * repeated insertion will causes the loss of other queue element + * or infinite loop. + * For examplex: + * Head->a->b->c->NULL, after insert_tail(head, b) => Head->a->b->NULL + * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b... + */ +if (!co->co_queue_next.sqe_next && +self->co_queue_wakeup.sqh_last != >co_queue_next.sqe_next) { +QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next); +} } else { qemu_aio_coroutine_enter(ctx, co); } -- 2.33.0
Re: [PATCH] qapi/block-core: improve Qcow2OverlapCheckFlags documentation
Vladimir Sementsov-Ogievskiy writes: > Most of fields have no description at all. Let's fix that. Still, no > reason to place here more detailed descriptions of what these > structures are, as we have public Qcow2 format specification. > > Signed-off-by: Vladimir Sementsov-Ogievskiy > --- > qapi/block-core.json | 25 + > 1 file changed, 21 insertions(+), 4 deletions(-) > > diff --git a/qapi/block-core.json b/qapi/block-core.json > index 1874f880a8..b9fb994a66 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -3403,14 +3403,31 @@ > # @Qcow2OverlapCheckFlags: > # > # Structure of flags for each metadata structure. Setting a field to > -# 'true' makes qemu guard that structure against unintended > -# overwriting. The default value is chosen according to the template > -# given. > +# 'true' makes qemu guard that Qcow2 format structure against Mind if I use the occasion to correct the spelling of QEMU? > +# unintended overwriting. See Qcow2 format specification for detailed > +# information on these structures. The default value is chosen > +# according to the template given. > # > # @template: Specifies a template mode which can be adjusted using the > # other flags, defaults to 'cached' > # > -# @bitmap-directory: since 3.0 > +# @main-header: Qcow2 format header > +# > +# @active-l1: Qcow2 active L1 table > +# > +# @active-l2: Qcow2 active L2 table > +# > +# @refcount-table: Qcow2 refcount table > +# > +# @refcount-block: Qcow2 refcount blocks > +# > +# @snapshot-table: Qcow2 snapshot table > +# > +# @inactive-l1: Qcow2 inactive L1 tables > +# > +# @inactive-l2: Qcow2 inactive L2 tables > +# > +# @bitmap-directory: Qcow2 bitmap directory (since 3.0) > # > # Since: 2.9 > ## Acked-by: Markus Armbruster
[PATCH] qapi/block-core: improve Qcow2OverlapCheckFlags documentation
Most of fields have no description at all. Let's fix that. Still, no reason to place here more detailed descriptions of what these structures are, as we have public Qcow2 format specification. Signed-off-by: Vladimir Sementsov-Ogievskiy --- qapi/block-core.json | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 1874f880a8..b9fb994a66 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -3403,14 +3403,31 @@ # @Qcow2OverlapCheckFlags: # # Structure of flags for each metadata structure. Setting a field to -# 'true' makes qemu guard that structure against unintended -# overwriting. The default value is chosen according to the template -# given. +# 'true' makes qemu guard that Qcow2 format structure against +# unintended overwriting. See Qcow2 format specification for detailed +# information on these structures. The default value is chosen +# according to the template given. # # @template: Specifies a template mode which can be adjusted using the # other flags, defaults to 'cached' # -# @bitmap-directory: since 3.0 +# @main-header: Qcow2 format header +# +# @active-l1: Qcow2 active L1 table +# +# @active-l2: Qcow2 active L2 table +# +# @refcount-table: Qcow2 refcount table +# +# @refcount-block: Qcow2 refcount blocks +# +# @snapshot-table: Qcow2 snapshot table +# +# @inactive-l1: Qcow2 inactive L1 tables +# +# @inactive-l2: Qcow2 inactive L2 tables +# +# @bitmap-directory: Qcow2 bitmap directory (since 3.0) # # Since: 2.9 ## -- 2.34.1