Re: [PULL 0/6] Block layer patches

2024-03-26 Thread Peter Maydell
On Tue, 26 Mar 2024 at 13:54, Kevin Wolf  wrote:
>
> The following changes since commit 096ae430a7b5a704af4cd94dca7200d6cb069991:
>
>   Merge tag 'pull-qapi-2024-03-26' of https://repo.or.cz/qemu/armbru into 
> staging (2024-03-26 09:50:21 +)
>
> are available in the Git repository at:
>
>   https://repo.or.cz/qemu/kevin.git tags/for-upstream
>
> for you to fetch changes up to 12d7b3bbdcededd3b695501d8d247239d769:
>
>   iotests: add test for stream job with an unaligned prefetch read 
> (2024-03-26 14:21:26 +0100)
>
> 
> Block layer patches
>
> - Fix crash with unaligned prefetch requests (e.g. in stream jobs)
> - vdpa-dev: Fix initialisation order to restore VDUSE compatibility
> - iotests fixes
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/9.0
for any user-visible changes.

-- PMM



Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-26 Thread Jonah Palmer




On 3/26/24 2:34 PM, Eugenio Perez Martin wrote:

On Tue, Mar 26, 2024 at 5:49 PM Jonah Palmer  wrote:




On 3/25/24 4:33 PM, Eugenio Perez Martin wrote:

On Mon, Mar 25, 2024 at 5:52 PM Jonah Palmer  wrote:




On 3/22/24 7:18 AM, Eugenio Perez Martin wrote:

On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  wrote:


The goal of these patches is to add support to a variety of virtio and
vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
indicates that all buffers are used by the device in the same order in
which they were made available by the driver.

These patches attempt to implement a generalized, non-device-specific
solution to support this feature.

The core feature behind this solution is a buffer mechanism in the form
of GLib's GHashTable. The decision behind using a hash table was to
leverage their ability for quick lookup, insertion, and removal
operations. Given that our keys are simply numbers of an ordered
sequence, a hash table seemed like the best choice for a buffer
mechanism.

-

The strategy behind this implementation is as follows:

We know that buffers that are popped from the available ring and enqueued
for further processing will always done in the same order in which they
were made available by the driver. Given this, we can note their order
by assigning the resulting VirtQueueElement a key. This key is a number
in a sequence that represents the order in which they were popped from
the available ring, relative to the other VirtQueueElements.

For example, given 3 "elements" that were popped from the available
ring, we assign a key value to them which represents their order (elem0
is popped first, then elem1, then lastly elem2):

elem2   --  elem1   --  elem0   ---> Enqueue for processing
   (key: 2)(key: 1)(key: 0)

Then these elements are enqueued for further processing by the host.

While most devices will return these completed elements in the same
order in which they were enqueued, some devices may not (e.g.
virtio-blk). To guarantee that these elements are put on the used ring
in the same order in which they were enqueued, we can use a buffering
mechanism that keeps track of the next expected sequence number of an
element.

In other words, if the completed element does not have a key value that
matches the next expected sequence number, then we know this element is
not in-order and we must stash it away in a hash table until an order
can be made. The element's key value is used as the key for placing it
in the hash table.

If the completed element has a key value that matches the next expected
sequence number, then we know this element is in-order and we can push
it on the used ring. Then we increment the next expected sequence number
and check if the hash table contains an element at this key location.

If so, we retrieve this element, push it to the used ring, delete the
key-value pair from the hash table, increment the next expected sequence
number, and check the hash table again for an element at this new key
location. This process is repeated until we're unable to find an element
in the hash table to continue the order.

So, for example, say the 3 elements we enqueued were completed in the
following order: elem1, elem2, elem0. The next expected sequence number
is 0:

   exp-seq-num = 0:

elem1   --> elem1.key == exp-seq-num ? --> No, stash it
   (key: 1) |
|
v
  
  |key: 1 - elem1|
  
   -
   exp-seq-num = 0:

elem2   --> elem2.key == exp-seq-num ? --> No, stash it
   (key: 2) |
|
v
  
  |key: 1 - elem1|
  |--|
  |key: 2 - elem2|
  
   -
   exp-seq-num = 0:

elem0   --> elem0.key == exp-seq-num ? --> Yes, push to used ring
   (key: 0)

   exp-seq-num = 1:

   lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
remove elem from table
|
v
  
  |key: 2 - elem2|

Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-26 Thread Eugenio Perez Martin
On Tue, Mar 26, 2024 at 5:49 PM Jonah Palmer  wrote:
>
>
>
> On 3/25/24 4:33 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 25, 2024 at 5:52 PM Jonah Palmer  
> > wrote:
> >>
> >>
> >>
> >> On 3/22/24 7:18 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  
> >>> wrote:
> 
>  The goal of these patches is to add support to a variety of virtio and
>  vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
>  indicates that all buffers are used by the device in the same order in
>  which they were made available by the driver.
> 
>  These patches attempt to implement a generalized, non-device-specific
>  solution to support this feature.
> 
>  The core feature behind this solution is a buffer mechanism in the form
>  of GLib's GHashTable. The decision behind using a hash table was to
>  leverage their ability for quick lookup, insertion, and removal
>  operations. Given that our keys are simply numbers of an ordered
>  sequence, a hash table seemed like the best choice for a buffer
>  mechanism.
> 
>  -
> 
>  The strategy behind this implementation is as follows:
> 
>  We know that buffers that are popped from the available ring and enqueued
>  for further processing will always done in the same order in which they
>  were made available by the driver. Given this, we can note their order
>  by assigning the resulting VirtQueueElement a key. This key is a number
>  in a sequence that represents the order in which they were popped from
>  the available ring, relative to the other VirtQueueElements.
> 
>  For example, given 3 "elements" that were popped from the available
>  ring, we assign a key value to them which represents their order (elem0
>  is popped first, then elem1, then lastly elem2):
> 
> elem2   --  elem1   --  elem0   ---> Enqueue for processing
>    (key: 2)(key: 1)(key: 0)
> 
>  Then these elements are enqueued for further processing by the host.
> 
>  While most devices will return these completed elements in the same
>  order in which they were enqueued, some devices may not (e.g.
>  virtio-blk). To guarantee that these elements are put on the used ring
>  in the same order in which they were enqueued, we can use a buffering
>  mechanism that keeps track of the next expected sequence number of an
>  element.
> 
>  In other words, if the completed element does not have a key value that
>  matches the next expected sequence number, then we know this element is
>  not in-order and we must stash it away in a hash table until an order
>  can be made. The element's key value is used as the key for placing it
>  in the hash table.
> 
>  If the completed element has a key value that matches the next expected
>  sequence number, then we know this element is in-order and we can push
>  it on the used ring. Then we increment the next expected sequence number
>  and check if the hash table contains an element at this key location.
> 
>  If so, we retrieve this element, push it to the used ring, delete the
>  key-value pair from the hash table, increment the next expected sequence
>  number, and check the hash table again for an element at this new key
>  location. This process is repeated until we're unable to find an element
>  in the hash table to continue the order.
> 
>  So, for example, say the 3 elements we enqueued were completed in the
>  following order: elem1, elem2, elem0. The next expected sequence number
>  is 0:
> 
>    exp-seq-num = 0:
> 
> elem1   --> elem1.key == exp-seq-num ? --> No, stash it
>    (key: 1) |
> |
> v
>   
>   |key: 1 - elem1|
>   
>    -
>    exp-seq-num = 0:
> 
> elem2   --> elem2.key == exp-seq-num ? --> No, stash it
>    (key: 2) |
> |
> v
>   
>   |key: 1 - elem1|
>   |--|
>   |key: 2 - elem2|
>   
>    -
>    

Re: [PATCH-for-9.0? v2 3/4] qtest/libqos: Reduce size_to_prdtl() declaration scope

2024-03-26 Thread Thomas Huth

On 26/03/2024 18.10, Philippe Mathieu-Daudé wrote:

Since size_to_prdtl() is only used within ahci.c,
declare it statically. This removes the last use
of "inlined function with external linkage". See
previous commit and commit 9de9fa5cf2 for rationale.

Suggested-by: Peter Maydell 
Signed-off-by: Philippe Mathieu-Daudé 
---



Reviewed-by: Thomas Huth 




Re: [PATCH-for-9.0? v2 3/4] qtest/libqos: Reduce size_to_prdtl() declaration scope

2024-03-26 Thread Peter Maydell
On Tue, 26 Mar 2024 at 17:10, Philippe Mathieu-Daudé  wrote:
>
> Since size_to_prdtl() is only used within ahci.c,
> declare it statically. This removes the last use
> of "inlined function with external linkage". See
> previous commit and commit 9de9fa5cf2 for rationale.
>
> Suggested-by: Peter Maydell 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Peter Maydell 

thanks
-- PMM



[PATCH-for-9.0? v2 3/4] qtest/libqos: Reduce size_to_prdtl() declaration scope

2024-03-26 Thread Philippe Mathieu-Daudé
Since size_to_prdtl() is only used within ahci.c,
declare it statically. This removes the last use
of "inlined function with external linkage". See
previous commit and commit 9de9fa5cf2 for rationale.

Suggested-by: Peter Maydell 
Signed-off-by: Philippe Mathieu-Daudé 
---
 tests/qtest/libqos/ahci.h | 1 -
 tests/qtest/libqos/ahci.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/tests/qtest/libqos/ahci.h b/tests/qtest/libqos/ahci.h
index 48017864bf..a0487a1557 100644
--- a/tests/qtest/libqos/ahci.h
+++ b/tests/qtest/libqos/ahci.h
@@ -599,7 +599,6 @@ void ahci_port_check_cmd_sanity(AHCIQState *ahci, 
AHCICommand *cmd);
 
 /* Misc */
 bool is_atapi(AHCIQState *ahci, uint8_t port);
-unsigned size_to_prdtl(unsigned bytes, unsigned bytes_per_prd);
 
 /* Command: Macro level execution */
 void ahci_guest_io(AHCIQState *ahci, uint8_t port, uint8_t ide_cmd,
diff --git a/tests/qtest/libqos/ahci.c b/tests/qtest/libqos/ahci.c
index a2c94c6e06..6d59c7551a 100644
--- a/tests/qtest/libqos/ahci.c
+++ b/tests/qtest/libqos/ahci.c
@@ -662,7 +662,7 @@ unsigned ahci_pick_cmd(AHCIQState *ahci, uint8_t port)
 g_assert_not_reached();
 }
 
-inline unsigned size_to_prdtl(unsigned bytes, unsigned bytes_per_prd)
+static unsigned size_to_prdtl(unsigned bytes, unsigned bytes_per_prd)
 {
 /* Each PRD can describe up to 4MiB */
 g_assert_cmphex(bytes_per_prd, <=, 4096 * 1024);
-- 
2.41.0




Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-26 Thread Jonah Palmer




On 3/25/24 4:33 PM, Eugenio Perez Martin wrote:

On Mon, Mar 25, 2024 at 5:52 PM Jonah Palmer  wrote:




On 3/22/24 7:18 AM, Eugenio Perez Martin wrote:

On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  wrote:


The goal of these patches is to add support to a variety of virtio and
vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
indicates that all buffers are used by the device in the same order in
which they were made available by the driver.

These patches attempt to implement a generalized, non-device-specific
solution to support this feature.

The core feature behind this solution is a buffer mechanism in the form
of GLib's GHashTable. The decision behind using a hash table was to
leverage their ability for quick lookup, insertion, and removal
operations. Given that our keys are simply numbers of an ordered
sequence, a hash table seemed like the best choice for a buffer
mechanism.

-

The strategy behind this implementation is as follows:

We know that buffers that are popped from the available ring and enqueued
for further processing will always done in the same order in which they
were made available by the driver. Given this, we can note their order
by assigning the resulting VirtQueueElement a key. This key is a number
in a sequence that represents the order in which they were popped from
the available ring, relative to the other VirtQueueElements.

For example, given 3 "elements" that were popped from the available
ring, we assign a key value to them which represents their order (elem0
is popped first, then elem1, then lastly elem2):

   elem2   --  elem1   --  elem0   ---> Enqueue for processing
  (key: 2)(key: 1)(key: 0)

Then these elements are enqueued for further processing by the host.

While most devices will return these completed elements in the same
order in which they were enqueued, some devices may not (e.g.
virtio-blk). To guarantee that these elements are put on the used ring
in the same order in which they were enqueued, we can use a buffering
mechanism that keeps track of the next expected sequence number of an
element.

In other words, if the completed element does not have a key value that
matches the next expected sequence number, then we know this element is
not in-order and we must stash it away in a hash table until an order
can be made. The element's key value is used as the key for placing it
in the hash table.

If the completed element has a key value that matches the next expected
sequence number, then we know this element is in-order and we can push
it on the used ring. Then we increment the next expected sequence number
and check if the hash table contains an element at this key location.

If so, we retrieve this element, push it to the used ring, delete the
key-value pair from the hash table, increment the next expected sequence
number, and check the hash table again for an element at this new key
location. This process is repeated until we're unable to find an element
in the hash table to continue the order.

So, for example, say the 3 elements we enqueued were completed in the
following order: elem1, elem2, elem0. The next expected sequence number
is 0:

  exp-seq-num = 0:

   elem1   --> elem1.key == exp-seq-num ? --> No, stash it
  (key: 1) |
   |
   v
 
 |key: 1 - elem1|
 
  -
  exp-seq-num = 0:

   elem2   --> elem2.key == exp-seq-num ? --> No, stash it
  (key: 2) |
   |
   v
 
 |key: 1 - elem1|
 |--|
 |key: 2 - elem2|
 
  -
  exp-seq-num = 0:

   elem0   --> elem0.key == exp-seq-num ? --> Yes, push to used ring
  (key: 0)

  exp-seq-num = 1:

  lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
   remove elem from table
   |
   v
 
 |key: 2 - elem2|
 

  exp-seq-num = 2:

  lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,

Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-26 Thread David Hildenbrand

+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);


Hm, shouldn't we just let the user specify a name, and if no name was 
specified, generate one ourselves?


I'm also not quite sure if "host_memory_backend_get_name()" should be 
used for the purpose here.


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 04/11] vhost-user-server: don't abort if we can't set fd non-blocking

2024-03-26 Thread Eric Blake
On Tue, Mar 26, 2024 at 02:39:29PM +0100, Stefano Garzarella wrote:
> In vhost-user-server we set all fd received from the other peer
> in non-blocking mode. For some of them (e.g. memfd, shm_open, etc.)
> if we fail, it's not really a problem, because we don't use these
> fd with blocking operations, but only to map memory.
> 
> In these cases a failure is not bad, so let's just report a warning
> instead of panicking if we fail to set some fd in non-blocking mode.
> 
> This for example occurs in macOS where setting shm_open() fd
> non-blocking is failing (errno: 25).

What is errno 25 on MacOS?

> 
> Signed-off-by: Stefano Garzarella 
> ---
>  util/vhost-user-server.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
> index 3bfb1ad3ec..064999f0b7 100644
> --- a/util/vhost-user-server.c
> +++ b/util/vhost-user-server.c
> @@ -66,7 +66,11 @@ static void vmsg_unblock_fds(VhostUserMsg *vmsg)
>  {
>  int i;
>  for (i = 0; i < vmsg->fd_num; i++) {
> -qemu_socket_set_nonblock(vmsg->fds[i]);
> +int ret = qemu_socket_try_set_nonblock(vmsg->fds[i]);
> +if (ret) {

Should this be 'if (ret < 0)'?

> +warn_report("Failed to set fd %d nonblock for request %d: %s",
> +vmsg->fds[i], vmsg->request, strerror(-ret));
> +}

This now ignores all errors even on pre-existing fds where we NEED
non-blocking, rather than just the specific (expected) error we are
seeing on MacOS.  Should this code be a bit more precise about
checking that -ret == EXXX for the expected errno value we are
ignoring for the specific fds where non-blocking is not essential?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH for-9.1 v2 03/11] libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported

2024-03-26 Thread Eric Blake
On Tue, Mar 26, 2024 at 02:39:28PM +0100, Stefano Garzarella wrote:
> libvhost-user will panic when receiving VHOST_USER_GET_INFLIGHT_FD
> message if MFD_ALLOW_SEALING is not defined, since it's not able
> to create a memfd.
> 
> VHOST_USER_GET_INFLIGHT_FD is used only if
> VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD is negotiated. So, let's mask
> that feature if the backend is not able to properly handle these
> messages.
> 
> Signed-off-by: Stefano Garzarella 
> ---
>  subprojects/libvhost-user/libvhost-user.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/subprojects/libvhost-user/libvhost-user.c 
> b/subprojects/libvhost-user/libvhost-user.c
> index a11afd1960..1c361ffd51 100644
> --- a/subprojects/libvhost-user/libvhost-user.c
> +++ b/subprojects/libvhost-user/libvhost-user.c
> @@ -1674,6 +1674,16 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg 
> *vmsg)
>  features |= dev->iface->get_protocol_features(dev);
>  }
>  
> +/*
> + * If MFD_ALLOW_SEALING is not defined, we are not able to handle
> + * VHOST_USER_GET_INFLIGHT_FD messages, since we can't create a memfd.
> + * Those messages are used only if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD
> + * is negotiated. A device implementation can enable it, so let's mask
> + * it to avoid a runtime panic.
> + */
> +#ifndef MFD_ALLOW_SEALING
> +features &= ~(1ULL << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD);
> +#endif

Masking the feature out of advertisement is obviously correct. But
should we also fix the code for handling
VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD to return an error to any client
that requests it in error when the feature was not advertised, instead
of panicking?

>  vmsg_set_reply_u64(vmsg, features);
>  return true;
>  }
> -- 
> 2.44.0
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-26 Thread David Hildenbrand

On 26.03.24 15:34, Eric Blake wrote:

On Tue, Mar 26, 2024 at 02:39:27PM +0100, Stefano Garzarella wrote:

In vu_message_write() we use sendmsg() to send the message header,
then a write() to send the payload.

If sendmsg() fails we should avoid sending the payload, since we
were unable to send the header.

Discovered before fixing the issue with the previous patch, where
sendmsg() failed on macOS due to wrong parameters, but the frontend
still sent the payload which the backend incorrectly interpreted
as a wrong header.

Signed-off-by: Stefano Garzarella 
---
  subprojects/libvhost-user/libvhost-user.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 22bea0c775..a11afd1960 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
  rc = sendmsg(conn_fd, , 0);
  } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
  
+if (rc <= 0) {


Is rejecting a 0 return value correct?  Technically, a 0 return means
a successful write of 0 bytes - but then again, it is unwise to
attempt to send an fd or other auxilliary ddata without at least one
regular byte, so we should not be attempting a write of 0 bytes.  So I
guess this one is okay, although I might have used < instead of <=.


I was wondering if we could see some partial sendmsg()/write succeeding. 
Meaning, we transferred some bytes but not all, and we'd actually need 
to loop ...


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-26 Thread Eric Blake
On Tue, Mar 26, 2024 at 02:39:27PM +0100, Stefano Garzarella wrote:
> In vu_message_write() we use sendmsg() to send the message header,
> then a write() to send the payload.
> 
> If sendmsg() fails we should avoid sending the payload, since we
> were unable to send the header.
> 
> Discovered before fixing the issue with the previous patch, where
> sendmsg() failed on macOS due to wrong parameters, but the frontend
> still sent the payload which the backend incorrectly interpreted
> as a wrong header.
> 
> Signed-off-by: Stefano Garzarella 
> ---
>  subprojects/libvhost-user/libvhost-user.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/subprojects/libvhost-user/libvhost-user.c 
> b/subprojects/libvhost-user/libvhost-user.c
> index 22bea0c775..a11afd1960 100644
> --- a/subprojects/libvhost-user/libvhost-user.c
> +++ b/subprojects/libvhost-user/libvhost-user.c
> @@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
> *vmsg)
>  rc = sendmsg(conn_fd, , 0);
>  } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
>  
> +if (rc <= 0) {

Is rejecting a 0 return value correct?  Technically, a 0 return means
a successful write of 0 bytes - but then again, it is unwise to
attempt to send an fd or other auxilliary ddata without at least one
regular byte, so we should not be attempting a write of 0 bytes.  So I
guess this one is okay, although I might have used < instead of <=.

> +vu_panic(dev, "Error while writing: %s", strerror(errno));
> +return false;
> +}

At any rate, noticing the error is the correct thing to do.

> +
>  if (vmsg->size) {
>  do {
>  if (vmsg->data) {
> -- 
> 2.44.0
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH for-9.1 v2 01/11] libvhost-user: set msg.msg_control to NULL when it is empty

2024-03-26 Thread David Hildenbrand

On 26.03.24 14:39, Stefano Garzarella wrote:

On some OS (e.g. macOS) sendmsg() returns -1 (errno EINVAL) if
the `struct msghdr` has the field `msg_controllen` set to 0, but
`msg_control` is not NULL.

Signed-off-by: Stefano Garzarella 
---
  subprojects/libvhost-user/libvhost-user.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a879149fef..22bea0c775 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -632,6 +632,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
  memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
  } else {
  msg.msg_controllen = 0;
+msg.msg_control = NULL;
  }
  
  do {


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 01/11] libvhost-user: set msg.msg_control to NULL when it is empty

2024-03-26 Thread Eric Blake
On Tue, Mar 26, 2024 at 02:39:26PM +0100, Stefano Garzarella wrote:
> On some OS (e.g. macOS) sendmsg() returns -1 (errno EINVAL) if
> the `struct msghdr` has the field `msg_controllen` set to 0, but
> `msg_control` is not NULL.
> 
> Signed-off-by: Stefano Garzarella 
> ---
>  subprojects/libvhost-user/libvhost-user.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake 

> 
> diff --git a/subprojects/libvhost-user/libvhost-user.c 
> b/subprojects/libvhost-user/libvhost-user.c
> index a879149fef..22bea0c775 100644
> --- a/subprojects/libvhost-user/libvhost-user.c
> +++ b/subprojects/libvhost-user/libvhost-user.c
> @@ -632,6 +632,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
> *vmsg)
>  memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
>  } else {
>  msg.msg_controllen = 0;
> +msg.msg_control = NULL;
>  }
>  
>  do {
> -- 
> 2.44.0
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




[PATCH for-9.1 v2 04/11] vhost-user-server: don't abort if we can't set fd non-blocking

2024-03-26 Thread Stefano Garzarella
In vhost-user-server we set all fd received from the other peer
in non-blocking mode. For some of them (e.g. memfd, shm_open, etc.)
if we fail, it's not really a problem, because we don't use these
fd with blocking operations, but only to map memory.

In these cases a failure is not bad, so let's just report a warning
instead of panicking if we fail to set some fd in non-blocking mode.

This for example occurs in macOS where setting shm_open() fd
non-blocking is failing (errno: 25).

Signed-off-by: Stefano Garzarella 
---
 util/vhost-user-server.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 3bfb1ad3ec..064999f0b7 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -66,7 +66,11 @@ static void vmsg_unblock_fds(VhostUserMsg *vmsg)
 {
 int i;
 for (i = 0; i < vmsg->fd_num; i++) {
-qemu_socket_set_nonblock(vmsg->fds[i]);
+int ret = qemu_socket_try_set_nonblock(vmsg->fds[i]);
+if (ret) {
+warn_report("Failed to set fd %d nonblock for request %d: %s",
+vmsg->fds[i], vmsg->request, strerror(-ret));
+}
 }
 }
 
-- 
2.44.0




[PATCH for-9.1 v2 00/11] vhost-user: support any POSIX system (tested on macOS, FreeBSD, OpenBSD)

2024-03-26 Thread Stefano Garzarella
v1: https://patchew.org/QEMU/20240228114759.44758-1-sgarz...@redhat.com/
v2:
  - fixed Author email and little typos in some patches
  - added memory-backend-shm (patches 9, 10, 11) [Daniel, David, Markus]
  - removed changes to memory-backend-file (ex patch 9)
  - used memory-backend-shm in tests/qtest/vhost-user-blk-test (patch 10)
  - added test case in tests/qtest/vhost-user-test (patch 11)

The vhost-user protocol is not really Linux-specific, so let's try support
QEMU's frontends and backends (including libvhost-user) in any POSIX system
with this series. The main use case is to be able to use virtio devices that
we don't have built-in in QEMU (e.g. virtiofsd, vhost-user-vsock, etc.) even
in non-Linux systems.

The first 5 patches are more like fixes discovered at runtime on macOS or
FreeBSD that could go even independently of this series.

Patches 6, 7, and 8 enable building of frontends and backends (including
libvhost-user) with associated code changes to succeed in compilation.

Patch 9 adds `memory-backend-shm` that uses the POSIX shm_open() API to
create shared memory which is identified by an fd that can be shared with
vhost-user backends. This is useful on those systems (like macOS) where
we don't have memfd_create() or special filesystems like "/dev/shm".

Patches 10 and 11 use `memory-backend-shm` in some vhost-user tests.

Maybe the first 5 patches can go separately, but I only discovered those
problems after testing patches 6 - 9, so I have included them in this series
for now. Please let me know if you prefer that I send them separately.

I tested this series using vhost-user-blk and QSD on macOS Sonoma 14.4
(aarch64), FreeBSD 14 (x86_64), OpenBSD 7.4 (x86_64), and Fedora 39 (x86_64)
in this way:

- Start vhost-user-blk or QSD (same commands for all systems)

  vhost-user-blk -s /tmp/vhost.socket \
-b Fedora-Cloud-Base-39-1.5.x86_64.raw

  qemu-storage-daemon \
--blockdev 
file,filename=Fedora-Cloud-Base-39-1.5.x86_64.qcow2,node-name=file \
--blockdev qcow2,file=file,node-name=qcow2 \
--export 
vhost-user-blk,addr.type=unix,addr.path=/tmp/vhost.socket,id=vub,num-queues=1,node-name=qcow2,writable=on

- macOS (aarch64): start QEMU (using hvf accelerator)

  qemu-system-aarch64 -smp 2 -cpu host -M virt,accel=hvf,memory-backend=mem \
-drive 
file=./build/pc-bios/edk2-aarch64-code.fd,if=pflash,format=raw,readonly=on \
-device virtio-net-device,netdev=net0 -netdev user,id=net0 \
-device ramfb -device usb-ehci -device usb-kbd \
-object memory-backend-shm,id=mem,size=512M \
-device vhost-user-blk-pci,num-queues=1,disable-legacy=on,chardev=char0 \
-chardev socket,id=char0,path=/tmp/vhost.socket

- FreeBSD/OpenBSD (x86_64): start QEMU (no accelerators available)

  qemu-system-x86_64 -smp 2 -M q35,memory-backend=mem \
-object memory-backend-shm,id=mem,size="512M" \
-device vhost-user-blk-pci,num-queues=1,chardev=char0 \
-chardev socket,id=char0,path=/tmp/vhost.socket

- Fedora (x86_64): start QEMU (using kvm accelerator)

  qemu-system-x86_64 -smp 2 -M q35,accel=kvm,memory-backend=mem \
-object memory-backend-shm,size="512M" \
-device vhost-user-blk-pci,num-queues=1,chardev=char0 \
-chardev socket,id=char0,path=/tmp/vhost.socket

Branch pushed (and CI started) at 
https://gitlab.com/sgarzarella/qemu/-/tree/macos-vhost-user?ref_type=heads

Thanks,
Stefano

Stefano Garzarella (11):
  libvhost-user: set msg.msg_control to NULL when it is empty
  libvhost-user: fail vu_message_write() if sendmsg() is failing
  libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported
  vhost-user-server: don't abort if we can't set fd non-blocking
  contrib/vhost-user-blk: fix bind() using the right size of the address
  vhost-user: enable frontends on any POSIX system
  libvhost-user: enable it on any POSIX system
  contrib/vhost-user-blk: enable it on any POSIX system
  hostmem: add a new memory backend based on POSIX shm_open()
  tests/qtest/vhost-user-blk-test: use memory-backend-shm
  tests/qtest/vhost-user-test: add a test case for memory-backend-shm

 docs/system/devices/vhost-user.rst|   5 +-
 meson.build   |   5 +-
 qapi/qom.json |  17 
 subprojects/libvhost-user/libvhost-user.h |   2 +-
 backends/hostmem-shm.c| 118 ++
 contrib/vhost-user-blk/vhost-user-blk.c   |  23 -
 hw/net/vhost_net.c|   5 +
 subprojects/libvhost-user/libvhost-user.c |  76 +-
 tests/qtest/vhost-user-blk-test.c |   2 +-
 tests/qtest/vhost-user-test.c |  23 +
 util/vhost-user-server.c  |   6 +-
 backends/meson.build  |   1 +
 hw/block/Kconfig  |   2 +-
 qemu-options.hx   |  10 ++
 util/meson.build  |   4 +-
 15 files changed, 280 insertions(+), 19 deletions(-)
 create mode 100644 

[PATCH for-9.1 v2 08/11] contrib/vhost-user-blk: enable it on any POSIX system

2024-03-26 Thread Stefano Garzarella
Let's make the code more portable by using the "qemu/bswap.h" API
and adding defines from block/file-posix.c to support O_DIRECT in
other systems (e.g. macOS).

vhost-user-server.c is a dependency, let's enable it for any POSIX
system.

Signed-off-by: Stefano Garzarella 
---
 meson.build |  2 --
 contrib/vhost-user-blk/vhost-user-blk.c | 19 +--
 util/meson.build|  4 +++-
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/meson.build b/meson.build
index 3197a2f62e..b541e5c875 100644
--- a/meson.build
+++ b/meson.build
@@ -1956,8 +1956,6 @@ has_statx = cc.has_header_symbol('sys/stat.h', 
'STATX_BASIC_STATS', prefix: gnu_
 has_statx_mnt_id = cc.has_header_symbol('sys/stat.h', 'STATX_MNT_ID', prefix: 
gnu_source_prefix)
 
 have_vhost_user_blk_server = get_option('vhost_user_blk_server') \
-  .require(host_os == 'linux',
-   error_message: 'vhost_user_blk_server requires linux') \
   .require(have_vhost_user,
error_message: 'vhost_user_blk_server requires vhost-user support') 
\
   .disable_auto_if(not have_tools and not have_system) \
diff --git a/contrib/vhost-user-blk/vhost-user-blk.c 
b/contrib/vhost-user-blk/vhost-user-blk.c
index a8ab9269a2..462e584857 100644
--- a/contrib/vhost-user-blk/vhost-user-blk.c
+++ b/contrib/vhost-user-blk/vhost-user-blk.c
@@ -16,6 +16,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/bswap.h"
 #include "standard-headers/linux/virtio_blk.h"
 #include "libvhost-user-glib.h"
 
@@ -24,6 +25,20 @@
 #include 
 #endif
 
+/* OS X does not have O_DSYNC */
+#ifndef O_DSYNC
+#ifdef O_SYNC
+#define O_DSYNC O_SYNC
+#elif defined(O_FSYNC)
+#define O_DSYNC O_FSYNC
+#endif
+#endif
+
+/* Approximate O_DIRECT with O_DSYNC if O_DIRECT isn't available */
+#ifndef O_DIRECT
+#define O_DIRECT O_DSYNC
+#endif
+
 enum {
 VHOST_USER_BLK_MAX_QUEUES = 8,
 };
@@ -267,13 +282,13 @@ static int vub_virtio_process_req(VubDev *vdev_blk,
 req->in = (struct virtio_blk_inhdr *)elem->in_sg[in_num - 1].iov_base;
 in_num--;
 
-type = le32toh(req->out->type);
+type = le32_to_cpu(req->out->type);
 switch (type & ~VIRTIO_BLK_T_BARRIER) {
 case VIRTIO_BLK_T_IN:
 case VIRTIO_BLK_T_OUT: {
 ssize_t ret = 0;
 bool is_write = type & VIRTIO_BLK_T_OUT;
-req->sector_num = le64toh(req->out->sector);
+req->sector_num = le64_to_cpu(req->out->sector);
 if (is_write) {
 ret  = vub_writev(req, >out_sg[1], out_num);
 } else {
diff --git a/util/meson.build b/util/meson.build
index 0ef9886be0..f52682ce96 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -113,10 +113,12 @@ if have_block
 util_ss.add(files('filemonitor-stub.c'))
   endif
   if host_os == 'linux'
-util_ss.add(files('vhost-user-server.c'), vhost_user)
 util_ss.add(files('vfio-helpers.c'))
 util_ss.add(files('chardev_open.c'))
   endif
+  if host_os != 'windows'
+util_ss.add(files('vhost-user-server.c'), vhost_user)
+  endif
 endif
 
 if cpu == 'aarch64'
-- 
2.44.0




[PATCH for-9.1 v2 01/11] libvhost-user: set msg.msg_control to NULL when it is empty

2024-03-26 Thread Stefano Garzarella
On some OS (e.g. macOS) sendmsg() returns -1 (errno EINVAL) if
the `struct msghdr` has the field `msg_controllen` set to 0, but
`msg_control` is not NULL.

Signed-off-by: Stefano Garzarella 
---
 subprojects/libvhost-user/libvhost-user.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a879149fef..22bea0c775 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -632,6 +632,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
 memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
 } else {
 msg.msg_controllen = 0;
+msg.msg_control = NULL;
 }
 
 do {
-- 
2.44.0




[PATCH for-9.1 v2 11/11] tests/qtest/vhost-user-test: add a test case for memory-backend-shm

2024-03-26 Thread Stefano Garzarella
`memory-backend-shm` can be used with vhost-user devices, so let's
add a new test case for it.

Signed-off-by: Stefano Garzarella 
---
 tests/qtest/vhost-user-test.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/tests/qtest/vhost-user-test.c b/tests/qtest/vhost-user-test.c
index d4e437265f..8c1d903b2a 100644
--- a/tests/qtest/vhost-user-test.c
+++ b/tests/qtest/vhost-user-test.c
@@ -44,6 +44,8 @@
 "mem-path=%s,share=on -numa node,memdev=mem"
 #define QEMU_CMD_MEMFD  " -m %d -object memory-backend-memfd,id=mem,size=%dM," 
\
 " -numa node,memdev=mem"
+#define QEMU_CMD_SHM" -m %d -object memory-backend-shm,id=mem,size=%dM," \
+" -numa node,memdev=mem"
 #define QEMU_CMD_CHR" -chardev socket,id=%s,path=%s%s"
 #define QEMU_CMD_NETDEV " -netdev vhost-user,id=hs0,chardev=%s,vhostforce=on"
 
@@ -195,6 +197,7 @@ enum test_memfd {
 TEST_MEMFD_AUTO,
 TEST_MEMFD_YES,
 TEST_MEMFD_NO,
+TEST_MEMFD_SHM,
 };
 
 static void append_vhost_net_opts(TestServer *s, GString *cmd_line,
@@ -228,6 +231,8 @@ static void append_mem_opts(TestServer *server, GString 
*cmd_line,
 
 if (memfd == TEST_MEMFD_YES) {
 g_string_append_printf(cmd_line, QEMU_CMD_MEMFD, size, size);
+} else if (memfd == TEST_MEMFD_SHM) {
+g_string_append_printf(cmd_line, QEMU_CMD_SHM, size, size);
 } else {
 const char *root = init_hugepagefs() ? : server->tmpfs;
 
@@ -788,6 +793,19 @@ static void *vhost_user_test_setup_memfd(GString 
*cmd_line, void *arg)
 return server;
 }
 
+static void *vhost_user_test_setup_shm(GString *cmd_line, void *arg)
+{
+TestServer *server = test_server_new("vhost-user-test", arg);
+test_server_listen(server);
+
+append_mem_opts(server, cmd_line, 256, TEST_MEMFD_SHM);
+server->vu_ops->append_opts(server, cmd_line, "");
+
+g_test_queue_destroy(vhost_user_test_cleanup, server);
+
+return server;
+}
+
 static void test_read_guest_mem(void *obj, void *arg, QGuestAllocator *alloc)
 {
 TestServer *server = arg;
@@ -1081,6 +1099,11 @@ static void register_vhost_user_test(void)
  "virtio-net",
  test_read_guest_mem, );
 
+opts.before = vhost_user_test_setup_shm;
+qos_add_test("vhost-user/read-guest-mem/shm",
+ "virtio-net",
+ test_read_guest_mem, );
+
 if (qemu_memfd_check(MFD_ALLOW_SEALING)) {
 opts.before = vhost_user_test_setup_memfd;
 qos_add_test("vhost-user/read-guest-mem/memfd",
-- 
2.44.0




[PATCH for-9.1 v2 07/11] libvhost-user: enable it on any POSIX system

2024-03-26 Thread Stefano Garzarella
The vhost-user protocol is not really Linux-specific so let's enable
libvhost-user for any POSIX system.

Compiling it on macOS and FreeBSD some problems came up:
- avoid to include linux/vhost.h which is avaibale only on Linux
  (vhost_types.h contains many of the things we need)
- macOS doesn't provide sys/endian.h, so let's define them
  (note: libvhost-user doesn't include qemu's headers, so we can't use
   use "qemu/bswap.h")
- define eventfd_[write|read] as write/read wrapper when system doesn't
  provide those (e.g. macOS)
- copy SEAL defines from include/qemu/memfd.h to make the code works
  on FreeBSD where MFD_ALLOW_SEALING is defined
- define MAP_NORESERVE if it's not defined (e.g. on FreeBSD)

Signed-off-by: Stefano Garzarella 
---
 meson.build   |  2 +-
 subprojects/libvhost-user/libvhost-user.h |  2 +-
 subprojects/libvhost-user/libvhost-user.c | 60 +--
 3 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/meson.build b/meson.build
index c19d51501a..3197a2f62e 100644
--- a/meson.build
+++ b/meson.build
@@ -3194,7 +3194,7 @@ endif
 config_host_data.set('CONFIG_FDT', fdt.found())
 
 vhost_user = not_found
-if host_os == 'linux' and have_vhost_user
+if have_vhost_user
   libvhost_user = subproject('libvhost-user')
   vhost_user = libvhost_user.get_variable('vhost_user_dep')
 endif
diff --git a/subprojects/libvhost-user/libvhost-user.h 
b/subprojects/libvhost-user/libvhost-user.h
index deb40e77b3..e13e1d3931 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -18,9 +18,9 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include "standard-headers/linux/virtio_ring.h"
+#include "standard-headers/linux/vhost_types.h"
 
 /* Based on qemu/hw/virtio/vhost-user.c */
 #define VHOST_USER_F_PROTOCOL_FEATURES 30
diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 1c361ffd51..03edb4bf64 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -28,9 +28,7 @@
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 
 /* Necessary to provide VIRTIO_F_VERSION_1 on system
  * with older linux headers. Must appear before
@@ -39,8 +37,8 @@
 #include "standard-headers/linux/virtio_config.h"
 
 #if defined(__linux__)
+#include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -52,6 +50,62 @@
 
 #endif
 
+#if defined(__APPLE__) && (__MACH__)
+#include 
+#define htobe16(x) OSSwapHostToBigInt16(x)
+#define htole16(x) OSSwapHostToLittleInt16(x)
+#define be16toh(x) OSSwapBigToHostInt16(x)
+#define le16toh(x) OSSwapLittleToHostInt16(x)
+
+#define htobe32(x) OSSwapHostToBigInt32(x)
+#define htole32(x) OSSwapHostToLittleInt32(x)
+#define be32toh(x) OSSwapBigToHostInt32(x)
+#define le32toh(x) OSSwapLittleToHostInt32(x)
+
+#define htobe64(x) OSSwapHostToBigInt64(x)
+#define htole64(x) OSSwapHostToLittleInt64(x)
+#define be64toh(x) OSSwapBigToHostInt64(x)
+#define le64toh(x) OSSwapLittleToHostInt64(x)
+#endif
+
+#ifdef CONFIG_EVENTFD
+#include 
+#else
+#define eventfd_t uint64_t
+
+int eventfd_write(int fd, eventfd_t value)
+{
+return (write(fd, , sizeof(value)) == sizeof(value)) ? 0 : -1;
+}
+
+int eventfd_read(int fd, eventfd_t *value)
+{
+return (read(fd, value, sizeof(*value)) == sizeof(*value)) ? 0 : -1;
+}
+#endif
+
+#ifdef MFD_ALLOW_SEALING
+#include 
+
+#ifndef F_LINUX_SPECIFIC_BASE
+#define F_LINUX_SPECIFIC_BASE 1024
+#endif
+
+#ifndef F_ADD_SEALS
+#define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9)
+#define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10)
+
+#define F_SEAL_SEAL 0x0001  /* prevent further seals from being set */
+#define F_SEAL_SHRINK   0x0002  /* prevent file from shrinking */
+#define F_SEAL_GROW 0x0004  /* prevent file from growing */
+#define F_SEAL_WRITE0x0008  /* prevent writes */
+#endif
+#endif
+
+#ifndef MAP_NORESERVE
+#define MAP_NORESERVE 0
+#endif
+
 #include "include/atomic.h"
 
 #include "libvhost-user.h"
-- 
2.44.0




[PATCH for-9.1 v2 10/11] tests/qtest/vhost-user-blk-test: use memory-backend-shm

2024-03-26 Thread Stefano Garzarella
`memory-backend-memfd` is available only on Linux while the new
`memory-backend-shm` can be used on any POSIX-compliant operating
system. Let's use it so we can run the test in multiple environments.

Signed-off-by: Stefano Garzarella 
---
 tests/qtest/vhost-user-blk-test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qtest/vhost-user-blk-test.c 
b/tests/qtest/vhost-user-blk-test.c
index 117b9acd10..e945f6abf2 100644
--- a/tests/qtest/vhost-user-blk-test.c
+++ b/tests/qtest/vhost-user-blk-test.c
@@ -906,7 +906,7 @@ static void start_vhost_user_blk(GString *cmd_line, int 
vus_instances,
vhost_user_blk_bin);
 
 g_string_append_printf(cmd_line,
-" -object memory-backend-memfd,id=mem,size=256M,share=on "
+" -object memory-backend-shm,id=mem,size=256M,share=on "
 " -M memory-backend=mem -m 256M ");
 
 for (i = 0; i < vus_instances; i++) {
-- 
2.44.0




[PATCH for-9.1 v2 03/11] libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported

2024-03-26 Thread Stefano Garzarella
libvhost-user will panic when receiving VHOST_USER_GET_INFLIGHT_FD
message if MFD_ALLOW_SEALING is not defined, since it's not able
to create a memfd.

VHOST_USER_GET_INFLIGHT_FD is used only if
VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD is negotiated. So, let's mask
that feature if the backend is not able to properly handle these
messages.

Signed-off-by: Stefano Garzarella 
---
 subprojects/libvhost-user/libvhost-user.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a11afd1960..1c361ffd51 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -1674,6 +1674,16 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg 
*vmsg)
 features |= dev->iface->get_protocol_features(dev);
 }
 
+/*
+ * If MFD_ALLOW_SEALING is not defined, we are not able to handle
+ * VHOST_USER_GET_INFLIGHT_FD messages, since we can't create a memfd.
+ * Those messages are used only if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD
+ * is negotiated. A device implementation can enable it, so let's mask
+ * it to avoid a runtime panic.
+ */
+#ifndef MFD_ALLOW_SEALING
+features &= ~(1ULL << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD);
+#endif
 vmsg_set_reply_u64(vmsg, features);
 return true;
 }
-- 
2.44.0




[PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-26 Thread Stefano Garzarella
In vu_message_write() we use sendmsg() to send the message header,
then a write() to send the payload.

If sendmsg() fails we should avoid sending the payload, since we
were unable to send the header.

Discovered before fixing the issue with the previous patch, where
sendmsg() failed on macOS due to wrong parameters, but the frontend
still sent the payload which the backend incorrectly interpreted
as a wrong header.

Signed-off-by: Stefano Garzarella 
---
 subprojects/libvhost-user/libvhost-user.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 22bea0c775..a11afd1960 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
 rc = sendmsg(conn_fd, , 0);
 } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
 
+if (rc <= 0) {
+vu_panic(dev, "Error while writing: %s", strerror(errno));
+return false;
+}
+
 if (vmsg->size) {
 do {
 if (vmsg->data) {
-- 
2.44.0




[PULL 6/6] iotests: add test for stream job with an unaligned prefetch read

2024-03-26 Thread Kevin Wolf
From: Fiona Ebner 

Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.

By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-5-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 2 files changed, 91 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 00..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+def setUp(self) -> None:
+"""
+Create two images:
+- base image {base} with {cluster_size // 2} bytes allocated
+- top image {top} without any data allocated and coarser
+  cluster size
+
+Attach a compress filter for the top image, because that
+requires that the request alignment is the top image's cluster
+size.
+"""
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size // 2),
+base, str(image_size))
+qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size),
+top, str(image_size))
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': imgfmt,
+'node-name': 'base',
+'file': {
+'driver': 'file',
+'filename': base
+}
+}))
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': 'compress',
+'node-name': 'compress-top',
+'file': {
+'driver': imgfmt,
+'node-name': 'top',
+'file': {
+'driver': 'file',
+'filename': top
+},
+'backing': 'base'
+}
+}))
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top)
+os.remove(base)
+
+def test_stream_unaligned_prefetch(self) -> None:
+self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.44.0




[PULL 4/6] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf
From: Fiona Ebner 

The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-3-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.44.0




[PULL 3/6] block/io: accept NULL qiov in bdrv_pad_request

2024-03-26 Thread Kevin Wolf
From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-2-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/io.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  _head, _tail,
-  _niov);
-
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
+
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = >local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = >local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.44.0




[PULL 5/6] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf
From: Fiona Ebner 

Same rationale as for commit "block-backend: fix edge case in
bdrv_next() where BDS associated to BB changes". The block graph might
change between the bdrv_next() call and the bdrv_next_cleanup() call,
so it could be that the associated BDS is not the same that was
referenced previously anymore. Instead, rely on bdrv_next() to set
it->bs to the BDS it referenced and unreference that one in any case.

Signed-off-by: Fiona Ebner 
Message-ID: <20240322095009.346989-4-f.eb...@proxmox.com>
Reviewed-by: Kevin Wolf 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Kevin Wolf 
---
 block/block-backend.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 28af1eb17a..db6f9b92a3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -663,13 +663,10 @@ void bdrv_next_cleanup(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
-if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
-if (it->blk) {
-bdrv_unref(blk_bs(it->blk));
-blk_unref(it->blk);
-}
-} else {
-bdrv_unref(it->bs);
+bdrv_unref(it->bs);
+
+if (it->phase == BDRV_NEXT_BACKEND_ROOTS && it->blk) {
+blk_unref(it->blk);
 }
 
 bdrv_next_reset(it);
-- 
2.44.0




[PULL 0/6] Block layer patches

2024-03-26 Thread Kevin Wolf
The following changes since commit 096ae430a7b5a704af4cd94dca7200d6cb069991:

  Merge tag 'pull-qapi-2024-03-26' of https://repo.or.cz/qemu/armbru into 
staging (2024-03-26 09:50:21 +)

are available in the Git repository at:

  https://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 12d7b3bbdcededd3b695501d8d247239d769:

  iotests: add test for stream job with an unaligned prefetch read (2024-03-26 
14:21:26 +0100)


Block layer patches

- Fix crash with unaligned prefetch requests (e.g. in stream jobs)
- vdpa-dev: Fix initialisation order to restore VDUSE compatibility
- iotests fixes


Fiona Ebner (3):
  block-backend: fix edge case in bdrv_next() where BDS associated to BB 
changes
  block-backend: fix edge case in bdrv_next_cleanup() where BDS associated 
to BB changes
  iotests: add test for stream job with an unaligned prefetch read

Kevin Wolf (1):
  vdpa-dev: Fix initialisation order to restore VDUSE compatibility

Stefan Reiter (1):
  block/io: accept NULL qiov in bdrv_pad_request

Thomas Huth (1):
  tests/qemu-iotests: Test 157 and 227 require virtio-blk

 block/block-backend.c  | 18 ++---
 block/io.c | 33 +
 hw/net/vhost_net.c | 10 +++
 hw/virtio/vdpa-dev.c   |  5 +-
 hw/virtio/vhost-vdpa.c | 29 +++-
 hw/virtio/vhost.c  |  8 +-
 hw/virtio/trace-events |  2 +-
 tests/qemu-iotests/157 |  2 +
 tests/qemu-iotests/227 |  2 +
 tests/qemu-iotests/tests/stream-unaligned-prefetch | 86 ++
 .../tests/stream-unaligned-prefetch.out|  5 ++
 11 files changed, 167 insertions(+), 33 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out




[PULL 2/6] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-26 Thread Kevin Wolf
VDUSE requires that virtqueues are first enabled before the DRIVER_OK
status flag is set; with the current API of the kernel module, it is
impossible to enable the opposite order in our block export code because
userspace is not notified when a virtqueue is enabled.

This requirement also mathces the normal initialisation order as done by
the generic vhost code in QEMU. However, commit 6c482547 accidentally
changed the order for vdpa-dev and broke access to VDUSE devices with
this.

This changes vdpa-dev to use the normal order again and use the standard
vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
used with vdpa-dev again after this fix.

vhost_net intentionally avoided enabling the vrings for vdpa and does
this manually later while it does enable them for other vhost backends.
Reflect this in the vhost_net code and return early for vdpa, so that
the behaviour doesn't change for this device.

Cc: qemu-sta...@nongnu.org
Fixes: 6c4825476a43 ('vdpa: move vhost_vdpa_set_vring_ready to the caller')
Signed-off-by: Kevin Wolf 
Message-ID: <20240315155949.86066-1-kw...@redhat.com>
Reviewed-by: Eugenio Pérez 
Reviewed-by: Stefano Garzarella 
Signed-off-by: Kevin Wolf 
---
 hw/net/vhost_net.c | 10 ++
 hw/virtio/vdpa-dev.c   |  5 +
 hw/virtio/vhost-vdpa.c | 29 ++---
 hw/virtio/vhost.c  |  8 +++-
 hw/virtio/trace-events |  2 +-
 5 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e8e1661646..fd1a93701a 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int enable)
 VHostNetState *net = get_vhost_net(nc);
 const VhostOps *vhost_ops = net->dev.vhost_ops;
 
+/*
+ * vhost-vdpa network devices need to enable dataplane virtqueues after
+ * DRIVER_OK, so they can recover device state before starting dataplane.
+ * Because of that, we don't enable virtqueues here and leave it to
+ * net/vhost-vdpa.c.
+ */
+if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
+return 0;
+}
+
 nc->vring_enable = enable;
 
 if (vhost_ops && vhost_ops->vhost_set_vring_enable) {
diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
index eb9ecea83b..13e87f06f6 100644
--- a/hw/virtio/vdpa-dev.c
+++ b/hw/virtio/vdpa-dev.c
@@ -253,14 +253,11 @@ static int vhost_vdpa_device_start(VirtIODevice *vdev, 
Error **errp)
 
 s->dev.acked_features = vdev->guest_features;
 
-ret = vhost_dev_start(>dev, vdev, false);
+ret = vhost_dev_start(>dev, vdev, true);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error starting vhost");
 goto err_guest_notifiers;
 }
-for (i = 0; i < s->dev.nvqs; ++i) {
-vhost_vdpa_set_vring_ready(>vdpa, i);
-}
 s->started = true;
 
 /*
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3bcd05cc22..e827b9175f 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -896,19 +896,41 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, 
int idx)
 return idx;
 }
 
-int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+static int vhost_vdpa_set_vring_enable_one(struct vhost_vdpa *v, unsigned idx,
+   int enable)
 {
 struct vhost_dev *dev = v->dev;
 struct vhost_vring_state state = {
 .index = idx,
-.num = 1,
+.num = enable,
 };
 int r = vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, );
 
-trace_vhost_vdpa_set_vring_ready(dev, idx, r);
+trace_vhost_vdpa_set_vring_enable_one(dev, idx, enable, r);
 return r;
 }
 
+static int vhost_vdpa_set_vring_enable(struct vhost_dev *dev, int enable)
+{
+struct vhost_vdpa *v = dev->opaque;
+unsigned int i;
+int ret;
+
+for (i = 0; i < dev->nvqs; ++i) {
+ret = vhost_vdpa_set_vring_enable_one(v, i, enable);
+if (ret < 0) {
+return ret;
+}
+}
+
+return 0;
+}
+
+int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx)
+{
+return vhost_vdpa_set_vring_enable_one(v, idx, 1);
+}
+
 static int vhost_vdpa_set_config_call(struct vhost_dev *dev,
int fd)
 {
@@ -1536,6 +1558,7 @@ const VhostOps vdpa_ops = {
 .vhost_set_features = vhost_vdpa_set_features,
 .vhost_reset_device = vhost_vdpa_reset_device,
 .vhost_get_vq_index = vhost_vdpa_get_vq_index,
+.vhost_set_vring_enable = vhost_vdpa_set_vring_enable,
 .vhost_get_config  = vhost_vdpa_get_config,
 .vhost_set_config = vhost_vdpa_set_config,
 .vhost_requires_shm_log = NULL,
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2e4e040db8..f50180e60e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1984,7 +1984,13 @@ static int vhost_dev_set_vring_enable(struct vhost_dev 
*hdev, int enable)
 return 

[PULL 1/6] tests/qemu-iotests: Test 157 and 227 require virtio-blk

2024-03-26 Thread Kevin Wolf
From: Thomas Huth 

Tests 157 and 227 use the virtio-blk device, so we have to mark these
tests accordingly to be skipped if this devices is not available (e.g.
when running the tests with qemu-system-avr only).

Signed-off-by: Thomas Huth 
Message-ID: <20240325154737.1305063-1-th...@redhat.com>
Reviewed-by: Kevin Wolf 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/157 | 2 ++
 tests/qemu-iotests/227 | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tests/qemu-iotests/157 b/tests/qemu-iotests/157
index 0dc9ba68d2..aa2ebbfb4b 100755
--- a/tests/qemu-iotests/157
+++ b/tests/qemu-iotests/157
@@ -40,6 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt generic
 _supported_proto file
 
+_require_devices virtio-blk
+
 do_run_qemu()
 {
 (
diff --git a/tests/qemu-iotests/227 b/tests/qemu-iotests/227
index 7e45a47ac6..eddaad679e 100755
--- a/tests/qemu-iotests/227
+++ b/tests/qemu-iotests/227
@@ -40,6 +40,8 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt generic
 _supported_proto file
 
+_require_devices virtio-blk
+
 do_run_qemu()
 {
 echo Testing: "$@"
-- 
2.44.0




[PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-26 Thread Stefano Garzarella
shm_open() creates and opens a new POSIX shared memory object.
A POSIX shared memory object allows creating memory backend with an
associated file descriptor that can be shared with external processes
(e.g. vhost-user).

The new `memory-backend-shm` can be used as an alternative when
`memory-backend-memfd` is not available (Linux only), since shm_open()
should be provided by any POSIX-compliant operating system.

Signed-off-by: Stefano Garzarella 
---
 docs/system/devices/vhost-user.rst |   5 +-
 qapi/qom.json  |  17 +
 backends/hostmem-shm.c | 118 +
 backends/meson.build   |   1 +
 qemu-options.hx|  10 +++
 5 files changed, 149 insertions(+), 2 deletions(-)
 create mode 100644 backends/hostmem-shm.c

diff --git a/docs/system/devices/vhost-user.rst 
b/docs/system/devices/vhost-user.rst
index 9b2da106ce..35259d8ec7 100644
--- a/docs/system/devices/vhost-user.rst
+++ b/docs/system/devices/vhost-user.rst
@@ -98,8 +98,9 @@ Shared memory object
 
 In order for the daemon to access the VirtIO queues to process the
 requests it needs access to the guest's address space. This is
-achieved via the ``memory-backend-file`` or ``memory-backend-memfd``
-objects. A reference to a file-descriptor which can access this object
+achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or
+``memory-backend-shm`` objects.
+A reference to a file-descriptor which can access this object
 will be passed via the socket as part of the protocol negotiation.
 
 Currently the shared memory object needs to match the size of the main
diff --git a/qapi/qom.json b/qapi/qom.json
index baae3a183f..88136b861d 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -720,6 +720,19 @@
 '*hugetlbsize': 'size',
 '*seal': 'bool' } }
 
+##
+# @MemoryBackendShmProperties:
+#
+# Properties for memory-backend-shm objects.
+#
+# The @share boolean option is true by default with shm.
+#
+# Since: 9.1
+##
+{ 'struct': 'MemoryBackendShmProperties',
+  'base': 'MemoryBackendProperties',
+  'data': { } }
+
 ##
 # @MemoryBackendEpcProperties:
 #
@@ -976,6 +989,8 @@
 { 'name': 'memory-backend-memfd',
   'if': 'CONFIG_LINUX' },
 'memory-backend-ram',
+{ 'name': 'memory-backend-shm',
+  'if': 'CONFIG_POSIX' },
 'pef-guest',
 { 'name': 'pr-manager-helper',
   'if': 'CONFIG_LINUX' },
@@ -1047,6 +1062,8 @@
   'memory-backend-memfd':   { 'type': 'MemoryBackendMemfdProperties',
   'if': 'CONFIG_LINUX' },
   'memory-backend-ram': 'MemoryBackendProperties',
+  'memory-backend-shm': { 'type': 'MemoryBackendShmProperties',
+  'if': 'CONFIG_POSIX' },
   'pr-manager-helper':  { 'type': 'PrManagerHelperProperties',
   'if': 'CONFIG_LINUX' },
   'qtest':  'QtestProperties',
diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c
new file mode 100644
index 00..7595204d29
--- /dev/null
+++ b/backends/hostmem-shm.c
@@ -0,0 +1,118 @@
+/*
+ * QEMU host POSIX shared memory object backend
+ *
+ * Copyright (C) 2024 Red Hat Inc
+ *
+ * Authors:
+ *   Stefano Garzarella 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "sysemu/hostmem.h"
+#include "qapi/error.h"
+
+#define TYPE_MEMORY_BACKEND_SHM "memory-backend-shm"
+
+OBJECT_DECLARE_SIMPLE_TYPE(HostMemoryBackendShm, MEMORY_BACKEND_SHM)
+
+struct HostMemoryBackendShm {
+HostMemoryBackend parent_obj;
+};
+
+static bool
+shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
+{
+g_autoptr(GString) shm_name = g_string_new(NULL);
+g_autofree char *backend_name = NULL;
+uint32_t ram_flags;
+int fd, oflag;
+mode_t mode;
+
+if (!backend->size) {
+error_setg(errp, "can't create backend with size 0");
+return false;
+}
+
+/*
+ * Let's use `mode = 0` because we don't want other processes to open our
+ * memory unless we share the file descriptor with them.
+ */
+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);
+
+fd = shm_open(shm_name->str, oflag, mode);
+if (fd < 0) {
+error_setg_errno(errp, errno,
+

[PATCH for-9.1 v2 06/11] vhost-user: enable frontends on any POSIX system

2024-03-26 Thread Stefano Garzarella
The vhost-user protocol is not really Linux-specific so let's enable
vhost-user frontends for any POSIX system.

In vhost_net.c we use VHOST_FILE_UNBIND which is defined in a Linux
specific header, let's define it for other systems as well.

Signed-off-by: Stefano Garzarella 
---
 meson.build| 1 -
 hw/net/vhost_net.c | 5 +
 hw/block/Kconfig   | 2 +-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/meson.build b/meson.build
index c9c3217ba4..c19d51501a 100644
--- a/meson.build
+++ b/meson.build
@@ -151,7 +151,6 @@ have_tpm = get_option('tpm') \
 
 # vhost
 have_vhost_user = get_option('vhost_user') \
-  .disable_auto_if(host_os != 'linux') \
   .require(host_os != 'windows',
error_message: 'vhost-user is not available on Windows').allowed()
 have_vhost_vdpa = get_option('vhost_vdpa') \
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index e8e1661646..b8a59b9e90 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -34,8 +34,13 @@
 #include "standard-headers/linux/virtio_ring.h"
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/virtio-bus.h"
+#if defined(__linux__)
 #include "linux-headers/linux/vhost.h"
+#endif
 
+#ifndef VHOST_FILE_UNBIND
+#define VHOST_FILE_UNBIND -1
+#endif
 
 /* Features supported by host kernel. */
 static const int kernel_feature_bits[] = {
diff --git a/hw/block/Kconfig b/hw/block/Kconfig
index 9e8f28f982..29ee09e434 100644
--- a/hw/block/Kconfig
+++ b/hw/block/Kconfig
@@ -40,7 +40,7 @@ config VHOST_USER_BLK
 bool
 # Only PCI devices are provided for now
 default y if VIRTIO_PCI
-depends on VIRTIO && VHOST_USER && LINUX
+depends on VIRTIO && VHOST_USER
 
 config SWIM
 bool
-- 
2.44.0




[PATCH for-9.1 v2 05/11] contrib/vhost-user-blk: fix bind() using the right size of the address

2024-03-26 Thread Stefano Garzarella
On macOS passing `-s /tmp/vhost.socket` parameter to the vhost-user-blk
application, the bind was done on `/tmp/vhost.socke` pathname,
missing the last character.

This sounds like one of the portability problems described in the
unix(7) manpage:

   Pathname sockets
   When  binding  a socket to a pathname, a few rules should
   be observed for maximum portability and ease of coding:

   •  The pathname in sun_path should be null-terminated.

   •  The length of the pathname, including the  terminating
  null byte, should not exceed the size of sun_path.

   •  The  addrlen  argument  that  describes  the enclosing
  sockaddr_un structure should have a value of at least:

  offsetof(struct sockaddr_un, sun_path) +
  strlen(addr.sun_path)+1

  or,  more  simply,  addrlen  can   be   specified   as
  sizeof(struct sockaddr_un).

So let's follow the last advice and simplify the code as well.

Signed-off-by: Stefano Garzarella 
---
 contrib/vhost-user-blk/vhost-user-blk.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/contrib/vhost-user-blk/vhost-user-blk.c 
b/contrib/vhost-user-blk/vhost-user-blk.c
index 89e5f11a64..a8ab9269a2 100644
--- a/contrib/vhost-user-blk/vhost-user-blk.c
+++ b/contrib/vhost-user-blk/vhost-user-blk.c
@@ -469,7 +469,6 @@ static int unix_sock_new(char *unix_fn)
 {
 int sock;
 struct sockaddr_un un;
-size_t len;
 
 assert(unix_fn);
 
@@ -481,10 +480,9 @@ static int unix_sock_new(char *unix_fn)
 
 un.sun_family = AF_UNIX;
 (void)snprintf(un.sun_path, sizeof(un.sun_path), "%s", unix_fn);
-len = sizeof(un.sun_family) + strlen(un.sun_path);
 
 (void)unlink(unix_fn);
-if (bind(sock, (struct sockaddr *), len) < 0) {
+if (bind(sock, (struct sockaddr *), sizeof(un)) < 0) {
 perror("bind");
 goto fail;
 }
-- 
2.44.0




Re: [PATCH v3 0/4] fix two edge cases related to stream block jobs

2024-03-26 Thread Kevin Wolf
Am 25.03.2024 um 21:11 hat Stefan Hajnoczi geschrieben:
> On Fri, Mar 22, 2024 at 10:50:05AM +0100, Fiona Ebner wrote:
> > Changes in v3:
> > * Also deal with edge case in bdrv_next_cleanup(). Haven't run
> >   into an actual issue there, but at least the caller in
> >   migration/block.c uses bdrv_nb_sectors() which, while not a
> >   coroutine wrapper itself (it's written manually), may call
> >   bdrv_refresh_total_sectors(), which is a generated coroutine
> >   wrapper, so AFAIU, the block graph can change during that call.
> >   And even without that, it's just better to be more consistent
> >   with bdrv_next().
> > 
> > Changes in v2:
> > * Ran into another issue while writing the IO test Stefan wanted
> >   to have (good call :)), so include a fix for that and add the
> >   test. I didn't notice during manual testing, because I hadn't
> >   used a scripted QMP 'quit', so there was no race.
> > 
> > Fiona Ebner (3):
> >   block-backend: fix edge case in bdrv_next() where BDS associated to BB
> > changes
> >   block-backend: fix edge case in bdrv_next_cleanup() where BDS
> > associated to BB changes
> >   iotests: add test for stream job with an unaligned prefetch read
> > 
> > Stefan Reiter (1):
> >   block/io: accept NULL qiov in bdrv_pad_request
> > 
> >  block/block-backend.c | 18 ++--
> >  block/io.c| 31 ---
> >  .../tests/stream-unaligned-prefetch   | 86 +++
> >  .../tests/stream-unaligned-prefetch.out   |  5 ++
> >  4 files changed, 117 insertions(+), 23 deletions(-)
> >  create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
> >  create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out
> 
> Looks good to me. I will wait until Thursday before merging in case
> Hanna, Vladimir, or Kevin have comments. Thanks!

Let's not delay it to -rc2. If something turns out to be wrong with it,
we can still revert it, but I think getting fixes in earlier is better
during freeze.

Thanks, applied to the block branch.

Kevin


signature.asc
Description: PGP signature


Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf
Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> The old_bs variable in bdrv_next() is currently determined by looking
> at the old block backend. However, if the block graph changes before
> the next bdrv_next() call, it might be that the associated BDS is not
> the same that was referenced previously. In that case, the wrong BDS
> is unreferenced, leading to an assertion failure later:
> 
> > bdrv_unref: Assertion `bs->refcnt > 0' failed.

Your change makes sense, but in theory it shouldn't make a difference.
The real bug is in the caller, you can't allow graph modifications while
iterating the list of nodes. Even if it doesn't crash (like after your
patch), you don't have any guarantee that you will have seen every node
that exists that the end - and maybe not even that you don't get the
same node twice.

> In particular, this can happen in the context of bdrv_flush_all(),
> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> a graph change (for example with a stream block job [0]).

The whole locking around this case is a bit tricky and would deserve
some cleanup.

The basic rule for bdrv_next() callers is that they need to hold the
graph reader lock as long as they are iterating the graph, otherwise
it's not safe. This implies that the ref/unref pairs in it should never
make a difference either - which is important, because at least
releasing the last reference is forbidden while holding the graph lock.
I intended to remove the ref/unref for bdrv_next(), but I didn't because
I realised that the callers need to be audited first that they really
obey the rules. You found one that would be problematic.

The thing that bdrv_flush_all() gets wrong is that it promises to follow
the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
something that polls. The compiler can't catch this because bdrv_flush()
is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:

- If called outside of coroutine context, they are GRAPH_UNLOCKED
- If called in coroutine context, they are GRAPH_RDLOCK

We should probably try harder to get rid of the mixed functions, because
a synchronous co_wrapper_bdrv_rdlock could actually be marked
GRAPH_UNLOCKED in the code and then the compiler could catch this case.

The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
with a coroutine wrapper so that the graph lock is held for the whole
function. Then calling bdrv_co_flush() while iterating the list is safe
and doesn't allow concurrent graph modifications.

Kevin




Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

2024-03-26 Thread zhuyangyang via
On Mon, 25 Mar 2024 11:00:31 -0500 Eric Blake wrote:
> >  util/async.c | 13 -
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/util/async.c b/util/async.c
> > index 0467890052..25fc1e6083 100644
> > --- a/util/async.c
> > +++ b/util/async.c
> > @@ -705,7 +705,18 @@ void aio_co_enter(AioContext *ctx, Coroutine *co)
> >  if (qemu_in_coroutine()) {
> >  Coroutine *self = qemu_coroutine_self();
> >  assert(self != co);
> > -QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next);
> > +/*
> > + * If the Coroutine *co is already in the co_queue_wakeup, this
> > + * repeated insertion will causes the loss of other queue element
> 
> s/causes/cause/
> 
> > + * or infinite loop.
> > + * For examplex:
> 
> s/examplex/example/
> 
> > + * Head->a->b->c->NULL, after insert_tail(head, b) => 
> > Head->a->b->NULL
> > + * Head->a-b>->NULL, after insert_tail(head, b) => Head->a->b->b...
> 
> s/b>->/b->/
> 
> > + */
> > +if (!co->co_queue_next.sqe_next &&
> > +self->co_queue_wakeup.sqh_last != >co_queue_next.sqe_next) 
> > {
> > +QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, 
> > co_queue_next);
> > +}
> >  } else {
> >  qemu_aio_coroutine_enter(ctx, co);
> >  }
> 
> Intuitively, attacking the symptoms (avoiding bogus list insertion
> when it is already on the list) makes sense; but I want to make sure
> we attack the root cause.

Repairing the nbd_negotiate_handle_starttls() can solve this problem, therefore,
I'm not sure if this commit is still needed.

--
Best Regards,
Zhu Yangyang




Re: [PATCH] tests/qemu-iotests: Test 157 and 227 require virtio-blk

2024-03-26 Thread Kevin Wolf
Am 25.03.2024 um 16:47 hat Thomas Huth geschrieben:
> Tests 157 and 227 use the virtio-blk device, so we have to mark these
> tests accordingly to be skipped if this devices is not available (e.g.
> when running the tests with qemu-system-avr only).
> 
> Signed-off-by: Thomas Huth 

Thanks, applied to the block branch.

Kevin




Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

2024-03-26 Thread zhuyangyang via
On Mon, 25 Mar 2024 11:50:41 -0400, Stefan Hajnoczi wrote:
> On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > may be disordered.
> 
> aio_poll() must not be called from coroutine context:
> 
>   bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
>^^^
> 
> Coroutines are not supposed to block. Instead, they should yield.
> 
> > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > some listened events is completed. Therefore, the completion callback
> > function is dispatched.
> > 
> > If this callback function needs to invoke aio_co_enter(), it will only
> > wake up the coroutine (because we are already in coroutine context),
> > which may cause that the data on this listening event_fd/socket_fd
> > is not read/cleared. When the next poll () exits, it will be woken up again
> > and inserted into the wakeup queue again.
> > 
> > For example, if TLS is enabled in NBD, the server will call 
> > g_main_loop_run()
> > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > The call stack is as follows:
> > 
> > aio_co_enter()
> > aio_co_wake()
> > qio_channel_restart_read()
> > aio_dispatch_handler()
> > aio_dispatch_handlers()
> > aio_dispatch()
> > aio_ctx_dispatch()
> > g_main_context_dispatch()
> > g_main_loop_run()
> > nbd_negotiate_handle_starttls()
> 
> This code does not look like it was designed to run in coroutine
> context. Two options:
> 
> 1. Don't run it in coroutine context (e.g. use a BH instead). This
>avoids blocking the coroutine but calling g_main_loop_run() is still
>ugly, in my opinion.
> 
> 2. Get rid of data.loop and use coroutine APIs instead:
> 
>while (!data.complete) {
>qemu_coroutine_yield();
>}
> 
>and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
>of g_main_loop_quit(data->loop).
> 
>This requires auditing the code to check whether the event loop might
>invoke something that interferes with
>nbd_negotiate_handle_starttls(). Typically this means monitor
>commands or fd activity that could change the state of this
>connection while it is yielded. This is where the real work is but
>hopefully it will not be that hard to figure out.

Thank you for your help, I'll try to fix it using the coroutine api.

> 
> > nbd_negotiate_options()
> > nbd_negotiate()
> > nbd_co_client_start()
> > coroutine_trampoline()
> > 

-- 
Best Regards,
Zhu Yangyang



Question about block graph lock limitation with generated co-wrappers

2024-03-26 Thread Fiona Ebner
Hi,
we have a custom block driver downstream, which currently calls
bdrv_get_info() (for its file child) in the bdrv_refresh_limits()
callback. However, with graph locking, this doesn't work anymore.
AFAICT, the reason is the following:

The block driver has a backing file option.
During initialization, in bdrv_set_backing_hd(), the graph lock is
acquired exclusively.
Then the bdrv_refresh_limits() callback is invoked.
Now bdrv_get_info() is called, which is a generated co-wrapper.
The bdrv_co_get_info_entry() function tries to acquire the graph lock
for reading, sees that has_writer is true and so the coroutine will be
put to wait, leading to a deadlock.

For my specific case, I can move the bdrv_get_info() call to bdrv_open()
as a workaround. But I wanted to ask if there is a way to make generated
co-wrappers inside an exclusively locked section work? And if not, could
we introduce/extend the annotations, so the compiler can catch this kind
of issue, i.e. calling a generated co-wrapper while in an exclusively
locked section?

Best Regards,
Fiona




Re: [PATCH 1/2] copy-before-write: allow specifying minimum cluster size

2024-03-26 Thread Markus Armbruster
Fiona Ebner  writes:

> Useful to make discard-source work in the context of backup fleecing
> when the fleecing image has a larger granularity than the backup
> target.
>
> Copy-before-write operations will use at least this granularity and in
> particular, discard requests to the source node will too. If the
> granularity is too small, they will just be aligned down in
> cbw_co_pdiscard_snapshot() and thus effectively ignored.
>
> The QAPI uses uint32 so the value will be non-negative, but still fit
> into a uint64_t.
>
> Suggested-by: Vladimir Sementsov-Ogievskiy 
> Signed-off-by: Fiona Ebner 
> ---
>  block/block-copy.c | 17 +
>  block/copy-before-write.c  |  3 ++-
>  include/block/block-copy.h |  1 +
>  qapi/block-core.json   |  8 +++-
>  4 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/block/block-copy.c b/block/block-copy.c
> index 7e3b378528..adb1cbb440 100644
> --- a/block/block-copy.c
> +++ b/block/block-copy.c
> @@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool 
> use_copy_range,
>  }
>  
>  static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
> + int64_t min_cluster_size,
>   Error **errp)
>  {
>  int ret;
> @@ -335,7 +336,7 @@ static int64_t 
> block_copy_calculate_cluster_size(BlockDriverState *target,
>  "used. If the actual block size of the target exceeds "
>  "this default, the backup may be unusable",
>  BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
> -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
> +return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT);

min_cluster_size is int64_t, BLOCK_COPY_CLUSTER_SIZE_DEFAULT is int, and
gets converted to int64_t.  The return type is int64_t.  Okay.

>  } else if (ret < 0 && !target_does_cow) {
>  error_setg_errno(errp, -ret,
>  "Couldn't determine the cluster size of the target image, "
> @@ -345,16 +346,18 @@ static int64_t 
> block_copy_calculate_cluster_size(BlockDriverState *target,
>  return ret;
>  } else if (ret < 0 && target_does_cow) {
>  /* Not fatal; just trudge on ahead. */
> -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
> +return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT);

Same.

>  }
>  
> -return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
> +return MAX(min_cluster_size,
> +   MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size));

Similar: bdi.cluster_size is int.

>  }
>  
>  BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
>   BlockDriverState *copy_bitmap_bs,
>   const BdrvDirtyBitmap *bitmap,
>   bool discard_source,
> + int64_t min_cluster_size,
>   Error **errp)
>  {
>  ERRP_GUARD();
> @@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
> BdrvChild *target,
>  
>  GLOBAL_STATE_CODE();
>  
> -cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
> +if (min_cluster_size && !is_power_of_2(min_cluster_size)) {

min_cluster_size is int64_t, is_power_of_2() takes uint64_t.  Bad if
min_cluster_size is negative.  Could this happen?


> +error_setg(errp, "min-cluster-size needs to be a power of 2");
> +return NULL;
> +}
> +
> +cluster_size = block_copy_calculate_cluster_size(target->bs,
> + min_cluster_size, errp);

min_cluster_size is int64_t, block_copy_calculate_cluster_size() takes
int64_t, returns int64_t, and cluster_size is int64_t.  Good.

>  if (cluster_size < 0) {
>  return NULL;
>  }
> diff --git a/block/copy-before-write.c b/block/copy-before-write.c
> index dac57481c5..f9896c6c1e 100644
> --- a/block/copy-before-write.c
> +++ b/block/copy-before-write.c
> @@ -476,7 +476,8 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
> int flags,
>  
>  s->discard_source = flags & BDRV_O_CBW_DISCARD_SOURCE;
>  s->bcs = block_copy_state_new(bs->file, s->target, bs, bitmap,
> -  flags & BDRV_O_CBW_DISCARD_SOURCE, errp);
> +  flags & BDRV_O_CBW_DISCARD_SOURCE,
> +  opts->min_cluster_size, errp);

@opts is BlockdevOptionsCbw *, opts->min_cluster_size is uint32_t (see
last hunk), block_copy_state_new() takes int64_t: opts->min_cluster_size
gets zero-extended.  Okay.

>  if (!s->bcs) {
>  error_prepend(errp, "Cannot create block-copy-state: ");
>  return -EINVAL;
> diff --git a/include/block/block-copy.h b/include/block/block-copy.h
> index bdc703bacd..77857c6c68 100644
> ---