Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

2024-03-27 Thread Eric Blake
On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote:
> On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > may be disordered.
> 
> aio_poll() must not be called from coroutine context:
> 
>   bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
>^^^
> 
> Coroutines are not supposed to block. Instead, they should yield.
> 
> > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > some listened events is completed. Therefore, the completion callback
> > function is dispatched.
> > 
> > If this callback function needs to invoke aio_co_enter(), it will only
> > wake up the coroutine (because we are already in coroutine context),
> > which may cause that the data on this listening event_fd/socket_fd
> > is not read/cleared. When the next poll () exits, it will be woken up again
> > and inserted into the wakeup queue again.
> > 
> > For example, if TLS is enabled in NBD, the server will call 
> > g_main_loop_run()
> > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > The call stack is as follows:
> > 
> > aio_co_enter()
> > aio_co_wake()
> > qio_channel_restart_read()
> > aio_dispatch_handler()
> > aio_dispatch_handlers()
> > aio_dispatch()
> > aio_ctx_dispatch()
> > g_main_context_dispatch()
> > g_main_loop_run()
> > nbd_negotiate_handle_starttls()
> 
> This code does not look like it was designed to run in coroutine
> context. Two options:
> 
> 1. Don't run it in coroutine context (e.g. use a BH instead). This
>avoids blocking the coroutine but calling g_main_loop_run() is still
>ugly, in my opinion.
> 
> 2. Get rid of data.loop and use coroutine APIs instead:
> 
>while (!data.complete) {
>qemu_coroutine_yield();
>}
> 
>and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
>of g_main_loop_quit(data->loop).
> 
>This requires auditing the code to check whether the event loop might
>invoke something that interferes with
>nbd_negotiate_handle_starttls(). Typically this means monitor
>commands or fd activity that could change the state of this
>connection while it is yielded. This is where the real work is but
>hopefully it will not be that hard to figure out.

I agree that 1) is ugly.  So far, I've added some temporary
assertions, to see when qio_channel_tls_handshake is reached; it looks
like nbd/server.c is calling it from within coroutine context, but
nbd/client.c is calling it from the main loop.  The qio code handles
either, but now I'm stuck in trying to get client.c into having the
right coroutine context; the TLS handshake is done before the usual
BlockDriverState *bs object is available, so I'm not even sure what
aio context, if any, I should be using.  But on my first try, I'm
hitting:

qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed.

so I obviously got something wrong.

It may be possible to use block_gen_c to turn nbd_receive_negotiate
and nbd_receive_export_list into co_wrapper functions, if I can audit
that all of their code goes through qio (and is therefore
coroutine-safe), but that work is still ongoing.

At any rate, qemu-iotest 233 should be good for testing that changes
in this area work; I've also been testing with libnbd (test
interop/interop-qemu-nbd-tls-certs hits qemu's server.c) and nbdkit
(test tests/test-tls-psk.sh hits qemu's client.c).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




[PATCH 2/3] vvfat: Fix usage of `info.file.offset`

2024-03-27 Thread Amjad Alsharafi
The field is marked as "the offset in the file (in clusters)", but it
was being used like this
`cluster_size*(nums)+mapping->info.file.offset`, which is incorrect.

Additionally, removed the `abort` when `first_mapping_index` does not
match, as this matches the case when adding new clusters for files, and
its inevitable that we reach this condition when doing that if the
clusters are not after one another, so there is no reason to `abort`
here, execution continues and the new clusters are written to disk
correctly.

Signed-off-by: Amjad Alsharafi 
---
 block/vvfat.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index ab342f0743..cb3ab81e29 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -1408,7 +1408,7 @@ read_cluster_directory:
 
 assert(s->current_fd);
 
-
offset=s->cluster_size*(cluster_num-s->current_mapping->begin)+s->current_mapping->info.file.offset;
+offset=s->cluster_size*((cluster_num - s->current_mapping->begin) + 
s->current_mapping->info.file.offset);
 if(lseek(s->current_fd, offset, SEEK_SET)!=offset)
 return -3;
 s->cluster=s->cluster_buffer;
@@ -1929,8 +1929,8 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
direntry_t* direntry, const ch
 (mapping->mode & MODE_DIRECTORY) == 0) {
 
 /* was modified in qcow */
-if (offset != mapping->info.file.offset + s->cluster_size
-* (cluster_num - mapping->begin)) {
+if (offset != s->cluster_size
+* ((cluster_num - mapping->begin) + 
mapping->info.file.offset)) {
 /* offset of this cluster in file chain has changed */
 abort();
 copy_it = 1;
@@ -1944,7 +1944,6 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, 
direntry_t* direntry, const ch
 
 if (mapping->first_mapping_index != first_mapping_index
 && mapping->info.file.offset > 0) {
-abort();
 copy_it = 1;
 }
 
@@ -2404,7 +2403,7 @@ static int commit_mappings(BDRVVVFATState* s,
 (mapping->end - mapping->begin);
 } else
 next_mapping->info.file.offset = mapping->info.file.offset +
-mapping->end - mapping->begin;
+(mapping->end - mapping->begin);
 
 mapping = next_mapping;
 }
-- 
2.44.0




[PATCH 0/3] vvfat: Fix bugs in writing and reading files

2024-03-27 Thread Amjad Alsharafi
These patches fix some bugs found when modifying files in vvfat.

First, there was a bug when writing to the second+ cluster of a file, it
will copy the cluster before it instead.

Another issue was modifying the clusters of a file and adding new
clusters, this showed 2 issues:
- If the new cluster is not immediately after the last cluster, it will
  cause issues when reading from this file in the future.
- Generally, the usage of info.file.offset was incorrect, and the
  system would crash on abort() when the file is modified and a new
  cluster was added.





[PATCH 1/3] vvfat: Fix bug in writing to middle of file

2024-03-27 Thread Amjad Alsharafi
Before this commit, the behavior when calling `commit_one_file` for
example with `offset=0x2000` (second cluster), what will happen is that
we won't fetch the next cluster from the fat, and instead use the first
cluster for the read operation.

This is due to off-by-one error here, where `i=0x2000 !< offset=0x2000`,
thus not fetching the next cluster.

Signed-off-by: Amjad Alsharafi 
---
 block/vvfat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index 9d050ba3ae..ab342f0743 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -2525,7 +2525,7 @@ commit_one_file(BDRVVVFATState* s, int dir_index, 
uint32_t offset)
 return -1;
 }
 
-for (i = s->cluster_size; i < offset; i += s->cluster_size)
+for (i = s->cluster_size; i <= offset; i += s->cluster_size)
 c = modified_fat_get(s, c);
 
 fd = qemu_open_old(mapping->path, O_RDWR | O_CREAT | O_BINARY, 0666);
-- 
2.44.0




[PATCH 3/3] ffvat: Fix reading files with non-continuous clusters

2024-03-27 Thread Amjad Alsharafi
When reading with `read_cluster` we get the `mapping` with
`find_mapping_for_cluster` and then we call `open_file` for this
mapping.
The issue appear when its the same file, but a second cluster that is
not immediately after it, imagine clusters `500 -> 503`, this will give
us 2 mappings one has the range `500..501` and another `503..504`, both
point to the same file, but different offsets.

When we don't open the file since the path is the same, we won't assign
`s->current_mapping` and thus accessing way out of bound of the file.

>From our example above, after `open_file` (that didn't open anything) we
will get the offset into the file with
`s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will
give us `0x2000 * (504-500)`, which is out of bound for this mapping and
will produce some issues.

Signed-off-by: Amjad Alsharafi 
---
 block/vvfat.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index cb3ab81e29..87165abc26 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -1360,15 +1360,22 @@ static int open_file(BDRVVVFATState* s,mapping_t* 
mapping)
 {
 if(!mapping)
 return -1;
+int new_path = 1;
 if(!s->current_mapping ||
-strcmp(s->current_mapping->path,mapping->path)) {
-/* open file */
-int fd = qemu_open_old(mapping->path,
+
s->current_mapping->first_mapping_index!=mapping->first_mapping_index ||
+(new_path = strcmp(s->current_mapping->path,mapping->path))) {
+
+if (new_path) {
+/* open file */
+int fd = qemu_open_old(mapping->path,
O_RDONLY | O_BINARY | O_LARGEFILE);
-if(fd<0)
-return -1;
-vvfat_close_current_file(s);
-s->current_fd = fd;
+if(fd<0)
+return -1;
+vvfat_close_current_file(s);
+
+s->current_fd = fd;
+}
+assert(s->current_fd);
 s->current_mapping = mapping;
 }
 return 0;
-- 
2.44.0




[PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()

2024-03-27 Thread Kevin Wolf
Coverity complains that the check introduced in commit 3f934817 suggests
that qiov could be NULL and we dereference it before reaching the check.
In fact, all of the callers pass a non-NULL pointer, so just remove the
misleading check.

Resolves: Coverity CID 1542668
Signed-off-by: Kevin Wolf 
---
 block/io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 395bea3bac..7217cf811b 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1730,7 +1730,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
  * For prefetching in stream_populate(), no qiov is passed along, because
  * only copy-on-read matters.
  */
-if (qiov && *qiov) {
+if (*qiov) {
 sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
   _head, _tail,
   _niov);
-- 
2.44.0




bdrv_pad_request() Coverity report

2024-03-27 Thread Stefan Hajnoczi
Hi Fiona,
The Coverity static checker sent a report about commit 3f934817c82c
("block/io: accept NULL qiov in bdrv_pad_request").

Please take a look and send a follow-up patch, if necessary:

*** CID 1542668:  Null pointer dereferences  (REVERSE_INULL)
/builds/qemu-project/qemu/bloc
k/io.c: 1733 in bdrv_pad_request()
1727 }
1728
1729 /*
1730  * For prefetching in stream_populate(), no qiov is
passed along, because
1731  * only copy-on-read matters.
1732  */
>>> CID 1542668:  Null pointer dereferences  (REVERSE_INULL)
>>> Null-checking "qiov" suggests that it may be null, but it has already 
>>> been dereferenced on all paths leading to the check.
1733 if (qiov && *qiov) {
1734 sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
1735   _head, _tail,
1736   _niov);
1737
1738 /* Guaranteed by bdrv_check_request32() */

Thanks,
Stefan



Re: [PATCH-for-9.0 v2 16/19] hw/xen/xen_pt: Add missing license

2024-03-27 Thread Anthony PERARD
On Tue, Nov 14, 2023 at 03:38:12PM +0100, Philippe Mathieu-Daudé wrote:
> Commit eaab4d60d3 ("Introduce Xen PCI Passthrough, qdevice")
> introduced both xen_pt.[ch], but only added the license to
> xen_pt.c. Use the same license for xen_pt.h.
> 
> Suggested-by: David Woodhouse 
> Signed-off-by: Philippe Mathieu-Daudé 

Fine by me. Looks like there was a license header before:
https://xenbits.xen.org/gitweb/?p=qemu-xen-unstable.git;a=blob;f=hw/pass-through.h;h=0b5822414e24d199a064abccc4d378dcaf569bd6;hb=HEAD
I don't know why I didn't copied it over here.

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



Re: [PATCH-for-9.0 v2 15/19] hw/xen: Reduce inclusion of 'cpu.h' to target-specific sources

2024-03-27 Thread Anthony PERARD
On Tue, Nov 14, 2023 at 03:38:11PM +0100, Philippe Mathieu-Daudé wrote:
> We rarely need to include "cpu.h" in headers. Including it
> 'taint' headers to be target-specific. Here only the i386/arm
> implementations requires "cpu.h", so include it there and
> remove from the "hw/xen/xen-hvm-common.h" *common* header.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: Richard Henderson 
> Reviewed-by: David Woodhouse 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



Re: [RFC PATCH-for-9.0 v2 13/19] hw/xen: Remove use of 'target_ulong' in handle_ioreq()

2024-03-27 Thread Anthony PERARD
On Tue, Nov 14, 2023 at 03:38:09PM +0100, Philippe Mathieu-Daudé wrote:
> Per commit f17068c1c7 ("xen-hvm: reorganize xen-hvm and move common
> function to xen-hvm-common"), handle_ioreq() is expected to be
> target-agnostic. However it uses 'target_ulong', which is a target
> specific definition.
> 
> Per xen/include/public/hvm/ioreq.h header:
> 
>   struct ioreq {
> uint64_t addr;  /* physical address */
> uint64_t data;  /* data (or paddr of data) */
> uint32_t count; /* for rep prefixes */
> uint32_t size;  /* size in bytes */
> uint32_t vp_eport;  /* evtchn for notifications to/from device model 
> */
> uint16_t _pad0;
> uint8_t state:4;
> uint8_t data_is_ptr:1;  /* if 1, data above is the guest paddr
>  * of the real data to use. */
> uint8_t dir:1;  /* 1=read, 0=write */
> uint8_t df:1;
> uint8_t _pad1:1;
> uint8_t type;   /* I/O type */
>   };
>   typedef struct ioreq ioreq_t;
> 
> If 'data' is not a pointer, it is a u64.
> 
> - In PIO / VMWARE_PORT modes, only 32-bit are used.
> 
> - In MMIO COPY mode, memory is accessed by chunks of 64-bit

Looks like it could also be 8, 16, or 32 as well, depending on
req->size.

> - In PCI_CONFIG mode, access is u8 or u16 or u32.
> 
> - None of TIMEOFFSET / INVALIDATE use 'req'.
> 
> - Fallback is only used in x86 for VMWARE_PORT.
> 
> Masking the upper bits of 'data' to keep 'req->size' low bits
> is irrelevant of the target word size. Remove the word size
> check and always extract the relevant bits.

When building QEMU for Xen, we tend to build the target "i386-softmmu",
which looks like to have target_ulong == uint32_t. So the `data`
clamping would only apply to size 8 and 16. The clamping with
target_ulong was introduce long time ago, here:
https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b4a663b87df3954557434a2d31bff7f6b2706ec1
and they were more IOREQ types.
So my guess is it isn't relevant anymore, but extending the clamping to
32-bits request should be fine, when using qemu-system-i386 that is, as
it is already be done if one use qemu-system-x86_64.

So I think the patch is fine, and the tests I've ran so far worked fine.

> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-27 Thread David Hildenbrand


So I thought that for now we only support the "anonymous" mode, and if
in the future we have a use case where the user wants to specify the
name, we can add it later.


Okay, so for now you really only want an anonymous fd that behaves like
a memfd, got it.

Likely we should somehow fail if the "POSIX shared memory object"
already exists. Hmm.


`O_CREAT | O_EXCL` should provide just this behavior.
  From shm_open(3) manpage:

  O_EXCL  If O_CREAT was also specified, and a shared memory object
  with the given name already exists, return an error.  The
  check for the existence of the object, and its creation if
  it does not exist, are performed atomically.



Cool!





That said, if you think it's already useful from the beginning, I can
add the name as an optional parameter.



I'm also not quite sure if "host_memory_backend_get_name()" should be
used for the purpose here.


What problem do you see? As an alternative I thought of a static
counter.


If it's really just an "internal UUID", as we'll remove the file using
shm_unlink(), then the name in fact shouldn't matter and this would be
fine. Correct?


Right. It's a name that will only "appear" in the system for a very
short time since we call shm_unlink() right after shm_open(). So I just
need the unique name to prevent several QEMUs launched at the same time
from colliding with the name.


Okay, essentially emulating tmpfile(), but for weird shmem_open() :)





So I assume if we ever want to have non-anonymous fds here, we'd pass
in a new property "name", and change the logic how we open/unlink.
Makes sense.


Exactly! I can clarify this in the commit description as well.

Thanks for the helpful comments!
If there is anything I need to change for v3, please tell me ;-)


Besides some patch description extension, makes sense to me :)

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-27 Thread Stefano Garzarella

On Wed, Mar 27, 2024 at 12:51:51PM +0100, David Hildenbrand wrote:

On 27.03.24 11:23, Stefano Garzarella wrote:

On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote:

+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);


Hm, shouldn't we just let the user specify a name, and if no name was
specified, generate one ourselves?


I thought about it and initially did it that way, but then some problems
came up so I tried to keep it as simple as possible for the user and for
our use case (having an fd associated with memory and sharing it with
other processes).

The problems I had were:

- what mode_t to use if the object does not exist and needs to be
   created? >
- exclude O_EXCL if the user passes the name since they may have already
   created it?


I'd handle both like we handle files. But I understand now that you 
really just want to "simulate" memfd.


Right, maybe I should write that in the commit message and 
documentation.






- call shm_unlink() only at object deallocation?


Right, we don't really do that for ordinary files, they keep existing. 
We only have the "discard-data" property to free up memory.


For memfd, it just happens "automatically". They cannot be looked up 
by name, and once the last user closes the fd, it gets destroyed.


Yep, I see.





So I thought that for now we only support the "anonymous" mode, and if
in the future we have a use case where the user wants to specify the
name, we can add it later.


Okay, so for now you really only want an anonymous fd that behaves like 
a memfd, got it.


Likely we should somehow fail if the "POSIX shared memory object" 
already exists. Hmm.


`O_CREAT | O_EXCL` should provide just this behavior.
From shm_open(3) manpage:

O_EXCL  If O_CREAT was also specified, and a shared memory object
with the given name already exists, return an error.  The
check for the existence of the object, and its creation if
it does not exist, are performed atomically.





That said, if you think it's already useful from the beginning, I can
add the name as an optional parameter.



I'm also not quite sure if "host_memory_backend_get_name()" should be
used for the purpose here.


What problem do you see? As an alternative I thought of a static
counter.


If it's really just an "internal UUID", as we'll remove the file using 
shm_unlink(), then the name in fact shouldn't matter and this would be 
fine. Correct?


Right. It's a name that will only "appear" in the system for a very 
short time since we call shm_unlink() right after shm_open(). So I just 
need the unique name to prevent several QEMUs launched at the same time 
from colliding with the name.




So I assume if we ever want to have non-anonymous fds here, we'd pass 
in a new property "name", and change the logic how we open/unlink.  
Makes sense.


Exactly! I can clarify this in the commit description as well.

Thanks for the helpful comments!
If there is anything I need to change for v3, please tell me ;-)

Stefano




Re: [PATCH-for-9.1] rdma: Remove RDMA subsystem and pvrdma device

2024-03-27 Thread Fabiano Rosas
Philippe Mathieu-Daudé  writes:

> The whole RDMA subsystem was deprecated in commit e9a54265f5
> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> released in v8.2. Time to remove it.
>
> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> in old migration streams.
>
> Remove the dependencies on libibumad and libibverbs.
>
> Remove the generated vmw_pvrdma/ directory from linux-headers.
>
> Remove RDMA handling from migration.
>
> Remove RDMA handling in GlusterFS block driver.
>
> Remove rdmacm-mux tool from contrib/.
>
> Remove PVRDMA device.
>
> Cc: Peter Xu 
> Cc: Li Zhijian 
> Cc: Yuval Shaia 
> Cc: Marcel Apfelbaum 
> Signed-off-by: Philippe Mathieu-Daudé 

Acked-by: Fabiano Rosas 



Re: [PATCH-for-9.0 v2 12/19] hw/xen: Merge 'hw/xen/arch_hvm.h' in 'hw/xen/xen-hvm-common.h'

2024-03-27 Thread Anthony PERARD
On Tue, Nov 14, 2023 at 03:38:08PM +0100, Philippe Mathieu-Daudé wrote:
> We don't need a target-specific header for common target-specific
> prototypes. Declare xen_arch_handle_ioreq() and xen_arch_set_memory()
> in "hw/xen/xen-hvm-common.h".
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: David Woodhouse 
> Reviewed-by: Richard Henderson 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



Re: [PATCH-for-9.0 v2 11/19] hw/xen/xen_arch_hvm: Rename prototypes using 'xen_arch_' prefix

2024-03-27 Thread Anthony PERARD
On Tue, Nov 14, 2023 at 03:38:07PM +0100, Philippe Mathieu-Daudé wrote:
> Use a common 'xen_arch_' prefix for architecture-specific functions.
> Rename xen_arch_set_memory() and xen_arch_handle_ioreq().
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: David Woodhouse 
> Reviewed-by: Richard Henderson 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma

2024-03-27 Thread David Woodhouse
On 27 March 2024 13:31:52 GMT, Anthony PERARD  wrote:
>On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote:
>> Except imported source files, QEMU code base uses
>> the QEMU_ALIGNED() macro to align its structures.
>
>This patch only convert the alignment, but discard pack. We need both or
>the struct is changed.

Which means we need some build-time asserts on struct size and field offsets. 
That should never have passed a build test.




Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma

2024-03-27 Thread Anthony PERARD
On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote:
> Except imported source files, QEMU code base uses
> the QEMU_ALIGNED() macro to align its structures.

This patch only convert the alignment, but discard pack. We need both or
the struct is changed.

> ---
>  hw/block/xen_blkif.h | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/xen_blkif.h b/hw/block/xen_blkif.h
> index 99733529c1..c1d154d502 100644
> --- a/hw/block/xen_blkif.h
> +++ b/hw/block/xen_blkif.h
> @@ -18,7 +18,6 @@ struct blkif_common_response {
>  };
>  
>  /* i386 protocol version */
> -#pragma pack(push, 4)
>  struct blkif_x86_32_request {
>  uint8_toperation;/* BLKIF_OP_??? 
> */
>  uint8_tnr_segments;  /* number of segments   
> */
> @@ -26,7 +25,7 @@ struct blkif_x86_32_request {
>  uint64_t   id;   /* private guest value, echoed in resp  
> */
>  blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  
> */
>  struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> -};
> +} QEMU_ALIGNED(4);

E.g. for this one, I've compare the output of
`pahole --class_name=blkif_x86_32_request build/qemu-system-i386`:

--- before
+++ after
@@ -1,11 +1,15 @@
 struct blkif_x86_32_request {
uint8_toperation;/* 0 1 */
uint8_tnr_segments;  /* 1 1 */
uint16_t   handle;   /* 2 2 */
-   uint64_t   id;   /* 4 8 */
-   uint64_t   sector_number;/*12 8 */
-   struct blkif_request_segment seg[11];/*2088 */

-   /* size: 108, cachelines: 2, members: 6 */
-   /* last cacheline: 44 bytes */
-} __attribute__((__packed__));
+   /* XXX 4 bytes hole, try to pack */
+
+   uint64_t   id;   /* 8 8 */
+   uint64_t   sector_number;/*16 8 */
+   struct blkif_request_segment seg[11];/*2488 */
+
+   /* size: 112, cachelines: 2, members: 6 */
+   /* sum members: 108, holes: 1, sum holes: 4 */
+   /* last cacheline: 48 bytes */
+} __attribute__((__aligned__(8)));

Thanks,

-- 
Anthony PERARD



Re: [PATCH-for-9.0 v2 05/19] hw/display: Restrict xen_register_framebuffer() call to Xen

2024-03-27 Thread Anthony PERARD
On Tue, Nov 14, 2023 at 03:38:01PM +0100, Philippe Mathieu-Daudé wrote:
> Only call xen_register_framebuffer() when Xen is enabled.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

I don't think this patch is very useful but it's fine, so:
Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD



Re: [PATCH-for-9.0 v2 08/19] hw/xen: Remove unused Xen stubs

2024-03-27 Thread Anthony PERARD
On Tue, Nov 14, 2023 at 03:38:04PM +0100, Philippe Mathieu-Daudé wrote:
> All these stubs are protected by a 'if (xen_enabled())' check.

Are you sure? There's still nothing that prevent a compiler from wanting
those, I don't think.

Sure, often compilers will remove dead code in `if(0){...}`, but there's
no guaranty, is there?

Cheers,

-- 
Anthony PERARD



Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-27 Thread David Hildenbrand

On 27.03.24 11:23, Stefano Garzarella wrote:

On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote:

+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);


Hm, shouldn't we just let the user specify a name, and if no name was
specified, generate one ourselves?


I thought about it and initially did it that way, but then some problems
came up so I tried to keep it as simple as possible for the user and for
our use case (having an fd associated with memory and sharing it with
other processes).

The problems I had were:

- what mode_t to use if the object does not exist and needs to be
created? >
- exclude O_EXCL if the user passes the name since they may have already
created it?


I'd handle both like we handle files. But I understand now that you 
really just want to "simulate" memfd.




- call shm_unlink() only at object deallocation?


Right, we don't really do that for ordinary files, they keep existing. 
We only have the "discard-data" property to free up memory.


For memfd, it just happens "automatically". They cannot be looked up by 
name, and once the last user closes the fd, it gets destroyed.




So I thought that for now we only support the "anonymous" mode, and if
in the future we have a use case where the user wants to specify the
name, we can add it later.


Okay, so for now you really only want an anonymous fd that behaves like 
a memfd, got it.


Likely we should somehow fail if the "POSIX shared memory object" 
already exists. Hmm.




That said, if you think it's already useful from the beginning, I can
add the name as an optional parameter.



I'm also not quite sure if "host_memory_backend_get_name()" should be
used for the purpose here.


What problem do you see? As an alternative I thought of a static
counter.


If it's really just an "internal UUID", as we'll remove the file using 
shm_unlink(), then the name in fact shouldn't matter and this would be 
fine. Correct?



So I assume if we ever want to have non-anonymous fds here, we'd pass in 
a new property "name", and change the logic how we open/unlink. Makes sense.


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-27 Thread Stefano Garzarella

On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote:

+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);


Hm, shouldn't we just let the user specify a name, and if no name was 
specified, generate one ourselves?


I thought about it and initially did it that way, but then some problems 
came up so I tried to keep it as simple as possible for the user and for 
our use case (having an fd associated with memory and sharing it with 
other processes).


The problems I had were:

- what mode_t to use if the object does not exist and needs to be 
  created?


- exclude O_EXCL if the user passes the name since they may have already 
  created it?


- call shm_unlink() only at object deallocation?

So I thought that for now we only support the "anonymous" mode, and if 
in the future we have a use case where the user wants to specify the 
name, we can add it later.


That said, if you think it's already useful from the beginning, I can 
add the name as an optional parameter.




I'm also not quite sure if "host_memory_backend_get_name()" should be 
used for the purpose here.


What problem do you see? As an alternative I thought of a static 
counter.


Thanks,
Stefano




Re: [PATCH for-9.1 v2 04/11] vhost-user-server: don't abort if we can't set fd non-blocking

2024-03-27 Thread Stefano Garzarella

On Tue, Mar 26, 2024 at 09:40:12AM -0500, Eric Blake wrote:

On Tue, Mar 26, 2024 at 02:39:29PM +0100, Stefano Garzarella wrote:

In vhost-user-server we set all fd received from the other peer
in non-blocking mode. For some of them (e.g. memfd, shm_open, etc.)
if we fail, it's not really a problem, because we don't use these
fd with blocking operations, but only to map memory.

In these cases a failure is not bad, so let's just report a warning
instead of panicking if we fail to set some fd in non-blocking mode.

This for example occurs in macOS where setting shm_open() fd
non-blocking is failing (errno: 25).


What is errno 25 on MacOS?


It should be ENOTTY.
I'll add in the commit description.





Signed-off-by: Stefano Garzarella 
---
 util/vhost-user-server.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 3bfb1ad3ec..064999f0b7 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -66,7 +66,11 @@ static void vmsg_unblock_fds(VhostUserMsg *vmsg)
 {
 int i;
 for (i = 0; i < vmsg->fd_num; i++) {
-qemu_socket_set_nonblock(vmsg->fds[i]);
+int ret = qemu_socket_try_set_nonblock(vmsg->fds[i]);
+if (ret) {


Should this be 'if (ret < 0)'?


I was confused by the assert() in qemu_socket_set_nonblock():
void qemu_socket_set_nonblock(int fd)
{
int f;
f = qemu_socket_try_set_nonblock(fd);
assert(f == 0);
}

BTW, I see most of the code checks ret < 0, so I'll fix it.




+warn_report("Failed to set fd %d nonblock for request %d: %s",
+vmsg->fds[i], vmsg->request, strerror(-ret));
+}


This now ignores all errors even on pre-existing fds where we NEED
non-blocking, rather than just the specific (expected) error we are
seeing on MacOS.  Should this code be a bit more precise about
checking that -ret == EXXX for the expected errno value we are
ignoring for the specific fds where non-blocking is not essential?


Good point, maybe I'll just avoid calling vmsg_unblock_fds() when the
message is VHOST_USER_ADD_MEM_REG or VHOST_USER_SET_MEM_TABLE.

These should be the cases where carried fds are used for mmap() and so
there is no need to mark them non-blocking.

WDYT?

Stefano




Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-27 Thread David Hildenbrand


I was wondering if we could see some partial sendmsg()/write
succeeding. Meaning, we transferred some bytes but not all, and we'd
actually need to loop ...


Yep, true, but I would fix it in another patch/series if you agree.


Absolutely.

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 03/11] libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported

2024-03-27 Thread Stefano Garzarella

On Tue, Mar 26, 2024 at 09:36:54AM -0500, Eric Blake wrote:

On Tue, Mar 26, 2024 at 02:39:28PM +0100, Stefano Garzarella wrote:

libvhost-user will panic when receiving VHOST_USER_GET_INFLIGHT_FD
message if MFD_ALLOW_SEALING is not defined, since it's not able
to create a memfd.

VHOST_USER_GET_INFLIGHT_FD is used only if
VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD is negotiated. So, let's mask
that feature if the backend is not able to properly handle these
messages.

Signed-off-by: Stefano Garzarella 
---
 subprojects/libvhost-user/libvhost-user.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a11afd1960..1c361ffd51 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -1674,6 +1674,16 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg 
*vmsg)
 features |= dev->iface->get_protocol_features(dev);
 }

+/*
+ * If MFD_ALLOW_SEALING is not defined, we are not able to handle
+ * VHOST_USER_GET_INFLIGHT_FD messages, since we can't create a memfd.
+ * Those messages are used only if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD
+ * is negotiated. A device implementation can enable it, so let's mask
+ * it to avoid a runtime panic.
+ */
+#ifndef MFD_ALLOW_SEALING
+features &= ~(1ULL << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD);
+#endif


Masking the feature out of advertisement is obviously correct. But
should we also fix the code for handling
VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD to return an error to any client
that requests it in error when the feature was not advertised, instead
of panicking?


Totally agree!

Do I send a separate patch from this series or include it in this
series?
I would do the former because this one is already long enough.

Thanks,
Stefano




Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-27 Thread Stefano Garzarella

On Tue, Mar 26, 2024 at 03:36:52PM +0100, David Hildenbrand wrote:

On 26.03.24 15:34, Eric Blake wrote:

On Tue, Mar 26, 2024 at 02:39:27PM +0100, Stefano Garzarella wrote:

In vu_message_write() we use sendmsg() to send the message header,
then a write() to send the payload.

If sendmsg() fails we should avoid sending the payload, since we
were unable to send the header.

Discovered before fixing the issue with the previous patch, where
sendmsg() failed on macOS due to wrong parameters, but the frontend
still sent the payload which the backend incorrectly interpreted
as a wrong header.

Signed-off-by: Stefano Garzarella 
---
 subprojects/libvhost-user/libvhost-user.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 22bea0c775..a11afd1960 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
 rc = sendmsg(conn_fd, , 0);
 } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
+if (rc <= 0) {


Is rejecting a 0 return value correct?  Technically, a 0 return means
a successful write of 0 bytes - but then again, it is unwise to
attempt to send an fd or other auxilliary ddata without at least one
regular byte, so we should not be attempting a write of 0 bytes.  So I
guess this one is okay, although I might have used < instead of <=.


I blindly copied what they already do at the end of the same function.
However, your observation is correct. That said they have a sendmsg()
to send the header. After this we have a write() loop to send the
payload.

Now if sendmsg() returns 0, but we have not sent all the header, what
should we do? Try sendmsg() again? For how many times?



I was wondering if we could see some partial sendmsg()/write 
succeeding. Meaning, we transferred some bytes but not all, and we'd 
actually need to loop ...


Yep, true, but I would fix it in another patch/series if you agree.

Thanks,
Stefano