Re: [PATCH-for-9.1] rdma: Remove RDMA subsystem and pvrdma device
On 27/03/2024 11.55, Philippe Mathieu-Daudé wrote: The whole RDMA subsystem was deprecated in commit e9a54265f5 ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem") released in v8.2. Time to remove it. Keep the RAM_SAVE_FLAG_HOOK definition since it might appears in old migration streams. Remove the dependencies on libibumad and libibverbs. Remove the generated vmw_pvrdma/ directory from linux-headers. Remove RDMA handling from migration. Remove RDMA handling in GlusterFS block driver. Remove rdmacm-mux tool from contrib/. Remove PVRDMA device. Cc: Peter Xu Cc: Li Zhijian Cc: Yuval Shaia Cc: Marcel Apfelbaum Signed-off-by: Philippe Mathieu-Daudé --- MAINTAINERS | 17 - docs/about/deprecated.rst |9 - docs/about/removed-features.rst |4 + docs/devel/migration/main.rst |6 - docs/pvrdma.txt | 345 -- docs/rdma.txt | 420 -- docs/system/device-url-syntax.rst.inc |4 +- docs/system/loongarch/virt.rst|2 +- docs/system/qemu-block-drivers.rst.inc|1 - meson.build | 59 - qapi/machine.json | 17 - qapi/migration.json | 31 +- qapi/qapi-schema.json |1 - qapi/rdma.json| 38 - contrib/rdmacm-mux/rdmacm-mux.h | 61 - hw/rdma/rdma_backend.h| 129 - hw/rdma/rdma_backend_defs.h | 76 - hw/rdma/rdma_rm.h | 97 - hw/rdma/rdma_rm_defs.h| 146 - hw/rdma/rdma_utils.h | 63 - hw/rdma/trace.h |1 - hw/rdma/vmw/pvrdma.h | 144 - hw/rdma/vmw/pvrdma_dev_ring.h | 46 - hw/rdma/vmw/pvrdma_qp_ops.h | 28 - hw/rdma/vmw/trace.h |1 - include/hw/rdma/rdma.h| 37 - include/monitor/hmp.h |1 - .../infiniband/hw/vmw_pvrdma/pvrdma_dev_api.h | 685 --- .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h | 348 -- .../standard-headers/rdma/vmw_pvrdma-abi.h| 310 -- migration/migration-stats.h |6 +- migration/migration.h |9 - migration/options.h |2 - migration/rdma.h | 69 - block/gluster.c | 39 - contrib/rdmacm-mux/main.c | 831 hw/core/machine-qmp-cmds.c| 32 - hw/rdma/rdma.c| 30 - hw/rdma/rdma_backend.c| 1401 -- hw/rdma/rdma_rm.c | 812 hw/rdma/rdma_utils.c | 126 - hw/rdma/vmw/pvrdma_cmd.c | 815 hw/rdma/vmw/pvrdma_dev_ring.c | 141 - hw/rdma/vmw/pvrdma_main.c | 735 --- hw/rdma/vmw/pvrdma_qp_ops.c | 298 -- migration/migration-stats.c |5 +- migration/migration.c | 31 - migration/options.c | 16 - migration/qemu-file.c |1 - migration/ram.c | 86 +- migration/rdma.c | 4184 - migration/savevm.c|2 +- monitor/qmp-cmds.c|1 - Kconfig.host |3 - contrib/rdmacm-mux/meson.build|7 - hmp-commands-info.hx | 13 - hw/Kconfig|1 - hw/meson.build|1 - hw/rdma/Kconfig |3 - hw/rdma/meson.build | 12 - hw/rdma/trace-events | 31 - hw/rdma/vmw/trace-events | 17 - meson_options.txt |4 - migration/meson.build |1 - migration/trace-events| 68 +- qapi/meson.build |1 - qemu-options.hx |6 - .../ci/org.centos/stream/8/x86_64/configure |1 - scripts/ci/setup/build-environment.yml|2 - scripts/coverity-scan/run-coverity-scan |2 +- scripts/meson-buildoptions.sh |6 - scripts/update-linux-headers.sh | 27 - tests/lcitool/projects/qemu.yml |2 - tests/migration/guestperf/engine.py |4 +- 74 files changed, 20 insertions(+), 12991 dele
Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup
On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote: > On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote: > > If g_main_loop_run()/aio_poll() is called in the coroutine context, > > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup > > may be disordered. > > aio_poll() must not be called from coroutine context: > > bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking); >^^^ > > Coroutines are not supposed to block. Instead, they should yield. > > > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means > > some listened events is completed. Therefore, the completion callback > > function is dispatched. > > > > If this callback function needs to invoke aio_co_enter(), it will only > > wake up the coroutine (because we are already in coroutine context), > > which may cause that the data on this listening event_fd/socket_fd > > is not read/cleared. When the next poll () exits, it will be woken up again > > and inserted into the wakeup queue again. > > > > For example, if TLS is enabled in NBD, the server will call > > g_main_loop_run() > > in the coroutine, and repeatedly wake up the io_read event on a socket. > > The call stack is as follows: > > > > aio_co_enter() > > aio_co_wake() > > qio_channel_restart_read() > > aio_dispatch_handler() > > aio_dispatch_handlers() > > aio_dispatch() > > aio_ctx_dispatch() > > g_main_context_dispatch() > > g_main_loop_run() > > nbd_negotiate_handle_starttls() > > This code does not look like it was designed to run in coroutine > context. Two options: > > 1. Don't run it in coroutine context (e.g. use a BH instead). This >avoids blocking the coroutine but calling g_main_loop_run() is still >ugly, in my opinion. > > 2. Get rid of data.loop and use coroutine APIs instead: > >while (!data.complete) { >qemu_coroutine_yield(); >} > >and update nbd_tls_handshake() to call aio_co_wake(data->co) instead >of g_main_loop_quit(data->loop). > >This requires auditing the code to check whether the event loop might >invoke something that interferes with >nbd_negotiate_handle_starttls(). Typically this means monitor >commands or fd activity that could change the state of this >connection while it is yielded. This is where the real work is but >hopefully it will not be that hard to figure out. I agree that 1) is ugly. So far, I've added some temporary assertions, to see when qio_channel_tls_handshake is reached; it looks like nbd/server.c is calling it from within coroutine context, but nbd/client.c is calling it from the main loop. The qio code handles either, but now I'm stuck in trying to get client.c into having the right coroutine context; the TLS handshake is done before the usual BlockDriverState *bs object is available, so I'm not even sure what aio context, if any, I should be using. But on my first try, I'm hitting: qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed. so I obviously got something wrong. It may be possible to use block_gen_c to turn nbd_receive_negotiate and nbd_receive_export_list into co_wrapper functions, if I can audit that all of their code goes through qio (and is therefore coroutine-safe), but that work is still ongoing. At any rate, qemu-iotest 233 should be good for testing that changes in this area work; I've also been testing with libnbd (test interop/interop-qemu-nbd-tls-certs hits qemu's server.c) and nbdkit (test tests/test-tls-psk.sh hits qemu's client.c). -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org
[PATCH 2/3] vvfat: Fix usage of `info.file.offset`
The field is marked as "the offset in the file (in clusters)", but it was being used like this `cluster_size*(nums)+mapping->info.file.offset`, which is incorrect. Additionally, removed the `abort` when `first_mapping_index` does not match, as this matches the case when adding new clusters for files, and its inevitable that we reach this condition when doing that if the clusters are not after one another, so there is no reason to `abort` here, execution continues and the new clusters are written to disk correctly. Signed-off-by: Amjad Alsharafi --- block/vvfat.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/block/vvfat.c b/block/vvfat.c index ab342f0743..cb3ab81e29 100644 --- a/block/vvfat.c +++ b/block/vvfat.c @@ -1408,7 +1408,7 @@ read_cluster_directory: assert(s->current_fd); - offset=s->cluster_size*(cluster_num-s->current_mapping->begin)+s->current_mapping->info.file.offset; +offset=s->cluster_size*((cluster_num - s->current_mapping->begin) + s->current_mapping->info.file.offset); if(lseek(s->current_fd, offset, SEEK_SET)!=offset) return -3; s->cluster=s->cluster_buffer; @@ -1929,8 +1929,8 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, direntry_t* direntry, const ch (mapping->mode & MODE_DIRECTORY) == 0) { /* was modified in qcow */ -if (offset != mapping->info.file.offset + s->cluster_size -* (cluster_num - mapping->begin)) { +if (offset != s->cluster_size +* ((cluster_num - mapping->begin) + mapping->info.file.offset)) { /* offset of this cluster in file chain has changed */ abort(); copy_it = 1; @@ -1944,7 +1944,6 @@ get_cluster_count_for_direntry(BDRVVVFATState* s, direntry_t* direntry, const ch if (mapping->first_mapping_index != first_mapping_index && mapping->info.file.offset > 0) { -abort(); copy_it = 1; } @@ -2404,7 +2403,7 @@ static int commit_mappings(BDRVVVFATState* s, (mapping->end - mapping->begin); } else next_mapping->info.file.offset = mapping->info.file.offset + -mapping->end - mapping->begin; +(mapping->end - mapping->begin); mapping = next_mapping; } -- 2.44.0
[PATCH 0/3] vvfat: Fix bugs in writing and reading files
These patches fix some bugs found when modifying files in vvfat. First, there was a bug when writing to the second+ cluster of a file, it will copy the cluster before it instead. Another issue was modifying the clusters of a file and adding new clusters, this showed 2 issues: - If the new cluster is not immediately after the last cluster, it will cause issues when reading from this file in the future. - Generally, the usage of info.file.offset was incorrect, and the system would crash on abort() when the file is modified and a new cluster was added.
[PATCH 1/3] vvfat: Fix bug in writing to middle of file
Before this commit, the behavior when calling `commit_one_file` for example with `offset=0x2000` (second cluster), what will happen is that we won't fetch the next cluster from the fat, and instead use the first cluster for the read operation. This is due to off-by-one error here, where `i=0x2000 !< offset=0x2000`, thus not fetching the next cluster. Signed-off-by: Amjad Alsharafi --- block/vvfat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/vvfat.c b/block/vvfat.c index 9d050ba3ae..ab342f0743 100644 --- a/block/vvfat.c +++ b/block/vvfat.c @@ -2525,7 +2525,7 @@ commit_one_file(BDRVVVFATState* s, int dir_index, uint32_t offset) return -1; } -for (i = s->cluster_size; i < offset; i += s->cluster_size) +for (i = s->cluster_size; i <= offset; i += s->cluster_size) c = modified_fat_get(s, c); fd = qemu_open_old(mapping->path, O_RDWR | O_CREAT | O_BINARY, 0666); -- 2.44.0
[PATCH 3/3] ffvat: Fix reading files with non-continuous clusters
When reading with `read_cluster` we get the `mapping` with `find_mapping_for_cluster` and then we call `open_file` for this mapping. The issue appear when its the same file, but a second cluster that is not immediately after it, imagine clusters `500 -> 503`, this will give us 2 mappings one has the range `500..501` and another `503..504`, both point to the same file, but different offsets. When we don't open the file since the path is the same, we won't assign `s->current_mapping` and thus accessing way out of bound of the file. >From our example above, after `open_file` (that didn't open anything) we will get the offset into the file with `s->cluster_size*(cluster_num-s->current_mapping->begin)`, which will give us `0x2000 * (504-500)`, which is out of bound for this mapping and will produce some issues. Signed-off-by: Amjad Alsharafi --- block/vvfat.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/block/vvfat.c b/block/vvfat.c index cb3ab81e29..87165abc26 100644 --- a/block/vvfat.c +++ b/block/vvfat.c @@ -1360,15 +1360,22 @@ static int open_file(BDRVVVFATState* s,mapping_t* mapping) { if(!mapping) return -1; +int new_path = 1; if(!s->current_mapping || -strcmp(s->current_mapping->path,mapping->path)) { -/* open file */ -int fd = qemu_open_old(mapping->path, + s->current_mapping->first_mapping_index!=mapping->first_mapping_index || +(new_path = strcmp(s->current_mapping->path,mapping->path))) { + +if (new_path) { +/* open file */ +int fd = qemu_open_old(mapping->path, O_RDONLY | O_BINARY | O_LARGEFILE); -if(fd<0) -return -1; -vvfat_close_current_file(s); -s->current_fd = fd; +if(fd<0) +return -1; +vvfat_close_current_file(s); + +s->current_fd = fd; +} +assert(s->current_fd); s->current_mapping = mapping; } return 0; -- 2.44.0
[PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()
Coverity complains that the check introduced in commit 3f934817 suggests that qiov could be NULL and we dereference it before reaching the check. In fact, all of the callers pass a non-NULL pointer, so just remove the misleading check. Resolves: Coverity CID 1542668 Signed-off-by: Kevin Wolf --- block/io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/io.c b/block/io.c index 395bea3bac..7217cf811b 100644 --- a/block/io.c +++ b/block/io.c @@ -1730,7 +1730,7 @@ static int bdrv_pad_request(BlockDriverState *bs, * For prefetching in stream_populate(), no qiov is passed along, because * only copy-on-read matters. */ -if (qiov && *qiov) { +if (*qiov) { sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, &sliced_head, &sliced_tail, &sliced_niov); -- 2.44.0
bdrv_pad_request() Coverity report
Hi Fiona, The Coverity static checker sent a report about commit 3f934817c82c ("block/io: accept NULL qiov in bdrv_pad_request"). Please take a look and send a follow-up patch, if necessary: *** CID 1542668: Null pointer dereferences (REVERSE_INULL) /builds/qemu-project/qemu/bloc k/io.c: 1733 in bdrv_pad_request() 1727 } 1728 1729 /* 1730 * For prefetching in stream_populate(), no qiov is passed along, because 1731 * only copy-on-read matters. 1732 */ >>> CID 1542668: Null pointer dereferences (REVERSE_INULL) >>> Null-checking "qiov" suggests that it may be null, but it has already >>> been dereferenced on all paths leading to the check. 1733 if (qiov && *qiov) { 1734 sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, 1735 &sliced_head, &sliced_tail, 1736 &sliced_niov); 1737 1738 /* Guaranteed by bdrv_check_request32() */ Thanks, Stefan
Re: [PATCH-for-9.0 v2 16/19] hw/xen/xen_pt: Add missing license
On Tue, Nov 14, 2023 at 03:38:12PM +0100, Philippe Mathieu-Daudé wrote: > Commit eaab4d60d3 ("Introduce Xen PCI Passthrough, qdevice") > introduced both xen_pt.[ch], but only added the license to > xen_pt.c. Use the same license for xen_pt.h. > > Suggested-by: David Woodhouse > Signed-off-by: Philippe Mathieu-Daudé Fine by me. Looks like there was a license header before: https://xenbits.xen.org/gitweb/?p=qemu-xen-unstable.git;a=blob;f=hw/pass-through.h;h=0b5822414e24d199a064abccc4d378dcaf569bd6;hb=HEAD I don't know why I didn't copied it over here. Acked-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [PATCH-for-9.0 v2 15/19] hw/xen: Reduce inclusion of 'cpu.h' to target-specific sources
On Tue, Nov 14, 2023 at 03:38:11PM +0100, Philippe Mathieu-Daudé wrote: > We rarely need to include "cpu.h" in headers. Including it > 'taint' headers to be target-specific. Here only the i386/arm > implementations requires "cpu.h", so include it there and > remove from the "hw/xen/xen-hvm-common.h" *common* header. > > Signed-off-by: Philippe Mathieu-Daudé > Reviewed-by: Richard Henderson > Reviewed-by: David Woodhouse Acked-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [RFC PATCH-for-9.0 v2 13/19] hw/xen: Remove use of 'target_ulong' in handle_ioreq()
On Tue, Nov 14, 2023 at 03:38:09PM +0100, Philippe Mathieu-Daudé wrote: > Per commit f17068c1c7 ("xen-hvm: reorganize xen-hvm and move common > function to xen-hvm-common"), handle_ioreq() is expected to be > target-agnostic. However it uses 'target_ulong', which is a target > specific definition. > > Per xen/include/public/hvm/ioreq.h header: > > struct ioreq { > uint64_t addr; /* physical address */ > uint64_t data; /* data (or paddr of data) */ > uint32_t count; /* for rep prefixes */ > uint32_t size; /* size in bytes */ > uint32_t vp_eport; /* evtchn for notifications to/from device model > */ > uint16_t _pad0; > uint8_t state:4; > uint8_t data_is_ptr:1; /* if 1, data above is the guest paddr > * of the real data to use. */ > uint8_t dir:1; /* 1=read, 0=write */ > uint8_t df:1; > uint8_t _pad1:1; > uint8_t type; /* I/O type */ > }; > typedef struct ioreq ioreq_t; > > If 'data' is not a pointer, it is a u64. > > - In PIO / VMWARE_PORT modes, only 32-bit are used. > > - In MMIO COPY mode, memory is accessed by chunks of 64-bit Looks like it could also be 8, 16, or 32 as well, depending on req->size. > - In PCI_CONFIG mode, access is u8 or u16 or u32. > > - None of TIMEOFFSET / INVALIDATE use 'req'. > > - Fallback is only used in x86 for VMWARE_PORT. > > Masking the upper bits of 'data' to keep 'req->size' low bits > is irrelevant of the target word size. Remove the word size > check and always extract the relevant bits. When building QEMU for Xen, we tend to build the target "i386-softmmu", which looks like to have target_ulong == uint32_t. So the `data` clamping would only apply to size 8 and 16. The clamping with target_ulong was introduce long time ago, here: https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b4a663b87df3954557434a2d31bff7f6b2706ec1 and they were more IOREQ types. So my guess is it isn't relevant anymore, but extending the clamping to 32-bits request should be fine, when using qemu-system-i386 that is, as it is already be done if one use qemu-system-x86_64. So I think the patch is fine, and the tests I've ran so far worked fine. > Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
So I thought that for now we only support the "anonymous" mode, and if in the future we have a use case where the user wants to specify the name, we can add it later. Okay, so for now you really only want an anonymous fd that behaves like a memfd, got it. Likely we should somehow fail if the "POSIX shared memory object" already exists. Hmm. `O_CREAT | O_EXCL` should provide just this behavior. From shm_open(3) manpage: O_EXCL If O_CREAT was also specified, and a shared memory object with the given name already exists, return an error. The check for the existence of the object, and its creation if it does not exist, are performed atomically. Cool! That said, if you think it's already useful from the beginning, I can add the name as an optional parameter. I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. What problem do you see? As an alternative I thought of a static counter. If it's really just an "internal UUID", as we'll remove the file using shm_unlink(), then the name in fact shouldn't matter and this would be fine. Correct? Right. It's a name that will only "appear" in the system for a very short time since we call shm_unlink() right after shm_open(). So I just need the unique name to prevent several QEMUs launched at the same time from colliding with the name. Okay, essentially emulating tmpfile(), but for weird shmem_open() :) So I assume if we ever want to have non-anonymous fds here, we'd pass in a new property "name", and change the logic how we open/unlink. Makes sense. Exactly! I can clarify this in the commit description as well. Thanks for the helpful comments! If there is anything I need to change for v3, please tell me ;-) Besides some patch description extension, makes sense to me :) -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On Wed, Mar 27, 2024 at 12:51:51PM +0100, David Hildenbrand wrote: On 27.03.24 11:23, Stefano Garzarella wrote: On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote: +mode = 0; +oflag = O_RDWR | O_CREAT | O_EXCL; +backend_name = host_memory_backend_get_name(backend); + +/* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ +g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), +backend_name); Hm, shouldn't we just let the user specify a name, and if no name was specified, generate one ourselves? I thought about it and initially did it that way, but then some problems came up so I tried to keep it as simple as possible for the user and for our use case (having an fd associated with memory and sharing it with other processes). The problems I had were: - what mode_t to use if the object does not exist and needs to be created? > - exclude O_EXCL if the user passes the name since they may have already created it? I'd handle both like we handle files. But I understand now that you really just want to "simulate" memfd. Right, maybe I should write that in the commit message and documentation. - call shm_unlink() only at object deallocation? Right, we don't really do that for ordinary files, they keep existing. We only have the "discard-data" property to free up memory. For memfd, it just happens "automatically". They cannot be looked up by name, and once the last user closes the fd, it gets destroyed. Yep, I see. So I thought that for now we only support the "anonymous" mode, and if in the future we have a use case where the user wants to specify the name, we can add it later. Okay, so for now you really only want an anonymous fd that behaves like a memfd, got it. Likely we should somehow fail if the "POSIX shared memory object" already exists. Hmm. `O_CREAT | O_EXCL` should provide just this behavior. From shm_open(3) manpage: O_EXCL If O_CREAT was also specified, and a shared memory object with the given name already exists, return an error. The check for the existence of the object, and its creation if it does not exist, are performed atomically. That said, if you think it's already useful from the beginning, I can add the name as an optional parameter. I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. What problem do you see? As an alternative I thought of a static counter. If it's really just an "internal UUID", as we'll remove the file using shm_unlink(), then the name in fact shouldn't matter and this would be fine. Correct? Right. It's a name that will only "appear" in the system for a very short time since we call shm_unlink() right after shm_open(). So I just need the unique name to prevent several QEMUs launched at the same time from colliding with the name. So I assume if we ever want to have non-anonymous fds here, we'd pass in a new property "name", and change the logic how we open/unlink. Makes sense. Exactly! I can clarify this in the commit description as well. Thanks for the helpful comments! If there is anything I need to change for v3, please tell me ;-) Stefano
Re: [PATCH-for-9.1] rdma: Remove RDMA subsystem and pvrdma device
Philippe Mathieu-Daudé writes: > The whole RDMA subsystem was deprecated in commit e9a54265f5 > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem") > released in v8.2. Time to remove it. > > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears > in old migration streams. > > Remove the dependencies on libibumad and libibverbs. > > Remove the generated vmw_pvrdma/ directory from linux-headers. > > Remove RDMA handling from migration. > > Remove RDMA handling in GlusterFS block driver. > > Remove rdmacm-mux tool from contrib/. > > Remove PVRDMA device. > > Cc: Peter Xu > Cc: Li Zhijian > Cc: Yuval Shaia > Cc: Marcel Apfelbaum > Signed-off-by: Philippe Mathieu-Daudé Acked-by: Fabiano Rosas
Re: [PATCH-for-9.0 v2 12/19] hw/xen: Merge 'hw/xen/arch_hvm.h' in 'hw/xen/xen-hvm-common.h'
On Tue, Nov 14, 2023 at 03:38:08PM +0100, Philippe Mathieu-Daudé wrote: > We don't need a target-specific header for common target-specific > prototypes. Declare xen_arch_handle_ioreq() and xen_arch_set_memory() > in "hw/xen/xen-hvm-common.h". > > Signed-off-by: Philippe Mathieu-Daudé > Reviewed-by: David Woodhouse > Reviewed-by: Richard Henderson Acked-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [PATCH-for-9.0 v2 11/19] hw/xen/xen_arch_hvm: Rename prototypes using 'xen_arch_' prefix
On Tue, Nov 14, 2023 at 03:38:07PM +0100, Philippe Mathieu-Daudé wrote: > Use a common 'xen_arch_' prefix for architecture-specific functions. > Rename xen_arch_set_memory() and xen_arch_handle_ioreq(). > > Signed-off-by: Philippe Mathieu-Daudé > Reviewed-by: David Woodhouse > Reviewed-by: Richard Henderson Acked-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma
On 27 March 2024 13:31:52 GMT, Anthony PERARD wrote: >On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote: >> Except imported source files, QEMU code base uses >> the QEMU_ALIGNED() macro to align its structures. > >This patch only convert the alignment, but discard pack. We need both or >the struct is changed. Which means we need some build-time asserts on struct size and field offsets. That should never have passed a build test.
Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma
On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote: > Except imported source files, QEMU code base uses > the QEMU_ALIGNED() macro to align its structures. This patch only convert the alignment, but discard pack. We need both or the struct is changed. > --- > hw/block/xen_blkif.h | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/hw/block/xen_blkif.h b/hw/block/xen_blkif.h > index 99733529c1..c1d154d502 100644 > --- a/hw/block/xen_blkif.h > +++ b/hw/block/xen_blkif.h > @@ -18,7 +18,6 @@ struct blkif_common_response { > }; > > /* i386 protocol version */ > -#pragma pack(push, 4) > struct blkif_x86_32_request { > uint8_toperation;/* BLKIF_OP_??? > */ > uint8_tnr_segments; /* number of segments > */ > @@ -26,7 +25,7 @@ struct blkif_x86_32_request { > uint64_t id; /* private guest value, echoed in resp > */ > blkif_sector_t sector_number;/* start sector idx on disk (r/w only) > */ > struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; > -}; > +} QEMU_ALIGNED(4); E.g. for this one, I've compare the output of `pahole --class_name=blkif_x86_32_request build/qemu-system-i386`: --- before +++ after @@ -1,11 +1,15 @@ struct blkif_x86_32_request { uint8_toperation;/* 0 1 */ uint8_tnr_segments; /* 1 1 */ uint16_t handle; /* 2 2 */ - uint64_t id; /* 4 8 */ - uint64_t sector_number;/*12 8 */ - struct blkif_request_segment seg[11];/*2088 */ - /* size: 108, cachelines: 2, members: 6 */ - /* last cacheline: 44 bytes */ -} __attribute__((__packed__)); + /* XXX 4 bytes hole, try to pack */ + + uint64_t id; /* 8 8 */ + uint64_t sector_number;/*16 8 */ + struct blkif_request_segment seg[11];/*2488 */ + + /* size: 112, cachelines: 2, members: 6 */ + /* sum members: 108, holes: 1, sum holes: 4 */ + /* last cacheline: 48 bytes */ +} __attribute__((__aligned__(8))); Thanks, -- Anthony PERARD
Re: [PATCH-for-9.0 v2 05/19] hw/display: Restrict xen_register_framebuffer() call to Xen
On Tue, Nov 14, 2023 at 03:38:01PM +0100, Philippe Mathieu-Daudé wrote: > Only call xen_register_framebuffer() when Xen is enabled. > > Signed-off-by: Philippe Mathieu-Daudé I don't think this patch is very useful but it's fine, so: Reviewed-by: Anthony PERARD Thanks, -- Anthony PERARD
Re: [PATCH-for-9.0 v2 08/19] hw/xen: Remove unused Xen stubs
On Tue, Nov 14, 2023 at 03:38:04PM +0100, Philippe Mathieu-Daudé wrote: > All these stubs are protected by a 'if (xen_enabled())' check. Are you sure? There's still nothing that prevent a compiler from wanting those, I don't think. Sure, often compilers will remove dead code in `if(0){...}`, but there's no guaranty, is there? Cheers, -- Anthony PERARD
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On 27.03.24 11:23, Stefano Garzarella wrote: On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote: +mode = 0; +oflag = O_RDWR | O_CREAT | O_EXCL; +backend_name = host_memory_backend_get_name(backend); + +/* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ +g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), +backend_name); Hm, shouldn't we just let the user specify a name, and if no name was specified, generate one ourselves? I thought about it and initially did it that way, but then some problems came up so I tried to keep it as simple as possible for the user and for our use case (having an fd associated with memory and sharing it with other processes). The problems I had were: - what mode_t to use if the object does not exist and needs to be created? > - exclude O_EXCL if the user passes the name since they may have already created it? I'd handle both like we handle files. But I understand now that you really just want to "simulate" memfd. - call shm_unlink() only at object deallocation? Right, we don't really do that for ordinary files, they keep existing. We only have the "discard-data" property to free up memory. For memfd, it just happens "automatically". They cannot be looked up by name, and once the last user closes the fd, it gets destroyed. So I thought that for now we only support the "anonymous" mode, and if in the future we have a use case where the user wants to specify the name, we can add it later. Okay, so for now you really only want an anonymous fd that behaves like a memfd, got it. Likely we should somehow fail if the "POSIX shared memory object" already exists. Hmm. That said, if you think it's already useful from the beginning, I can add the name as an optional parameter. I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. What problem do you see? As an alternative I thought of a static counter. If it's really just an "internal UUID", as we'll remove the file using shm_unlink(), then the name in fact shouldn't matter and this would be fine. Correct? So I assume if we ever want to have non-anonymous fds here, we'd pass in a new property "name", and change the logic how we open/unlink. Makes sense. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote: +mode = 0; +oflag = O_RDWR | O_CREAT | O_EXCL; +backend_name = host_memory_backend_get_name(backend); + +/* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ +g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), +backend_name); Hm, shouldn't we just let the user specify a name, and if no name was specified, generate one ourselves? I thought about it and initially did it that way, but then some problems came up so I tried to keep it as simple as possible for the user and for our use case (having an fd associated with memory and sharing it with other processes). The problems I had were: - what mode_t to use if the object does not exist and needs to be created? - exclude O_EXCL if the user passes the name since they may have already created it? - call shm_unlink() only at object deallocation? So I thought that for now we only support the "anonymous" mode, and if in the future we have a use case where the user wants to specify the name, we can add it later. That said, if you think it's already useful from the beginning, I can add the name as an optional parameter. I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. What problem do you see? As an alternative I thought of a static counter. Thanks, Stefano
Re: [PATCH for-9.1 v2 04/11] vhost-user-server: don't abort if we can't set fd non-blocking
On Tue, Mar 26, 2024 at 09:40:12AM -0500, Eric Blake wrote: On Tue, Mar 26, 2024 at 02:39:29PM +0100, Stefano Garzarella wrote: In vhost-user-server we set all fd received from the other peer in non-blocking mode. For some of them (e.g. memfd, shm_open, etc.) if we fail, it's not really a problem, because we don't use these fd with blocking operations, but only to map memory. In these cases a failure is not bad, so let's just report a warning instead of panicking if we fail to set some fd in non-blocking mode. This for example occurs in macOS where setting shm_open() fd non-blocking is failing (errno: 25). What is errno 25 on MacOS? It should be ENOTTY. I'll add in the commit description. Signed-off-by: Stefano Garzarella --- util/vhost-user-server.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c index 3bfb1ad3ec..064999f0b7 100644 --- a/util/vhost-user-server.c +++ b/util/vhost-user-server.c @@ -66,7 +66,11 @@ static void vmsg_unblock_fds(VhostUserMsg *vmsg) { int i; for (i = 0; i < vmsg->fd_num; i++) { -qemu_socket_set_nonblock(vmsg->fds[i]); +int ret = qemu_socket_try_set_nonblock(vmsg->fds[i]); +if (ret) { Should this be 'if (ret < 0)'? I was confused by the assert() in qemu_socket_set_nonblock(): void qemu_socket_set_nonblock(int fd) { int f; f = qemu_socket_try_set_nonblock(fd); assert(f == 0); } BTW, I see most of the code checks ret < 0, so I'll fix it. +warn_report("Failed to set fd %d nonblock for request %d: %s", +vmsg->fds[i], vmsg->request, strerror(-ret)); +} This now ignores all errors even on pre-existing fds where we NEED non-blocking, rather than just the specific (expected) error we are seeing on MacOS. Should this code be a bit more precise about checking that -ret == EXXX for the expected errno value we are ignoring for the specific fds where non-blocking is not essential? Good point, maybe I'll just avoid calling vmsg_unblock_fds() when the message is VHOST_USER_ADD_MEM_REG or VHOST_USER_SET_MEM_TABLE. These should be the cases where carried fds are used for mmap() and so there is no need to mark them non-blocking. WDYT? Stefano
Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing
I was wondering if we could see some partial sendmsg()/write succeeding. Meaning, we transferred some bytes but not all, and we'd actually need to loop ... Yep, true, but I would fix it in another patch/series if you agree. Absolutely. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 03/11] libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported
On Tue, Mar 26, 2024 at 09:36:54AM -0500, Eric Blake wrote: On Tue, Mar 26, 2024 at 02:39:28PM +0100, Stefano Garzarella wrote: libvhost-user will panic when receiving VHOST_USER_GET_INFLIGHT_FD message if MFD_ALLOW_SEALING is not defined, since it's not able to create a memfd. VHOST_USER_GET_INFLIGHT_FD is used only if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD is negotiated. So, let's mask that feature if the backend is not able to properly handle these messages. Signed-off-by: Stefano Garzarella --- subprojects/libvhost-user/libvhost-user.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a11afd1960..1c361ffd51 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -1674,6 +1674,16 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg) features |= dev->iface->get_protocol_features(dev); } +/* + * If MFD_ALLOW_SEALING is not defined, we are not able to handle + * VHOST_USER_GET_INFLIGHT_FD messages, since we can't create a memfd. + * Those messages are used only if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD + * is negotiated. A device implementation can enable it, so let's mask + * it to avoid a runtime panic. + */ +#ifndef MFD_ALLOW_SEALING +features &= ~(1ULL << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD); +#endif Masking the feature out of advertisement is obviously correct. But should we also fix the code for handling VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD to return an error to any client that requests it in error when the feature was not advertised, instead of panicking? Totally agree! Do I send a separate patch from this series or include it in this series? I would do the former because this one is already long enough. Thanks, Stefano
Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing
On Tue, Mar 26, 2024 at 03:36:52PM +0100, David Hildenbrand wrote: On 26.03.24 15:34, Eric Blake wrote: On Tue, Mar 26, 2024 at 02:39:27PM +0100, Stefano Garzarella wrote: In vu_message_write() we use sendmsg() to send the message header, then a write() to send the payload. If sendmsg() fails we should avoid sending the payload, since we were unable to send the header. Discovered before fixing the issue with the previous patch, where sendmsg() failed on macOS due to wrong parameters, but the frontend still sent the payload which the backend incorrectly interpreted as a wrong header. Signed-off-by: Stefano Garzarella --- subprojects/libvhost-user/libvhost-user.c | 5 + 1 file changed, 5 insertions(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 22bea0c775..a11afd1960 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg) rc = sendmsg(conn_fd, &msg, 0); } while (rc < 0 && (errno == EINTR || errno == EAGAIN)); +if (rc <= 0) { Is rejecting a 0 return value correct? Technically, a 0 return means a successful write of 0 bytes - but then again, it is unwise to attempt to send an fd or other auxilliary ddata without at least one regular byte, so we should not be attempting a write of 0 bytes. So I guess this one is okay, although I might have used < instead of <=. I blindly copied what they already do at the end of the same function. However, your observation is correct. That said they have a sendmsg() to send the header. After this we have a write() loop to send the payload. Now if sendmsg() returns 0, but we have not sent all the header, what should we do? Try sendmsg() again? For how many times? I was wondering if we could see some partial sendmsg()/write succeeding. Meaning, we transferred some bytes but not all, and we'd actually need to loop ... Yep, true, but I would fix it in another patch/series if you agree. Thanks, Stefano