Re: [PATCH V2 3/3] target/i386: Move host_cpu_enable_cpu_pm into kvm_cpu_realizefn()
On Fri, May 24, 2024 at 01:00:17PM -0700, Zide Chen wrote: > Date: Fri, 24 May 2024 13:00:17 -0700 > From: Zide Chen > Subject: [PATCH V2 3/3] target/i386: Move host_cpu_enable_cpu_pm into > kvm_cpu_realizefn() > X-Mailer: git-send-email 2.34.1 > > It seems not a good idea to expand features in host_cpu_realizefn, > which is for host CPU only. It is restricted by max_features and should be part of the "max" CPU, and the current target/i386/host-cpu.c should only serve the “host” CPU. > Additionally, cpu-pm option is KVM > specific, and it's cleaner to put it in kvm_cpu_realizefn(), together > with the WAITPKG code. > > Fixes: f5cc5a5c1686 ("i386: split cpu accelerators from cpu.c, using > AccelCPUClass") > Signed-off-by: Zide Chen > --- > target/i386/host-cpu.c| 12 > target/i386/kvm/kvm-cpu.c | 11 +-- > 2 files changed, 9 insertions(+), 14 deletions(-) > > diff --git a/target/i386/host-cpu.c b/target/i386/host-cpu.c > index 280e427c017c..8b8bf5afeccf 100644 > --- a/target/i386/host-cpu.c > +++ b/target/i386/host-cpu.c > @@ -42,15 +42,6 @@ static uint32_t host_cpu_phys_bits(void) > return host_phys_bits; > } > > -static void host_cpu_enable_cpu_pm(X86CPU *cpu) > -{ > -CPUX86State *env = &cpu->env; > - > -host_cpuid(5, 0, &cpu->mwait.eax, &cpu->mwait.ebx, > - &cpu->mwait.ecx, &cpu->mwait.edx); > -env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR; > -} > - > static uint32_t host_cpu_adjust_phys_bits(X86CPU *cpu) > { > uint32_t host_phys_bits = host_cpu_phys_bits(); > @@ -83,9 +74,6 @@ bool host_cpu_realizefn(CPUState *cs, Error **errp) > X86CPU *cpu = X86_CPU(cs); > CPUX86State *env = &cpu->env; > > -if (cpu->max_features && enable_cpu_pm) { > -host_cpu_enable_cpu_pm(cpu); > -} > if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) { > uint32_t phys_bits = host_cpu_adjust_phys_bits(cpu); > > diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c > index 3adcedf0dbc3..197c892da89a 100644 > --- a/target/i386/kvm/kvm-cpu.c > +++ b/target/i386/kvm/kvm-cpu.c > @@ -64,9 +64,16 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp) > * cpu_common_realizefn() (via xcc->parent_realize) > */ > if (cpu->max_features) { > -if (enable_cpu_pm && kvm_has_waitpkg()) { > -env->features[FEAT_7_0_ECX] |= CPUID_7_0_ECX_WAITPKG; > +if (enable_cpu_pm) { > +if (kvm_has_waitpkg()) { > +env->features[FEAT_7_0_ECX] |= CPUID_7_0_ECX_WAITPKG; > +} If you agree with my comment in patch 2, here we need a mwait bit check. > +host_cpuid(5, 0, &cpu->mwait.eax, &cpu->mwait.ebx, > + &cpu->mwait.ecx, &cpu->mwait.edx); > +env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR; > } > + > if (cpu->ucode_rev == 0) { > cpu->ucode_rev = > kvm_arch_get_supported_msr_feature(kvm_state, > -- > 2.34.1 > >
Re: Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159
On 5/31/24 14:19, Itaru Kitayama wrote: On May 30, 2024, at 22:30, Philippe Mathieu-Daudé wrote: Cc'ing more developers On 30/5/24 06:30, Itaru Kitayama wrote: Hi, When I see a Realm VM creation fails with: Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159: qemu-system-aarch64: RME: failed to configure SVE: Invalid argument test.sh: line 8: 2502 Aborted qemu-system-aarch64 -M 'virt,acpi=off,gic-version=3' -cpu host -enable-kvm -smp 2 -m 512M -overcommit 'mem-lock=on' -M 'confidential-guest-support=rme0' -object 'rme-guest,id=rme0,measurement-algo=sha512,num-pmu-counters=6,sve-vector-length=256' -kernel Image -initrd rootfs.cpio -append 'earycon console=ttyAMA0 rdinit=/sbin/init' -nographic -net none do I need to suspect first the VMM, QEMU, or the Image? The kernel is built with LLVM, does it matter? Thanks, Itaru. I’m testing Jean’s repo at: https://git.codelinaro.org/linaro/dcap/qemu/-/tree/cca/v2?ref_type=heads I got a chance to try CCA software components, suggested by [1]. However, the edk2 is stuck somewhere. I didn't reach to stage of loading guest kernel yet. I'm replying to see if anyone has a idea. [1] https://linaro.atlassian.net/wiki/spaces/QEMU/pages/29051027459/Building+an+RME+stack+for+QEMU ---> tf-rmm # git clone https://git.codelinaro.org/linaro/dcap/rmm.git tf-rmm # cd tf-rmm # git checkout origin/cca/v2 -b cca/v2 # git submodule update --init --recursive # export CROSS_COMPILE= # cmake -DCMAKE_BUILD_TYPE=Debug -DRMM_CONFIG=qemu_virt_defcfg -B build-qemu # cmake --build build-qemu ---> edk2 # git clone g...@github.com:tianocore/edk2.git # cd edk2 # git submodule update --init --recursive # source edksetup.sh # make -j -C BaseTools # export GCC5_AARCH64_PREFIX= # build -b RELEASE -a AARCH64 -t GCC5 -p ArmVirtPkg/ArmVirtQemuKernel.dsc ---> tf-a # git clone https://git.codelinaro.org/linaro/dcap/tf-a/trusted-firmware-a.git tf-a # cd tf-a # git checkout origin/cca/v2 -b cca/v2 # make -j CROSS_COMPILE= PLAT=qemu ENABLE_RME=1 DEBUG=1 LOG_LEVEL=40 \ QEMU_USE_GIC_DRIVER=QEMU_GICV3 RMM=../rmm/build-qemu/Debug/rmm.img \ BL33=../edk2/Build/ArmVirtQemuKernel-AARCH64/RELEASE_GCC5/FV/QEMU_EFI.fd all fip # dd if=build/qemu/debug/bl1.bin of=flash.bin # dd if=build/qemu/debug/fip.bin of=flash.bin seek=64 bs=4096 ---> QEMU # git clone https://git.qemu.org/git/qemu.git qemu.main # cd qemu.main # ./configure --target-list=aarch64-softmmu # make -j 60 ---> boot the host on the emulated hardware /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -M virt,virtualization=on,secure=on,gic-version=3,acpi=off \ -cpu max,x-rme=on -m 8G -smp 8 \ -nodefaults -monitor none -serial mon:stdio -nographic \ -bios /home/gavin/sandbox/CCA/tf-a/flash.bin \ -kernel /home/gavin/sandbox/CCA/linux/arch/arm64/boot/Image \ -drive format=raw,if=none,file=${CCA}/buildroot/output/images/rootfs.ext4,id=hd0 \ -device virtio-blk-pci,drive=hd0 \ -append "root=/dev/vda console=ttyAMA0" : NOTICE: Booting Trusted Firmware NOTICE: BL1: v2.10.0(debug):99e0b97aa-dirty NOTICE: BL1: Built : 00:31:35, May 31 2024 INFO:BL1: RAM 0xe0ee000 - 0xe0f7000 INFO:BL1: Loading BL2 INFO:Loading image id=1 at address 0xe06b000 INFO:Image id=1 loaded: 0xe06b000 - 0xe0742d1 NOTICE: BL1: Booting BL2 INFO:Entry point address = 0xe06b000 INFO:SPSR = 0x3cd INFO:[GPT] Boot Configuration INFO: PPS/T: 0x2/40 INFO: PGS/P: 0x0/12 INFO: L0GPTSZ/S: 0x0/30 INFO: PAS count: 0x6 INFO: L0 base: 0xedfe000 INFO:[GPT] PAS[0]: base 0xe001000, size 0xff000, GPI 0xa, type 0x1 INFO:[GPT] PAS[1]: base 0xe10, size 0xcfe000, GPI 0x8, type 0x1 INFO:[GPT] PAS[2]: base 0xedfe000, size 0x202000, GPI 0xa, type 0x1 INFO:[GPT] PAS[3]: base 0x4000, size 0x10, GPI 0x9, type 0x1 INFO:[GPT] PAS[4]: base 0x4010, size 0x280, GPI 0xb, type 0x1 INFO:[GPT] PAS[5]: base 0x4290, size 0x1fd70, GPI 0x9, type 0x1 INFO:Enabling Granule Protection Checks NOTICE: BL2: v2.10.0(debug):99e0b97aa-dirty NOTICE: BL2: Built : 00:31:35, May 31 2024 INFO:BL2: Doing platform setup INFO:Reserved RMM memory [0x4010, 0x428f] in Device tree INFO:BL2: Loading image id 3 INFO:Loading image id=3 at address 0xe0a INFO:Image id=3 loaded: 0xe0a - 0xe0b10c4 INFO:BL2: Loading image id 35 INFO:Loading image id=35 at address 0x4010 INFO:Image id=35 loaded: 0x4010 - 0x403033b0 INFO:BL2: Loading image id 5 INFO:Loading image id=5 at address 0x6000 INFO:Image id=5 loaded: 0x6000 - 0x6020 NOTICE: BL2: Booting BL31 INFO:Entry point address = 0xe0a INFO:SPSR = 0x3cd NOTICE: BL31: v2.10.0(debug
Re: [PATCH V2 2/3] target/i386: call cpu_exec_realizefn before x86_cpu_filter_features
Hi Zide, On Fri, May 24, 2024 at 01:00:16PM -0700, Zide Chen wrote: > Date: Fri, 24 May 2024 13:00:16 -0700 > From: Zide Chen > Subject: [PATCH V2 2/3] target/i386: call cpu_exec_realizefn before > x86_cpu_filter_features > X-Mailer: git-send-email 2.34.1 > > cpu_exec_realizefn which calls the accel-specific realizefn may expand > features. e.g., some accel-specific options may require extra features > to be enabled, and it's appropriate to expand these features in accel- > specific realizefn. > > One such example is the cpu-pm option, which may add CPUID_EXT_MONITOR. > > Thus, call cpu_exec_realizefn before x86_cpu_filter_features to ensure > that it won't expose features not supported by the host. > > Fixes: 662175b91ff2 ("i386: reorder call to cpu_exec_realizefn") > Suggested-by: Xiaoyao Li > Signed-off-by: Zide Chen > --- > target/i386/cpu.c | 24 > target/i386/kvm/kvm-cpu.c | 1 - > 2 files changed, 12 insertions(+), 13 deletions(-) > > diff --git a/target/i386/cpu.c b/target/i386/cpu.c > index bc2dceb647fa..a1c1c785bd2f 100644 > --- a/target/i386/cpu.c > +++ b/target/i386/cpu.c > @@ -7604,6 +7604,18 @@ static void x86_cpu_realizefn(DeviceState *dev, Error > **errp) > } > } > > +/* > + * note: the call to the framework needs to happen after feature > expansion, > + * but before the checks/modifications to ucode_rev, mwait, phys_bits. > + * These may be set by the accel-specific code, > + * and the results are subsequently checked / assumed in this function. > + */ > +cpu_exec_realizefn(cs, &local_err); > +if (local_err != NULL) { > +error_propagate(errp, local_err); > +return; > +} > + > x86_cpu_filter_features(cpu, cpu->check_cpuid || cpu->enforce_cpuid); For your case, which sets cpu-pm=on via overcommit, then x86_cpu_filter_features() will complain that mwait is not supported. Such warning is not necessary, because the purpose of overcommit (from code) is only to support mwait when possible, not to commit to support mwait in Guest. Additionally, I understand x86_cpu_filter_features() is primarily intended to filter features configured by the user, and the changes of CPUID after x86_cpu_filter_features() should by default be regarded like "QEMU knows what it is doing". I feel adding a check for the CPUID mwait bit in host_cpu_realizefn() is enough, after all, this bit should be present if host supports mwait and enable_cpu_pm (in kvm_arch_get_supported_cpuid()). Thanks, Zhao
[PATCH] cxl: Get rid of unused cfmw_list
There is no user for this member. All '-M cxl-fmw.N' options have been parsed and saved to CXLState.fixed_windows. Signed-off-by: Li Zhijian --- hw/cxl/cxl-host.c| 1 - include/hw/cxl/cxl.h | 1 - 2 files changed, 2 deletions(-) diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c index c5f5fcfd64d0..926d3d3da705 100644 --- a/hw/cxl/cxl-host.c +++ b/hw/cxl/cxl-host.c @@ -335,7 +335,6 @@ static void machine_set_cfmw(Object *obj, Visitor *v, const char *name, for (it = cfmw_list; it; it = it->next) { cxl_fixed_memory_window_config(state, it->value, errp); } -state->cfmw_list = cfmw_list; } void cxl_machine_init(Object *obj, CXLState *state) diff --git a/include/hw/cxl/cxl.h b/include/hw/cxl/cxl.h index 75e47b686441..e3ecbef03872 100644 --- a/include/hw/cxl/cxl.h +++ b/include/hw/cxl/cxl.h @@ -43,7 +43,6 @@ typedef struct CXLState { MemoryRegion host_mr; unsigned int next_mr_idx; GList *fixed_windows; -CXLFixedMemoryWindowOptionsList *cfmw_list; } CXLState; struct CXLHost { -- 2.29.2
Re: tests/avocado: Add LoongArch machine start test
在2024年5月31日五月 上午2:52,gaosong写道: > 在 2024/5/30 下午9:16, Jiaxun Yang 写道: >> >> 在2024年5月30日五月 下午2:00,gaosong写道: >> [...] FYI, the test does not seem to work anymore - apparently the binaries have changed and now the hashes do not match anymore. Could you please update it? (preferably with some versioned binaries that do not change in the course of time?) >>> Thank you, I had send a patch to fix it. >> Hi Song, >> >> As LoongArch EDK2 support has been merged long ago, do you to make a clean >> build and add it to pc-bios directory? > EDK2 LoongArchVirt under OvmfPkg is being committed to upstream. > > PR: > https://github.com/tianocore/edk2/pull/5208 I meant here: https://gitlab.com/qemu-project/qemu/-/tree/master/pc-bios?ref_type=heads Thanks > > Thanks > Song Gao >> >> Thanks >> - Jiaxun -- - Jiaxun
RE: [PATCH v4 12/16] aspeed/soc: Add AST2700 support
Hi Cedric, > From: Cédric Le Goater > Subject: Re: [PATCH v4 12/16] aspeed/soc: Add AST2700 support > > On 5/27/24 10:02, Jamin Lin wrote: > > Initial definitions for a simple machine using an AST2700 SOC (Cortex-a35 > CPU). > > > > AST2700 SOC and its interrupt controller are too complex to handle in > > the common Aspeed SoC framework. We introduce a new ast2700 class with > > instance_init and realize handlers. > > > > AST2700 is a 64 bits quad core cpus and support 8 watchdog. > > Update maximum ASPEED_CPUS_NUM to 4 and ASPEED_WDTS_NUM to 8. > > In addition, update AspeedSocState to support scuio, sli, sliio and intc. > > > > Add TYPE_ASPEED27X0_SOC machine type. > > > > The SDMC controller is unlocked at SPL stage. > > At present, only supports to emulate booting start from u-boot stage. > > Set SDMC controller unlocked by default. > > > > In INTC, each interrupt of INT 128 to INT 136 combines 32 interrupts. > > It connect GICINT IRQ GPIO-OUTPUT pins to GIC device with irq 128 to 136. > > And, if a device irq is 128 to 136, its irq GPIO-OUTPUT pin is > > connected to GICINT or-gates instead of GIC device. > > > > Signed-off-by: Troy Lee > > Signed-off-by: Jamin Lin > > --- > > hw/arm/aspeed_ast27x0.c | 563 > > > hw/arm/meson.build | 1 + > > include/hw/arm/aspeed_soc.h | 26 +- > > 3 files changed, 588 insertions(+), 2 deletions(-) > > create mode 100644 hw/arm/aspeed_ast27x0.c > > > > diff --git a/hw/arm/aspeed_ast27x0.c b/hw/arm/aspeed_ast27x0.c new > > file mode 100644 index 00..a3a03fc1ca > > --- /dev/null > > +++ b/hw/arm/aspeed_ast27x0.c > > @@ -0,0 +1,563 @@ > > +/* > > + * ASPEED SoC 27x0 family > > + * > > + * Copyright (C) 2024 ASPEED Technology Inc. > > + * > > + * This code is licensed under the GPL version 2 or later. See > > + * the COPYING file in the top-level directory. > > + * > > + * Implementation extracted from the AST2600 and adapted for AST27x0. > > + */ > > + > > +#include "qemu/osdep.h" > > +#include "qapi/error.h" > > +#include "hw/misc/unimp.h" > > +#include "hw/arm/aspeed_soc.h" > > +#include "qemu/module.h" > > +#include "qemu/error-report.h" > > +#include "hw/i2c/aspeed_i2c.h" > > +#include "net/net.h" > > +#include "sysemu/sysemu.h" > > +#include "hw/intc/arm_gicv3.h" > > +#include "qapi/qmp/qlist.h" > > + > > +static const hwaddr aspeed_soc_ast2700_memmap[] = { > > +[ASPEED_DEV_SPI_BOOT] = 0x4, > > +[ASPEED_DEV_SRAM] = 0x1000, > > +[ASPEED_DEV_SDMC] = 0x12C0, > > +[ASPEED_DEV_SCU] = 0x12C02000, > > +[ASPEED_DEV_SCUIO] = 0x14C02000, > > +[ASPEED_DEV_UART0] = 0X14C33000, > > +[ASPEED_DEV_UART1] = 0X14C33100, > > +[ASPEED_DEV_UART2] = 0X14C33200, > > +[ASPEED_DEV_UART3] = 0X14C33300, > > +[ASPEED_DEV_UART4] = 0X12C1A000, > > +[ASPEED_DEV_UART5] = 0X14C33400, > > +[ASPEED_DEV_UART6] = 0X14C33500, > > +[ASPEED_DEV_UART7] = 0X14C33600, > > +[ASPEED_DEV_UART8] = 0X14C33700, > > +[ASPEED_DEV_UART9] = 0X14C33800, > > +[ASPEED_DEV_UART10]= 0X14C33900, > > +[ASPEED_DEV_UART11]= 0X14C33A00, > > +[ASPEED_DEV_UART12]= 0X14C33B00, > > +[ASPEED_DEV_WDT] = 0x14C37000, > > +[ASPEED_DEV_VUART] = 0X14C3, > > +[ASPEED_DEV_FMC] = 0x1400, > > +[ASPEED_DEV_SPI0] = 0x1401, > > +[ASPEED_DEV_SPI1] = 0x1402, > > +[ASPEED_DEV_SPI2] = 0x1403, > > +[ASPEED_DEV_SDRAM] = 0x4, > > +[ASPEED_DEV_MII1] = 0x1404, > > +[ASPEED_DEV_MII2] = 0x14040008, > > +[ASPEED_DEV_MII3] = 0x14040010, > > +[ASPEED_DEV_ETH1] = 0x1405, > > +[ASPEED_DEV_ETH2] = 0x1406, > > +[ASPEED_DEV_ETH3] = 0x1407, > > +[ASPEED_DEV_EMMC] = 0x1209, > > +[ASPEED_DEV_INTC] = 0x1210, > > +[ASPEED_DEV_SLI] = 0x12C17000, > > +[ASPEED_DEV_SLIIO] = 0x14C1E000, > > +[ASPEED_GIC_DIST] = 0x1220, > > +[ASPEED_GIC_REDIST]= 0x1228, > > +}; > > + > > +#define AST2700_MAX_IRQ 288 > > + > > +/* Shared Peripheral Interrupt values below are offset by -32 from > > +datasheet */ static const int aspeed_soc_ast2700_irqmap[] = { > > +[ASPEED_DEV_UART0] = 132, > > +[ASPEED_DEV_UART1] = 132, > > +[ASPEED_DEV_UART2] = 132, > > +[ASPEED_DEV_UART3] = 132, > > +[ASPEED_DEV_UART4] = 8, > > +[ASPEED_DEV_UART5] = 132, > > +[ASPEED_DEV_UART6] = 132, > > +[ASPEED_DEV_UART7] = 132, > > +[ASPEED_DEV_UART8] = 132, > > +[ASPEED_DEV_UART9] = 132, > > +[ASPEED_DEV_UART10]= 132, > > +[ASPEED_DEV_UART11]= 132, > > +[ASPEED_DEV_UART12]= 132, > > +[ASPEED_DEV_FMC] = 131, > > +[ASPEED_DEV_SDMC] = 0, > > +[ASPEED_DEV_SCU] = 12, > > +[ASPEED_DEV_ADC] =
Re: [PATCH] Use "void *" as parameter for functions that are used for aio_set_event_notifier()
On 29/05/2024 20.22, Stefan Hajnoczi wrote: On Wed, May 29, 2024 at 07:49:48PM +0200, Thomas Huth wrote: aio_set_event_notifier() and aio_set_event_notifier_poll() in util/aio-posix.c and util/aio-win32.c are casting function pointers of functions that take an "EventNotifier *" pointer as parameter to function pointers that take a "void *" pointer as parameter (i.e. the IOHandler type). When those function pointers are later used to call the referenced function, this triggers undefined behavior errors with the latest version of Clang in Fedora 40 when compiling with the option "-fsanitize=undefined". And this also prevents enabling the strict mode of CFI which is currently disabled with -fsanitize-cfi-icall-generalize-pointers. Thus let us avoid the problem by using "void *" as parameter in all spots where it is needed. Signed-off-by: Thomas Huth --- Yes, I know, the patch looks ugly ... but I don't see a better way to tackle this. If someone has a better idea, suggestions are welcome! An alternative is adding EventNotifierHandler *io_read, *io_poll_ready, *io_poll_begin, and *io_poll_end fields to EventNotifier so that aio_set_event_notifier() and aio_set_event_notifier_poll() can pass helper functions to the underlying aio_set_fd_handler() and aio_set_fd_poll() APIs. These helper functions then invoke the EventNotifier callbacks: /* Helpers */ static void event_notifier_io_read(void *opaque) { EventNotifier *notifier = opaque; notifier->io_read(notifier); } That's a nice idea, thanks, I'll give it a try! Thomas
Re: [PATCH V2 1/3] vl: Allow multiple -overcommit commands
On 30/05/2024 16.01, Zhao Liu wrote: On Mon, May 27, 2024 at 07:19:56AM +0200, Thomas Huth wrote: Date: Mon, 27 May 2024 07:19:56 +0200 From: Thomas Huth Subject: Re: [PATCH V2 1/3] vl: Allow multiple -overcommit commands On 24/05/2024 22.00, Zide Chen wrote: Both cpu-pm and mem-lock are related to system resource overcommit, but they are separate from each other, in terms of how they are realized, and of course, they are applied to different system resources. It's tempting to use separate command lines to specify their behavior. e.g., in the following example, the cpu-pm command is quietly overwritten, and it's not easy to notice it without careful inspection. --overcommit mem-lock=on --overcommit cpu-pm=on Fixes: c8c9dc42b7ca ("Remove the deprecated -realtime option") Suggested-by: Thomas Huth Signed-off-by: Zide Chen --- v2: Thanks to Thomas' suggestion, changed to this better approach, which is more generic and can handle situations like: "enabled the option in the config file, and now you'd like to disable it on the command line again". system/vl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/system/vl.c b/system/vl.c index a3eede5fa5b8..dfa6cdd9283b 100644 --- a/system/vl.c +++ b/system/vl.c @@ -3545,8 +3545,8 @@ void qemu_init(int argc, char **argv) if (!opts) { exit(1); } -enable_mlock = qemu_opt_get_bool(opts, "mem-lock", false); -enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", false); +enable_mlock = qemu_opt_get_bool(opts, "mem-lock", enable_mlock); +enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", enable_cpu_pm); break; case QEMU_OPTION_compat: { Reviewed-by: Thomas Huth Hi Thomas, BTW, do you think it's a good idea to define the overcommit via QAPI way (defined in a json file)? ;-) My rough understanding is that all APIs are better to be defined via QAPI to go JSON compatible. Sorry, no clue whether it makes sense here... CC:-ing Markus for recommendations. Thomas
Re: [PATCH] hw/net: prevent potential NULL dereference
On Thu, May 30, 2024 at 10:03:51AM +0100, Peter Maydell wrote: > On Thu, 30 May 2024 at 01:52, David Gibson > wrote: > > > > On Wed, May 29, 2024 at 02:07:18PM +0300, Oleg Sviridov wrote: > > > Pointer, returned from function 'spapr_vio_find_by_reg', may be NULL and > > > is dereferenced immediately after. > > > > > > Found by Linux Verification Center (linuxtesting.org) with SVACE. > > > > > > Signed-off-by: Oleg Sviridov > > > --- > > > hw/net/spapr_llan.c | 4 > > > 1 file changed, 4 insertions(+) > > > > > > diff --git a/hw/net/spapr_llan.c b/hw/net/spapr_llan.c > > > index ecb30b7c76..f40b733229 100644 > > > --- a/hw/net/spapr_llan.c > > > +++ b/hw/net/spapr_llan.c > > > @@ -770,6 +770,10 @@ static target_ulong > > > h_change_logical_lan_mac(PowerPCCPU *cpu, > > > SpaprVioVlan *dev = VIO_SPAPR_VLAN_DEVICE(sdev); > > > > Hmm... I thought VIO_SPAPR_VLAN_DEVICE() was supposed to abort if sdev > > was NULL or not of the right type. Or have the rules for qom helpers > > changed since I wrote this. > > QOM casts abort if the type is wrong, but a NULL pointer is > passed through as a NULL pointer. Ah, my mistake. LGTM, then. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH 4/4] hw/s390x: Deprecate the QMP @dump-skeys command
On 30/05/2024 09.45, Philippe Mathieu-Daudé wrote: Prefer @dump-s390-skeys which is target agnostic. Signed-off-by: Philippe Mathieu-Daudé --- docs/about/deprecated.rst | 5 + qapi/misc-target.json | 5 + 2 files changed, 10 insertions(+) diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst index 40585ca7d5..3cb43085ba 100644 --- a/docs/about/deprecated.rst +++ b/docs/about/deprecated.rst @@ -148,6 +148,11 @@ accepted incorrect commands will return an error. Users should make sure that all arguments passed to ``device_add`` are consistent with the documented property types. +``dump-skeys`` (since 9.1) +'' + +Use the more generic ``dump-s390-skeys`` command. FWIW, that's "more specific", not "more generic". But as I said in my reply to the cover letter, we should maybe consider a more generic command instead, indeed. Thomas
Re: [PATCH 0/4] hw/s390x: Alias @dump-skeys -> @dump-s390-skey and deprecate
On 30/05/2024 09.45, Philippe Mathieu-Daudé wrote: We are trying to unify all qemu-system-FOO to a single binary. In order to do that we need to remove QAPI target specific code. @dump-skeys is only available on qemu-system-s390x. This series rename it as @dump-s390-skey, making it available on other binaries. We take care of backward compatibility via deprecation. Philippe Mathieu-Daudé (4): hw/s390x: Introduce the @dump-s390-skeys QMP command hw/s390x: Introduce the 'dump_s390_skeys' HMP command hw/s390x: Deprecate the HMP 'dump_skeys' command hw/s390x: Deprecate the QMP @dump-skeys command Why do we have to rename the command? Just for the sake of it? I think renaming HMP commands is maybe ok, but breaking the API in QMP is something you should consider twice. And even if we decide to rename ... maybe we should discuss whether it makes sense to come up with a generic command instead: As far as I know, ARM also has something similar, called MTE. Maybe we also want to dump MTE keys one day? So the new command should maybe be called "dump-memory-keys" instead? Or should it maybe rather be an option to the existing "dump-guest-memory" command instead? Thomas
Re: Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159
> On May 30, 2024, at 22:30, Philippe Mathieu-Daudé wrote: > > Cc'ing more developers > > On 30/5/24 06:30, Itaru Kitayama wrote: >> Hi, >> When I see a Realm VM creation fails with: >> Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159: >> qemu-system-aarch64: RME: failed to configure SVE: Invalid argument >> test.sh: line 8: 2502 Aborted qemu-system-aarch64 -M >> 'virt,acpi=off,gic-version=3' -cpu host -enable-kvm -smp 2 -m 512M >> -overcommit 'mem-lock=on' -M 'confidential-guest-support=rme0' -object >> 'rme-guest,id=rme0,measurement-algo=sha512,num-pmu-counters=6,sve-vector-length=256' >> -kernel Image -initrd rootfs.cpio -append 'earycon console=ttyAMA0 >> rdinit=/sbin/init' -nographic -net none >> do I need to suspect first the VMM, QEMU, or the Image? The kernel is built >> with LLVM, does it matter? >> Thanks, >> Itaru. > I’m testing Jean’s repo at: https://git.codelinaro.org/linaro/dcap/qemu/-/tree/cca/v2?ref_type=heads Itaru.
[PATCH v4 09/10] hw/nvme: add reservation protocal command
Add reservation acquire, reservation register, reservation release and reservation report commands in the nvme device layer. By introducing these commands, this enables the nvme device to perform reservation-related tasks, including querying keys, querying reservation status, registering reservation keys, initiating and releasing reservations, as well as clearing and preempting reservations held by other keys. These commands are crucial for management and control of shared storage resources in a persistent manner. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- hw/nvme/ctrl.c | 321 ++- hw/nvme/nvme.h | 4 + include/block/nvme.h | 37 + 3 files changed, 361 insertions(+), 1 deletion(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 182307a48b..033abd0afe 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -294,6 +294,10 @@ static const uint32_t nvme_cse_iocs_nvm[256] = { [NVME_CMD_COMPARE] = NVME_CMD_EFF_CSUPP, [NVME_CMD_IO_MGMT_RECV] = NVME_CMD_EFF_CSUPP, [NVME_CMD_IO_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, +[NVME_CMD_RESV_REGISTER]= NVME_CMD_EFF_CSUPP, +[NVME_CMD_RESV_REPORT] = NVME_CMD_EFF_CSUPP, +[NVME_CMD_RESV_ACQUIRE] = NVME_CMD_EFF_CSUPP, +[NVME_CMD_RESV_RELEASE] = NVME_CMD_EFF_CSUPP, }; static const uint32_t nvme_cse_iocs_zoned[256] = { @@ -308,6 +312,10 @@ static const uint32_t nvme_cse_iocs_zoned[256] = { [NVME_CMD_ZONE_APPEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, [NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, [NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFF_CSUPP, +[NVME_CMD_RESV_REGISTER]= NVME_CMD_EFF_CSUPP, +[NVME_CMD_RESV_REPORT] = NVME_CMD_EFF_CSUPP, +[NVME_CMD_RESV_ACQUIRE] = NVME_CMD_EFF_CSUPP, +[NVME_CMD_RESV_RELEASE] = NVME_CMD_EFF_CSUPP, }; static void nvme_process_sq(void *opaque); @@ -1745,6 +1753,7 @@ static void nvme_aio_err(NvmeRequest *req, int ret) switch (req->cmd.opcode) { case NVME_CMD_READ: +case NVME_CMD_RESV_REPORT: status = NVME_UNRECOVERED_READ; break; case NVME_CMD_FLUSH: @@ -1752,6 +1761,9 @@ static void nvme_aio_err(NvmeRequest *req, int ret) case NVME_CMD_WRITE_ZEROES: case NVME_CMD_ZONE_APPEND: case NVME_CMD_COPY: +case NVME_CMD_RESV_REGISTER: +case NVME_CMD_RESV_ACQUIRE: +case NVME_CMD_RESV_RELEASE: status = NVME_WRITE_FAULT; break; default: @@ -2127,7 +2139,10 @@ static inline bool nvme_is_write(NvmeRequest *req) return rw->opcode == NVME_CMD_WRITE || rw->opcode == NVME_CMD_ZONE_APPEND || - rw->opcode == NVME_CMD_WRITE_ZEROES; + rw->opcode == NVME_CMD_WRITE_ZEROES || + rw->opcode == NVME_CMD_RESV_REGISTER || + rw->opcode == NVME_CMD_RESV_ACQUIRE || + rw->opcode == NVME_CMD_RESV_RELEASE; } static void nvme_misc_cb(void *opaque, int ret) @@ -2692,6 +2707,302 @@ static uint16_t nvme_verify(NvmeCtrl *n, NvmeRequest *req) return NVME_NO_COMPLETE; } +typedef struct NvmeKeyInfo { +uint64_t cr_key; +uint64_t nr_key; +} NvmeKeyInfo; + +static uint16_t nvme_resv_register(NvmeCtrl *n, NvmeRequest *req) +{ +int ret; +NvmeKeyInfo key_info; +NvmeNamespace *ns = req->ns; +uint32_t cdw10 = le32_to_cpu(req->cmd.cdw10); +bool ignore_key = cdw10 >> 3 & 0x1; +uint8_t action = cdw10 & 0x7; +uint8_t ptpl = cdw10 >> 30 & 0x3; +bool aptpl; + +switch (ptpl) { +case NVME_RESV_PTPL_NO_CHANGE: +aptpl = (ns->id_ns.rescap & NVME_PR_CAP_PTPL) ? true : false; +break; +case NVME_RESV_PTPL_DISABLE: +aptpl = false; +break; +case NVME_RESV_PTPL_ENABLE: +aptpl = true; +break; +default: +return NVME_INVALID_FIELD; +} + +ret = nvme_h2c(n, (uint8_t *)&key_info, sizeof(NvmeKeyInfo), req); +if (ret) { +return ret; +} + +switch (action) { +case NVME_RESV_REGISTER_ACTION_REGISTER: +req->aiocb = blk_aio_pr_register(ns->blkconf.blk, 0, + key_info.nr_key, 0, aptpl, + ignore_key, nvme_misc_cb, + req); +break; +case NVME_RESV_REGISTER_ACTION_UNREGISTER: +req->aiocb = blk_aio_pr_register(ns->blkconf.blk, key_info.cr_key, 0, + 0, aptpl, ignore_key, + nvme_misc_cb, req); +break; +case NVME_RESV_REGISTER_ACTION_REPLACE: +req->aiocb = blk_aio_pr_register(ns->blkconf.blk, key_info.cr_key, + key_info.nr_key, 0, aptpl, ignore_key, + nvme_misc_cb, req); +break; +default: +return
[PATCH v4 02/10] block/raw: add persistent reservation in/out driver
Add persistent reservation in/out operations for raw driver. The following methods are implemented: bdrv_co_pr_read_keys, bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve, bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- block/raw-format.c | 56 ++ 1 file changed, 56 insertions(+) diff --git a/block/raw-format.c b/block/raw-format.c index ac7e8495f6..3746bc1bd3 100644 --- a/block/raw-format.c +++ b/block/raw-format.c @@ -454,6 +454,55 @@ raw_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf) return bdrv_co_ioctl(bs->file->bs, req, buf); } +static int coroutine_fn GRAPH_RDLOCK +raw_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation, +uint32_t num_keys, uint64_t *keys) +{ + +return bdrv_co_pr_read_keys(bs->file->bs, generation, num_keys, keys); +} + +static int coroutine_fn GRAPH_RDLOCK +raw_co_pr_read_reservation(BlockDriverState *bs, uint32_t *generation, + uint64_t *key, BlockPrType *type) +{ +return bdrv_co_pr_read_reservation(bs->file->bs, generation, key, type); +} + +static int coroutine_fn GRAPH_RDLOCK +raw_co_pr_register(BlockDriverState *bs, uint64_t old_key, + uint64_t new_key, BlockPrType type, + bool ptpl, bool ignore_key) +{ +return bdrv_co_pr_register(bs->file->bs, old_key, new_key, + type, ptpl, ignore_key); +} + +static int coroutine_fn GRAPH_RDLOCK +raw_co_pr_reserve(BlockDriverState *bs, uint64_t key, BlockPrType type) +{ +return bdrv_co_pr_reserve(bs->file->bs, key, type); +} + +static int coroutine_fn GRAPH_RDLOCK +raw_co_pr_release(BlockDriverState *bs, uint64_t key, BlockPrType type) +{ +return bdrv_co_pr_release(bs->file->bs, key, type); +} + +static int coroutine_fn GRAPH_RDLOCK +raw_co_pr_clear(BlockDriverState *bs, uint64_t key) +{ +return bdrv_co_pr_clear(bs->file->bs, key); +} + +static int coroutine_fn GRAPH_RDLOCK +raw_co_pr_preempt(BlockDriverState *bs, uint64_t old_key, + uint64_t new_key, BlockPrType type, bool abort) +{ +return bdrv_co_pr_preempt(bs->file->bs, old_key, new_key, type, abort); +} + static int GRAPH_RDLOCK raw_has_zero_init(BlockDriverState *bs) { return bdrv_has_zero_init(bs->file->bs); @@ -672,6 +721,13 @@ BlockDriver bdrv_raw = { .strong_runtime_opts = raw_strong_runtime_opts, .mutable_opts = mutable_opts, .bdrv_cancel_in_flight = raw_cancel_in_flight, +.bdrv_co_pr_read_keys= raw_co_pr_read_keys, +.bdrv_co_pr_read_reservation = raw_co_pr_read_reservation, +.bdrv_co_pr_register = raw_co_pr_register, +.bdrv_co_pr_reserve = raw_co_pr_reserve, +.bdrv_co_pr_release = raw_co_pr_release, +.bdrv_co_pr_clear= raw_co_pr_clear, +.bdrv_co_pr_preempt = raw_co_pr_preempt, }; static void bdrv_raw_init(void) -- 2.20.1
[PATCH v4 10/10] block/iscsi: add persistent reservation in/out driver
Add persistent reservation in/out operations for iscsi driver. The following methods are implemented: bdrv_co_pr_read_keys, bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve, bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- block/iscsi.c | 443 ++ 1 file changed, 443 insertions(+) diff --git a/block/iscsi.c b/block/iscsi.c index 2ff14b7472..d94ebe35bd 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -96,6 +96,7 @@ typedef struct IscsiLun { unsigned long *allocmap_valid; long allocmap_size; int cluster_size; +uint8_t pr_cap; bool use_16_for_rw; bool write_protected; bool lbpme; @@ -280,6 +281,8 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status, iTask->err_code = -error; iTask->err_str = g_strdup(iscsi_get_error(iscsi)); } +} else if (status == SCSI_STATUS_RESERVATION_CONFLICT) { +iTask->err_code = -EBADE; } } } @@ -1792,6 +1795,52 @@ static void iscsi_save_designator(IscsiLun *lun, } } +static void iscsi_get_pr_cap_sync(IscsiLun *iscsilun, Error **errp) +{ +struct scsi_task *task = NULL; +struct scsi_persistent_reserve_in_report_capabilities *rc = NULL; +int retries = ISCSI_CMD_RETRIES; +int xferlen = sizeof(struct scsi_persistent_reserve_in_report_capabilities); + +do { +if (task != NULL) { +scsi_free_scsi_task(task); +task = NULL; +} + +task = iscsi_persistent_reserve_in_sync(iscsilun->iscsi, + iscsilun->lun, SCSI_PR_IN_REPORT_CAPABILITIES, xferlen); +if (task != NULL && task->status == SCSI_STATUS_GOOD) { +rc = scsi_datain_unmarshall(task); +if (rc == NULL) { +error_setg(errp, +"iSCSI: Failed to unmarshall report capabilities data."); +} else { +iscsilun->pr_cap = +scsi_pr_cap_to_block(rc->persistent_reservation_type_mask); +iscsilun->pr_cap |= (rc->ptpl_a) ? BLK_PR_CAP_PTPL : 0; +} +break; +} + +if (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION +&& task->sense.key == SCSI_SENSE_UNIT_ATTENTION) { +break; +} + +} while (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION + && task->sense.key == SCSI_SENSE_UNIT_ATTENTION + && retries-- > 0); + +if (task == NULL || task->status != SCSI_STATUS_GOOD) { +error_setg(errp, "iSCSI: failed to send report capabilities command"); +} + +if (task) { +scsi_free_scsi_task(task); +} +} + static int iscsi_open(BlockDriverState *bs, QDict *options, int flags, Error **errp) { @@ -2024,6 +2073,11 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags, bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP; } +iscsi_get_pr_cap_sync(iscsilun, &local_err); +if (local_err != NULL) { +error_propagate(errp, local_err); +ret = -EINVAL; +} out: qemu_opts_del(opts); g_free(initiator_name); @@ -2110,6 +2164,8 @@ static void iscsi_refresh_limits(BlockDriverState *bs, Error **errp) bs->bl.opt_transfer = pow2floor(iscsilun->bl.opt_xfer_len * iscsilun->block_size); } + +bs->bl.pr_cap = iscsilun->pr_cap; } /* Note that this will not re-establish a connection with an iSCSI target - it @@ -2408,6 +2464,385 @@ out_unlock: return r; } +static int coroutine_fn +iscsi_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation, + uint32_t num_keys, uint64_t *keys) +{ +IscsiLun *iscsilun = bs->opaque; +QEMUIOVector qiov; +struct IscsiTask iTask; +int xferlen = sizeof(struct scsi_persistent_reserve_in_read_keys) + + sizeof(uint64_t) * num_keys; +uint8_t *buf = g_malloc0(xferlen); +int32_t num_collect_keys = 0; +int r = 0; + +qemu_iovec_init_buf(&qiov, buf, xferlen); +iscsi_co_init_iscsitask(iscsilun, &iTask); +qemu_mutex_lock(&iscsilun->mutex); +retry: +iTask.task = iscsi_persistent_reserve_in_task(iscsilun->iscsi, + iscsilun->lun, SCSI_PR_IN_READ_KEYS, xferlen, + iscsi_co_generic_cb, &iTask); + +if (iTask.task == NULL) { +qemu_mutex_unlock(&iscsilun->mutex); +return -ENOMEM; +} + +scsi_task_set_iov_in(iTask.task, (struct scsi_iovec *)qiov.iov, qiov.niov); +iscsi_co_wait_for_task(&iTask, iscsilun); + +if (iTask.task != NULL) { +scsi_free_scsi_task(iTask.task); +iTask.task = NULL; +} + +if (iTask.do_retry) { +iTask.complete = 0; +goto retry; +} + +
[PATCH v4 05/10] hw/scsi: add persistent reservation in/out api for scsi device
Add persistent reservation in/out operations in the SCSI device layer. By introducing the persistent reservation in/out api, this enables the SCSI device to perform reservation-related tasks, including querying keys, querying reservation status, registering reservation keys, initiating and releasing reservations, as well as clearing and preempting reservations held by other keys. These operations are crucial for management and control of shared storage resources in a persistent manner. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- hw/scsi/scsi-disk.c | 352 1 file changed, 352 insertions(+) diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c index 4bd7af9d0c..0e964dbd87 100644 --- a/hw/scsi/scsi-disk.c +++ b/hw/scsi/scsi-disk.c @@ -32,6 +32,7 @@ #include "migration/vmstate.h" #include "hw/scsi/emulation.h" #include "scsi/constants.h" +#include "scsi/utils.h" #include "sysemu/block-backend.h" #include "sysemu/blockdev.h" #include "hw/block/block.h" @@ -42,6 +43,7 @@ #include "qemu/cutils.h" #include "trace.h" #include "qom/object.h" +#include "block/block_int.h" #ifdef __linux #include @@ -1474,6 +1476,346 @@ static void scsi_disk_emulate_read_data(SCSIRequest *req) scsi_req_complete(&r->req, GOOD); } +typedef struct SCSIPrReadKeys { +uint32_t generation; +uint32_t num_keys; +uint64_t *keys; +void *req; +} SCSIPrReadKeys; + +typedef struct SCSIPrReadReservation { +uint32_t generation; +uint64_t key; +BlockPrType type; +void *req; +} SCSIPrReadReservation; + +static void scsi_pr_read_keys_complete(void *opaque, int ret) +{ +int num_keys; +uint8_t *buf; +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque; +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req; +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev); + +assert(blk_get_aio_context(s->qdev.conf.blk) == +qemu_get_current_aio_context()); + +assert(r->req.aiocb != NULL); +r->req.aiocb = NULL; + +if (scsi_disk_req_check_error(r, ret, true)) { +goto done; +} + +buf = scsi_req_get_buf(&r->req); +num_keys = MIN(blk_keys->num_keys, ret); +blk_keys->generation = cpu_to_be32(blk_keys->generation); +memcpy(&buf[0], &blk_keys->generation, 4); +for (int i = 0; i < num_keys; i++) { +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]); +memcpy(&buf[8 + i * 8], &blk_keys->keys[i], 8); +} +num_keys = cpu_to_be32(num_keys * 8); +memcpy(&buf[4], &num_keys, 4); + +scsi_req_data(&r->req, r->buflen); +done: +scsi_req_unref(&r->req); +g_free(blk_keys->keys); +g_free(blk_keys); +} + +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req) +{ +SCSIPrReadKeys *blk_keys; +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req); +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev); +int buflen = MIN(r->req.cmd.xfer, r->buflen); +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t); + +blk_keys = g_new0(SCSIPrReadKeys, 1); +blk_keys->generation = 0; +/* num_keys is the maximum number of keys that can be transmitted */ +blk_keys->num_keys = num_keys; +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys); +blk_keys->req = r; + +/* The request is used as the AIO opaque value, so add a ref. */ +scsi_req_ref(&r->req); +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, &blk_keys->generation, +blk_keys->num_keys, blk_keys->keys, +scsi_pr_read_keys_complete, blk_keys); +return 0; +} + +static void scsi_pr_read_reservation_complete(void *opaque, int ret) +{ +uint8_t *buf; +uint32_t additional_len = 0; +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque; +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req; +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev); + +assert(blk_get_aio_context(s->qdev.conf.blk) == +qemu_get_current_aio_context()); + +assert(r->req.aiocb != NULL); +r->req.aiocb = NULL; + +if (scsi_disk_req_check_error(r, ret, true)) { +goto done; +} + +buf = scsi_req_get_buf(&r->req); +blk_rsv->generation = cpu_to_be32(blk_rsv->generation); +memcpy(&buf[0], &blk_rsv->generation, 4); +if (ret) { +additional_len = cpu_to_be32(16); +blk_rsv->key = cpu_to_be64(blk_rsv->key); +memcpy(&buf[8], &blk_rsv->key, 8); +buf[21] = block_pr_type_to_scsi(blk_rsv->type) & 0xf; +} else { +additional_len = cpu_to_be32(0); +} + +memcpy(&buf[4], &additional_len, 4); +scsi_req_data(&r->req, r->buflen); + +done: +scsi_req_unref(&r->req); +g_free(blk_rsv); +} + +static int scsi_disk_emulate_pr_read_reservation(SCSIRequest *req) +{ +SCSIPrReadReservation *blk_rsv; +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req); +SCSIDiskS
[PATCH v4 01/10] block: add persistent reservation in/out api
Add persistent reservation in/out operations at the block level. The following operations are included: - read_keys:retrieves the list of registered keys. - read_reservation: retrieves the current reservation status. - register: registers a new reservation key. - reserve: initiates a reservation for a specific key. - release: releases a reservation for a specific key. - clear:clears all existing reservations. - preempt: preempts a reservation held by another key. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- block/block-backend.c | 397 ++ block/io.c| 163 include/block/block-common.h | 40 +++ include/block/block-io.h | 20 ++ include/block/block_int-common.h | 84 +++ include/sysemu/block-backend-io.h | 24 ++ 6 files changed, 728 insertions(+) diff --git a/block/block-backend.c b/block/block-backend.c index db6f9b92a3..6707d94df7 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -1770,6 +1770,403 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned long int req, void *buf, return blk_aio_prwv(blk, req, 0, buf, blk_aio_ioctl_entry, 0, cb, opaque); } +typedef struct BlkPrInCo { +BlockBackend *blk; +uint32_t *generation; +uint32_t num_keys; +BlockPrType *type; +uint64_t *keys; +int ret; +} BlkPrInCo; + +typedef struct BlkPrInCB { +BlockAIOCB common; +BlkPrInCo prco; +bool has_returned; +} BlkPrInCB; + +static const AIOCBInfo blk_pr_in_aiocb_info = { +.aiocb_size = sizeof(BlkPrInCB), +}; + +static void blk_pr_in_complete(BlkPrInCB *acb) +{ +if (acb->has_returned) { +acb->common.cb(acb->common.opaque, acb->prco.ret); +blk_dec_in_flight(acb->prco.blk); +qemu_aio_unref(acb); +} +} + +static void blk_pr_in_complete_bh(void *opaque) +{ +BlkPrInCB *acb = opaque; +assert(acb->has_returned); +blk_pr_in_complete(acb); +} + +static BlockAIOCB *blk_aio_pr_in(BlockBackend *blk, uint32_t *generation, + uint32_t num_keys, BlockPrType *type, + uint64_t *keys, CoroutineEntry co_entry, + BlockCompletionFunc *cb, void *opaque) +{ +BlkPrInCB *acb; +Coroutine *co; + +blk_inc_in_flight(blk); +acb = blk_aio_get(&blk_pr_in_aiocb_info, blk, cb, opaque); +acb->prco = (BlkPrInCo) { +.blk= blk, +.generation = generation, +.num_keys = num_keys, +.type = type, +.ret= NOT_DONE, +.keys = keys, +}; +acb->has_returned = false; + +co = qemu_coroutine_create(co_entry, acb); +aio_co_enter(qemu_get_current_aio_context(), co); + +acb->has_returned = true; +if (acb->prco.ret != NOT_DONE) { +replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(), + blk_pr_in_complete_bh, acb); +} + +return &acb->common; +} + +/* To be called between exactly one pair of blk_inc/dec_in_flight() */ +static int coroutine_fn +blk_aio_pr_do_read_keys(BlockBackend *blk, uint32_t *generation, +uint32_t num_keys, uint64_t *keys) +{ +IO_CODE(); + +blk_wait_while_drained(blk); +GRAPH_RDLOCK_GUARD(); + +if (!blk_co_is_available(blk)) { +return -ENOMEDIUM; +} + +return bdrv_co_pr_read_keys(blk_bs(blk), generation, num_keys, keys); +} + +static void coroutine_fn blk_aio_pr_read_keys_entry(void *opaque) +{ +BlkPrInCB *acb = opaque; +BlkPrInCo *prco = &acb->prco; + +prco->ret = blk_aio_pr_do_read_keys(prco->blk, prco->generation, +prco->num_keys, prco->keys); +blk_pr_in_complete(acb); +} + +BlockAIOCB *blk_aio_pr_read_keys(BlockBackend *blk, uint32_t *generation, + uint32_t num_keys, uint64_t *keys, + BlockCompletionFunc *cb, void *opaque) +{ +IO_CODE(); +return blk_aio_pr_in(blk, generation, num_keys, NULL, keys, + blk_aio_pr_read_keys_entry, cb, opaque); +} + +/* To be called between exactly one pair of blk_inc/dec_in_flight() */ +static int coroutine_fn +blk_aio_pr_do_read_reservation(BlockBackend *blk, uint32_t *generation, + uint64_t *key, BlockPrType *type) +{ +IO_CODE(); + +blk_wait_while_drained(blk); +GRAPH_RDLOCK_GUARD(); + +if (!blk_co_is_available(blk)) { +return -ENOMEDIUM; +} + +return bdrv_co_pr_read_reservation(blk_bs(blk), generation, key, type); +} + +static void coroutine_fn blk_aio_pr_read_reservation_entry(void *opaque) +{ +BlkPrInCB *acb = opaque; +BlkPrInCo *prco = &acb->prco; + +prco->ret = blk_aio_pr_do_read_reservation(prco->blk, prco->generation, + prco->keys, prco->
[PATCH v4 08/10] hw/nvme: enable ONCS and rescap function
This commit enables ONCS to support the reservation function at the controller level. Also enables rescap function in the namespace by detecting the supported reservation function in the backend driver. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- hw/nvme/ctrl.c | 3 ++- hw/nvme/ns.c | 5 + 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 127c3d2383..182307a48b 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -8248,7 +8248,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->nn = cpu_to_le32(NVME_MAX_NAMESPACES); id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP | NVME_ONCS_FEATURES | NVME_ONCS_DSM | - NVME_ONCS_COMPARE | NVME_ONCS_COPY); + NVME_ONCS_COMPARE | NVME_ONCS_COPY | + NVME_ONCS_RESRVATIONS); /* * NOTE: If this device ever supports a command set that does NOT use 0x0 diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c index ea8db175db..320c9bf658 100644 --- a/hw/nvme/ns.c +++ b/hw/nvme/ns.c @@ -20,6 +20,7 @@ #include "qemu/bitops.h" #include "sysemu/sysemu.h" #include "sysemu/block-backend.h" +#include "block/block_int.h" #include "nvme.h" #include "trace.h" @@ -33,6 +34,7 @@ void nvme_ns_init_format(NvmeNamespace *ns) BlockDriverInfo bdi; int npdg, ret; int64_t nlbas; +uint8_t blk_pr_cap; ns->lbaf = id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)]; ns->lbasz = 1 << ns->lbaf.ds; @@ -55,6 +57,9 @@ void nvme_ns_init_format(NvmeNamespace *ns) } id_ns->npda = id_ns->npdg = npdg - 1; + +blk_pr_cap = blk_bs(ns->blkconf.blk)->file->bs->bl.pr_cap; +id_ns->rescap = block_pr_cap_to_nvme(blk_pr_cap); } static int nvme_ns_init(NvmeNamespace *ns, Error **errp) -- 2.20.1
[PATCH v4 03/10] scsi/constant: add persistent reservation in/out protocol constants
Add constants for the persistent reservation in/out protocol in the scsi/constant module. The constants include the persistent reservation command, type, and scope values defined in sections 6.13 and 6.14 of the SCSI Primary Commands-4 (SPC-4) specification. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- include/scsi/constants.h | 52 1 file changed, 52 insertions(+) diff --git a/include/scsi/constants.h b/include/scsi/constants.h index 9b98451912..922a314535 100644 --- a/include/scsi/constants.h +++ b/include/scsi/constants.h @@ -319,4 +319,56 @@ #define IDENT_DESCR_TGT_DESCR_SIZE 32 #define XCOPY_BLK2BLK_SEG_DESC_SIZE 28 +typedef enum { +SCSI_PR_WRITE_EXCLUSIVE = 0x01, +SCSI_PR_EXCLUSIVE_ACCESS= 0x03, +SCSI_PR_WRITE_EXCLUSIVE_REGS_ONLY = 0x05, +SCSI_PR_EXCLUSIVE_ACCESS_REGS_ONLY = 0x06, +SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS= 0x07, +SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS = 0x08, +} SCSIPrType; + +typedef enum { +SCSI_PR_LU_SCOPE = 0x00, +} SCSIPrScope; + +typedef enum { +SCSI_PR_OUT_REGISTER = 0x0, +SCSI_PR_OUT_RESERVE = 0x1, +SCSI_PR_OUT_RELEASE = 0x2, +SCSI_PR_OUT_CLEAR= 0x3, +SCSI_PR_OUT_PREEMPT = 0x4, +SCSI_PR_OUT_PREEMPT_AND_ABORT= 0x5, +SCSI_PR_OUT_REG_AND_IGNORE_KEY = 0x6, +SCSI_PR_OUT_REG_AND_MOVE = 0x7, +} SCSIPrOutAction; + +typedef enum { +SCSI_PR_IN_READ_KEYS = 0x0, +SCSI_PR_IN_READ_RESERVATION = 0x1, +SCSI_PR_IN_REPORT_CAPABILITIES = 0x2, +} SCSIPrInAction; + +typedef enum { +/* Exclusive Access All Registrants reservation type */ +SCSI_PR_CAP_EX_AC_AR = 1 << 0, +/* Write Exclusive reservation type */ +SCSI_PR_CAP_WR_EX = 1 << 9, +/* Exclusive Access reservation type */ +SCSI_PR_CAP_EX_AC = 1 << 11, +/* Write Exclusive Registrants Only reservation type */ +SCSI_PR_CAP_WR_EX_RO = 1 << 13, +/* Exclusive Access Registrants Only reservation type */ +SCSI_PR_CAP_EX_AC_RO = 1 << 14, +/* Write Exclusive All Registrants reservation type */ +SCSI_PR_CAP_WR_EX_AR = 1 << 15, + +SCSI_PR_CAP_ALL = (SCSI_PR_CAP_EX_AC_AR | + SCSI_PR_CAP_WR_EX | + SCSI_PR_CAP_EX_AC | + SCSI_PR_CAP_WR_EX_RO | + SCSI_PR_CAP_EX_AC_RO | + SCSI_PR_CAP_WR_EX_AR), +} SCSIPrCap; + #endif -- 2.20.1
[PATCH v4 04/10] scsi/util: add helper functions for persistent reservation types conversion
This commit introduces two helper functions that facilitate the conversion between the persistent reservation types used in the SCSI protocol and those used in the block layer. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- include/scsi/utils.h | 8 + scsi/utils.c | 81 2 files changed, 89 insertions(+) diff --git a/include/scsi/utils.h b/include/scsi/utils.h index d5c8efa16e..89a0b082fb 100644 --- a/include/scsi/utils.h +++ b/include/scsi/utils.h @@ -1,6 +1,8 @@ #ifndef SCSI_UTILS_H #define SCSI_UTILS_H +#include "block/block-common.h" +#include "scsi/constants.h" #ifdef CONFIG_LINUX #include #endif @@ -135,6 +137,12 @@ uint32_t scsi_data_cdb_xfer(uint8_t *buf); uint32_t scsi_cdb_xfer(uint8_t *buf); int scsi_cdb_length(uint8_t *buf); +BlockPrType scsi_pr_type_to_block(SCSIPrType type); +SCSIPrType block_pr_type_to_scsi(BlockPrType type); + +uint8_t scsi_pr_cap_to_block(uint16_t scsi_pr_cap); +uint16_t block_pr_cap_to_scsi(uint8_t block_pr_cap); + /* Linux SG_IO interface. */ #ifdef CONFIG_LINUX #define SG_ERR_DRIVER_TIMEOUT 0x06 diff --git a/scsi/utils.c b/scsi/utils.c index 357b036671..0dfdeb499d 100644 --- a/scsi/utils.c +++ b/scsi/utils.c @@ -658,3 +658,84 @@ int scsi_sense_from_host_status(uint8_t host_status, } return GOOD; } + +BlockPrType scsi_pr_type_to_block(SCSIPrType type) +{ +switch (type) { +case SCSI_PR_WRITE_EXCLUSIVE: +return BLK_PR_WRITE_EXCLUSIVE; +case SCSI_PR_EXCLUSIVE_ACCESS: +return BLK_PR_EXCLUSIVE_ACCESS; +case SCSI_PR_WRITE_EXCLUSIVE_REGS_ONLY: +return BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY; +case SCSI_PR_EXCLUSIVE_ACCESS_REGS_ONLY: +return BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY; +case SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS: +return BLK_PR_WRITE_EXCLUSIVE_ALL_REGS; +case SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS: +return BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS; +} + +return 0; +} + +SCSIPrType block_pr_type_to_scsi(BlockPrType type) +{ +switch (type) { +case BLK_PR_WRITE_EXCLUSIVE: +return SCSI_PR_WRITE_EXCLUSIVE; +case BLK_PR_EXCLUSIVE_ACCESS: +return SCSI_PR_EXCLUSIVE_ACCESS; +case BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY: +return SCSI_PR_WRITE_EXCLUSIVE_REGS_ONLY; +case BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY: +return SCSI_PR_EXCLUSIVE_ACCESS_REGS_ONLY; +case BLK_PR_WRITE_EXCLUSIVE_ALL_REGS: +return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS; +case BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS: +return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS; +} + +return 0; +} + + +uint8_t scsi_pr_cap_to_block(uint16_t scsi_pr_cap) +{ +uint8_t res = 0; + +res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX) ? + BLK_PR_CAP_WR_EX : 0; +res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC) ? + BLK_PR_CAP_EX_AC : 0; +res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX_RO) ? + BLK_PR_CAP_WR_EX_RO : 0; +res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC_RO) ? + BLK_PR_CAP_EX_AC_RO : 0; +res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX_AR) ? + BLK_PR_CAP_WR_EX_AR : 0; +res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC_AR) ? + BLK_PR_CAP_EX_AC_AR : 0; + +return res; +} + +uint16_t block_pr_cap_to_scsi(uint8_t block_pr_cap) +{ +uint16_t res = 0; + +res |= (block_pr_cap & BLK_PR_CAP_WR_EX) ? + SCSI_PR_CAP_WR_EX : 0; +res |= (block_pr_cap & BLK_PR_CAP_EX_AC) ? + SCSI_PR_CAP_EX_AC : 0; +res |= (block_pr_cap & BLK_PR_CAP_WR_EX_RO) ? + SCSI_PR_CAP_WR_EX_RO : 0; +res |= (block_pr_cap & BLK_PR_CAP_EX_AC_RO) ? + SCSI_PR_CAP_EX_AC_RO : 0; +res |= (block_pr_cap & BLK_PR_CAP_WR_EX_AR) ? + SCSI_PR_CAP_WR_EX_AR : 0; +res |= (block_pr_cap & BLK_PR_CAP_EX_AC_AR) ? + SCSI_PR_CAP_EX_AC_AR : 0; + +return res; +} -- 2.20.1
[PATCH v4 00/10] Support persistent reservation operations
Hi, Patch v4 has been modified. Many thanks to Klaus for the code review. Could anyone please review the core block layer and SCSI layer related codes? v3->v4: - At the nvme layer, the two patches of enabling the ONCS function and enabling rescap are combined into one. - At the nvme layer, add helper functions for pr capacity conversion between the block layer and the nvme layer. v2->v3: In v2 Persist Through Power Loss(PTPL) is enable default. In v3 PTPL is supported, which is passed as a parameter. v1->v2: - Add sg_persist --report-capabilities for SCSI protocol and enable oncs and rescap for NVMe protocol. - Add persistent reservation capabilities constants and helper functions for SCSI and NVMe protocol. - Add comments for necessary APIs. v1: - Add seven APIs about persistent reservation command for block layer. These APIs including reading keys, reading reservations, registering, reserving, releasing, clearing and preempting. - Add the necessary pr-related operation APIs for both the SCSI protocol and NVMe protocol at the device layer. - Add scsi driver at the driver layer to verify the functions Changqi Lu (10): block: add persistent reservation in/out api block/raw: add persistent reservation in/out driver scsi/constant: add persistent reservation in/out protocol constants scsi/util: add helper functions for persistent reservation types conversion hw/scsi: add persistent reservation in/out api for scsi device block/nvme: add reservation command protocol constants hw/nvme: add helper functions for converting reservation types hw/nvme: enable ONCS and rescap function hw/nvme: add reservation protocal command block/iscsi: add persistent reservation in/out driver block/block-backend.c | 397 ++ block/io.c| 163 +++ block/iscsi.c | 443 ++ block/raw-format.c| 56 hw/nvme/ctrl.c| 324 +- hw/nvme/ns.c | 5 + hw/nvme/nvme.h| 84 ++ hw/scsi/scsi-disk.c | 352 include/block/block-common.h | 40 +++ include/block/block-io.h | 20 ++ include/block/block_int-common.h | 84 ++ include/block/nvme.h | 98 +++ include/scsi/constants.h | 52 include/scsi/utils.h | 8 + include/sysemu/block-backend-io.h | 24 ++ scsi/utils.c | 81 ++ 16 files changed, 2229 insertions(+), 2 deletions(-) -- 2.20.1
[PATCH v4 07/10] hw/nvme: add helper functions for converting reservation types
This commit introduces two helper functions that facilitate the conversion between the reservation types used in the NVME protocol and those used in the block layer. Reviewed-by: Stefan Hajnoczi Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- hw/nvme/nvme.h | 80 ++ 1 file changed, 80 insertions(+) diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index bed8191bd5..b1ad27c8f2 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -474,6 +474,86 @@ static inline const char *nvme_io_opc_str(uint8_t opc) } } +static inline NvmeResvType block_pr_type_to_nvme(BlockPrType type) +{ +switch (type) { +case BLK_PR_WRITE_EXCLUSIVE: +return NVME_RESV_WRITE_EXCLUSIVE; +case BLK_PR_EXCLUSIVE_ACCESS: +return NVME_RESV_EXCLUSIVE_ACCESS; +case BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY: +return NVME_RESV_WRITE_EXCLUSIVE_REGS_ONLY; +case BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY: +return NVME_RESV_EXCLUSIVE_ACCESS_REGS_ONLY; +case BLK_PR_WRITE_EXCLUSIVE_ALL_REGS: +return NVME_RESV_WRITE_EXCLUSIVE_ALL_REGS; +case BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS: +return NVME_RESV_EXCLUSIVE_ACCESS_ALL_REGS; +} + +return 0; +} + +static inline BlockPrType nvme_pr_type_to_block(NvmeResvType type) +{ +switch (type) { +case NVME_RESV_WRITE_EXCLUSIVE: +return BLK_PR_WRITE_EXCLUSIVE; +case NVME_RESV_EXCLUSIVE_ACCESS: +return BLK_PR_EXCLUSIVE_ACCESS; +case NVME_RESV_WRITE_EXCLUSIVE_REGS_ONLY: +return BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY; +case NVME_RESV_EXCLUSIVE_ACCESS_REGS_ONLY: +return BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY; +case NVME_RESV_WRITE_EXCLUSIVE_ALL_REGS: +return BLK_PR_WRITE_EXCLUSIVE_ALL_REGS; +case NVME_RESV_EXCLUSIVE_ACCESS_ALL_REGS: +return BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS; +} + +return 0; +} + +static inline uint8_t nvme_pr_cap_to_block(uint16_t nvme_pr_cap) +{ +uint8_t res = 0; + +res |= (nvme_pr_cap & NVME_PR_CAP_WR_EX) ? + BLK_PR_CAP_WR_EX : 0; +res |= (nvme_pr_cap & NVME_PR_CAP_EX_AC) ? + BLK_PR_CAP_EX_AC : 0; +res |= (nvme_pr_cap & NVME_PR_CAP_WR_EX_RO) ? + BLK_PR_CAP_WR_EX_RO : 0; +res |= (nvme_pr_cap & NVME_PR_CAP_EX_AC_RO) ? + BLK_PR_CAP_EX_AC_RO : 0; +res |= (nvme_pr_cap & NVME_PR_CAP_WR_EX_AR) ? + BLK_PR_CAP_WR_EX_AR : 0; +res |= (nvme_pr_cap & NVME_PR_CAP_EX_AC_AR) ? + BLK_PR_CAP_EX_AC_AR : 0; + +return res; +} + +static inline uint8_t block_pr_cap_to_nvme(uint8_t block_pr_cap) +{ +uint16_t res = 0; + +res |= (block_pr_cap & BLK_PR_CAP_WR_EX) ? + NVME_PR_CAP_WR_EX : 0; +res |= (block_pr_cap & BLK_PR_CAP_EX_AC) ? + NVME_PR_CAP_EX_AC : 0; +res |= (block_pr_cap & BLK_PR_CAP_WR_EX_RO) ? + NVME_PR_CAP_WR_EX_RO : 0; +res |= (block_pr_cap & BLK_PR_CAP_EX_AC_RO) ? + NVME_PR_CAP_EX_AC_RO : 0; +res |= (block_pr_cap & BLK_PR_CAP_WR_EX_AR) ? + NVME_PR_CAP_WR_EX_AR : 0; +res |= (block_pr_cap & BLK_PR_CAP_EX_AC_AR) ? + NVME_PR_CAP_EX_AC_AR : 0; + +return res; +} + typedef struct NvmeSQueue { struct NvmeCtrl *ctrl; uint16_tsqid; -- 2.20.1
[PATCH v4 06/10] block/nvme: add reservation command protocol constants
Add constants for the NVMe persistent command protocol. The constants include the reservation command opcode and reservation type values defined in section 7 of the NVMe 2.0 specification. Signed-off-by: Changqi Lu Signed-off-by: zhenwei pi --- include/block/nvme.h | 61 1 file changed, 61 insertions(+) diff --git a/include/block/nvme.h b/include/block/nvme.h index bb231d0b9a..da6ccb0f3b 100644 --- a/include/block/nvme.h +++ b/include/block/nvme.h @@ -633,6 +633,10 @@ enum NvmeIoCommands { NVME_CMD_WRITE_ZEROES = 0x08, NVME_CMD_DSM= 0x09, NVME_CMD_VERIFY = 0x0c, +NVME_CMD_RESV_REGISTER = 0x0d, +NVME_CMD_RESV_REPORT= 0x0e, +NVME_CMD_RESV_ACQUIRE = 0x11, +NVME_CMD_RESV_RELEASE = 0x15, NVME_CMD_IO_MGMT_RECV = 0x12, NVME_CMD_COPY = 0x19, NVME_CMD_IO_MGMT_SEND = 0x1d, @@ -641,6 +645,63 @@ enum NvmeIoCommands { NVME_CMD_ZONE_APPEND= 0x7d, }; +typedef enum { +NVME_RESV_REGISTER_ACTION_REGISTER = 0x00, +NVME_RESV_REGISTER_ACTION_UNREGISTER= 0x01, +NVME_RESV_REGISTER_ACTION_REPLACE = 0x02, +} NvmeReservationRegisterAction; + +typedef enum { +NVME_RESV_RELEASE_ACTION_RELEASE= 0x00, +NVME_RESV_RELEASE_ACTION_CLEAR = 0x01, +} NvmeReservationReleaseAction; + +typedef enum { +NVME_RESV_ACQUIRE_ACTION_ACQUIRE= 0x00, +NVME_RESV_ACQUIRE_ACTION_PREEMPT= 0x01, +NVME_RESV_ACQUIRE_ACTION_PREEMPT_AND_ABORT = 0x02, +} NvmeReservationAcquireAction; + +typedef enum { +NVME_RESV_WRITE_EXCLUSIVE = 0x01, +NVME_RESV_EXCLUSIVE_ACCESS = 0x02, +NVME_RESV_WRITE_EXCLUSIVE_REGS_ONLY = 0x03, +NVME_RESV_EXCLUSIVE_ACCESS_REGS_ONLY= 0x04, +NVME_RESV_WRITE_EXCLUSIVE_ALL_REGS = 0x05, +NVME_RESV_EXCLUSIVE_ACCESS_ALL_REGS = 0x06, +} NvmeResvType; + +typedef enum { +NVME_RESV_PTPL_NO_CHANGE = 0x00, +NVME_RESV_PTPL_DISABLE = 0x02, +NVME_RESV_PTPL_ENABLE= 0x03, +} NvmeResvPTPL; + +typedef enum NVMEPrCap { +/* Persist Through Power Loss */ +NVME_PR_CAP_PTPL = 1 << 0, +/* Write Exclusive reservation type */ +NVME_PR_CAP_WR_EX = 1 << 1, +/* Exclusive Access reservation type */ +NVME_PR_CAP_EX_AC = 1 << 2, +/* Write Exclusive Registrants Only reservation type */ +NVME_PR_CAP_WR_EX_RO = 1 << 3, +/* Exclusive Access Registrants Only reservation type */ +NVME_PR_CAP_EX_AC_RO = 1 << 4, +/* Write Exclusive All Registrants reservation type */ +NVME_PR_CAP_WR_EX_AR = 1 << 5, +/* Exclusive Access All Registrants reservation type */ +NVME_PR_CAP_EX_AC_AR = 1 << 6, + +NVME_PR_CAP_ALL = (NVME_PR_CAP_PTPL | + NVME_PR_CAP_WR_EX | + NVME_PR_CAP_EX_AC | + NVME_PR_CAP_WR_EX_RO | + NVME_PR_CAP_EX_AC_RO | + NVME_PR_CAP_WR_EX_AR | + NVME_PR_CAP_EX_AC_AR), +} NvmePrCap; + typedef struct QEMU_PACKED NvmeDeleteQ { uint8_t opcode; uint8_t flags; -- 2.20.1
Re: tests/avocado: Add LoongArch machine start test
在 2024/5/30 下午9:16, Jiaxun Yang 写道: 在2024年5月30日五月 下午2:00,gaosong写道: [...] FYI, the test does not seem to work anymore - apparently the binaries have changed and now the hashes do not match anymore. Could you please update it? (preferably with some versioned binaries that do not change in the course of time?) Thank you, I had send a patch to fix it. Hi Song, As LoongArch EDK2 support has been merged long ago, do you to make a clean build and add it to pc-bios directory? EDK2 LoongArchVirt under OvmfPkg is being committed to upstream. PR: https://github.com/tianocore/edk2/pull/5208 Thanks Song Gao Thanks - Jiaxun
Re: [PATCH 5/5] core/cpu-common: initialise plugin state before thread creation
On 5/30/24 12:42, Alex Bennée wrote: Originally I tried to move where vCPU thread initialisation to later in realize. However pulling that thread (sic) got gnarly really quickly. It turns out some steps of CPU realization need values that can only be determined from the running vCPU thread. However having moved enough out of the thread creation we can now queue work before the thread starts (at least for TCG guests) and avoid the race between vcpu_init and other vcpu states a plugin might subscribe to. Signed-off-by: Alex Bennée --- hw/core/cpu-common.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c index 6cfc01593a..bf1a7b8892 100644 --- a/hw/core/cpu-common.c +++ b/hw/core/cpu-common.c @@ -222,14 +222,6 @@ static void cpu_common_realizefn(DeviceState *dev, Error **errp) cpu_resume(cpu); } -/* Plugin initialization must wait until the cpu start executing code */ -#ifdef CONFIG_PLUGIN -if (tcg_enabled()) { -cpu->plugin_state = qemu_plugin_create_vcpu_state(); -async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL); -} -#endif - /* NOTE: latest generic point where the cpu is fully realized */ } @@ -273,6 +265,18 @@ static void cpu_common_initfn(Object *obj) QTAILQ_INIT(&cpu->watchpoints); cpu_exec_initfn(cpu); + +/* + * Plugin initialization must wait until the cpu start executing + * code, but we must queue this work before the threads are + * created to ensure we don't race. + */ +#ifdef CONFIG_PLUGIN +if (tcg_enabled()) { +cpu->plugin_state = qemu_plugin_create_vcpu_state(); +async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL); +} +#endif } static void cpu_common_finalize(Object *obj) Could you check it works for all combination? - user-mode - system-mode tcg - system-mode mttcg When I tried to move this code around, one of them didn't work correctly. Else, Reviewed-by: Pierrick Bouvier
Re: [PATCH 4/5] plugins: remove special casing for cpu->realized
On 5/30/24 12:42, Alex Bennée wrote: Now the condition variable is initialised early on we don't need to go through hoops to avoid calling async_run_on_cpu. Signed-off-by: Alex Bennée --- plugins/core.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/plugins/core.c b/plugins/core.c index 0726bc7f25..badede28cf 100644 --- a/plugins/core.c +++ b/plugins/core.c @@ -65,11 +65,7 @@ static void plugin_cpu_update__locked(gpointer k, gpointer v, gpointer udata) CPUState *cpu = container_of(k, CPUState, cpu_index); run_on_cpu_data mask = RUN_ON_CPU_HOST_ULONG(*plugin.mask); -if (DEVICE(cpu)->realized) { -async_run_on_cpu(cpu, plugin_cpu_update__async, mask); -} else { -plugin_cpu_update__async(cpu, mask); -} +async_run_on_cpu(cpu, plugin_cpu_update__async, mask); } void plugin_unregister_cb__locked(struct qemu_plugin_ctx *ctx, Reviewed-by: Pierrick Bouvier
Re: [PATCH 3/5] cpu-target: don't set cpu->thread_id to bogus value
On 5/30/24 12:42, Alex Bennée wrote: The thread_id isn't valid until the threads are created. There is no point setting it here. The only thing that cares about the thread_id is qmp_query_cpus_fast. Signed-off-by: Alex Bennée --- cpu-target.c | 1 - 1 file changed, 1 deletion(-) diff --git a/cpu-target.c b/cpu-target.c index 5af120e8aa..499facf774 100644 --- a/cpu-target.c +++ b/cpu-target.c @@ -241,7 +241,6 @@ void cpu_exec_initfn(CPUState *cpu) cpu->num_ases = 0; #ifndef CONFIG_USER_ONLY -cpu->thread_id = qemu_get_thread_id(); cpu->memory = get_system_memory(); object_ref(OBJECT(cpu->memory)); #endif Reviewed-by: Pierrick Bouvier
Re: [PATCH 1/5] hw/core: expand on the alignment of CPUState
On 5/30/24 12:42, Alex Bennée wrote: Make the relationship between CPUState, ArchCPU and cpu_env a bit clearer in the kdoc comments. Signed-off-by: Alex Bennée --- include/hw/core/cpu.h | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index bb398e8237..35d345371b 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -391,7 +391,8 @@ struct qemu_work_item; #define CPU_UNSET_NUMA_NODE_ID -1 /** - * CPUState: + * struct CPUState - common state of one CPU core or thread. + * * @cpu_index: CPU index (informative). * @cluster_index: Identifies which cluster this CPU is in. * For boards which don't define clusters or for "loose" CPUs not assigned @@ -439,10 +440,15 @@ struct qemu_work_item; * @kvm_fetch_index: Keeps the index that we last fetched from the per-vCPU *dirty ring structure. * - * State of one CPU core or thread. + * @neg_align: The CPUState is the common part of a concrete ArchCPU + * which is allocated when an individual CPU instance is created. As + * such care is taken is ensure there is no gap between between + * CPUState and CPUArchState within ArchCPU. * - * Align, in order to match possible alignment required by CPUArchState, - * and eliminate a hole between CPUState and CPUArchState within ArchCPU. + * @neg: The architectural register state ("cpu_env") immediately follows CPUState + * in ArchCPU and is passed to TCG code. The @neg structure holds some + * common TCG CPU variables which are accessed with a negative offset + * from cpu_env. */ struct CPUState { /*< private >*/ Reviewed-by: Pierrick Bouvier
Re: [PATCH 2/5] cpu: move Qemu[Thread|Cond] setup into common code
On 5/30/24 12:42, Alex Bennée wrote: Aside from the round robin threads this is all common code. By moving the halt_cond setup we also no longer need hacks to work around the race between QOM object creation and thread creation. It is a little ugly to free stuff up for the round robin thread but better it deal with its own specialises than making the other accelerators jump through hoops. Signed-off-by: Alex Bennée --- include/hw/core/cpu.h | 4 accel/dummy-cpus.c| 3 --- accel/hvf/hvf-accel-ops.c | 4 accel/kvm/kvm-accel-ops.c | 3 --- accel/tcg/tcg-accel-ops-mttcg.c | 4 accel/tcg/tcg-accel-ops-rr.c | 14 +++--- hw/core/cpu-common.c | 5 + target/i386/nvmm/nvmm-accel-ops.c | 3 --- target/i386/whpx/whpx-accel-ops.c | 3 --- 9 files changed, 16 insertions(+), 27 deletions(-) diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index 35d345371b..a405119eda 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -404,10 +404,14 @@ struct qemu_work_item; * @tcg_cflags: Pre-computed cflags for this cpu. * @nr_cores: Number of cores within this CPU package. * @nr_threads: Number of threads within this CPU core. + * @thread: Host thread details, only live once @created is #true + * @sem: WIN32 only semaphore used only for qtest + * @thread_id: native thread id of vCPU, only live once @created is #true * @running: #true if CPU is currently running (lockless). * @has_waiter: #true if a CPU is currently waiting for the cpu_exec_end; * valid under cpu_list_lock. * @created: Indicates whether the CPU thread has been successfully created. + * @halt_cond: condition variable sleeping threads can wait on. * @interrupt_request: Indicates a pending interrupt request. * @halted: Nonzero if the CPU is in suspended state. * @stop: Indicates a pending stop request. diff --git a/accel/dummy-cpus.c b/accel/dummy-cpus.c index 20519f1ea4..f32d8c8dc3 100644 --- a/accel/dummy-cpus.c +++ b/accel/dummy-cpus.c @@ -68,9 +68,6 @@ void dummy_start_vcpu_thread(CPUState *cpu) { char thread_name[VCPU_THREAD_NAME_SIZE]; -cpu->thread = g_malloc0(sizeof(QemuThread)); -cpu->halt_cond = g_malloc0(sizeof(QemuCond)); -qemu_cond_init(cpu->halt_cond); snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/DUMMY", cpu->cpu_index); qemu_thread_create(cpu->thread, thread_name, dummy_cpu_thread_fn, cpu, diff --git a/accel/hvf/hvf-accel-ops.c b/accel/hvf/hvf-accel-ops.c index 40d4187d9d..6f1e27ef46 100644 --- a/accel/hvf/hvf-accel-ops.c +++ b/accel/hvf/hvf-accel-ops.c @@ -463,10 +463,6 @@ static void hvf_start_vcpu_thread(CPUState *cpu) */ assert(hvf_enabled()); -cpu->thread = g_malloc0(sizeof(QemuThread)); -cpu->halt_cond = g_malloc0(sizeof(QemuCond)); -qemu_cond_init(cpu->halt_cond); - snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF", cpu->cpu_index); qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn, diff --git a/accel/kvm/kvm-accel-ops.c b/accel/kvm/kvm-accel-ops.c index 94c828ac8d..c239dfc87a 100644 --- a/accel/kvm/kvm-accel-ops.c +++ b/accel/kvm/kvm-accel-ops.c @@ -66,9 +66,6 @@ static void kvm_start_vcpu_thread(CPUState *cpu) { char thread_name[VCPU_THREAD_NAME_SIZE]; -cpu->thread = g_malloc0(sizeof(QemuThread)); -cpu->halt_cond = g_malloc0(sizeof(QemuCond)); -qemu_cond_init(cpu->halt_cond); snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/KVM", cpu->cpu_index); qemu_thread_create(cpu->thread, thread_name, kvm_vcpu_thread_fn, diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c index c552b45b8e..49814ec4af 100644 --- a/accel/tcg/tcg-accel-ops-mttcg.c +++ b/accel/tcg/tcg-accel-ops-mttcg.c @@ -137,10 +137,6 @@ void mttcg_start_vcpu_thread(CPUState *cpu) g_assert(tcg_enabled()); tcg_cpu_init_cflags(cpu, current_machine->smp.max_cpus > 1); -cpu->thread = g_new0(QemuThread, 1); -cpu->halt_cond = g_malloc0(sizeof(QemuCond)); -qemu_cond_init(cpu->halt_cond); - /* create a thread per vCPU with TCG (MTTCG) */ snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/TCG", cpu->cpu_index); diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c index 894e73e52c..84c36c1450 100644 --- a/accel/tcg/tcg-accel-ops-rr.c +++ b/accel/tcg/tcg-accel-ops-rr.c @@ -317,22 +317,22 @@ void rr_start_vcpu_thread(CPUState *cpu) tcg_cpu_init_cflags(cpu, false); if (!single_tcg_cpu_thread) { -cpu->thread = g_new0(QemuThread, 1); -cpu->halt_cond = g_new0(QemuCond, 1); -qemu_cond_init(cpu->halt_cond); +single_tcg_halt_cond = cpu->halt_cond; +single_tcg_cpu_thread = cpu->thread; /* share a single thread for all cpus with TCG */ snprintf(thread_name, VCPU_THREAD_NAME_SIZE,
[PATCH v3 3/6] sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time
From: Alex Bennée Move the key functionality of moving time forward into the clock sub-system itself. This will allow us to plumb in time control into plugins. From: Alex Bennée Signed-off-by: Alex Bennée Signed-off-by: Pierrick Bouvier --- include/qemu/timer.h | 15 +++ system/qtest.c | 25 +++-- util/qemu-timer.c| 26 ++ 3 files changed, 44 insertions(+), 22 deletions(-) diff --git a/include/qemu/timer.h b/include/qemu/timer.h index 9a366e551fb..910587d8293 100644 --- a/include/qemu/timer.h +++ b/include/qemu/timer.h @@ -245,6 +245,21 @@ bool qemu_clock_run_timers(QEMUClockType type); */ bool qemu_clock_run_all_timers(void); +/** + * qemu_clock_advance_virtual_time(): advance the virtual time tick + * @target: target time in nanoseconds + * + * This function is used where the control of the flow of time has + * been delegated to outside the clock subsystem (be it qtest, icount + * or some other external source). You can ask the clock system to + * return @early at the first expired timer. + * + * Time can only move forward, attempts to reverse time would lead to + * an error. + * + * Returns: new virtual time. + */ +int64_t qemu_clock_advance_virtual_time(int64_t dest); /* * QEMUTimerList diff --git a/system/qtest.c b/system/qtest.c index ee8b139e982..e6f6b4e62d5 100644 --- a/system/qtest.c +++ b/system/qtest.c @@ -337,26 +337,6 @@ void qtest_set_virtual_clock(int64_t count) qatomic_set_i64(&qtest_clock_counter, count); } -static void qtest_clock_warp(int64_t dest) -{ -int64_t clock = cpus_get_virtual_clock(); -AioContext *aio_context; -assert(qtest_enabled()); -aio_context = qemu_get_aio_context(); -while (clock < dest) { -int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL, - QEMU_TIMER_ATTR_ALL); -int64_t warp = qemu_soonest_timeout(dest - clock, deadline); - -cpus_set_virtual_clock(cpus_get_virtual_clock() + warp); - -qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL); -timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]); -clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); -} -qemu_clock_notify(QEMU_CLOCK_VIRTUAL); -} - static bool (*process_command_cb)(CharBackend *chr, gchar **words); void qtest_set_command_cb(bool (*pc_cb)(CharBackend *chr, gchar **words)) @@ -755,7 +735,8 @@ static void qtest_process_command(CharBackend *chr, gchar **words) ns = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL, QEMU_TIMER_ATTR_ALL); } -qtest_clock_warp(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns); +qemu_clock_advance_virtual_time( +qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns); qtest_send_prefix(chr); qtest_sendf(chr, "OK %"PRIi64"\n", (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL)); @@ -781,7 +762,7 @@ static void qtest_process_command(CharBackend *chr, gchar **words) g_assert(words[1]); ret = qemu_strtoi64(words[1], NULL, 0, &ns); g_assert(ret == 0); -qtest_clock_warp(ns); +qemu_clock_advance_virtual_time(ns); qtest_send_prefix(chr); qtest_sendf(chr, "OK %"PRIi64"\n", (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL)); diff --git a/util/qemu-timer.c b/util/qemu-timer.c index 6a0de33dd2b..213114be68c 100644 --- a/util/qemu-timer.c +++ b/util/qemu-timer.c @@ -645,6 +645,11 @@ int64_t qemu_clock_get_ns(QEMUClockType type) } } +static void qemu_virtual_clock_set_ns(int64_t time) +{ +return cpus_set_virtual_clock(time); +} + void init_clocks(QEMUTimerListNotifyCB *notify_cb) { QEMUClockType type; @@ -675,3 +680,24 @@ bool qemu_clock_run_all_timers(void) return progress; } + +int64_t qemu_clock_advance_virtual_time(int64_t dest) +{ +int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); +AioContext *aio_context; +aio_context = qemu_get_aio_context(); +while (clock < dest) { +int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL, + QEMU_TIMER_ATTR_ALL); +int64_t warp = qemu_soonest_timeout(dest - clock, deadline); + +qemu_virtual_clock_set_ns(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + warp); + +qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL); +timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]); +clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); +} +qemu_clock_notify(QEMU_CLOCK_VIRTUAL); + +return clock; +} -- 2.39.2
[PATCH v3 4/6] qtest: move qtest_{get, set}_virtual_clock to accel/qtest/qtest.c
Signed-off-by: Pierrick Bouvier --- include/sysemu/qtest.h | 3 --- accel/qtest/qtest.c| 12 system/qtest.c | 12 3 files changed, 12 insertions(+), 15 deletions(-) diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h index 45f3b7e1df5..c161d751650 100644 --- a/include/sysemu/qtest.h +++ b/include/sysemu/qtest.h @@ -34,9 +34,6 @@ void qtest_server_init(const char *qtest_chrdev, const char *qtest_log, Error ** void qtest_server_set_send_handler(void (*send)(void *, const char *), void *opaque); void qtest_server_inproc_recv(void *opaque, const char *buf); - -int64_t qtest_get_virtual_clock(void); -void qtest_set_virtual_clock(int64_t count); #endif #endif diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c index 53182e6c2ae..bf14032d294 100644 --- a/accel/qtest/qtest.c +++ b/accel/qtest/qtest.c @@ -24,6 +24,18 @@ #include "qemu/main-loop.h" #include "hw/core/cpu.h" +static int64_t qtest_clock_counter; + +static int64_t qtest_get_virtual_clock(void) +{ +return qatomic_read_i64(&qtest_clock_counter); +} + +static void qtest_set_virtual_clock(int64_t count) +{ +qatomic_set_i64(&qtest_clock_counter, count); +} + static int qtest_init_accel(MachineState *ms) { return 0; diff --git a/system/qtest.c b/system/qtest.c index e6f6b4e62d5..ba210780ec0 100644 --- a/system/qtest.c +++ b/system/qtest.c @@ -325,18 +325,6 @@ static void qtest_irq_handler(void *opaque, int n, int level) } } -static int64_t qtest_clock_counter; - -int64_t qtest_get_virtual_clock(void) -{ -return qatomic_read_i64(&qtest_clock_counter); -} - -void qtest_set_virtual_clock(int64_t count) -{ -qatomic_set_i64(&qtest_clock_counter, count); -} - static bool (*process_command_cb)(CharBackend *chr, gchar **words); void qtest_set_command_cb(bool (*pc_cb)(CharBackend *chr, gchar **words)) -- 2.39.2
[PATCH v3 1/6] sysemu: add set_virtual_time to accel ops
From: Alex Bennée We are about to remove direct calls to individual accelerators for this information and will need a central point for plugins to hook into time changes. From: Alex Bennée Signed-off-by: Alex Bennée Reviewed-by: Philippe Mathieu-Daudé Signed-off-by: Pierrick Bouvier --- include/sysemu/accel-ops.h | 18 +- include/sysemu/cpu-timers.h| 3 ++- ...et-virtual-clock.c => cpus-virtual-clock.c} | 5 + system/cpus.c | 11 +++ stubs/meson.build | 2 +- 5 files changed, 36 insertions(+), 3 deletions(-) rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%) diff --git a/include/sysemu/accel-ops.h b/include/sysemu/accel-ops.h index ef91fc28bbd..a0886722305 100644 --- a/include/sysemu/accel-ops.h +++ b/include/sysemu/accel-ops.h @@ -20,7 +20,12 @@ typedef struct AccelOpsClass AccelOpsClass; DECLARE_CLASS_CHECKERS(AccelOpsClass, ACCEL_OPS, TYPE_ACCEL_OPS) -/* cpus.c operations interface */ +/** + * struct AccelOpsClass - accelerator interfaces + * + * This structure is used to abstract accelerator differences from the + * core CPU code. Not all have to be implemented. + */ struct AccelOpsClass { /*< private >*/ ObjectClass parent_class; @@ -44,7 +49,18 @@ struct AccelOpsClass { void (*handle_interrupt)(CPUState *cpu, int mask); +/** + * @get_virtual_clock: fetch virtual clock + * @set_virtual_clock: set virtual clock + * + * These allow the timer subsystem to defer to the accelerator to + * fetch time. The set function is needed if the accelerator wants + * to track the changes to time as the timer is warped through + * various timer events. + */ int64_t (*get_virtual_clock)(void); +void (*set_virtual_clock)(int64_t time); + int64_t (*get_elapsed_ticks)(void); /* gdbstub hooks */ diff --git a/include/sysemu/cpu-timers.h b/include/sysemu/cpu-timers.h index d86738a378d..7bfa960fbd6 100644 --- a/include/sysemu/cpu-timers.h +++ b/include/sysemu/cpu-timers.h @@ -96,8 +96,9 @@ int64_t cpu_get_clock(void); void qemu_timer_notify_cb(void *opaque, QEMUClockType type); -/* get the VIRTUAL clock and VM elapsed ticks via the cpus accel interface */ +/* get/set VIRTUAL clock and VM elapsed ticks via the cpus accel interface */ int64_t cpus_get_virtual_clock(void); +void cpus_set_virtual_clock(int64_t new_time); int64_t cpus_get_elapsed_ticks(void); #endif /* SYSEMU_CPU_TIMERS_H */ diff --git a/stubs/cpus-get-virtual-clock.c b/stubs/cpus-virtual-clock.c similarity index 68% rename from stubs/cpus-get-virtual-clock.c rename to stubs/cpus-virtual-clock.c index fd447d53f3c..af7c1a1d403 100644 --- a/stubs/cpus-get-virtual-clock.c +++ b/stubs/cpus-virtual-clock.c @@ -6,3 +6,8 @@ int64_t cpus_get_virtual_clock(void) { return cpu_get_clock(); } + +void cpus_set_virtual_clock(int64_t new_time) +{ +/* do nothing */ +} diff --git a/system/cpus.c b/system/cpus.c index f8fa78f33d4..d3640c95030 100644 --- a/system/cpus.c +++ b/system/cpus.c @@ -229,6 +229,17 @@ int64_t cpus_get_virtual_clock(void) return cpu_get_clock(); } +/* + * Signal the new virtual time to the accelerator. This is only needed + * by accelerators that need to track the changes as we warp time. + */ +void cpus_set_virtual_clock(int64_t new_time) +{ +if (cpus_accel && cpus_accel->set_virtual_clock) { +cpus_accel->set_virtual_clock(new_time); +} +} + /* * return the time elapsed in VM between vm_start and vm_stop. Unless * icount is active, cpus_get_elapsed_ticks() uses units of the host CPU cycle diff --git a/stubs/meson.build b/stubs/meson.build index 3b9d42023cb..a1deafde08c 100644 --- a/stubs/meson.build +++ b/stubs/meson.build @@ -28,7 +28,7 @@ endif if have_block or have_ga stub_ss.add(files('replay-tools.c')) # stubs for hooks in util/main-loop.c, util/async.c etc. - stub_ss.add(files('cpus-get-virtual-clock.c')) + stub_ss.add(files('cpus-virtual-clock.c')) stub_ss.add(files('icount.c')) stub_ss.add(files('graph-lock.c')) if linux_io_uring.found() -- 2.39.2
Re: [PATCH v2 0/6] Implement icount=auto using TCG Plugins
A v3 was just sent, with a fix to the algorithm used. On 5/30/24 10:49, Pierrick Bouvier wrote: The goal here is to be able to scale temporally execution of qemu-user/system, using a given number of instructions per second. We define a virtual clock, that can be late or in advance compared to real time. When we are in advance, we slow execution (by sleeping) until catching real time. Finally, we should be able to cleanup icount=auto mode completely, and keep icount usage for determistic purposes only. It is built upon new TCG Plugins inline ops (store + conditional callbacks), now merged on master. Example in user-mode: - Retrieve number of instructions to execute /bin/true $ ./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d plugin /bin/true cpu 0 insns: 120546 total insns: 120546 - Slow execution to match 5 seconds $ time ./build/qemu-x86_64 -plugin ./build/contrib/plugins/libips,ips=$((120546/5)) /bin/true real0m4.985s v2 -- - Added missing personal Signed-off-by for commits from Alex - Fix bad rebase in stubs/meson.build - move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c - A race condition was identified for plugins init/idle/resume, but is not related to this series, and will be fixed in another one: https://lore.kernel.org/qemu-devel/20240529152219.825680-1-alex.ben...@linaro.org/ Alex Bennée (4): sysemu: add set_virtual_time to accel ops qtest: use cpu interface in qtest_clock_warp sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time plugins: add time control API Pierrick Bouvier (2): qtest: move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c contrib/plugins: add ips plugin example for cost modeling include/qemu/qemu-plugin.h| 23 ++ include/qemu/timer.h | 15 ++ include/sysemu/accel-ops.h| 18 +- include/sysemu/cpu-timers.h | 3 +- include/sysemu/qtest.h| 2 - accel/qtest/qtest.c | 13 + contrib/plugins/ips.c | 239 ++ plugins/api.c | 31 +++ ...t-virtual-clock.c => cpus-virtual-clock.c} | 5 + system/cpus.c | 11 + system/qtest.c| 37 +-- util/qemu-timer.c | 26 ++ contrib/plugins/Makefile | 1 + plugins/qemu-plugins.symbols | 2 + stubs/meson.build | 2 +- 15 files changed, 389 insertions(+), 39 deletions(-) create mode 100644 contrib/plugins/ips.c rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%)
[PATCH v3 0/6] Implement icount=auto using TCG Plugins
The goal here is to be able to scale temporally execution of qemu-user/system, using a given number of instructions per second. We define a virtual clock, that can be late or in advance compared to real time. When we are in advance, we slow execution (by sleeping) until catching real time. Finally, we should be able to cleanup icount=auto mode completely, and keep icount usage for determistic purposes only. It is built upon new TCG Plugins inline ops (store + conditional callbacks), now merged on master. Example in user-mode: - Retrieve number of instructions to execute /bin/true $ ./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d plugin /bin/true cpu 0 insns: 120546 total insns: 120546 - Slow execution to match 5 seconds $ time ./build/qemu-x86_64 -plugin ./build/contrib/plugins/libips,ips=$((120546/5)) /bin/true real0m4.985s Tested in system mode by booting a full debian system, and using: $ sysbench cpu run Performance decrease linearly with the given number of ips. v2 -- - Added missing personal Signed-off-by for commits from Alex - Fix bad rebase in stubs/meson.build - move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c - A race condition was identified for plugins init/idle/resume, but is not related to this series, and will be fixed in another one: https://lore.kernel.org/qemu-devel/20240529152219.825680-1-alex.ben...@linaro.org/ v3 -- - remove precise execution (per insn, per tb is enough) - fixed algorithm used. Instead of comparing from start time of the system, we just check on a given quantum of time that we didn't run too fast. It is more simple, there is no need to track idle/resume events, and a vcpu can sleep correctly. - use "sysbench cpu run" in system mode to check execution is slowed as expected. - do not use idle/resume callback Alex Bennée (4): sysemu: add set_virtual_time to accel ops qtest: use cpu interface in qtest_clock_warp sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time plugins: add time control API Pierrick Bouvier (2): qtest: move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c contrib/plugins: add ips plugin example for cost modeling include/qemu/qemu-plugin.h| 23 +++ include/qemu/timer.h | 15 ++ include/sysemu/accel-ops.h| 18 +- include/sysemu/cpu-timers.h | 3 +- include/sysemu/qtest.h| 2 - accel/qtest/qtest.c | 13 ++ contrib/plugins/ips.c | 164 ++ plugins/api.c | 31 ...t-virtual-clock.c => cpus-virtual-clock.c} | 5 + system/cpus.c | 11 ++ system/qtest.c| 37 +--- util/qemu-timer.c | 26 +++ contrib/plugins/Makefile | 1 + plugins/qemu-plugins.symbols | 2 + stubs/meson.build | 2 +- 15 files changed, 314 insertions(+), 39 deletions(-) create mode 100644 contrib/plugins/ips.c rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%) -- 2.39.2
[PATCH v3 6/6] contrib/plugins: add ips plugin example for cost modeling
This plugin uses the new time control interface to make decisions about the state of time during the emulation. The algorithm is currently very simple. The user specifies an ips rate which applies per core. If the core runs ahead of its allocated execution time the plugin sleeps for a bit to let real time catch up. Either way time is updated for the emulation as a function of total executed instructions with some adjustments for cores that idle. Examples Slow down execution of /bin/true: $ num_insn=$(./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d plugin /bin/true |& grep total | sed -e 's/.*: //') $ time ./build/qemu-x86_64 -plugin ./build/contrib/plugins/libips.so,ips=$(($num_insn/4)) /bin/true real 4.000s Boot a Linux kernel simulating a 250MHz cpu: $ /build/qemu-system-x86_64 -kernel /boot/vmlinuz-6.1.0-21-amd64 -append "console=ttyS0" -plugin ./build/contrib/plugins/libips.so,ips=$((250*1000*1000)) -smp 1 -m 512 check time until kernel panic on serial0 Tested in system mode by booting a full debian system, and using: $ sysbench cpu run Performance decrease linearly with the given number of ips. Signed-off-by: Pierrick Bouvier --- contrib/plugins/ips.c| 164 +++ contrib/plugins/Makefile | 1 + 2 files changed, 165 insertions(+) create mode 100644 contrib/plugins/ips.c diff --git a/contrib/plugins/ips.c b/contrib/plugins/ips.c new file mode 100644 index 000..db77729264b --- /dev/null +++ b/contrib/plugins/ips.c @@ -0,0 +1,164 @@ +/* + * ips rate limiting plugin. + * + * This plugin can be used to restrict the execution of a system to a + * particular number of Instructions Per Second (ips). This controls + * time as seen by the guest so while wall-clock time may be longer + * from the guests point of view time will pass at the normal rate. + * + * This uses the new plugin API which allows the plugin to control + * system time. + * + * Copyright (c) 2023 Linaro Ltd + * + * SPDX-License-Identifier: GPL-2.0-or-later + */ + +#include +#include +#include + +QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION; + +/* how many times do we update time per sec */ +#define NUM_TIME_UPDATE_PER_SEC 10 +#define NSEC_IN_ONE_SEC (1000 * 1000 * 1000) + +static GMutex global_state_lock; + +static uint64_t max_insn_per_second = 1000 * 1000 * 1000; /* ips per core, per second */ +static uint64_t max_insn_per_quantum; /* trap every N instructions */ +static int64_t virtual_time_ns; /* last set virtual time */ + +static const void *time_handle; + +typedef struct { +uint64_t total_insn; +uint64_t quantum_insn; /* insn in last quantum */ +int64_t last_quantum_time; /* time when last quantum started */ +} vCPUTime; + +struct qemu_plugin_scoreboard *vcpus; + +/* return epoch time in ns */ +static int64_t now_ns(void) +{ +return g_get_real_time() * 1000; +} + +static uint64_t num_insn_during(int64_t elapsed_ns) +{ +double num_secs = elapsed_ns / (double) NSEC_IN_ONE_SEC; +return num_secs * (double) max_insn_per_second; +} + +static int64_t time_for_insn(uint64_t num_insn) +{ +double num_secs = (double) num_insn / (double) max_insn_per_second; +return num_secs * (double) NSEC_IN_ONE_SEC; +} + +static void update_system_time(vCPUTime *vcpu) +{ +int64_t elapsed_ns = now_ns() - vcpu->last_quantum_time; +uint64_t max_insn = num_insn_during(elapsed_ns); + +if (vcpu->quantum_insn >= max_insn) { +/* this vcpu ran faster than expected, so it has to sleep */ +uint64_t insn_advance = vcpu->quantum_insn - max_insn; +uint64_t time_advance_ns = time_for_insn(insn_advance); +int64_t sleep_us = time_advance_ns / 1000; +g_usleep(sleep_us); +} + +vcpu->total_insn += vcpu->quantum_insn; +vcpu->quantum_insn = 0; +vcpu->last_quantum_time = now_ns(); + +/* based on total number of instructions, what should be the new time? */ +int64_t new_virtual_time = time_for_insn(vcpu->total_insn); + +g_mutex_lock(&global_state_lock); + +/* Time only moves forward. Another vcpu might have updated it already. */ +if (new_virtual_time > virtual_time_ns) { +qemu_plugin_update_ns(time_handle, new_virtual_time); +virtual_time_ns = new_virtual_time; +} + +g_mutex_unlock(&global_state_lock); +} + +static void vcpu_init(qemu_plugin_id_t id, unsigned int cpu_index) +{ +vCPUTime *vcpu = qemu_plugin_scoreboard_find(vcpus, cpu_index); +vcpu->total_insn = 0; +vcpu->quantum_insn = 0; +vcpu->last_quantum_time = now_ns(); +} + +static void vcpu_exit(qemu_plugin_id_t id, unsigned int cpu_index) +{ +vCPUTime *vcpu = qemu_plugin_scoreboard_find(vcpus, cpu_index); +update_system_time(vcpu); +} + +static void every_quantum_insn(unsigned int cpu_index, void *udata) +{ +vCPUTime *vcpu = qemu_plugin_scoreboard_find(vcpus, cpu_index); +g_assert(vcpu->quantum_insn >= max_insn_per_quantum); +update_system_time(v
[PATCH v3 5/6] plugins: add time control API
From: Alex Bennée Expose the ability to control time through the plugin API. Only one plugin can control time so it has to request control when loaded. There are probably more corner cases to catch here. From: Alex Bennée Signed-off-by: Alex Bennée Signed-off-by: Pierrick Bouvier --- include/qemu/qemu-plugin.h | 23 +++ plugins/api.c| 31 +++ plugins/qemu-plugins.symbols | 2 ++ 3 files changed, 56 insertions(+) diff --git a/include/qemu/qemu-plugin.h b/include/qemu/qemu-plugin.h index 95703d8fec1..80b1637cede 100644 --- a/include/qemu/qemu-plugin.h +++ b/include/qemu/qemu-plugin.h @@ -661,6 +661,29 @@ void qemu_plugin_register_vcpu_mem_inline_per_vcpu( qemu_plugin_u64 entry, uint64_t imm); +/** + * qemu_plugin_request_time_control() - request the ability to control time + * + * This grants the plugin the ability to control system time. Only one + * plugin can control time so if multiple plugins request the ability + * all but the first will fail. + * + * Returns an opaque handle or NULL if fails + */ +const void *qemu_plugin_request_time_control(void); + +/** + * qemu_plugin_update_ns() - update system emulation time + * @handle: opaque handle returned by qemu_plugin_request_time_control() + * @time: time in nanoseconds + * + * This allows an appropriately authorised plugin (i.e. holding the + * time control handle) to move system time forward to @time. + * + * Start time is 0. + */ +void qemu_plugin_update_ns(const void *handle, int64_t time); + typedef void (*qemu_plugin_vcpu_syscall_cb_t)(qemu_plugin_id_t id, unsigned int vcpu_index, int64_t num, uint64_t a1, uint64_t a2, diff --git a/plugins/api.c b/plugins/api.c index 5a0a7f8c712..26822b69ea2 100644 --- a/plugins/api.c +++ b/plugins/api.c @@ -39,6 +39,7 @@ #include "qemu/main-loop.h" #include "qemu/plugin.h" #include "qemu/log.h" +#include "qemu/timer.h" #include "tcg/tcg.h" #include "exec/exec-all.h" #include "exec/gdbstub.h" @@ -583,3 +584,33 @@ uint64_t qemu_plugin_u64_sum(qemu_plugin_u64 entry) } return total; } + +/* + * Time control + */ +static bool has_control; + +const void *qemu_plugin_request_time_control(void) +{ +if (!has_control) { +has_control = true; +return &has_control; +} +return NULL; +} + +static void advance_virtual_time__async(CPUState *cpu, run_on_cpu_data data) +{ +int64_t new_time = data.host_ulong; +qemu_clock_advance_virtual_time(new_time); +} + +void qemu_plugin_update_ns(const void *handle, int64_t new_time) +{ +if (handle == &has_control) { +/* Need to execute out of cpu_exec, so bql can be locked. */ +async_run_on_cpu(current_cpu, + advance_virtual_time__async, + RUN_ON_CPU_HOST_ULONG(new_time)); +} +} diff --git a/plugins/qemu-plugins.symbols b/plugins/qemu-plugins.symbols index aa0a77a319f..ca773d8d9fe 100644 --- a/plugins/qemu-plugins.symbols +++ b/plugins/qemu-plugins.symbols @@ -38,6 +38,7 @@ qemu_plugin_register_vcpu_tb_exec_cond_cb; qemu_plugin_register_vcpu_tb_exec_inline_per_vcpu; qemu_plugin_register_vcpu_tb_trans_cb; + qemu_plugin_request_time_control; qemu_plugin_reset; qemu_plugin_scoreboard_free; qemu_plugin_scoreboard_find; @@ -51,5 +52,6 @@ qemu_plugin_u64_set; qemu_plugin_u64_sum; qemu_plugin_uninstall; + qemu_plugin_update_ns; qemu_plugin_vcpu_for_each; }; -- 2.39.2
[PATCH v3 2/6] qtest: use cpu interface in qtest_clock_warp
From: Alex Bennée This generalises the qtest_clock_warp code to use the AccelOps handlers for updating its own sense of time. This will make the next patch which moves the warp code closer to pure code motion. From: Alex Bennée Signed-off-by: Alex Bennée Acked-by: Thomas Huth Signed-off-by: Pierrick Bouvier --- include/sysemu/qtest.h | 1 + accel/qtest/qtest.c| 1 + system/qtest.c | 6 +++--- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h index b5d5fd34637..45f3b7e1df5 100644 --- a/include/sysemu/qtest.h +++ b/include/sysemu/qtest.h @@ -36,6 +36,7 @@ void qtest_server_set_send_handler(void (*send)(void *, const char *), void qtest_server_inproc_recv(void *opaque, const char *buf); int64_t qtest_get_virtual_clock(void); +void qtest_set_virtual_clock(int64_t count); #endif #endif diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c index f6056ac8361..53182e6c2ae 100644 --- a/accel/qtest/qtest.c +++ b/accel/qtest/qtest.c @@ -52,6 +52,7 @@ static void qtest_accel_ops_class_init(ObjectClass *oc, void *data) ops->create_vcpu_thread = dummy_start_vcpu_thread; ops->get_virtual_clock = qtest_get_virtual_clock; +ops->set_virtual_clock = qtest_set_virtual_clock; }; static const TypeInfo qtest_accel_ops_type = { diff --git a/system/qtest.c b/system/qtest.c index 6da58b3874e..ee8b139e982 100644 --- a/system/qtest.c +++ b/system/qtest.c @@ -332,14 +332,14 @@ int64_t qtest_get_virtual_clock(void) return qatomic_read_i64(&qtest_clock_counter); } -static void qtest_set_virtual_clock(int64_t count) +void qtest_set_virtual_clock(int64_t count) { qatomic_set_i64(&qtest_clock_counter, count); } static void qtest_clock_warp(int64_t dest) { -int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); +int64_t clock = cpus_get_virtual_clock(); AioContext *aio_context; assert(qtest_enabled()); aio_context = qemu_get_aio_context(); @@ -348,7 +348,7 @@ static void qtest_clock_warp(int64_t dest) QEMU_TIMER_ATTR_ALL); int64_t warp = qemu_soonest_timeout(dest - clock, deadline); -qtest_set_virtual_clock(qtest_get_virtual_clock() + warp); +cpus_set_virtual_clock(cpus_get_virtual_clock() + warp); qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL); timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]); -- 2.39.2
Re: [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT
On Thu, May 23, 2024 at 04:05:43PM -0300, Fabiano Rosas wrote: > We're about to enable the use of O_DIRECT in the migration code and > due to the alignment restrictions imposed by filesystems we need to > make sure the flag is only used when doing aligned IO. > > The migration will do parallel IO to different regions of a file, so > we need to use more than one file descriptor. Those cannot be obtained > by duplicating (dup()) since duplicated file descriptors share the > file status flags, including O_DIRECT. If one migration channel does > unaligned IO while another sets O_DIRECT to do aligned IO, the > filesystem would fail the unaligned operation. > > The add-fd QMP command along with the fdset code are specifically > designed to allow the user to pass a set of file descriptors with > different access flags into QEMU to be later fetched by code that > needs to alternate between those flags when doing IO. > > Extend the fdset matching to behave the same with the O_DIRECT flag. > > Signed-off-by: Fabiano Rosas Reviewed-by: Peter Xu One thing I am confused but totally irrelevant to this specific change. I wonder why do we need dupfds at all, and why client needs to remove-fd at all. It's about what would go wrong if qmp client only add-fd, then if it's consumed by anything, it gets moved from "fds" list to "dupfds" list. The thing is I don't expect the client should pass over any fd that will not be consumed. Then if it's always consumed, why bother dup() at all, and why bother an explicit remove-fd. -- Peter Xu
Re: [PATCH v2 11/18] migration/multifd: Add direct-io support
On Thu, May 23, 2024 at 04:05:41PM -0300, Fabiano Rosas wrote: > When multifd is used along with mapped-ram, we can take benefit of a > filesystem that supports the O_DIRECT flag and perform direct I/O in > the multifd threads. This brings a significant performance improvement > because direct-io writes bypass the page cache which would otherwise > be thrashed by the multifd data which is unlikely to be needed again > in a short period of time. > > To be able to use a multifd channel opened with O_DIRECT, we must > ensure that a certain aligment is used. Filesystems usually require a > block-size alignment for direct I/O. The way to achieve this is by > enabling the mapped-ram feature, which already aligns its I/O properly > (see MAPPED_RAM_FILE_OFFSET_ALIGNMENT at ram.c). > > By setting O_DIRECT on the multifd channels, all writes to the same > file descriptor need to be aligned as well, even the ones that come > from outside multifd, such as the QEMUFile I/O from the main migration > code. This makes it impossible to use the same file descriptor for the > QEMUFile and for the multifd channels. The various flags and metadata > written by the main migration code will always be unaligned by virtue > of their small size. To workaround this issue, we'll require a second > file descriptor to be used exclusively for direct I/O. > > The second file descriptor can be obtained by QEMU by re-opening the > migration file (already possible), or by being provided by the user or > management application (support to be added in future patches). > > Signed-off-by: Fabiano Rosas > --- > migration/file.c | 31 ++- > migration/file.h | 1 - > migration/migration.c | 23 +++ > 3 files changed, 49 insertions(+), 6 deletions(-) > > diff --git a/migration/file.c b/migration/file.c > index ba5b5c44ff..ac4d206492 100644 > --- a/migration/file.c > +++ b/migration/file.c > @@ -50,12 +50,31 @@ void file_cleanup_outgoing_migration(void) > outgoing_args.fname = NULL; > } > > +static void file_enable_direct_io(int *flags) > +{ > +#ifdef O_DIRECT > +if (migrate_direct_io()) { > +*flags |= O_DIRECT; > +} > +#else > +/* it should have been rejected when setting the parameter */ > +g_assert_not_reached(); > +#endif > +} > + > bool file_send_channel_create(gpointer opaque, Error **errp) > { > QIOChannelFile *ioc; > int flags = O_WRONLY; > bool ret = true; > > +/* > + * Attempt to enable O_DIRECT for the secondary channels. These > + * are used for sending ram pages and writes should be guaranteed > + * to be aligned to at least page size. > + */ > +file_enable_direct_io(&flags); Call this only if enabled? That looks clearer, IMHO: if (migrate_direct_io()) { file_enable_direct_io(&flags); } Then: static void file_enable_direct_io(int *flags) { #ifdef O_DIRECT *flags |= O_DIRECT; #else /* it should have been rejected when setting the parameter */ g_assert_not_reached(); #endif } If you remember we have similar multifd calls, and I hoped all multifd functions are only invoked when multifd is enabled first. Same thing. > + > ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, errp); > if (!ioc) { > ret = false; > @@ -116,21 +135,23 @@ static gboolean > file_accept_incoming_migration(QIOChannel *ioc, > return G_SOURCE_REMOVE; > } > > -void file_create_incoming_channels(QIOChannel *ioc, Error **errp) > +static void file_create_incoming_channels(QIOChannel *ioc, char *filename, > + Error **errp) > { > -int i, fd, channels = 1; > +int i, channels = 1; > g_autofree QIOChannel **iocs = NULL; > +int flags = O_RDONLY; > > if (migrate_multifd()) { > channels += migrate_multifd_channels(); > +file_enable_direct_io(&flags); Same here. Other than that looks good. Thanks, > } > > iocs = g_new0(QIOChannel *, channels); > -fd = QIO_CHANNEL_FILE(ioc)->fd; > iocs[0] = ioc; > > for (i = 1; i < channels; i++) { > -QIOChannelFile *fioc = qio_channel_file_new_dupfd(fd, errp); > +QIOChannelFile *fioc = qio_channel_file_new_path(filename, flags, 0, > errp); > > if (!fioc) { > while (i) { > @@ -170,7 +191,7 @@ void file_start_incoming_migration(FileMigrationArgs > *file_args, Error **errp) > return; > } > > -file_create_incoming_channels(QIO_CHANNEL(fioc), errp); > +file_create_incoming_channels(QIO_CHANNEL(fioc), filename, errp); > } > > int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov, > diff --git a/migration/file.h b/migration/file.h > index 7699c04677..9f71e87f74 100644 > --- a/migration/file.h > +++ b/migration/file.h > @@ -20,7 +20,6 @@ void file_start_outgoing_migration(MigrationState *s, > int file_parse_offset(char *filespec, uint64_t
Re: [PATCH] hw/usb/hcd-ohci: Fix ohci_service_td: accept valid TDs
On Thu, May 30, 2024 at 2:14 PM Alex Bennée wrote: > David Hubbard writes: > > > From: Cord Amfmgm > > > > This changes the way the ohci emulation handles a Transfer Descriptor > with > > "Current Buffer Pointer" set to "Buffer End" + 1. > > > > The OHCI spec 4.3.1.2 Table 4-2 allows td.cbp to be one byte more than > td.be > > to signal the buffer has zero length. Currently qemu only accepts > zero-length > > Transfer Descriptors if the td.cbp is equal to 0, while actual OHCI > hardware > > accepts both cases. > > Which version of the OHCI spec is this? I can't find it in the one copy > Google throws up: > > > http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/ohci_11.pdf > > Replace http with https in that URL and it downloads the PDF OK - it is for IEEE-1394 Firewire though. Try this link: https://www.cs.usfca.edu/~cruse/cs698s10/hcir1_0a.pdf - I am on page 35/160 (the page is numbered "21" on the bottom) for the Table 4-2.
Re: [PATCH v2 10/18] migration: Add direct-io parameter
On Thu, May 23, 2024 at 04:05:40PM -0300, Fabiano Rosas wrote: > Add the direct-io migration parameter that tells the migration code to > use O_DIRECT when opening the migration stream file whenever possible. > > This is currently only used with the mapped-ram migration that has a > clear window guaranteed to perform aligned writes. > > Acked-by: Markus Armbruster > Signed-off-by: Fabiano Rosas Reviewed-by: Peter Xu -- Peter Xu
Re: hw/usb/hcd-ohci: Fix #1510, #303: pid not IN or OUT
On Thu, May 30, 2024 at 2:12 PM Alex Bennée wrote: > Cord Amfmgm writes: > > > On Thu, May 30, 2024 at 3:33 AM Alex Bennée > wrote: > > > > Cord Amfmgm writes: > > > > > On Tue, May 28, 2024 at 11:32 AM Peter Maydell < > peter.mayd...@linaro.org> wrote: > > > > > > On Tue, 28 May 2024 at 16:37, Cord Amfmgm > wrote: > > > > > > > > On Tue, May 28, 2024 at 9:03 AM Peter Maydell < > peter.mayd...@linaro.org> wrote: > > > >> > > > >> On Mon, 20 May 2024 at 23:24, Cord Amfmgm > wrote: > > > >> > On Mon, May 20, 2024 at 12:05 PM Peter Maydell < > peter.mayd...@linaro.org> wrote: > > > > > >> > And here's an example buffer of length 0 -- you probably > already know what I'm going to do here: > > > >> > > > > >> > char buf[0]; > > > >> > char * CurrentBufferPointer = &buf[0]; > > > >> > char * BufferEnd = &buf[-1]; // "address of the last byte in > the buffer" > > > >> > // The OHCI Host Controller than advances CurrentBufferPointer > like this: CurrentBufferPointer += 0 > > > >> > // After the transfer: > > > >> > // CurrentBufferPointer = &buf[0]; > > > >> > // BufferEnd = &buf[-1]; > > > >> > > > >> Right, but why do you think this is valid, rather than > > > >> being a guest software bug? My reading of the spec is that it's > > > >> pretty clear about how to say "zero length buffer", and this > > > >> isn't it. > > > >> > > > >> Is there some real-world guest OS that programs the OHCI > > > >> controller this way that we're trying to accommodate? > > > > > > > > > > > > qemu versions 4.2 and before allowed this behavior. > > > > > > So? That might just mean we had a bug and we fixed it. > > > 4.2 is a very old version of QEMU and nobody seems to have > > > complained in the four years since we released 5.0 about this, > > > which suggests that generally guest OS drivers don't try > > > to send zero-length buffers in this way. > > > > > > > I don't think it's valid to ask for a *popular* guest OS as a > proof-of-concept because I'm not an expert on those. > > > > > > I didn't ask for "popular"; I asked for "real-world". > > > What is the actual guest code you're running that falls over > > > because of the behaviour change? > > > > > > More generally, why do you want this behaviour to be > > > changed? Reasonable reasons might include: > > > * we're out of spec based on reading the documentation > > > * you're trying to run some old Windows VM/QNX/etc image, > > > and it doesn't work any more > > > * all the real hardware we tested behaves this way > > > > > > But don't necessarily include: > > > * something somebody wrote and only tested on QEMU happens to > > > assume the old behaviour rather than following the hw spec > > > > > > QEMU occasionally works around guest OS bugs, but only as > > > when we really have to. It's usually better to fix the > > > bug in the guest. > > > > > > It's not, and I've already demonstrated that real hardware is > consistent with the fix in this patch. > > > > > > Please check your tone. > > > > I don't think that is a particularly helpful comment for someone who is > > taking the time to review your patches. Reading through the thread I > > didn't see anything that said this is how real HW behaves but I may well > > have missed it. However you have a number of review comments to address > > so I suggest you spin a v2 of the series to address them and outline the > > reason to accept an out of spec transaction. > > > > I did a rework of the patch -- see my email from May 20, quoted below -- > and I was under the impression it addressed all the > > review comments. Did I miss something? I apologize if I did. > > Ahh I see - I'd only seen this thread continue so wasn't aware a new > version had been posted. For future patches consider using -vN when > sending them so we can clearly see a new revision is available. > > > > >> index acd6016980..71b54914d3 100644 > >> --- a/hw/usb/hcd-ohci.c > >> +++ b/hw/usb/hcd-ohci.c > >> @@ -941,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct > ohci_ed *ed) > >> if ((td.cbp & 0xf000) != (td.be & 0xf000)) { > >> len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff); > >> } else { > >> -if (td.cbp > td.be) { > >> -trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be); > >> +if (td.cbp - 1 > td.be) { /* rely on td.cbp != 0 */ > > > >> Reading through the thread I didn't see anything that said this is how > real HW behaves but I may well have missed it. > > > > This is what I wrote regarding real HW: > > > > Results are: > > > > qemu 4.2 | qemu HEAD | actual HW > > ++ > > works fine | ohci_die() | works fine > > > > Would additional verification of the actual HW be useful? > > > > Peter posted the following which is more specific than "qemu 4.2" -- I > agree this is most likely the qemu commi
Re: [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file
On Thu, May 23, 2024 at 04:05:39PM -0300, Fabiano Rosas wrote: > We want to make use of the Error object to report fdset errors from > qemu_open_internal() and passing the error pointer to qemu_open_old() > would require changing all callers. Move the file channel to the new > API instead. > > Reviewed-by: Daniel P. Berrangé > Signed-off-by: Fabiano Rosas Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add
On Thu, May 23, 2024 at 04:05:38PM -0300, Fabiano Rosas wrote: > I'm keeping the EACCES because callers expect to be able to look at > errno. > > Signed-off-by: Fabiano Rosas Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds
On Thu, May 23, 2024 at 04:05:36PM -0300, Fabiano Rosas wrote: > We've been up until now cleaning up any file descriptors that have > been passed into QEMU and never duplicated[1,2]. A file descriptor > without duplicates indicates that no part of QEMU has made use of > it. This approach is starting to show some cracks now that we're > starting to consume fds from the migration code: > > - Doing cleanup every time the last monitor connection closes works to > reap unused fds, but also has the side effect of forcing the > management layer to pass the file descriptors again in case of a > disconnect/re-connect, if that happened to be the only monitor > connection. > > Another side effect is that removing an fd with qmp_remove_fd() is > effectively delayed until the last monitor connection closes. > > The reliance on mon_refcount is also problematic because it's racy. > > - Checking runstate_is_running() skips the cleanup unless the VM is > running and avoids premature cleanup of the fds, but also has the > side effect of blocking the legitimate removal of an fd via > qmp_remove_fd() if the VM happens to be in another state. > > This affects qmp_remove_fd() and qmp_query_fdsets() in particular > because requesting a removal at a bad time (guest stopped) might > cause an fd to never be removed, or to be removed at a much later > point in time, causing the query command to continue showing the > supposedly removed fd/fdset. > > Note that file descriptors that *have* been duplicated are owned by > the code that uses them and will be removed after qemu_close() is > called. Therefore we've decided that the best course of action to > avoid the undesired side-effects is to stop managing non-duplicated > file descriptors. > > 1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect") > 2- ebe52b592d ("monitor: Prevent removing fd from set during init") > > Signed-off-by: Fabiano Rosas > --- > monitor/fds.c | 15 --- > monitor/hmp.c | 2 -- > monitor/monitor-internal.h | 1 - > monitor/qmp.c | 2 -- > 4 files changed, 8 insertions(+), 12 deletions(-) > > diff --git a/monitor/fds.c b/monitor/fds.c > index 101e21720d..f7b98814fa 100644 > --- a/monitor/fds.c > +++ b/monitor/fds.c > @@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, > Error **errp) > > static void monitor_fdset_free(MonFdset *mon_fdset) > { > +/* > + * Only remove an empty fdset. The fds are owned by the user and > + * should have been removed with qmp_remove_fd(). The dup_fds are > + * owned by QEMU and should have been removed with qemu_close(). > + */ > if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) { > QLIST_REMOVE(mon_fdset, next); > g_free(mon_fdset); > @@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset) > MonFdsetFd *mon_fdset_fd_next; > > QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, > mon_fdset_fd_next) { > -if ((mon_fdset_fd->removed || > -(QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) && > -runstate_is_running()) { > +if (mon_fdset_fd->removed) { I don't know the code well, but I like it. > monitor_fdset_fd_free(mon_fdset_fd); > } > } > @@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void) > > QEMU_LOCK_GUARD(&mon_fdsets_lock); > QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) { > -monitor_fdset_cleanup(mon_fdset); > +monitor_fdset_free(mon_fdset); > } > } > > @@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd) > if (mon_fdset_fd_dup->fd == dup_fd) { > QLIST_REMOVE(mon_fdset_fd_dup, next); > g_free(mon_fdset_fd_dup); > -if (QLIST_EMPTY(&mon_fdset->dup_fds)) { > -monitor_fdset_cleanup(mon_fdset); > -} > +monitor_fdset_free(mon_fdset); This and above changes are not crystal clear to me where the _cleanup() does extra check "removed" and clean those fds. I think it'll make it easier for me to understand if this one and the next are squashed together. But maybe it's simply because I didn't fully understand. > return; > } > } > diff --git a/monitor/hmp.c b/monitor/hmp.c > index 69c1b7e98a..460e8832f6 100644 > --- a/monitor/hmp.c > +++ b/monitor/hmp.c > @@ -1437,11 +1437,9 @@ static void monitor_event(void *opaque, QEMUChrEvent > event) > monitor_resume(mon); > } > qemu_mutex_unlock(&mon->mon_lock); > -mon_refcount++; > break; > > case CHR_EVENT_CLOSED: > -mon_refcount--; > monitor_fdsets_cleanup(); > break; > > diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h > index 252de85681..cb628f681d 100644 > --- a/
Re: [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free
On Thu, May 23, 2024 at 04:05:35PM -0300, Fabiano Rosas wrote: > Introduce two new functions to remove and free no longer used fds and > fdsets. > > We need those to decouple the remove/free routines from > monitor_fdset_cleanup() which will go away in the next patches. > > The new functions: > > - monitor_fdset_free() will be used when a monitor connection closes > and when an fd is removed to cleanup any fdset that is now empty. > > - monitor_fdset_fd_free() will be used to remove one or more fds that > have been explicitly targeted by qmp_remove_fd(). > > Signed-off-by: Fabiano Rosas Reviewed-by: Peter Xu One nitpick below. > --- > monitor/fds.c | 26 ++ > 1 file changed, 18 insertions(+), 8 deletions(-) > > diff --git a/monitor/fds.c b/monitor/fds.c > index fb9f58c056..101e21720d 100644 > --- a/monitor/fds.c > +++ b/monitor/fds.c > @@ -167,6 +167,22 @@ int monitor_get_fd(Monitor *mon, const char *fdname, > Error **errp) > return -1; > } > > +static void monitor_fdset_free(MonFdset *mon_fdset) > +{ > +if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) { > +QLIST_REMOVE(mon_fdset, next); > +g_free(mon_fdset); > +} > +} Would monitor_fdset_free_if_empty() (or similar) slightly better? static void monitor_fdset_free(MonFdset *mon_fdset) { QLIST_REMOVE(mon_fdset, next); g_free(mon_fdset); } static void monitor_fdset_free_if_empty(MonFdset *mon_fdset) { if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) { monitor_fdset_free(mon_fdset); } } > + > +static void monitor_fdset_fd_free(MonFdsetFd *mon_fdset_fd) > +{ > +close(mon_fdset_fd->fd); > +g_free(mon_fdset_fd->opaque); > +QLIST_REMOVE(mon_fdset_fd, next); > +g_free(mon_fdset_fd); > +} > + > static void monitor_fdset_cleanup(MonFdset *mon_fdset) > { > MonFdsetFd *mon_fdset_fd; > @@ -176,17 +192,11 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset) > if ((mon_fdset_fd->removed || > (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) && > runstate_is_running()) { > -close(mon_fdset_fd->fd); > -g_free(mon_fdset_fd->opaque); > -QLIST_REMOVE(mon_fdset_fd, next); > -g_free(mon_fdset_fd); > +monitor_fdset_fd_free(mon_fdset_fd); > } > } > > -if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) { > -QLIST_REMOVE(mon_fdset, next); > -g_free(mon_fdset); > -} > +monitor_fdset_free(mon_fdset); > } > > void monitor_fdsets_cleanup(void) > -- > 2.35.3 > -- Peter Xu
RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > For rdma programming, the current mainstream implementation is to use > rdma_cm to establish a connection, and then use verbs to transmit data. > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs have > > different responsibilities. rdma_cm fd is used to notify connection > > establishment events, and verbs fd is used to notify new CQEs. When > poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin > event can be monitored, which means that an rdma_cm event occurs. When > the verbs fd is directly polled/epolled, only the pollin event can be > listened, > which indicates that a new CQE is generated. > > > > Rsocket is a sub-module attached to the rdma_cm library and provides > > rdma calls that are completely similar to socket interfaces. However, > > this library returns only the rdma_cm fd for listening to link setup-related > events and does not expose the verbs fd (readable and writable events for > listening to data). Only the rpoll interface provided by the RSocket can be > used > to listen to related events. However, QEMU uses the ppoll interface to listen > to > the rdma_cm fd (gotten by raccept API). > > And cannot listen to the verbs fd event. Only some hacking methods can be > used to address this problem. > > > > Do you guys have any ideas? Thanks. The current rsocket code allows calling rpoll() with non-rsocket fd's, so an app can use rpoll() directly in place of poll(). It may be easiest to add an rppoll() call to rsockets and call that when using RDMA. In case the easy path isn't feasible: An extension could allow extracting the actual fd's under an rsocket, in order to allow a user to call poll()/ppoll() directly. But it would be non-trivial. The 'fd' that represents an rsocket happens to be the fd related to the RDMA CM. That's because an rsocket needs a unique integer value to report as an 'fd' value which will not conflict with any other fd value that the app may have. I would consider the fd value an implementation detail, rather than something which an app should depend upon. (For example, the 'fd' value returned for a datagram rsocket is actually a UDP socket fd). Once an rsocket is in the connected state, it's possible an extended rgetsockopt() or rfcntl() call could return the fd related to the CQ. But if an app tried to call poll() on that fd, the results would not be as expected. For example, it's possible for data to be available to receive on the rsocket without the CQ fd being signaled. Calling poll() on the CQ fd in this state could leave the app hanging. This is a natural? result of races in the RDMA CQ signaling. If you look at the rsocket rpoll() implementation, you'll see that it checks for data prior to sleeping. For an app to safely wait in poll/ppoll on the CQ fd, it would need to invoke some sort of 'pre-poll' routine, which would perform the same checks done in rpoll() prior to blocking. As a reference to a similar pre-poll routine, see the fi_trywait() call from this man page: https://ofiwg.github.io/libfabric/v1.21.0/man/fi_poll.3.html This is for a different library but deals with the same underlying problem. Obviously adding an rtrywait() to rsockets is possible but wouldn't align with any socket API equivalent. - Sean
[PATCH 1/5] hw/core: expand on the alignment of CPUState
Make the relationship between CPUState, ArchCPU and cpu_env a bit clearer in the kdoc comments. Signed-off-by: Alex Bennée --- include/hw/core/cpu.h | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index bb398e8237..35d345371b 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -391,7 +391,8 @@ struct qemu_work_item; #define CPU_UNSET_NUMA_NODE_ID -1 /** - * CPUState: + * struct CPUState - common state of one CPU core or thread. + * * @cpu_index: CPU index (informative). * @cluster_index: Identifies which cluster this CPU is in. * For boards which don't define clusters or for "loose" CPUs not assigned @@ -439,10 +440,15 @@ struct qemu_work_item; * @kvm_fetch_index: Keeps the index that we last fetched from the per-vCPU *dirty ring structure. * - * State of one CPU core or thread. + * @neg_align: The CPUState is the common part of a concrete ArchCPU + * which is allocated when an individual CPU instance is created. As + * such care is taken is ensure there is no gap between between + * CPUState and CPUArchState within ArchCPU. * - * Align, in order to match possible alignment required by CPUArchState, - * and eliminate a hole between CPUState and CPUArchState within ArchCPU. + * @neg: The architectural register state ("cpu_env") immediately follows CPUState + * in ArchCPU and is passed to TCG code. The @neg structure holds some + * common TCG CPU variables which are accessed with a negative offset + * from cpu_env. */ struct CPUState { /*< private >*/ -- 2.39.2
[PATCH 2/5] cpu: move Qemu[Thread|Cond] setup into common code
Aside from the round robin threads this is all common code. By moving the halt_cond setup we also no longer need hacks to work around the race between QOM object creation and thread creation. It is a little ugly to free stuff up for the round robin thread but better it deal with its own specialises than making the other accelerators jump through hoops. Signed-off-by: Alex Bennée --- include/hw/core/cpu.h | 4 accel/dummy-cpus.c| 3 --- accel/hvf/hvf-accel-ops.c | 4 accel/kvm/kvm-accel-ops.c | 3 --- accel/tcg/tcg-accel-ops-mttcg.c | 4 accel/tcg/tcg-accel-ops-rr.c | 14 +++--- hw/core/cpu-common.c | 5 + target/i386/nvmm/nvmm-accel-ops.c | 3 --- target/i386/whpx/whpx-accel-ops.c | 3 --- 9 files changed, 16 insertions(+), 27 deletions(-) diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index 35d345371b..a405119eda 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -404,10 +404,14 @@ struct qemu_work_item; * @tcg_cflags: Pre-computed cflags for this cpu. * @nr_cores: Number of cores within this CPU package. * @nr_threads: Number of threads within this CPU core. + * @thread: Host thread details, only live once @created is #true + * @sem: WIN32 only semaphore used only for qtest + * @thread_id: native thread id of vCPU, only live once @created is #true * @running: #true if CPU is currently running (lockless). * @has_waiter: #true if a CPU is currently waiting for the cpu_exec_end; * valid under cpu_list_lock. * @created: Indicates whether the CPU thread has been successfully created. + * @halt_cond: condition variable sleeping threads can wait on. * @interrupt_request: Indicates a pending interrupt request. * @halted: Nonzero if the CPU is in suspended state. * @stop: Indicates a pending stop request. diff --git a/accel/dummy-cpus.c b/accel/dummy-cpus.c index 20519f1ea4..f32d8c8dc3 100644 --- a/accel/dummy-cpus.c +++ b/accel/dummy-cpus.c @@ -68,9 +68,6 @@ void dummy_start_vcpu_thread(CPUState *cpu) { char thread_name[VCPU_THREAD_NAME_SIZE]; -cpu->thread = g_malloc0(sizeof(QemuThread)); -cpu->halt_cond = g_malloc0(sizeof(QemuCond)); -qemu_cond_init(cpu->halt_cond); snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/DUMMY", cpu->cpu_index); qemu_thread_create(cpu->thread, thread_name, dummy_cpu_thread_fn, cpu, diff --git a/accel/hvf/hvf-accel-ops.c b/accel/hvf/hvf-accel-ops.c index 40d4187d9d..6f1e27ef46 100644 --- a/accel/hvf/hvf-accel-ops.c +++ b/accel/hvf/hvf-accel-ops.c @@ -463,10 +463,6 @@ static void hvf_start_vcpu_thread(CPUState *cpu) */ assert(hvf_enabled()); -cpu->thread = g_malloc0(sizeof(QemuThread)); -cpu->halt_cond = g_malloc0(sizeof(QemuCond)); -qemu_cond_init(cpu->halt_cond); - snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF", cpu->cpu_index); qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn, diff --git a/accel/kvm/kvm-accel-ops.c b/accel/kvm/kvm-accel-ops.c index 94c828ac8d..c239dfc87a 100644 --- a/accel/kvm/kvm-accel-ops.c +++ b/accel/kvm/kvm-accel-ops.c @@ -66,9 +66,6 @@ static void kvm_start_vcpu_thread(CPUState *cpu) { char thread_name[VCPU_THREAD_NAME_SIZE]; -cpu->thread = g_malloc0(sizeof(QemuThread)); -cpu->halt_cond = g_malloc0(sizeof(QemuCond)); -qemu_cond_init(cpu->halt_cond); snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/KVM", cpu->cpu_index); qemu_thread_create(cpu->thread, thread_name, kvm_vcpu_thread_fn, diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c index c552b45b8e..49814ec4af 100644 --- a/accel/tcg/tcg-accel-ops-mttcg.c +++ b/accel/tcg/tcg-accel-ops-mttcg.c @@ -137,10 +137,6 @@ void mttcg_start_vcpu_thread(CPUState *cpu) g_assert(tcg_enabled()); tcg_cpu_init_cflags(cpu, current_machine->smp.max_cpus > 1); -cpu->thread = g_new0(QemuThread, 1); -cpu->halt_cond = g_malloc0(sizeof(QemuCond)); -qemu_cond_init(cpu->halt_cond); - /* create a thread per vCPU with TCG (MTTCG) */ snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/TCG", cpu->cpu_index); diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c index 894e73e52c..84c36c1450 100644 --- a/accel/tcg/tcg-accel-ops-rr.c +++ b/accel/tcg/tcg-accel-ops-rr.c @@ -317,22 +317,22 @@ void rr_start_vcpu_thread(CPUState *cpu) tcg_cpu_init_cflags(cpu, false); if (!single_tcg_cpu_thread) { -cpu->thread = g_new0(QemuThread, 1); -cpu->halt_cond = g_new0(QemuCond, 1); -qemu_cond_init(cpu->halt_cond); +single_tcg_halt_cond = cpu->halt_cond; +single_tcg_cpu_thread = cpu->thread; /* share a single thread for all cpus with TCG */ snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "ALL CPUs/TCG"); qemu_thread_create(cpu->thread, thread_name,
[PATCH 4/5] plugins: remove special casing for cpu->realized
Now the condition variable is initialised early on we don't need to go through hoops to avoid calling async_run_on_cpu. Signed-off-by: Alex Bennée --- plugins/core.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/plugins/core.c b/plugins/core.c index 0726bc7f25..badede28cf 100644 --- a/plugins/core.c +++ b/plugins/core.c @@ -65,11 +65,7 @@ static void plugin_cpu_update__locked(gpointer k, gpointer v, gpointer udata) CPUState *cpu = container_of(k, CPUState, cpu_index); run_on_cpu_data mask = RUN_ON_CPU_HOST_ULONG(*plugin.mask); -if (DEVICE(cpu)->realized) { -async_run_on_cpu(cpu, plugin_cpu_update__async, mask); -} else { -plugin_cpu_update__async(cpu, mask); -} +async_run_on_cpu(cpu, plugin_cpu_update__async, mask); } void plugin_unregister_cb__locked(struct qemu_plugin_ctx *ctx, -- 2.39.2
[PATCH 5/5] core/cpu-common: initialise plugin state before thread creation
Originally I tried to move where vCPU thread initialisation to later in realize. However pulling that thread (sic) got gnarly really quickly. It turns out some steps of CPU realization need values that can only be determined from the running vCPU thread. However having moved enough out of the thread creation we can now queue work before the thread starts (at least for TCG guests) and avoid the race between vcpu_init and other vcpu states a plugin might subscribe to. Signed-off-by: Alex Bennée --- hw/core/cpu-common.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c index 6cfc01593a..bf1a7b8892 100644 --- a/hw/core/cpu-common.c +++ b/hw/core/cpu-common.c @@ -222,14 +222,6 @@ static void cpu_common_realizefn(DeviceState *dev, Error **errp) cpu_resume(cpu); } -/* Plugin initialization must wait until the cpu start executing code */ -#ifdef CONFIG_PLUGIN -if (tcg_enabled()) { -cpu->plugin_state = qemu_plugin_create_vcpu_state(); -async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL); -} -#endif - /* NOTE: latest generic point where the cpu is fully realized */ } @@ -273,6 +265,18 @@ static void cpu_common_initfn(Object *obj) QTAILQ_INIT(&cpu->watchpoints); cpu_exec_initfn(cpu); + +/* + * Plugin initialization must wait until the cpu start executing + * code, but we must queue this work before the threads are + * created to ensure we don't race. + */ +#ifdef CONFIG_PLUGIN +if (tcg_enabled()) { +cpu->plugin_state = qemu_plugin_create_vcpu_state(); +async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL); +} +#endif } static void cpu_common_finalize(Object *obj) -- 2.39.2
[PATCH 0/5] cpus: a few tweaks to CPU realization
The recent IPS plugin exposed a race condition between vcpu_init callbacks and the other vcpu state callbacks. I originally thought there was some wider re-factoring to be done to clean this up but it turns out things are broadly where they should be. However some of the stuff allocated in the vCPU threads can clearly be done earlier so I've moved enough from cpu_common_realizefn to cpu_common_initfn to allow plugins to queue work before the threads start solving the race. Please review. Alex Bennée (5): hw/core: expand on the alignment of CPUState cpu: move Qemu[Thread|Cond] setup into common code cpu-target: don't set cpu->thread_id to bogus value plugins: remove special casing for cpu->realized core/cpu-common: initialise plugin state before thread creation include/hw/core/cpu.h | 18 ++ accel/dummy-cpus.c| 3 --- accel/hvf/hvf-accel-ops.c | 4 accel/kvm/kvm-accel-ops.c | 3 --- accel/tcg/tcg-accel-ops-mttcg.c | 4 accel/tcg/tcg-accel-ops-rr.c | 14 +++--- cpu-target.c | 1 - hw/core/cpu-common.c | 25 + plugins/core.c| 6 +- target/i386/nvmm/nvmm-accel-ops.c | 3 --- target/i386/whpx/whpx-accel-ops.c | 3 --- 11 files changed, 39 insertions(+), 45 deletions(-) -- 2.39.2
[PATCH 3/5] cpu-target: don't set cpu->thread_id to bogus value
The thread_id isn't valid until the threads are created. There is no point setting it here. The only thing that cares about the thread_id is qmp_query_cpus_fast. Signed-off-by: Alex Bennée --- cpu-target.c | 1 - 1 file changed, 1 deletion(-) diff --git a/cpu-target.c b/cpu-target.c index 5af120e8aa..499facf774 100644 --- a/cpu-target.c +++ b/cpu-target.c @@ -241,7 +241,6 @@ void cpu_exec_initfn(CPUState *cpu) cpu->num_ases = 0; #ifndef CONFIG_USER_ONLY -cpu->thread_id = qemu_get_thread_id(); cpu->memory = get_system_memory(); object_ref(OBJECT(cpu->memory)); #endif -- 2.39.2
Re: [PATCH V1 00/26] Live update: cpr-exec
On Thu, May 30, 2024 at 01:17:05PM -0400, Steven Sistare wrote: > On 5/28/2024 12:42 PM, Peter Xu wrote: > > On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote: > > > On 5/27/2024 1:45 PM, Peter Xu wrote: > > > > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote: > > > > > I understand, thanks. If I can help with any of your todo list, > > > > > just ask - steve > > > > > > > > Thanks for offering the help, Steve. Started looking at this today, > > > > then I > > > > found that I miss something high-level. Let me ask here, and let me > > > > apologize already for starting to throw multiple questions.. > > > > > > > > IIUC the whole idea of this patchset is to allow efficient QEMU > > > > upgrade, in > > > > this case not host kernel but QEMU-only, and/or upper. > > > > > > > > Is there any justification on why the complexity is needed here? It > > > > looks > > > > to me this one is more involved than cpr-reboot, so I'm thinking how > > > > much > > > > we can get from the complexity, and whether it's worthwhile. 1000+ LOC > > > > is > > > > the min support, and if we even expect more to come, that's really > > > > important, IMHO. > > > > > > > > For example, what's the major motivation of this whole work? Is that > > > > more > > > > on performance, or is it more for supporting the special devices like > > > > VFIO > > > > which we used to not support, or something else? I can't find them in > > > > whatever cover letter I can find, including this one. > > > > > > > > Firstly, regarding performance, IMHO it'll be always nice to share even > > > > some very fundamental downtime measurement comparisons using the new > > > > exec > > > > mode v.s. the old migration ways to upgrade QEMU binary. Do you perhaps > > > > have some number on hand when you started working on this feature years > > > > ago? Or maybe some old links on the list would help too, as I didn't > > > > follow this work since the start. > > > > > > > > On VFIO, IIUC you started out this project without VFIO migration being > > > > there. Now we have VFIO migration so not sure how much it would work > > > > for > > > > the upgrade use case. Even with current VFIO migration, we may not want > > > > to > > > > migrate device states for a local upgrade I suppose, as that can be a > > > > lot > > > > depending on the type of device assigned. However it'll be nice to > > > > discuss > > > > this too if this is the major purpose of the series. > > > > > > > > I think one other challenge on QEMU upgrade with VFIO devices is that > > > > the > > > > dest QEMU won't be able to open the VFIO device when the src QEMU is > > > > still > > > > using it as the owner. IIUC this is a similar condition where QEMU > > > > wants > > > > to have proper ownership transfer of a shared block device, and AFAIR > > > > right > > > > now we resolved that issue using some form of file lock on the image > > > > file. > > > > In this case it won't easily apply to a VFIO dev fd, but maybe we still > > > > have other approaches, not sure whether you investigated any. E.g. > > > > could > > > > the VFIO handle be passed over using unix scm rights? I think this > > > > might > > > > remove one dependency of using exec which can cause quite some > > > > difference > > > > v.s. a generic migration (from which regard, cpr-reboot is still a > > > > pretty > > > > generic migration). > > > > > > > > You also mentioned vhost/tap, is that also a major goal of this series > > > > in > > > > the follow up patchsets? Is this a problem only because this solution > > > > will > > > > do exec? Can it work if either the exec()ed qemu or dst qemu create the > > > > vhost/tap fds when boot? > > > > > > > > Meanwhile, could you elaborate a bit on the implication on chardevs? > > > > From > > > > what I read in the doc update it looks like a major part of work in the > > > > future, but I don't yet understand the issue.. Is it also relevant to > > > > the > > > > exec() approach? > > > > > > > > In all cases, some of such discussion would be really appreciated. And > > > > if > > > > you used to consider other approaches to solve this problem it'll be > > > > great > > > > to mention how you chose this way. Considering this work contains too > > > > many > > > > things, it'll be nice if such discussion can start with the > > > > fundamentals, > > > > e.g. on why exec() is a must. > > > > > > The main goal of cpr-exec is providing a fast and reliable way to update > > > qemu. cpr-reboot is not fast enough or general enough. It requires the > > > guest to support suspend and resume for all devices, and that takes > > > seconds. > > > If one actually reboots the host, that adds more seconds, depending on > > > system services. cpr-exec takes 0.1 secs, and works every time, unlike > > > like migration which can fail to converge on a busy system. Live > > > migration > > > also consumes more system and netwo
Re: [RFC PATCH] cpus: split qemu_init_vcpu and delay vCPU thread creation
Pierrick Bouvier writes: > On 5/29/24 08:22, Alex Bennée wrote: >> This ensures we don't start the thread until cpu_common_realizefn has >> finished. This ensures that plugins will always run >> qemu_plugin_vcpu_init__async first before any other states. It doesn't >> totally eliminate the race that plugin_cpu_update__locked has to work >> around though. I found this while reviewing the ips plugin which makes >> heavy use of the vcpu phase callbacks. >> An alternative might be to move the explicit creation of vCPU >> threads >> to qdev_machine_creation_done()? It doesn't affect user-mode which >> already has a thread to execute in and ensures the QOM object has >> completed creation in cpu_create() before continuing. >> Signed-off-by: Alex Bennée >> Cc: Pierrick Bouvier >> Cc: Philippe Mathieu-Daudé >> --- >> include/hw/core/cpu.h | 8 >> accel/tcg/user-exec-stub.c | 5 + >> hw/core/cpu-common.c | 7 ++- >> plugins/core.c | 5 + >> system/cpus.c | 15 ++- >> 5 files changed, 34 insertions(+), 6 deletions(-) >> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h >> index bb398e8237..6920699585 100644 >> --- a/include/hw/core/cpu.h >> +++ b/include/hw/core/cpu.h >> @@ -1041,6 +1041,14 @@ void end_exclusive(void); >>*/ >> void qemu_init_vcpu(CPUState *cpu); >> +/** >> + * qemu_start_vcpu: >> + * @cpu: The vCPU to start. >> + * >> + * Create the vCPU thread and start it running. >> + */ >> +void qemu_start_vcpu(CPUState *cpu); >> + >> #define SSTEP_ENABLE 0x1 /* Enable simulated HW single stepping */ >> #define SSTEP_NOIRQ 0x2 /* Do not use IRQ while single stepping */ >> #define SSTEP_NOTIMER 0x4 /* Do not Timers while single stepping */ >> diff --git a/accel/tcg/user-exec-stub.c b/accel/tcg/user-exec-stub.c >> index 4fbe2dbdc8..162bb72bbe 100644 >> --- a/accel/tcg/user-exec-stub.c >> +++ b/accel/tcg/user-exec-stub.c >> @@ -18,6 +18,11 @@ void cpu_exec_reset_hold(CPUState *cpu) >> { >> } >> +void qemu_start_vcpu(CPUState *cpu) >> +{ >> +/* NOP for user-mode, we already have a thread */ >> +} >> + >> /* User mode emulation does not support record/replay yet. */ >> bool replay_exception(void) >> diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c >> index 0f0a247f56..68895ddd59 100644 >> --- a/hw/core/cpu-common.c >> +++ b/hw/core/cpu-common.c >> @@ -230,7 +230,12 @@ static void cpu_common_realizefn(DeviceState *dev, >> Error **errp) >> } >> #endif >> -/* NOTE: latest generic point where the cpu is fully realized >> */ >> +/* >> + * With everything set up we can finally start the vCPU thread. >> + * This is a NOP for linux-user. >> + * NOTE: latest generic point where the cpu is fully realized >> + */ >> +qemu_start_vcpu(cpu); >> } >> static void cpu_common_unrealizefn(DeviceState *dev) >> diff --git a/plugins/core.c b/plugins/core.c >> index 0726bc7f25..1e5da7853b 100644 >> --- a/plugins/core.c >> +++ b/plugins/core.c >> @@ -65,6 +65,11 @@ static void plugin_cpu_update__locked(gpointer k, >> gpointer v, gpointer udata) >> CPUState *cpu = container_of(k, CPUState, cpu_index); >> run_on_cpu_data mask = RUN_ON_CPU_HOST_ULONG(*plugin.mask); >> +/* >> + * There is a race condition between the starting of the vCPU >> + * thread at the end of cpu_common_realizefn and when realized is >> + * finally set. >> + */ > > Could we simply have an active wait here? > while (!DEVICE(cpu)->realized) {} > > We have a guarantee it will be realized shortly, and if it's too hard > to have a proper synchronization mechanism (introduce a > realize_cond?), then waiting for the proper state does not seem too > bad. > > It's a bit strange for me to document an existing race condition, > instead of finding a solution. If only it were that simple ;-) Having been digging into this today it looks like there is a careful set of dependencies on when threads need to be created during CPU realization. I did try pushing the thread creation out of realization but that breaks things like KVM which can't initialise some things until the thread is up. I'm now trying to move the plugin queuing its async work to after things are initialised and before threads are created. My only concern now is if I need to avoid kicking threads before they are created. -- Alex Bennée Virtualisation Tech Lead @ Linaro
Re: [PATCH] hw/usb/hcd-ohci: Fix ohci_service_td: accept valid TDs
David Hubbard writes: > From: Cord Amfmgm > > This changes the way the ohci emulation handles a Transfer Descriptor with > "Current Buffer Pointer" set to "Buffer End" + 1. > > The OHCI spec 4.3.1.2 Table 4-2 allows td.cbp to be one byte more than td.be > to signal the buffer has zero length. Currently qemu only accepts zero-length > Transfer Descriptors if the td.cbp is equal to 0, while actual OHCI hardware > accepts both cases. Which version of the OHCI spec is this? I can't find it in the one copy Google throws up: http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/ohci_11.pdf > The qemu ohci emulation has a regression in ohci_service_td. Version 4.2 > and earlier matched the spec. (I haven't taken the time to bisect exactly > where the logic was changed.) > > With a tiny OS[1] that boots and executes a test, the issue can be seen: > > * OS that sends USB requests to a USB mass storage device > but sends td.cbp = td.be + 1 > * qemu 4.2 > * qemu HEAD (4e66a0854) > * Actual OHCI controller (hardware) -- Alex Bennée Virtualisation Tech Lead @ Linaro
Re: hw/usb/hcd-ohci: Fix #1510, #303: pid not IN or OUT
Cord Amfmgm writes: > On Thu, May 30, 2024 at 3:33 AM Alex Bennée wrote: > > Cord Amfmgm writes: > > > On Tue, May 28, 2024 at 11:32 AM Peter Maydell > wrote: > > > > On Tue, 28 May 2024 at 16:37, Cord Amfmgm wrote: > > > > > > On Tue, May 28, 2024 at 9:03 AM Peter Maydell > wrote: > > >> > > >> On Mon, 20 May 2024 at 23:24, Cord Amfmgm wrote: > > >> > On Mon, May 20, 2024 at 12:05 PM Peter Maydell > wrote: > > > >> > And here's an example buffer of length 0 -- you probably already > know what I'm going to do here: > > >> > > > >> > char buf[0]; > > >> > char * CurrentBufferPointer = &buf[0]; > > >> > char * BufferEnd = &buf[-1]; // "address of the last byte in the > buffer" > > >> > // The OHCI Host Controller than advances CurrentBufferPointer like > this: CurrentBufferPointer += 0 > > >> > // After the transfer: > > >> > // CurrentBufferPointer = &buf[0]; > > >> > // BufferEnd = &buf[-1]; > > >> > > >> Right, but why do you think this is valid, rather than > > >> being a guest software bug? My reading of the spec is that it's > > >> pretty clear about how to say "zero length buffer", and this > > >> isn't it. > > >> > > >> Is there some real-world guest OS that programs the OHCI > > >> controller this way that we're trying to accommodate? > > > > > > > > > qemu versions 4.2 and before allowed this behavior. > > > > So? That might just mean we had a bug and we fixed it. > > 4.2 is a very old version of QEMU and nobody seems to have > > complained in the four years since we released 5.0 about this, > > which suggests that generally guest OS drivers don't try > > to send zero-length buffers in this way. > > > > > I don't think it's valid to ask for a *popular* guest OS as a > proof-of-concept because I'm not an expert on those. > > > > I didn't ask for "popular"; I asked for "real-world". > > What is the actual guest code you're running that falls over > > because of the behaviour change? > > > > More generally, why do you want this behaviour to be > > changed? Reasonable reasons might include: > > * we're out of spec based on reading the documentation > > * you're trying to run some old Windows VM/QNX/etc image, > > and it doesn't work any more > > * all the real hardware we tested behaves this way > > > > But don't necessarily include: > > * something somebody wrote and only tested on QEMU happens to > > assume the old behaviour rather than following the hw spec > > > > QEMU occasionally works around guest OS bugs, but only as > > when we really have to. It's usually better to fix the > > bug in the guest. > > > > It's not, and I've already demonstrated that real hardware is consistent > with the fix in this patch. > > > > Please check your tone. > > I don't think that is a particularly helpful comment for someone who is > taking the time to review your patches. Reading through the thread I > didn't see anything that said this is how real HW behaves but I may well > have missed it. However you have a number of review comments to address > so I suggest you spin a v2 of the series to address them and outline the > reason to accept an out of spec transaction. > > I did a rework of the patch -- see my email from May 20, quoted below -- and > I was under the impression it addressed all the > review comments. Did I miss something? I apologize if I did. Ahh I see - I'd only seen this thread continue so wasn't aware a new version had been posted. For future patches consider using -vN when sending them so we can clearly see a new revision is available. > >> index acd6016980..71b54914d3 100644 >> --- a/hw/usb/hcd-ohci.c >> +++ b/hw/usb/hcd-ohci.c >> @@ -941,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct >> ohci_ed *ed) >> if ((td.cbp & 0xf000) != (td.be & 0xf000)) { >> len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff); >> } else { >> -if (td.cbp > td.be) { >> -trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be); >> +if (td.cbp - 1 > td.be) { /* rely on td.cbp != 0 */ > >> Reading through the thread I didn't see anything that said this is how real >> HW behaves but I may well have missed it. > > This is what I wrote regarding real HW: > > Results are: > > qemu 4.2 | qemu HEAD | actual HW > ++ > works fine | ohci_die() | works fine > > Would additional verification of the actual HW be useful? > > Peter posted the following which is more specific than "qemu 4.2" -- I agree > this is most likely the qemu commit where this > thread is focused: > >> Almost certainly this was commit 1328fe0c32d54 ("hw: usb: hcd-ohci: >> check len and frame_number variables"), which added these bounds >> checks. Prior to that we did no bounds checking at all, which >> meant that we permitted cbp=be+1 to mean a zero length, but also >> that we permitted the gu
Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote: > On 5/29/2024 3:25 PM, Peter Xu wrote: > > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: > > > On 5/28/2024 5:44 PM, Peter Xu wrote: > > > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: > > > > > Preserve fields of RAMBlocks that allocate their host memory during > > > > > CPR so > > > > > the RAM allocation can be recovered. > > > > > > > > This sentence itself did not explain much, IMHO. QEMU can share memory > > > > using fd based memory already of all kinds, as long as the memory > > > > backend > > > > is path-based it can be shared by sharing the same paths to dst. > > > > > > > > This reads very confusing as a generic concept. I mean, QEMU migration > > > > relies on so many things to work right. We mostly asks the users to > > > > "use > > > > exactly the same cmdline for src/dst QEMU unless you know what you're > > > > doing", otherwise many things can break. That should also include > > > > ramblock > > > > being matched between src/dst due to the same cmdlines provided on both > > > > sides. It'll be confusing to mention this when we thought the ramblocks > > > > also rely on that fact. > > > > > > > > So IIUC this sentence should be dropped in the real patch, and I'll try > > > > to > > > > guess the real reason with below.. > > > > > > The properties of the implicitly created ramblocks must be preserved. > > > The defaults can and do change between qemu releases, even when the > > > command-line > > > parameters do not change for the explicit objects that cause these > > > implicit > > > ramblocks to be created. > > > > AFAIU, QEMU relies on ramblocks to be the same before this series. Do you > > have an example? Would that already cause issue when migrate? > > Alignment has changed, and used_length vs max_length changed when > resizeable ramblocks were introduced. I have dealt with these issues > while supporting cpr for our internal use, and the learned lesson is to > explicitly communicate the creation-time parameters to new qemu. Why used_length can change? I'm looking at ram_mig_ram_block_resized(): if (!migration_is_idle()) { /* * Precopy code on the source cannot deal with the size of RAM blocks * changing at random points in time - especially after sending the * RAM block sizes in the migration stream, they must no longer change. * Abort and indicate a proper reason. */ error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr); migration_cancel(err); error_free(err); } We sent used_length upfront of a migration during SETUP phase. Looks like what you're describing can be something different, though? Regarding to rb->align: isn't that mostly a constant, reflecting the MR's alignment? It's set when ramblock is created IIUC: rb->align = mr->align; When will the alignment change? > > These are not an issue for migration because the ramblock is re-created > and the data copied into the new memory. > > > > > > Mirror the mr->align field in the RAMBlock to simplify the vmstate. > > > > > Preserve the old host address, even though it is immediately > > > > > discarded, > > > > > as it will be needed in the future for CPR with iommufd. Preserve > > > > > guest_memfd, even though CPR does not yet support it, to maintain > > > > > vmstate > > > > > compatibility when it becomes supported. > > > > > > > > .. It could be about the vfio vaddr update feature that you mentioned > > > > and > > > > only for iommufd (as IIUC vfio still relies on iova ranges, then it > > > > won't > > > > help here)? > > > > > > > > If so, IMHO we should have this patch (or any variance form) to be there > > > > for your upcoming vfio support. Keeping this around like this will make > > > > the series harder to review. Or is it needed even before VFIO? > > > > > > This patch is needed independently of vfio or iommufd. > > > > > > guest_memfd is independent of vfio or iommufd. It is a recent addition > > > which I have not tried to support, but I added this placeholder field > > > to it can be supported in the future without adding a new field later > > > and maintaining backwards compatibility. > > > > Is guest_memfd the only user so far, then? If so, would it be possible we > > split it as a separate effort on top of the base cpr-exec support? > > I don't understand the question. I am indeed deferring support for > guest_memfd > to a future time. For now, I am adding a blocker, and reserving a field for > it in the preserved ramblock attributes, to avoid adding a subsection later. I meant I'm thinking whether the new ramblock vmsd may not be required for the initial implementation. E.g., IIUC vaddr is required by iommufd, and so far that's not part of the initial support. Then I think a major thing is about the fds to be managed that will need to be shared. If we put
Re: [PATCH V1 17/26] machine: memfd-alloc option
On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote: > On 5/29/2024 3:14 PM, Peter Xu wrote: > > On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote: > > > > > diff --git a/system/memory.c b/system/memory.c > > > > > index 49f1cb2..ca04a0e 100644 > > > > > --- a/system/memory.c > > > > > +++ b/system/memory.c > > > > > @@ -1552,8 +1552,9 @@ bool > > > > > memory_region_init_ram_nomigrate(MemoryRegion *mr, > > > > > uint64_t size, > > > > > Error **errp) > > > > >{ > > > > > +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0; > > > > > > > > If there's a machine option to "use memfd for allocations", then it's > > > > shared mem... Hmm.. > > > > > > > > It is a bit confusing to me in quite a few levels: > > > > > > > > - Why memory allocation method will be defined by a machine > > > > property, > > > > even if we have memory-backend-* which should cover everything? > > > > > > Some memory regions are implicitly created, and have no explicit > > > representation > > > on the qemu command line. memfd-alloc affects those. > > > > > > More generally, memfd-alloc affects all ramblock allocations that are > > > not explicitly represented by memory-backend object. Thus the simple > > > command line "qemu -m 1G" does not explicitly describe an object, so it > > > goes through the anonymous allocation path, and is affected by > > > memfd-alloc. > > > > Can we simply now allow "qemu -m 1G" to work for cpr-exec? > > I assume you meant "simply not allow". > > Yes, I could do that, but I would need to explicitly add code to exclude this > case, and add a blocker. Right now it "just works" for all paths that lead to > ram_block_alloc_host, without any special logic at the memory-backend level. > And, I'm not convinced that simplifies the docs, as now I would need to tell > the user that "-m 1G" and similar constructions do not work with cpr. > > I can try to clarify the doc for -memfd-alloc as currently defined. Why do we need to keep cpr working for existing qemu cmdlines? We'll already need to add more new cmdline options already anyway, right? cpr-reboot wasn't doing it, and that made sense to me, so that new features will require the user to opt-in for it, starting with changing its cmdlines. > > > AFAIU that's > > what we do with cpr-reboot: we ask the user to specify the right things to > > make other thing work. Otherwise it won't. > > > > > > > > Internally, create_default_memdev does create a memory-backend object. > > > That is what my doc comment above refers to: > > >Any associated memory-backend objects are created with share=on > > > > > > An explicit "qemu -object memory-backend-*" is not affected by > > > memfd-alloc. > > > > > > The qapi comments in patch "migration: cpr-exec mode" attempt to say all > > > that: > > > > > > +# Memory backend objects must have the share=on attribute, and > > > +# must be mmap'able in the new QEMU process. For example, > > > +# memory-backend-file is acceptable, but memory-backend-ram is > > > +# not. > > > +# > > > +# The VM must be started with the '-machine memfd-alloc=on' > > > +# option. This causes implicit ram blocks -- those not explicitly > > > +# described by a memory-backend object -- to be allocated by > > > +# mmap'ing a memfd. Examples include VGA, ROM, and even guest > > > +# RAM when it is specified without a memory-backend object. > > > > VGA is IIRC 16MB chunk, ROM is even smaller. If the user specifies -object > > memory-backend-file,share=on propertly, these should be the only outliers? > > > > Are these important enough for the downtime? Can we put them into the > > migrated image alongside with the rest device states? > > It's not about downtime. vfio, vdpa, and iommufd pin all guest pages. > The pages must remain pinned during CPR to support ongoing DMA activity > which could target those pages (which we do not quiesce), and the same > physical pages must be used for the ramblocks in the new qemu process. Ah ok, yes DMA can happen on the fly. Guest mem is definitely the major DMA target and that can be covered by -object memory-backend-*,shared=on cmdlines. ROM is definitely not a DMA target. So is VGA ram a target for, perhaps, an assigned vGPU device? Do we have a list of things that will need that? Can we make them work somehow by sharing them like guest mem? It'll be a complete tragedy if we introduced this whole thing only because of some minority. I want to understand whether there's any generic way to solve this problem rather than this magical machine property. IMHO it's very not trivial to maintain. > > > > > - Even if we have such a machine property, why setting "memfd" will > > > > always imply shared? why not private? After all it's not called > > > > "memfd-shared-alloc", and we can create private ma
Re: [PATCH V1 07/26] migration: VMStateId
On Thu, May 30, 2024 at 01:11:26PM -0400, Steven Sistare wrote: > On 5/29/2024 2:53 PM, Peter Xu wrote: > > On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote: > > > How about a more general name for the type: > > > > > > migration/misc.h > > > typedef char (MigrationId)[256]; > > > > How about qemu/typedefs.h? Not sure whether it's applicable. Markus (in > > the loop) may have a better idea. > > > > Meanwhile, s/MigrationID/IDString/? > > typedefs.h has a different purpose; giving short names to types > defined in internal include files. > > This id is specific to migration, so I still think its name should reflect > migration and it belongs in some include/migration/*.h file. > > ramblocks and migration are already closely related. There is nothing wrong > with including a migration header in ramblock.h so it can use a migration > type. > We already have: > include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h" > include/hw/display/ramfb.h:#include "migration/vmstate.h" > include/hw/input/pl050.h:#include "migration/vmstate.h" > include/hw/pci/shpc.h:#include "migration/vmstate.h" > include/hw/virtio/virtio.h:#include "migration/vmstate.h" > include/hw/hyperv/vmbus.h:#include "migration/vmstate.h" > > The 256 byte magic length already appears in too many places, and my code > would add more occurrences, so I really think that abstracting this type > would be cleaner. I agree having a typedef is nicer, but I don't understand why it must be migration related. It can be the type QEMU uses to represent any string based ID, and that's a generic concept to me. Migration can have a wrapper to process that type, but then migration will include the generic header in that case, it feels more natural that way? -- Peter Xu
[PATCH v2 6/6] contrib/plugins: add ips plugin example for cost modeling
This plugin uses the new time control interface to make decisions about the state of time during the emulation. The algorithm is currently very simple. The user specifies an ips rate which applies per core. If the core runs ahead of its allocated execution time the plugin sleeps for a bit to let real time catch up. Either way time is updated for the emulation as a function of total executed instructions with some adjustments for cores that idle. Examples Slow down execution of /bin/true: $ num_insn=$(./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d plugin /bin/true |& grep total | sed -e 's/.*: //') $ time ./build/qemu-x86_64 -plugin ./build/contrib/plugins/libips.so,ips=$(($num_insn/4)) /bin/true real 4.000s Boot a Linux kernel simulating a 250MHz cpu: $ /build/qemu-system-x86_64 -kernel /boot/vmlinuz-6.1.0-21-amd64 -append "console=ttyS0" -plugin ./build/contrib/plugins/libips.so,ips=$((250*1000*1000)) -smp 1 -m 512 check time until kernel panic on serial0 Signed-off-by: Pierrick Bouvier --- contrib/plugins/ips.c| 239 +++ contrib/plugins/Makefile | 1 + 2 files changed, 240 insertions(+) create mode 100644 contrib/plugins/ips.c diff --git a/contrib/plugins/ips.c b/contrib/plugins/ips.c new file mode 100644 index 000..cf3159df391 --- /dev/null +++ b/contrib/plugins/ips.c @@ -0,0 +1,239 @@ +/* + * ips rate limiting plugin. + * + * This plugin can be used to restrict the execution of a system to a + * particular number of Instructions Per Second (ips). This controls + * time as seen by the guest so while wall-clock time may be longer + * from the guests point of view time will pass at the normal rate. + * + * This uses the new plugin API which allows the plugin to control + * system time. + * + * Copyright (c) 2023 Linaro Ltd + * + * SPDX-License-Identifier: GPL-2.0-or-later + */ + +#include +#include +#include + +QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION; + +/* how many times do we update time per sec */ +#define NUM_TIME_UPDATE_PER_SEC 10 +#define NSEC_IN_ONE_SEC (1000 * 1000 * 1000) + +static GMutex global_state_lock; + +static uint64_t insn_per_second = 1000 * 1000; /* ips per core, per second */ +static uint64_t insn_quantum; /* trap every N instructions */ +static bool precise_execution; /* count every instruction */ +static int64_t start_time_ns; /* time (ns since epoch) first vCPU started */ +static int64_t virtual_time_ns; /* last set virtual time */ + +static const void *time_handle; + +typedef enum { +UNKNOWN = 0, +EXECUTING, +IDLE, +FINISHED +} vCPUState; + +typedef struct { +uint64_t counter; +uint64_t track_insn; +vCPUState state; +/* timestamp when vCPU entered state */ +int64_t last_state_time; +} vCPUTime; + +struct qemu_plugin_scoreboard *vcpus; + +/* return epoch time in ns */ +static int64_t now_ns(void) +{ +return g_get_real_time() * 1000; +} + +static uint64_t num_insn_during(int64_t elapsed_ns) +{ +double num_secs = elapsed_ns / (double) NSEC_IN_ONE_SEC; +return num_secs * (double) insn_per_second; +} + +static int64_t time_for_insn(uint64_t num_insn) +{ +double num_secs = (double) num_insn / (double) insn_per_second; +return num_secs * (double) NSEC_IN_ONE_SEC; +} + +static int64_t uptime_ns(void) +{ +int64_t now = now_ns(); +g_assert(now >= start_time_ns); +return now - start_time_ns; +} + +static void vcpu_set_state(vCPUTime *vcpu, vCPUState new_state) +{ +vcpu->last_state_time = now_ns(); +vcpu->state = new_state; +} + +static void update_system_time(vCPUTime *vcpu) +{ +/* flush remaining instructions */ +vcpu->counter += vcpu->track_insn; +vcpu->track_insn = 0; + +int64_t uptime = uptime_ns(); +uint64_t expected_insn = num_insn_during(uptime); + +if (vcpu->counter >= expected_insn) { +/* this vcpu ran faster than expected, so it has to sleep */ +uint64_t insn_advance = vcpu->counter - expected_insn; +uint64_t time_advance_ns = time_for_insn(insn_advance); +int64_t sleep_us = time_advance_ns / 1000; +g_usleep(sleep_us); +} + +/* based on number of instructions, what should be the new time? */ +int64_t new_virtual_time = time_for_insn(vcpu->counter); + +g_mutex_lock(&global_state_lock); + +/* Time only moves forward. Another vcpu might have updated it already. */ +if (new_virtual_time > virtual_time_ns) { +qemu_plugin_update_ns(time_handle, new_virtual_time); +virtual_time_ns = new_virtual_time; +} + +g_mutex_unlock(&global_state_lock); +} + +static void set_start_time() +{ +g_mutex_lock(&global_state_lock); +if (!start_time_ns) { +start_time_ns = now_ns(); +} +g_mutex_unlock(&global_state_lock); +} + +static void vcpu_init(qemu_plugin_id_t id, unsigned int cpu_index) +{ +vCPUTime *vcpu = qemu_plugin_scoreboard_find(vcpus, cpu_index); +/* ensure start time
[PATCH v2 5/6] plugins: add time control API
From: Alex Bennée Expose the ability to control time through the plugin API. Only one plugin can control time so it has to request control when loaded. There are probably more corner cases to catch here. From: Alex Bennée Signed-off-by: Alex Bennée Signed-off-by: Pierrick Bouvier --- include/qemu/qemu-plugin.h | 23 +++ plugins/api.c| 31 +++ plugins/qemu-plugins.symbols | 2 ++ 3 files changed, 56 insertions(+) diff --git a/include/qemu/qemu-plugin.h b/include/qemu/qemu-plugin.h index 95703d8fec1..80b1637cede 100644 --- a/include/qemu/qemu-plugin.h +++ b/include/qemu/qemu-plugin.h @@ -661,6 +661,29 @@ void qemu_plugin_register_vcpu_mem_inline_per_vcpu( qemu_plugin_u64 entry, uint64_t imm); +/** + * qemu_plugin_request_time_control() - request the ability to control time + * + * This grants the plugin the ability to control system time. Only one + * plugin can control time so if multiple plugins request the ability + * all but the first will fail. + * + * Returns an opaque handle or NULL if fails + */ +const void *qemu_plugin_request_time_control(void); + +/** + * qemu_plugin_update_ns() - update system emulation time + * @handle: opaque handle returned by qemu_plugin_request_time_control() + * @time: time in nanoseconds + * + * This allows an appropriately authorised plugin (i.e. holding the + * time control handle) to move system time forward to @time. + * + * Start time is 0. + */ +void qemu_plugin_update_ns(const void *handle, int64_t time); + typedef void (*qemu_plugin_vcpu_syscall_cb_t)(qemu_plugin_id_t id, unsigned int vcpu_index, int64_t num, uint64_t a1, uint64_t a2, diff --git a/plugins/api.c b/plugins/api.c index 5a0a7f8c712..26822b69ea2 100644 --- a/plugins/api.c +++ b/plugins/api.c @@ -39,6 +39,7 @@ #include "qemu/main-loop.h" #include "qemu/plugin.h" #include "qemu/log.h" +#include "qemu/timer.h" #include "tcg/tcg.h" #include "exec/exec-all.h" #include "exec/gdbstub.h" @@ -583,3 +584,33 @@ uint64_t qemu_plugin_u64_sum(qemu_plugin_u64 entry) } return total; } + +/* + * Time control + */ +static bool has_control; + +const void *qemu_plugin_request_time_control(void) +{ +if (!has_control) { +has_control = true; +return &has_control; +} +return NULL; +} + +static void advance_virtual_time__async(CPUState *cpu, run_on_cpu_data data) +{ +int64_t new_time = data.host_ulong; +qemu_clock_advance_virtual_time(new_time); +} + +void qemu_plugin_update_ns(const void *handle, int64_t new_time) +{ +if (handle == &has_control) { +/* Need to execute out of cpu_exec, so bql can be locked. */ +async_run_on_cpu(current_cpu, + advance_virtual_time__async, + RUN_ON_CPU_HOST_ULONG(new_time)); +} +} diff --git a/plugins/qemu-plugins.symbols b/plugins/qemu-plugins.symbols index aa0a77a319f..ca773d8d9fe 100644 --- a/plugins/qemu-plugins.symbols +++ b/plugins/qemu-plugins.symbols @@ -38,6 +38,7 @@ qemu_plugin_register_vcpu_tb_exec_cond_cb; qemu_plugin_register_vcpu_tb_exec_inline_per_vcpu; qemu_plugin_register_vcpu_tb_trans_cb; + qemu_plugin_request_time_control; qemu_plugin_reset; qemu_plugin_scoreboard_free; qemu_plugin_scoreboard_find; @@ -51,5 +52,6 @@ qemu_plugin_u64_set; qemu_plugin_u64_sum; qemu_plugin_uninstall; + qemu_plugin_update_ns; qemu_plugin_vcpu_for_each; }; -- 2.39.2
[PATCH v2 2/6] qtest: use cpu interface in qtest_clock_warp
From: Alex Bennée This generalises the qtest_clock_warp code to use the AccelOps handlers for updating its own sense of time. This will make the next patch which moves the warp code closer to pure code motion. From: Alex Bennée Signed-off-by: Alex Bennée Acked-by: Thomas Huth Signed-off-by: Pierrick Bouvier --- include/sysemu/qtest.h | 1 + accel/qtest/qtest.c| 1 + system/qtest.c | 6 +++--- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h index b5d5fd34637..45f3b7e1df5 100644 --- a/include/sysemu/qtest.h +++ b/include/sysemu/qtest.h @@ -36,6 +36,7 @@ void qtest_server_set_send_handler(void (*send)(void *, const char *), void qtest_server_inproc_recv(void *opaque, const char *buf); int64_t qtest_get_virtual_clock(void); +void qtest_set_virtual_clock(int64_t count); #endif #endif diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c index f6056ac8361..53182e6c2ae 100644 --- a/accel/qtest/qtest.c +++ b/accel/qtest/qtest.c @@ -52,6 +52,7 @@ static void qtest_accel_ops_class_init(ObjectClass *oc, void *data) ops->create_vcpu_thread = dummy_start_vcpu_thread; ops->get_virtual_clock = qtest_get_virtual_clock; +ops->set_virtual_clock = qtest_set_virtual_clock; }; static const TypeInfo qtest_accel_ops_type = { diff --git a/system/qtest.c b/system/qtest.c index 6da58b3874e..ee8b139e982 100644 --- a/system/qtest.c +++ b/system/qtest.c @@ -332,14 +332,14 @@ int64_t qtest_get_virtual_clock(void) return qatomic_read_i64(&qtest_clock_counter); } -static void qtest_set_virtual_clock(int64_t count) +void qtest_set_virtual_clock(int64_t count) { qatomic_set_i64(&qtest_clock_counter, count); } static void qtest_clock_warp(int64_t dest) { -int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); +int64_t clock = cpus_get_virtual_clock(); AioContext *aio_context; assert(qtest_enabled()); aio_context = qemu_get_aio_context(); @@ -348,7 +348,7 @@ static void qtest_clock_warp(int64_t dest) QEMU_TIMER_ATTR_ALL); int64_t warp = qemu_soonest_timeout(dest - clock, deadline); -qtest_set_virtual_clock(qtest_get_virtual_clock() + warp); +cpus_set_virtual_clock(cpus_get_virtual_clock() + warp); qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL); timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]); -- 2.39.2
[PATCH v2 3/6] sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time
From: Alex Bennée Move the key functionality of moving time forward into the clock sub-system itself. This will allow us to plumb in time control into plugins. From: Alex Bennée Signed-off-by: Alex Bennée Signed-off-by: Pierrick Bouvier --- include/qemu/timer.h | 15 +++ system/qtest.c | 25 +++-- util/qemu-timer.c| 26 ++ 3 files changed, 44 insertions(+), 22 deletions(-) diff --git a/include/qemu/timer.h b/include/qemu/timer.h index 9a366e551fb..910587d8293 100644 --- a/include/qemu/timer.h +++ b/include/qemu/timer.h @@ -245,6 +245,21 @@ bool qemu_clock_run_timers(QEMUClockType type); */ bool qemu_clock_run_all_timers(void); +/** + * qemu_clock_advance_virtual_time(): advance the virtual time tick + * @target: target time in nanoseconds + * + * This function is used where the control of the flow of time has + * been delegated to outside the clock subsystem (be it qtest, icount + * or some other external source). You can ask the clock system to + * return @early at the first expired timer. + * + * Time can only move forward, attempts to reverse time would lead to + * an error. + * + * Returns: new virtual time. + */ +int64_t qemu_clock_advance_virtual_time(int64_t dest); /* * QEMUTimerList diff --git a/system/qtest.c b/system/qtest.c index ee8b139e982..e6f6b4e62d5 100644 --- a/system/qtest.c +++ b/system/qtest.c @@ -337,26 +337,6 @@ void qtest_set_virtual_clock(int64_t count) qatomic_set_i64(&qtest_clock_counter, count); } -static void qtest_clock_warp(int64_t dest) -{ -int64_t clock = cpus_get_virtual_clock(); -AioContext *aio_context; -assert(qtest_enabled()); -aio_context = qemu_get_aio_context(); -while (clock < dest) { -int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL, - QEMU_TIMER_ATTR_ALL); -int64_t warp = qemu_soonest_timeout(dest - clock, deadline); - -cpus_set_virtual_clock(cpus_get_virtual_clock() + warp); - -qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL); -timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]); -clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); -} -qemu_clock_notify(QEMU_CLOCK_VIRTUAL); -} - static bool (*process_command_cb)(CharBackend *chr, gchar **words); void qtest_set_command_cb(bool (*pc_cb)(CharBackend *chr, gchar **words)) @@ -755,7 +735,8 @@ static void qtest_process_command(CharBackend *chr, gchar **words) ns = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL, QEMU_TIMER_ATTR_ALL); } -qtest_clock_warp(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns); +qemu_clock_advance_virtual_time( +qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns); qtest_send_prefix(chr); qtest_sendf(chr, "OK %"PRIi64"\n", (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL)); @@ -781,7 +762,7 @@ static void qtest_process_command(CharBackend *chr, gchar **words) g_assert(words[1]); ret = qemu_strtoi64(words[1], NULL, 0, &ns); g_assert(ret == 0); -qtest_clock_warp(ns); +qemu_clock_advance_virtual_time(ns); qtest_send_prefix(chr); qtest_sendf(chr, "OK %"PRIi64"\n", (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL)); diff --git a/util/qemu-timer.c b/util/qemu-timer.c index 6a0de33dd2b..213114be68c 100644 --- a/util/qemu-timer.c +++ b/util/qemu-timer.c @@ -645,6 +645,11 @@ int64_t qemu_clock_get_ns(QEMUClockType type) } } +static void qemu_virtual_clock_set_ns(int64_t time) +{ +return cpus_set_virtual_clock(time); +} + void init_clocks(QEMUTimerListNotifyCB *notify_cb) { QEMUClockType type; @@ -675,3 +680,24 @@ bool qemu_clock_run_all_timers(void) return progress; } + +int64_t qemu_clock_advance_virtual_time(int64_t dest) +{ +int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); +AioContext *aio_context; +aio_context = qemu_get_aio_context(); +while (clock < dest) { +int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL, + QEMU_TIMER_ATTR_ALL); +int64_t warp = qemu_soonest_timeout(dest - clock, deadline); + +qemu_virtual_clock_set_ns(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + warp); + +qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL); +timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]); +clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); +} +qemu_clock_notify(QEMU_CLOCK_VIRTUAL); + +return clock; +} -- 2.39.2
[PATCH v2 4/6] qtest: move qtest_{get, set}_virtual_clock to accel/qtest/qtest.c
Signed-off-by: Pierrick Bouvier --- include/sysemu/qtest.h | 3 --- accel/qtest/qtest.c| 12 system/qtest.c | 12 3 files changed, 12 insertions(+), 15 deletions(-) diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h index 45f3b7e1df5..c161d751650 100644 --- a/include/sysemu/qtest.h +++ b/include/sysemu/qtest.h @@ -34,9 +34,6 @@ void qtest_server_init(const char *qtest_chrdev, const char *qtest_log, Error ** void qtest_server_set_send_handler(void (*send)(void *, const char *), void *opaque); void qtest_server_inproc_recv(void *opaque, const char *buf); - -int64_t qtest_get_virtual_clock(void); -void qtest_set_virtual_clock(int64_t count); #endif #endif diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c index 53182e6c2ae..bf14032d294 100644 --- a/accel/qtest/qtest.c +++ b/accel/qtest/qtest.c @@ -24,6 +24,18 @@ #include "qemu/main-loop.h" #include "hw/core/cpu.h" +static int64_t qtest_clock_counter; + +static int64_t qtest_get_virtual_clock(void) +{ +return qatomic_read_i64(&qtest_clock_counter); +} + +static void qtest_set_virtual_clock(int64_t count) +{ +qatomic_set_i64(&qtest_clock_counter, count); +} + static int qtest_init_accel(MachineState *ms) { return 0; diff --git a/system/qtest.c b/system/qtest.c index e6f6b4e62d5..ba210780ec0 100644 --- a/system/qtest.c +++ b/system/qtest.c @@ -325,18 +325,6 @@ static void qtest_irq_handler(void *opaque, int n, int level) } } -static int64_t qtest_clock_counter; - -int64_t qtest_get_virtual_clock(void) -{ -return qatomic_read_i64(&qtest_clock_counter); -} - -void qtest_set_virtual_clock(int64_t count) -{ -qatomic_set_i64(&qtest_clock_counter, count); -} - static bool (*process_command_cb)(CharBackend *chr, gchar **words); void qtest_set_command_cb(bool (*pc_cb)(CharBackend *chr, gchar **words)) -- 2.39.2
[PATCH v2 1/6] sysemu: add set_virtual_time to accel ops
From: Alex Bennée We are about to remove direct calls to individual accelerators for this information and will need a central point for plugins to hook into time changes. From: Alex Bennée Signed-off-by: Alex Bennée Reviewed-by: Philippe Mathieu-Daudé Signed-off-by: Pierrick Bouvier --- include/sysemu/accel-ops.h | 18 +- include/sysemu/cpu-timers.h| 3 ++- ...et-virtual-clock.c => cpus-virtual-clock.c} | 5 + system/cpus.c | 11 +++ stubs/meson.build | 2 +- 5 files changed, 36 insertions(+), 3 deletions(-) rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%) diff --git a/include/sysemu/accel-ops.h b/include/sysemu/accel-ops.h index ef91fc28bbd..a0886722305 100644 --- a/include/sysemu/accel-ops.h +++ b/include/sysemu/accel-ops.h @@ -20,7 +20,12 @@ typedef struct AccelOpsClass AccelOpsClass; DECLARE_CLASS_CHECKERS(AccelOpsClass, ACCEL_OPS, TYPE_ACCEL_OPS) -/* cpus.c operations interface */ +/** + * struct AccelOpsClass - accelerator interfaces + * + * This structure is used to abstract accelerator differences from the + * core CPU code. Not all have to be implemented. + */ struct AccelOpsClass { /*< private >*/ ObjectClass parent_class; @@ -44,7 +49,18 @@ struct AccelOpsClass { void (*handle_interrupt)(CPUState *cpu, int mask); +/** + * @get_virtual_clock: fetch virtual clock + * @set_virtual_clock: set virtual clock + * + * These allow the timer subsystem to defer to the accelerator to + * fetch time. The set function is needed if the accelerator wants + * to track the changes to time as the timer is warped through + * various timer events. + */ int64_t (*get_virtual_clock)(void); +void (*set_virtual_clock)(int64_t time); + int64_t (*get_elapsed_ticks)(void); /* gdbstub hooks */ diff --git a/include/sysemu/cpu-timers.h b/include/sysemu/cpu-timers.h index d86738a378d..7bfa960fbd6 100644 --- a/include/sysemu/cpu-timers.h +++ b/include/sysemu/cpu-timers.h @@ -96,8 +96,9 @@ int64_t cpu_get_clock(void); void qemu_timer_notify_cb(void *opaque, QEMUClockType type); -/* get the VIRTUAL clock and VM elapsed ticks via the cpus accel interface */ +/* get/set VIRTUAL clock and VM elapsed ticks via the cpus accel interface */ int64_t cpus_get_virtual_clock(void); +void cpus_set_virtual_clock(int64_t new_time); int64_t cpus_get_elapsed_ticks(void); #endif /* SYSEMU_CPU_TIMERS_H */ diff --git a/stubs/cpus-get-virtual-clock.c b/stubs/cpus-virtual-clock.c similarity index 68% rename from stubs/cpus-get-virtual-clock.c rename to stubs/cpus-virtual-clock.c index fd447d53f3c..af7c1a1d403 100644 --- a/stubs/cpus-get-virtual-clock.c +++ b/stubs/cpus-virtual-clock.c @@ -6,3 +6,8 @@ int64_t cpus_get_virtual_clock(void) { return cpu_get_clock(); } + +void cpus_set_virtual_clock(int64_t new_time) +{ +/* do nothing */ +} diff --git a/system/cpus.c b/system/cpus.c index f8fa78f33d4..d3640c95030 100644 --- a/system/cpus.c +++ b/system/cpus.c @@ -229,6 +229,17 @@ int64_t cpus_get_virtual_clock(void) return cpu_get_clock(); } +/* + * Signal the new virtual time to the accelerator. This is only needed + * by accelerators that need to track the changes as we warp time. + */ +void cpus_set_virtual_clock(int64_t new_time) +{ +if (cpus_accel && cpus_accel->set_virtual_clock) { +cpus_accel->set_virtual_clock(new_time); +} +} + /* * return the time elapsed in VM between vm_start and vm_stop. Unless * icount is active, cpus_get_elapsed_ticks() uses units of the host CPU cycle diff --git a/stubs/meson.build b/stubs/meson.build index 3b9d42023cb..a1deafde08c 100644 --- a/stubs/meson.build +++ b/stubs/meson.build @@ -28,7 +28,7 @@ endif if have_block or have_ga stub_ss.add(files('replay-tools.c')) # stubs for hooks in util/main-loop.c, util/async.c etc. - stub_ss.add(files('cpus-get-virtual-clock.c')) + stub_ss.add(files('cpus-virtual-clock.c')) stub_ss.add(files('icount.c')) stub_ss.add(files('graph-lock.c')) if linux_io_uring.found() -- 2.39.2
[PATCH v2 0/6] Implement icount=auto using TCG Plugins
The goal here is to be able to scale temporally execution of qemu-user/system, using a given number of instructions per second. We define a virtual clock, that can be late or in advance compared to real time. When we are in advance, we slow execution (by sleeping) until catching real time. Finally, we should be able to cleanup icount=auto mode completely, and keep icount usage for determistic purposes only. It is built upon new TCG Plugins inline ops (store + conditional callbacks), now merged on master. Example in user-mode: - Retrieve number of instructions to execute /bin/true $ ./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d plugin /bin/true cpu 0 insns: 120546 total insns: 120546 - Slow execution to match 5 seconds $ time ./build/qemu-x86_64 -plugin ./build/contrib/plugins/libips,ips=$((120546/5)) /bin/true real0m4.985s v2 -- - Added missing personal Signed-off-by for commits from Alex - Fix bad rebase in stubs/meson.build - move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c - A race condition was identified for plugins init/idle/resume, but is not related to this series, and will be fixed in another one: https://lore.kernel.org/qemu-devel/20240529152219.825680-1-alex.ben...@linaro.org/ Alex Bennée (4): sysemu: add set_virtual_time to accel ops qtest: use cpu interface in qtest_clock_warp sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time plugins: add time control API Pierrick Bouvier (2): qtest: move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c contrib/plugins: add ips plugin example for cost modeling include/qemu/qemu-plugin.h| 23 ++ include/qemu/timer.h | 15 ++ include/sysemu/accel-ops.h| 18 +- include/sysemu/cpu-timers.h | 3 +- include/sysemu/qtest.h| 2 - accel/qtest/qtest.c | 13 + contrib/plugins/ips.c | 239 ++ plugins/api.c | 31 +++ ...t-virtual-clock.c => cpus-virtual-clock.c} | 5 + system/cpus.c | 11 + system/qtest.c| 37 +-- util/qemu-timer.c | 26 ++ contrib/plugins/Makefile | 1 + plugins/qemu-plugins.symbols | 2 + stubs/meson.build | 2 +- 15 files changed, 389 insertions(+), 39 deletions(-) create mode 100644 contrib/plugins/ips.c rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%) -- 2.39.2
Re: [PATCH RISU v2 06/13] risugen: Add sparc64 support
On 5/30/24 06:23, Peter Maydell wrote: +sub open_asm($) +{ +my ($basename) = @_; +my $fname = $basename . ".s"; +open(ASM, ">", $fname) or die "can't open $fname: $!"; +select ASM; I think that using Perl select like this is liable to be rather confusing, because it has "action at a distance" effects on every other print statement. I would prefer it if we passed in the filehandle to the print statements explicitly. (We can use a global if handing the filehandle around to all the functions is annoying.) I think I tried that and something didn't work exporting or importing the variable. My perl fu is weak, so I probably made some trivial error. r~
Re: [PATCH 2/5] qtest: use cpu interface in qtest_clock_warp
On 5/29/24 23:32, Paolo Bonzini wrote: On Fri, May 17, 2024 at 12:21 AM Pierrick Bouvier wrote: From: Alex Bennée This generalises the qtest_clock_warp code to use the AccelOps handlers for updating its own sense of time. This will make the next patch which moves the warp code closer to pure code motion. From: Alex Bennée Signed-off-by: Alex Bennée Acked-by: Thomas Huth --- include/sysemu/qtest.h | 1 + accel/qtest/qtest.c| 1 + system/qtest.c | 6 +++--- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h index b5d5fd34637..45f3b7e1df5 100644 --- a/include/sysemu/qtest.h +++ b/include/sysemu/qtest.h @@ -36,6 +36,7 @@ void qtest_server_set_send_handler(void (*send)(void *, const char *), void qtest_server_inproc_recv(void *opaque, const char *buf); int64_t qtest_get_virtual_clock(void); +void qtest_set_virtual_clock(int64_t count); You can move qtest_get_virtual_clock/qtest_set_virtual_clock to accel/qtest/qtest.c instead, and make them static. They are not used anymore in system/qtest.c, and it actually makes a lot more sense that they aren't. Changed for next revision, thanks Paolo #endif #endif diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c index f6056ac8361..53182e6c2ae 100644 --- a/accel/qtest/qtest.c +++ b/accel/qtest/qtest.c @@ -52,6 +52,7 @@ static void qtest_accel_ops_class_init(ObjectClass *oc, void *data) ops->create_vcpu_thread = dummy_start_vcpu_thread; ops->get_virtual_clock = qtest_get_virtual_clock; +ops->set_virtual_clock = qtest_set_virtual_clock; }; static const TypeInfo qtest_accel_ops_type = { diff --git a/system/qtest.c b/system/qtest.c index 6da58b3874e..ee8b139e982 100644 --- a/system/qtest.c +++ b/system/qtest.c @@ -332,14 +332,14 @@ int64_t qtest_get_virtual_clock(void) return qatomic_read_i64(&qtest_clock_counter); } -static void qtest_set_virtual_clock(int64_t count) +void qtest_set_virtual_clock(int64_t count) { qatomic_set_i64(&qtest_clock_counter, count); } static void qtest_clock_warp(int64_t dest) { -int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL); +int64_t clock = cpus_get_virtual_clock(); AioContext *aio_context; assert(qtest_enabled()); aio_context = qemu_get_aio_context(); @@ -348,7 +348,7 @@ static void qtest_clock_warp(int64_t dest) QEMU_TIMER_ATTR_ALL); int64_t warp = qemu_soonest_timeout(dest - clock, deadline); -qtest_set_virtual_clock(qtest_get_virtual_clock() + warp); +cpus_set_virtual_clock(cpus_get_virtual_clock() + warp); qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL); timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]); -- 2.39.2
Re: [PATCH RISU v2 05/13] risugen: Be explicit about print destinations
On 5/30/24 05:51, Peter Maydell wrote: @@ -87,13 +87,13 @@ sub progress_update($) my $barlen = int($proglen * $done / $progmax); if ($barlen != $lastprog) { $lastprog = $barlen; -print "[" . "-" x $barlen . " " x ($proglen - $barlen) . "]\r"; +print STDOUT "[" . "-" x $barlen . " " x ($proglen - $barlen) . "]\r"; } } sub progress_end() { -print "[" . "-" x $proglen . "]\n"; +print STDOUT "[" . "-" x $proglen . "]\n"; $| = 0; } These are the progress-bar indicators -- shouldn't they go to STDERR, not STDOUT, if we're going to be careful about where we send output? Why? I would think that only errors would go do stderr... @@ -163,20 +163,20 @@ sub dump_insn_details($$) { # Dump the instruction details for one insn my ($insn, $rec) = @_; -print "insn $insn: "; +print STDOUT "insn $insn: "; my $insnwidth = $rec->{width}; my $fixedbits = $rec->{fixedbits}; my $fixedbitmask = $rec->{fixedbitmask}; my $constraint = $rec->{blocks}{"constraints"}; -print sprintf(" insnwidth %d fixedbits %08x mask %08x ", $insnwidth, $fixedbits, $fixedbitmask); +print STDOUT sprintf(" insnwidth %d fixedbits %08x mask %08x ", $insnwidth, $fixedbits, $fixedbitmask); if (defined $constraint) { -print "constraint $constraint "; +print STDOUT "constraint $constraint "; } for my $tuple (@{ $rec->{fields} }) { my ($var, $pos, $mask) = @$tuple; -print "($var, $pos, " . sprintf("%08x", $mask) . ") "; +print STDOUT "($var, $pos, " . sprintf("%08x", $mask) . ") "; } -print "\n"; +print STDOUT "\n"; } As a debug-print helper routine maybe this should be STDERR too? Likewise? r~
Re: [RFC PATCH] cpus: split qemu_init_vcpu and delay vCPU thread creation
On 5/29/24 08:22, Alex Bennée wrote: This ensures we don't start the thread until cpu_common_realizefn has finished. This ensures that plugins will always run qemu_plugin_vcpu_init__async first before any other states. It doesn't totally eliminate the race that plugin_cpu_update__locked has to work around though. I found this while reviewing the ips plugin which makes heavy use of the vcpu phase callbacks. An alternative might be to move the explicit creation of vCPU threads to qdev_machine_creation_done()? It doesn't affect user-mode which already has a thread to execute in and ensures the QOM object has completed creation in cpu_create() before continuing. Signed-off-by: Alex Bennée Cc: Pierrick Bouvier Cc: Philippe Mathieu-Daudé --- include/hw/core/cpu.h | 8 accel/tcg/user-exec-stub.c | 5 + hw/core/cpu-common.c | 7 ++- plugins/core.c | 5 + system/cpus.c | 15 ++- 5 files changed, 34 insertions(+), 6 deletions(-) diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index bb398e8237..6920699585 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -1041,6 +1041,14 @@ void end_exclusive(void); */ void qemu_init_vcpu(CPUState *cpu); +/** + * qemu_start_vcpu: + * @cpu: The vCPU to start. + * + * Create the vCPU thread and start it running. + */ +void qemu_start_vcpu(CPUState *cpu); + #define SSTEP_ENABLE 0x1 /* Enable simulated HW single stepping */ #define SSTEP_NOIRQ 0x2 /* Do not use IRQ while single stepping */ #define SSTEP_NOTIMER 0x4 /* Do not Timers while single stepping */ diff --git a/accel/tcg/user-exec-stub.c b/accel/tcg/user-exec-stub.c index 4fbe2dbdc8..162bb72bbe 100644 --- a/accel/tcg/user-exec-stub.c +++ b/accel/tcg/user-exec-stub.c @@ -18,6 +18,11 @@ void cpu_exec_reset_hold(CPUState *cpu) { } +void qemu_start_vcpu(CPUState *cpu) +{ +/* NOP for user-mode, we already have a thread */ +} + /* User mode emulation does not support record/replay yet. */ bool replay_exception(void) diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c index 0f0a247f56..68895ddd59 100644 --- a/hw/core/cpu-common.c +++ b/hw/core/cpu-common.c @@ -230,7 +230,12 @@ static void cpu_common_realizefn(DeviceState *dev, Error **errp) } #endif -/* NOTE: latest generic point where the cpu is fully realized */ +/* + * With everything set up we can finally start the vCPU thread. + * This is a NOP for linux-user. + * NOTE: latest generic point where the cpu is fully realized + */ +qemu_start_vcpu(cpu); } static void cpu_common_unrealizefn(DeviceState *dev) diff --git a/plugins/core.c b/plugins/core.c index 0726bc7f25..1e5da7853b 100644 --- a/plugins/core.c +++ b/plugins/core.c @@ -65,6 +65,11 @@ static void plugin_cpu_update__locked(gpointer k, gpointer v, gpointer udata) CPUState *cpu = container_of(k, CPUState, cpu_index); run_on_cpu_data mask = RUN_ON_CPU_HOST_ULONG(*plugin.mask); +/* + * There is a race condition between the starting of the vCPU + * thread at the end of cpu_common_realizefn and when realized is + * finally set. + */ Could we simply have an active wait here? while (!DEVICE(cpu)->realized) {} We have a guarantee it will be realized shortly, and if it's too hard to have a proper synchronization mechanism (introduce a realize_cond?), then waiting for the proper state does not seem too bad. It's a bit strange for me to document an existing race condition, instead of finding a solution. if (DEVICE(cpu)->realized) { async_run_on_cpu(cpu, plugin_cpu_update__async, mask); } else { diff --git a/system/cpus.c b/system/cpus.c index d3640c9503..7dd8464c5e 100644 --- a/system/cpus.c +++ b/system/cpus.c @@ -488,11 +488,13 @@ void cpus_kick_thread(CPUState *cpu) void qemu_cpu_kick(CPUState *cpu) { -qemu_cond_broadcast(cpu->halt_cond); -if (cpus_accel->kick_vcpu_thread) { -cpus_accel->kick_vcpu_thread(cpu); -} else { /* default */ -cpus_kick_thread(cpu); +if (cpu->halt_cond) { +qemu_cond_broadcast(cpu->halt_cond); +if (cpus_accel->kick_vcpu_thread) { +cpus_accel->kick_vcpu_thread(cpu); +} else { /* default */ +cpus_kick_thread(cpu); +} } } @@ -674,7 +676,10 @@ void qemu_init_vcpu(CPUState *cpu) cpu->num_ases = 1; cpu_address_space_init(cpu, 0, "cpu-memory", cpu->memory); } +} +void qemu_start_vcpu(CPUState *cpu) +{ /* accelerators all implement the AccelOpsClass */ g_assert(cpus_accel != NULL && cpus_accel->create_vcpu_thread != NULL); cpus_accel->create_vcpu_thread(cpu);
Re: [PATCH 1/5] sysemu: add set_virtual_time to accel ops
On 5/29/24 23:30, Paolo Bonzini wrote: On Fri, May 17, 2024 at 12:21 AM Pierrick Bouvier wrote: diff --git a/stubs/meson.build b/stubs/meson.build index 3b9d42023cb..672213b7482 100644 --- a/stubs/meson.build +++ b/stubs/meson.build @@ -3,6 +3,11 @@ # below, so that it is clear who needs the stubbed functionality. stub_ss.add(files('cpu-get-clock.c')) +stub_ss.add(files('cpus-virtual-clock.c')) +stub_ss.add(files('qemu-timer-notify-cb.c')) +stub_ss.add(files('icount.c')) +stub_ss.add(files('dump.c')) +stub_ss.add(files('error-printf.c')) Was there a problem when rebasing? Paolo Hi Paolo, Yes, Philippe made the same comment. I'll fix it in next revision. Thanks! stub_ss.add(files('fdset.c')) stub_ss.add(files('iothread-lock.c')) stub_ss.add(files('is-daemonized.c')) @@ -28,7 +33,6 @@ endif if have_block or have_ga stub_ss.add(files('replay-tools.c')) # stubs for hooks in util/main-loop.c, util/async.c etc. - stub_ss.add(files('cpus-get-virtual-clock.c')) stub_ss.add(files('icount.c')) stub_ss.add(files('graph-lock.c')) if linux_io_uring.found() -- 2.39.2
Re: [PATCH V1 00/26] Live update: cpr-exec
On 5/28/2024 12:42 PM, Peter Xu wrote: On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote: On 5/27/2024 1:45 PM, Peter Xu wrote: On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote: I understand, thanks. If I can help with any of your todo list, just ask - steve Thanks for offering the help, Steve. Started looking at this today, then I found that I miss something high-level. Let me ask here, and let me apologize already for starting to throw multiple questions.. IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in this case not host kernel but QEMU-only, and/or upper. Is there any justification on why the complexity is needed here? It looks to me this one is more involved than cpr-reboot, so I'm thinking how much we can get from the complexity, and whether it's worthwhile. 1000+ LOC is the min support, and if we even expect more to come, that's really important, IMHO. For example, what's the major motivation of this whole work? Is that more on performance, or is it more for supporting the special devices like VFIO which we used to not support, or something else? I can't find them in whatever cover letter I can find, including this one. Firstly, regarding performance, IMHO it'll be always nice to share even some very fundamental downtime measurement comparisons using the new exec mode v.s. the old migration ways to upgrade QEMU binary. Do you perhaps have some number on hand when you started working on this feature years ago? Or maybe some old links on the list would help too, as I didn't follow this work since the start. On VFIO, IIUC you started out this project without VFIO migration being there. Now we have VFIO migration so not sure how much it would work for the upgrade use case. Even with current VFIO migration, we may not want to migrate device states for a local upgrade I suppose, as that can be a lot depending on the type of device assigned. However it'll be nice to discuss this too if this is the major purpose of the series. I think one other challenge on QEMU upgrade with VFIO devices is that the dest QEMU won't be able to open the VFIO device when the src QEMU is still using it as the owner. IIUC this is a similar condition where QEMU wants to have proper ownership transfer of a shared block device, and AFAIR right now we resolved that issue using some form of file lock on the image file. In this case it won't easily apply to a VFIO dev fd, but maybe we still have other approaches, not sure whether you investigated any. E.g. could the VFIO handle be passed over using unix scm rights? I think this might remove one dependency of using exec which can cause quite some difference v.s. a generic migration (from which regard, cpr-reboot is still a pretty generic migration). You also mentioned vhost/tap, is that also a major goal of this series in the follow up patchsets? Is this a problem only because this solution will do exec? Can it work if either the exec()ed qemu or dst qemu create the vhost/tap fds when boot? Meanwhile, could you elaborate a bit on the implication on chardevs? From what I read in the doc update it looks like a major part of work in the future, but I don't yet understand the issue.. Is it also relevant to the exec() approach? In all cases, some of such discussion would be really appreciated. And if you used to consider other approaches to solve this problem it'll be great to mention how you chose this way. Considering this work contains too many things, it'll be nice if such discussion can start with the fundamentals, e.g. on why exec() is a must. The main goal of cpr-exec is providing a fast and reliable way to update qemu. cpr-reboot is not fast enough or general enough. It requires the guest to support suspend and resume for all devices, and that takes seconds. If one actually reboots the host, that adds more seconds, depending on system services. cpr-exec takes 0.1 secs, and works every time, unlike like migration which can fail to converge on a busy system. Live migration also consumes more system and network resources. Right, but note that when I was thinking of a comparison between cpr-exec v.s. normal migration, I didn't mean a "normal live migration". I think it's more of the case whether exec() can be avoided. I had a feeling that this exec() will cause a major part of work elsewhere but maybe I am wrong as I didn't see the whole branch. The only parts of this work that are specific to exec are these patches and the qemu_clear_cloexec() calls in cpr.c. vl: helper to request re-exec migration: precreate vmstate for exec The rest would be the same if some other mechanism were used to start new qemu. Additional code would be needed for the new mechanism, such as SCM_RIGHTS sends. AFAIU, "cpr-exec takes 0.1 secs" is a conditional result. I think it at least should be relevant to what devices are attached to the VM, right? E.g., I observed at least two things that can dras
Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr
On 5/29/2024 3:25 PM, Peter Xu wrote: On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote: On 5/28/2024 5:44 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote: Preserve fields of RAMBlocks that allocate their host memory during CPR so the RAM allocation can be recovered. This sentence itself did not explain much, IMHO. QEMU can share memory using fd based memory already of all kinds, as long as the memory backend is path-based it can be shared by sharing the same paths to dst. This reads very confusing as a generic concept. I mean, QEMU migration relies on so many things to work right. We mostly asks the users to "use exactly the same cmdline for src/dst QEMU unless you know what you're doing", otherwise many things can break. That should also include ramblock being matched between src/dst due to the same cmdlines provided on both sides. It'll be confusing to mention this when we thought the ramblocks also rely on that fact. So IIUC this sentence should be dropped in the real patch, and I'll try to guess the real reason with below.. The properties of the implicitly created ramblocks must be preserved. The defaults can and do change between qemu releases, even when the command-line parameters do not change for the explicit objects that cause these implicit ramblocks to be created. AFAIU, QEMU relies on ramblocks to be the same before this series. Do you have an example? Would that already cause issue when migrate? Alignment has changed, and used_length vs max_length changed when resizeable ramblocks were introduced. I have dealt with these issues while supporting cpr for our internal use, and the learned lesson is to explicitly communicate the creation-time parameters to new qemu. These are not an issue for migration because the ramblock is re-created and the data copied into the new memory. Mirror the mr->align field in the RAMBlock to simplify the vmstate. Preserve the old host address, even though it is immediately discarded, as it will be needed in the future for CPR with iommufd. Preserve guest_memfd, even though CPR does not yet support it, to maintain vmstate compatibility when it becomes supported. .. It could be about the vfio vaddr update feature that you mentioned and only for iommufd (as IIUC vfio still relies on iova ranges, then it won't help here)? If so, IMHO we should have this patch (or any variance form) to be there for your upcoming vfio support. Keeping this around like this will make the series harder to review. Or is it needed even before VFIO? This patch is needed independently of vfio or iommufd. guest_memfd is independent of vfio or iommufd. It is a recent addition which I have not tried to support, but I added this placeholder field to it can be supported in the future without adding a new field later and maintaining backwards compatibility. Is guest_memfd the only user so far, then? If so, would it be possible we split it as a separate effort on top of the base cpr-exec support? I don't understand the question. I am indeed deferring support for guest_memfd to a future time. For now, I am adding a blocker, and reserving a field for it in the preserved ramblock attributes, to avoid adding a subsection later. - Steve
Re: [PATCH V1 07/26] migration: VMStateId
On 5/29/2024 2:53 PM, Peter Xu wrote: On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote: How about a more general name for the type: migration/misc.h typedef char (MigrationId)[256]; How about qemu/typedefs.h? Not sure whether it's applicable. Markus (in the loop) may have a better idea. Meanwhile, s/MigrationID/IDString/? typedefs.h has a different purpose; giving short names to types defined in internal include files. This id is specific to migration, so I still think its name should reflect migration and it belongs in some include/migration/*.h file. ramblocks and migration are already closely related. There is nothing wrong with including a migration header in ramblock.h so it can use a migration type. We already have: include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h" include/hw/display/ramfb.h:#include "migration/vmstate.h" include/hw/input/pl050.h:#include "migration/vmstate.h" include/hw/pci/shpc.h:#include "migration/vmstate.h" include/hw/virtio/virtio.h:#include "migration/vmstate.h" include/hw/hyperv/vmbus.h:#include "migration/vmstate.h" The 256 byte magic length already appears in too many places, and my code would add more occurrences, so I really think that abstracting this type would be cleaner. - Steve exec/ramblock.h struct RAMBlock { MigrationId idstr; migration/savevm.c typedef struct CompatEntry { MigrationId idstr; typedef struct SaveStateEntry { MigrationId idstr; - Steve
Re: [PATCH V1 17/26] machine: memfd-alloc option
On 5/29/2024 3:14 PM, Peter Xu wrote: On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote: diff --git a/system/memory.c b/system/memory.c index 49f1cb2..ca04a0e 100644 --- a/system/memory.c +++ b/system/memory.c @@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr, uint64_t size, Error **errp) { +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0; If there's a machine option to "use memfd for allocations", then it's shared mem... Hmm.. It is a bit confusing to me in quite a few levels: - Why memory allocation method will be defined by a machine property, even if we have memory-backend-* which should cover everything? Some memory regions are implicitly created, and have no explicit representation on the qemu command line. memfd-alloc affects those. More generally, memfd-alloc affects all ramblock allocations that are not explicitly represented by memory-backend object. Thus the simple command line "qemu -m 1G" does not explicitly describe an object, so it goes through the anonymous allocation path, and is affected by memfd-alloc. Can we simply now allow "qemu -m 1G" to work for cpr-exec? I assume you meant "simply not allow". Yes, I could do that, but I would need to explicitly add code to exclude this case, and add a blocker. Right now it "just works" for all paths that lead to ram_block_alloc_host, without any special logic at the memory-backend level. And, I'm not convinced that simplifies the docs, as now I would need to tell the user that "-m 1G" and similar constructions do not work with cpr. I can try to clarify the doc for -memfd-alloc as currently defined. AFAIU that's what we do with cpr-reboot: we ask the user to specify the right things to make other thing work. Otherwise it won't. Internally, create_default_memdev does create a memory-backend object. That is what my doc comment above refers to: Any associated memory-backend objects are created with share=on An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc. The qapi comments in patch "migration: cpr-exec mode" attempt to say all that: +# Memory backend objects must have the share=on attribute, and +# must be mmap'able in the new QEMU process. For example, +# memory-backend-file is acceptable, but memory-backend-ram is +# not. +# +# The VM must be started with the '-machine memfd-alloc=on' +# option. This causes implicit ram blocks -- those not explicitly +# described by a memory-backend object -- to be allocated by +# mmap'ing a memfd. Examples include VGA, ROM, and even guest +# RAM when it is specified without a memory-backend object. VGA is IIRC 16MB chunk, ROM is even smaller. If the user specifies -object memory-backend-file,share=on propertly, these should be the only outliers? Are these important enough for the downtime? Can we put them into the migrated image alongside with the rest device states? It's not about downtime. vfio, vdpa, and iommufd pin all guest pages. The pages must remain pinned during CPR to support ongoing DMA activity which could target those pages (which we do not quiesce), and the same physical pages must be used for the ramblocks in the new qemu process. - Even if we have such a machine property, why setting "memfd" will always imply shared? why not private? After all it's not called "memfd-shared-alloc", and we can create private mappings using e.g. memory-backend-memfd,share=off. There is no use case for memfd-alloc with share=off, so no point IMO in making the option more verbose. Unfortunately this fact doesn't make the property easier to understand. :-( > For cpr, the mapping with all its modifications must be visible to new qemu when qemu mmaps it. So this might be the important part - do you mean migrating VGA/ROM/... small ramblocks won't work (besides any performance concerns)? Could you elaborate? Pinning. Cpr-reboot already introduced lots of tricky knobs to QEMU. We may need to restrict that specialty to minimal, making the interfacing as clear as possible, or (at least migration) maintainers will start to be soon scared and running away, if such proposal was not shot down. In short, I hope when we introduce new knobs for cpr, we shouldn't always keep cpr-* modes in mind, but consider whenever the user can use it without cpr-*. I'm not sure whether it'll be always possible, but we should try. I agree in principle. FWIW, I have tried to generalize the functionality needed by cpr so it can be used in other ways: per-mode blockers, per-mode notifiers, precreate vmstate, factory objects; to base it on migration internals with minimal change (vmstate); and to make minimal changes in the migration control paths. - Steve
Re: [PATCH V1 05/26] migration: precreate vmstate
On 5/29/2024 2:39 PM, Peter Xu wrote: On Tue, May 28, 2024 at 11:09:49AM -0400, Steven Sistare wrote: On 5/27/2024 2:16 PM, Peter Xu wrote: On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote: Provide the VMStateDescription precreate field to mark objects that must be loaded on the incoming side before devices have been created, because they provide properties that will be needed at creation time. They will be saved to and loaded from their own QEMUFile, via qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these functions are not yet called in this patch. Allow them to be called before or after normal migration is active, when current_migration and current_incoming are not valid. Signed-off-by: Steve Sistare --- include/migration/vmstate.h | 6 migration/savevm.c | 69 + migration/savevm.h | 3 ++ 3 files changed, 73 insertions(+), 5 deletions(-) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 294d2d8..4691334 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -198,6 +198,12 @@ struct VMStateDescription { * a QEMU_VM_SECTION_START section. */ bool early_setup; + +/* + * Send/receive this object in the precreate migration stream. + */ +bool precreate; + int version_id; int minimum_version_id; MigrationPriority priority; diff --git a/migration/savevm.c b/migration/savevm.c index 9789823..a30bcd9 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -239,6 +239,7 @@ static SaveState savevm_state = { #define SAVEVM_FOREACH(se, entry)\ QTAILQ_FOREACH(se, &savevm_state.handlers, entry)\ +if (!se->vmsd || !se->vmsd->precreate) #define SAVEVM_FOREACH_ALL(se, entry)\ QTAILQ_FOREACH(se, &savevm_state.handlers, entry) @@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, SaveStateEntry *se, } } +static bool send_section_footer(SaveStateEntry *se) +{ +return (se->vmsd && se->vmsd->precreate) || + migrate_get_current()->send_section_footer; +} Does the precreate vmsd "require" the footer? Or it should also work? IMHO it's less optimal to bind features without good reasons. It is not required. However, IMO we should not treat send-section-footer as a fungible feature. It is strictly an improvement, as was added to catch misformated sections. It is only registered as a feature for backwards compatibility with qemu 2.3 and xen. For a brand new data stream such as precreate, where we are not constrained by backwards compatibility, we should unconditionally use the better protocol, and always send the footer. I see your point, but I still don't think we should mangle these things. It makes future justification harder on whether section footer should be sent. Take example of whatever new feature added for migration like mapped-ram, we might also want to enforce it by adding "return migrate_mapped_ram() || ..." but it means we keep growing this with no benefit. What you worry on "what if this is turned off" isn't a real one: nobody will turn it off! We started to deprecate machines, and I had a feeling it will be enforced at some point by default. That's fine, I'll delete the send_section_footer() function. - Steve
Re: [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset
On Thu, May 23, 2024 at 04:05:33PM -0300, Fabiano Rosas wrote: > Add a test for file migration using fdset. The passing of fds is more > complex than using a file path. This is also the scenario where it's > most important we ensure that the initial migration stream offset is > respected because the fdset interface is the one used by the > management layer when providing a non empty migration file. > > Note that fd passing is not available on Windows, so anything that > uses add-fd needs to exclude that platform. > > Signed-off-by: Fabiano Rosas Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
On 5/30/2024 6:54 AM, Zhao Liu wrote: > Hi Zide, > > On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote: >> Date: Wed, 29 May 2024 10:31:21 -0700 >> From: "Chen, Zide" >> Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off >> >> >> >> On 5/29/2024 5:46 AM, Igor Mammedov wrote: >>> On Tue, 28 May 2024 11:16:59 -0700 >>> "Chen, Zide" wrote: >>> On 5/28/2024 2:23 AM, Igor Mammedov wrote: > On Fri, 24 May 2024 13:00:14 -0700 > Zide Chen wrote: > >> Currently, if running "-overcommit cpu-pm=on" on hosts that don't >> have MWAIT support, the MWAIT/MONITOR feature is advertised to the >> guest and executing MWAIT/MONITOR on the guest triggers #UD. > > this is missing proper description how do you trigger issue > with reproducer and detailed description why guest sees MWAIT > when it's not supported by host. If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the >>> it's bette to provide full QEMU CLI and host/guest kernels used and what >>> hardware was used if it's relevant so others can reproduce problem. >> >> I ever reproduced this on an older Intel Icelake machine, a >> Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic >> issue, not specific to particular models. >> >> For the CLI, I think the only command line options that matter are >> -overcommit cpu-pm=on: to set enable_cpu_pm >> -cpu host: so that cpu->max_features is set >> >> For QEMU version, as long as it's after this commit: 662175b91ff2 >> ("i386: reorder call to cpu_exec_realizefn") >> >> The guest fails to boot: >> >> [ 24.825568] smpboot: x86: Booting SMP configuration: >> [ 24.826377] node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 >> #13 #14 #15 #17 >> [ 24.985799] node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135 >> #136 #137 #138 #139 #140 #141 #142 #143 #145 >> [ 25.136955] invalid opcode: 1 PREEMPT SMP NOPTI >> [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2 >> [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS >> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04 >> [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80 >> [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15 >> 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8 >> [ 25.137790] RSP: :91403e70 EFLAGS: 00010046 >> [ 25.137790] RAX: 9140a980 RBX: 9140a980 RCX: >> >> [ 25.137790] RDX: RSI: 97f1ade21b20 RDI: >> 0004 >> [ 25.137790] RBP: R08: 0005da4709cb R09: >> 0001 >> [ 25.137790] R10: 5da4 R11: 0009 R12: >> >> [ 25.137790] R13: 98573ff90fc0 R14: 9140a038 R15: >> 00093ff0 >> [ 25.137790] FS: () GS:97f1ade0() >> knlGS: >> [ 25.137790] CS: 0010 DS: ES: CR0: 80050033 >> [ 25.137790] CR2: 97d8aa801000 CR3: 0049e9430001 CR4: >> 00770ef0 >> [ 25.137790] DR0: DR1: DR2: >> >> [ 25.137790] DR3: DR6: 07f0 DR7: >> 0400 >> [ 25.137790] PKRU: 5554 >> [ 25.137790] Call Trace: >> [ 25.137790] >> [ 25.137790] ? die+0x37/0x90 >> [ 25.137790] ? do_trap+0xe3/0x110 >> [ 25.137790] ? mwait_idle+0x35/0x80 >> [ 25.137790] ? do_error_trap+0x6a/0x90 >> [ 25.137790] ? mwait_idle+0x35/0x80 >> [ 25.137790] ? exc_invalid_op+0x52/0x70 >> [ 25.137790] ? mwait_idle+0x35/0x80 >> [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20 >> [ 25.137790] ? mwait_idle+0x35/0x80 >> [ 25.137790] default_idle_call+0x30/0x100 >> [ 25.137790] cpuidle_idle_call+0x12c/0x170 >> [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0 >> [ 25.137790] do_idle+0x7f/0xd0 >> [ 25.137790] cpu_startup_entry+0x29/0x30 >> [ 25.137790] rest_init+0xcc/0xd0 >> [ 25.137790] start_kernel+0x396/0x5d0 >> [ 25.137790] x86_64_start_reservations+0x18/0x30 >> [ 25.137790] x86_64_start_kernel+0xe7/0xf0 >> [ 25.137790] common_startup_64+0x13e/0x148 >> [ 25.137790] >> [ 25.137790] Modules linked in: >> [ 25.137790] --[ end trace ]-- >> [ 25.137790] invalid opcode: 2 PREEMPT SMP NOPTI >> [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80 >> [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15 >> 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8 >> >>> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so that it doesn't have a chance to check MWAIT against host features and will be advertised to the guest regardless of whether it's supported by the host or not. x86_cpu_realizefn() x86_cpu_filter_features() cpu_exec_realizefn() kvm_cpu_realizefn host_cpu_realizefn host_cpu_enable_cpu_pm env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR; If it's not supported by the host, executi
Re: [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check
On Thu, May 23, 2024 at 04:05:32PM -0300, Fabiano Rosas wrote: > When doing file migration, QEMU accepts an offset that should be > skipped when writing the migration stream to the file. The purpose of > the offset is to allow the management layer to put its own metadata at > the start of the file. > > We have tests for this in migration-test, but only testing that the > migration stream starts at the correct offset and not that it actually > leaves the data intact. Unsurprisingly, there's been a bug in that > area that the tests didn't catch. > > Fix the tests to write some data to the offset region and check that > it's actually there after the migration. > > While here, switch to using g_get_file_contents() which is more > portable than mmap(). > > Signed-off-by: Fabiano Rosas Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 01/18] migration: Fix file migration with fdset
On Thu, May 23, 2024 at 04:05:31PM -0300, Fabiano Rosas wrote: > When the "file:" migration support was added we missed the special > case in the qemu_open_old implementation that allows for a particular > file name format to be used to refer to a set of file descriptors that > have been previously provided to QEMU via the add-fd QMP command. > > When using this fdset feature, we should not truncate the migration > file because being given an fd means that the management layer is in > control of the file and will likely already have some data written to > it. This is further indicated by the presence of the 'offset' > argument, which indicates the start of the region where QEMU is > allowed to write. > > Fix the issue by replacing the O_TRUNC flag on open by an ftruncate > call, which will take the offset into consideration. > > Fixes: 385f510df5 ("migration: file URI offset") > Suggested-by: Daniel P. Berrangé > Signed-off-by: Fabiano Rosas Reviewed-by: Peter Xu Right below the change, did we forget to free the ioc if qio_channel_io_seek() failed? > --- > migration/file.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/migration/file.c b/migration/file.c > index ab18ba505a..ba5b5c44ff 100644 > --- a/migration/file.c > +++ b/migration/file.c > @@ -84,12 +84,19 @@ void file_start_outgoing_migration(MigrationState *s, > > trace_migration_file_outgoing(filename); > > -fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC, > - 0600, errp); > +fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY, 0600, > errp); > if (!fioc) { > return; > } > > +if (ftruncate(fioc->fd, offset)) { > +error_setg_errno(errp, errno, > + "failed to truncate migration file to offset %" > PRIx64, > + offset); > +object_unref(OBJECT(fioc)); > +return; > +} > + > outgoing_args.fname = g_strdup(filename); > > ioc = QIO_CHANNEL(fioc); > -- > 2.35.3 > -- Peter Xu
Re: hw/usb/hcd-ohci: Fix #1510, #303: pid not IN or OUT
On Thu, May 30, 2024 at 3:33 AM Alex Bennée wrote: > Cord Amfmgm writes: > > > On Tue, May 28, 2024 at 11:32 AM Peter Maydell > wrote: > > > > On Tue, 28 May 2024 at 16:37, Cord Amfmgm wrote: > > > > > > On Tue, May 28, 2024 at 9:03 AM Peter Maydell < > peter.mayd...@linaro.org> wrote: > > >> > > >> On Mon, 20 May 2024 at 23:24, Cord Amfmgm > wrote: > > >> > On Mon, May 20, 2024 at 12:05 PM Peter Maydell < > peter.mayd...@linaro.org> wrote: > > > >> > And here's an example buffer of length 0 -- you probably already > know what I'm going to do here: > > >> > > > >> > char buf[0]; > > >> > char * CurrentBufferPointer = &buf[0]; > > >> > char * BufferEnd = &buf[-1]; // "address of the last byte in the > buffer" > > >> > // The OHCI Host Controller than advances CurrentBufferPointer > like this: CurrentBufferPointer += 0 > > >> > // After the transfer: > > >> > // CurrentBufferPointer = &buf[0]; > > >> > // BufferEnd = &buf[-1]; > > >> > > >> Right, but why do you think this is valid, rather than > > >> being a guest software bug? My reading of the spec is that it's > > >> pretty clear about how to say "zero length buffer", and this > > >> isn't it. > > >> > > >> Is there some real-world guest OS that programs the OHCI > > >> controller this way that we're trying to accommodate? > > > > > > > > > qemu versions 4.2 and before allowed this behavior. > > > > So? That might just mean we had a bug and we fixed it. > > 4.2 is a very old version of QEMU and nobody seems to have > > complained in the four years since we released 5.0 about this, > > which suggests that generally guest OS drivers don't try > > to send zero-length buffers in this way. > > > > > I don't think it's valid to ask for a *popular* guest OS as a > proof-of-concept because I'm not an expert on those. > > > > I didn't ask for "popular"; I asked for "real-world". > > What is the actual guest code you're running that falls over > > because of the behaviour change? > > > > More generally, why do you want this behaviour to be > > changed? Reasonable reasons might include: > > * we're out of spec based on reading the documentation > > * you're trying to run some old Windows VM/QNX/etc image, > > and it doesn't work any more > > * all the real hardware we tested behaves this way > > > > But don't necessarily include: > > * something somebody wrote and only tested on QEMU happens to > > assume the old behaviour rather than following the hw spec > > > > QEMU occasionally works around guest OS bugs, but only as > > when we really have to. It's usually better to fix the > > bug in the guest. > > > > It's not, and I've already demonstrated that real hardware is consistent > with the fix in this patch. > > > > Please check your tone. > > I don't think that is a particularly helpful comment for someone who is > taking the time to review your patches. Reading through the thread I > didn't see anything that said this is how real HW behaves but I may well > have missed it. However you have a number of review comments to address > so I suggest you spin a v2 of the series to address them and outline the > reason to accept an out of spec transaction. > > I did a rework of the patch -- see my email from May 20, quoted below -- and I was under the impression it addressed all the review comments. Did I miss something? I apologize if I did. > index acd6016980..71b54914d3 100644 > --- a/hw/usb/hcd-ohci.c > +++ b/hw/usb/hcd-ohci.c > @@ -941,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct ohci_ed *ed) > if ((td.cbp & 0xf000) != (td.be & 0xf000)) { > len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff); > } else { > -if (td.cbp > td.be) { > -trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be); > +if (td.cbp - 1 > td.be) { /* rely on td.cbp != 0 */ > Reading through the thread I didn't see anything that said this is how real HW behaves but I may well have missed it. This is what I wrote regarding real HW: Results are: qemu 4.2 | qemu HEAD | actual HW ++ works fine | ohci_die() | works fine Would additional verification of the actual HW be useful? Peter posted the following which is more specific than "qemu 4.2" -- I agree this is most likely the qemu commit where this thread is focused: > Almost certainly this was commit 1328fe0c32d54 ("hw: usb: hcd-ohci: > check len and frame_number variables"), which added these bounds > checks. Prior to that we did no bounds checking at all, which > meant that we permitted cbp=be+1 to mean a zero length, but also > that we permitted the guest to overrun host-side buffers by > specifying completely bogus cbp and be values. The timeframe is > more or less right (2020), at least. > > -- PMM Where does the conversation go from here? I'm under the impression I have provided objective answers to all the questions and resolved all review com
Re: [PATCH] hw/riscv/virt.c: add address-cells in create_fdt_one_aplic()
On Thu, May 30, 2024 at 09:26:58AM -0300, Daniel Henrique Barboza wrote: > > > On 5/30/24 08:06, Andrew Jones wrote: > > On Thu, May 30, 2024 at 01:05:41PM GMT, Andrew Jones wrote: > > > On Thu, May 30, 2024 at 05:49:49AM GMT, Daniel Henrique Barboza wrote: > > > > We need #address-cells properties in all interrupt controllers that are > > > > referred by an interrupt-map [1]. For the RISC-V machine, both PLIC and > > > > APLIC controllers must have this property. > > > > > > > > PLIC already sets it in create_fdt_socket_plic(). Set the property for > > > > APLIC in create_fdt_one_aplic(). > > > > > > > > [1] > > > > https://lore.kernel.org/linux-arm-kernel/cal_jsqje15d-xxxmelsmud+jqhzzxgzdxvikchn6kfwqk6n...@mail.gmail.com/ > > > > > > There are other issues[2] with the DT nodes that we should address at the > > > same time. > > > > > > [2] https://lore.kernel.org/all/20240529-rust-tile-a05517a6260f@spud/ > > > > I meant to CC Conor. Doing that now. > > I'll take a look at these other DT nodes issues. > > Conor, mind give me pointers on how do I reproduce the validation you did > in [2]? Using upstream 'dtc' I have stuff like: > > ../qemu/qemu_dts.dts:261.4-68: Warning (interrupts_extended_property): > /soc/aplic@d00:interrupts-extended: cell 0 is not a phandle reference > > Which seems to also be an error but it's not what you reported. Are you > using 'dt-validate' from dt-schema? Yeah, dt-validate. There's probably some stuff that I could add to my machine to make it more interesting, but I ran: $(qemu) -smp 4 -M virt,aia=aplic-imsic,dumpdtb=$(qemu_dtb) -cpu max -m 1G -nographic dt-validate --schema $(processed_schema) $(qemu_dtb) 2>&1 | tee logs/dtbdump.log A processed schema is a pre-requisite, and I usually have one sitting around from running dtbs_check or dt_binding_check in Linux, but I think you can use dt-rebasing to generate one either: https://git.kernel.org/pub/scm/linux/kernel/git/devicetree/devicetree-rebasing.git/tree/Bindings/Makefile You'll see a bunch of noise from undocumented isa extensions, but at the end of the output should be the aplic complaints. I forgot I had it disabled to test something when I did that test the other day, there's also complaints about the imsics: qemu.dtb: imsics@2800: $nodename:0: 'imsics@2800' does not match '^interrupt-controller(@[0-9a-f,]+)*$' from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2800: compatible:0: 'riscv,imsics' is not one of ['qemu,imsics'] from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2800: compatible: ['riscv,imsics'] is too short from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2800: '#msi-cells' is a required property from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2800: Unevaluated properties are not allowed ('compatible' was unexpected) from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2400: $nodename:0: 'imsics@2400' does not match '^interrupt-controller(@[0-9a-f,]+)*$' from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2400: compatible:0: 'riscv,imsics' is not one of ['qemu,imsics'] from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2400: compatible: ['riscv,imsics'] is too short from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2400: '#msi-cells' is a required property from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# qemu.dtb: imsics@2400: Unevaluated properties are not allowed ('compatible' was unexpected) from schema $id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# Cheers, Conor. signature.asc Description: PGP signature
Re: [PATCH 1/2] accel/tcg: Make TCGCPUOps::cpu_exec_halt return bool for whether to halt
On Tue, 30 Apr 2024 at 18:15, Alex Bennée wrote: > The x86 version is essentially being called for side effects. Do we want > to document this usage in the method? I plan to take these two patches into target-arm.next, with a slightly beefed up doc comment: /** * @cpu_exec_halt: Callback for handling halt in cpu_exec. * * The target CPU should do any special processing here that it needs * to do when the CPU is in the halted state. * * Return true to indicate that the CPU should now leave halt, false * if it should remain in the halted state. * * If this method is not provided, the default is to do nothing, and * to leave halt if cpu_has_work() returns true. */ which hopefully makes it a little clearer to the reader that "do stuff you need to do when in halt" is the primary purpose of the method. I'll do the follow-on tidyup rth suggests once these have made it into git. thanks -- PMM
[RFC PATCH 0/1] vhost-user: Add SHMEM_MAP/UNMAP requests
Hi all, This is an early attempt to have backends support dynamic fd mapping into shared memory regions. As such, there are a few things that need settling, so I wanted to post this first to have some early feedback. The usecase for this is, e.g., to support vhost-user-gpu RESOURCE_BLOB operations, or DAX Window request for virtio-fs. In general, any operation where a backend would need to mmap an fd to a shared memory so that the guest can access it. The request will be processed by the VMM, that will, in turn, trigger a mmap with the instructed parameters (i.e., shmid, shm_offset, fd_offset, fd, lenght). As there are already a couple devices that could benefit of such a feature, and more could require it in the future, my intention was to make it generic. To that end, I declared the shared memory region list in `VirtIODevice`. I could add a couple commodity functions to add new regions to the list, so that the devices can use them. But I wanted to gather some feedback before refining it further, as I am probably missing some required steps/or security concerns that I am not taking into account. Albert Esteve (1): vhost-user: add shmem mmap request docs/interop/vhost-user.rst | 23 hw/virtio/vhost-user.c | 106 hw/virtio/virtio.c | 2 + include/hw/virtio/virtio.h | 3 + 4 files changed, 134 insertions(+) -- 2.44.0
[RFC PATCH 1/1] vhost-user: add shmem mmap request
Add SHMEM_MAP/UNMAP requests to vhost-user. This request allows backends to dynamically map fds into a shared memory region indentified by its `shmid`. Then, the fd memory is advertised to the frontend through a BAR+offset, so it can be read by the driver while its valid. Then, the backend can munmap the memory range in a given shared memory region (again, identified by its `shmid`), to free it. After this, the region becomes private and shall not be accessed by the frontend anymore. Initializing the memory region is reponsiblity of the PCI device that will using it. Signed-off-by: Albert Esteve --- docs/interop/vhost-user.rst | 23 hw/virtio/vhost-user.c | 106 hw/virtio/virtio.c | 2 + include/hw/virtio/virtio.h | 3 + 4 files changed, 134 insertions(+) diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst index d8419fd2f1..3caf2a290c 100644 --- a/docs/interop/vhost-user.rst +++ b/docs/interop/vhost-user.rst @@ -1859,6 +1859,29 @@ is sent by the front-end. when the operation is successful, or non-zero otherwise. Note that if the operation fails, no fd is sent to the backend. +``VHOST_USER_BACKEND_SHMEM_MAP`` + :id: 9 + :equivalent ioctl: N/A + :request payload: fd and ``struct VhostUserMMap`` + :reply payload: N/A + + This message can be submitted by the backends to advertise a new mapping + to be made in a given shared memory region. Upon receiving the message, + QEMU will mmap the given fd into the shared memory region with the + requested ``shmid``. A reply is generated indicating whether mapping + succeeded. + +``VHOST_USER_BACKEND_SHMEM_UNMAP`` + :id: 10 + :equivalent ioctl: N/A + :request payload: ``struct VhostUserMMap`` + :reply payload: N/A + + This message can be submitted by the backends so that QEMU un-mmap + a given range (``offset``, ``len``) in the shared memory region with the + requested ``shmid``. + A reply is generated indicating whether unmapping succeeded. + .. _reply_ack: VHOST_USER_PROTOCOL_F_REPLY_ACK diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index cdf9af4a4b..9526b9d07f 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest { VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6, VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7, VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8, +VHOST_USER_BACKEND_SHMEM_MAP = 9, +VHOST_USER_BACKEND_SHMEM_UNMAP = 10, VHOST_USER_BACKEND_MAX } VhostUserBackendRequest; @@ -192,6 +194,23 @@ typedef struct VhostUserShared { unsigned char uuid[16]; } VhostUserShared; +/* For the flags field of VhostUserMMap */ +#define VHOST_USER_FLAG_MAP_R (1u << 0) +#define VHOST_USER_FLAG_MAP_W (1u << 1) + +typedef struct { +/* Shared memory region ID */ +uint8_t shmid; +/* File offset */ +uint64_t fd_offset; +/* Offset within the shared memory region */ +uint64_t shm_offset; +/* Size of region to map */ +uint64_t len; +/* Flags for the mmap operation, from VHOST_USER_FLAG_* */ +uint64_t flags; +} VhostUserMMap; + typedef struct { VhostUserRequest request; @@ -224,6 +243,7 @@ typedef union { VhostUserInflight inflight; VhostUserShared object; VhostUserTransferDeviceState transfer_state; +VhostUserMMap mmap; } VhostUserPayload; typedef struct VhostUserMsg { @@ -1748,6 +1768,85 @@ vhost_user_backend_handle_shared_object_lookup(struct vhost_user *u, return 0; } +static int +vhost_user_backend_handle_shmem_map(struct vhost_dev *dev, + VhostUserMMap *vu_mmap, + int fd) +{ +void *addr = 0; +MemoryRegion *mr = NULL; + +if (fd < 0) { +error_report("Bad fd for map"); +return -EBADF; +} + +if (!dev->vdev->shmem_list || +dev->vdev->n_shmem_regions <= vu_mmap->shmid) { +error_report("Shared memory region at " + "ID %d unitialized", vu_mmap->shmid); +return -EFAULT; +} + +mr = &dev->vdev->shmem_list[vu_mmap->shmid]; + +if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len || +(vu_mmap->shm_offset + vu_mmap->len) > mr->size) { +error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64, + vu_mmap->shm_offset, vu_mmap->len); +return -EFAULT; +} + +void *shmem_ptr = memory_region_get_ram_ptr(mr); + +addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len, +((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) | +((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0), +MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset); +if (addr == MAP_FAILED) { +error_report("Failed to mmap mem fd"); +return -EFAULT; +} + +return 0; +} + +static int +vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev, +
Re: [Semihosting Tests PATCH v2 1/3] .editorconfig: add code conventions for tooling
On 5/30/2024 6:23 AM, Alex Bennée wrote: It's a pain when you come back to a code base you haven't touched in a while and realise whatever indent settings you were using having carried over. Add an editorconfig and be done with it. Signed-off-by: Alex Bennée Adding an editorconfig seems like a great idea IMO. But I wonder - will it result in unintentional additional changes when saving a file that contains baseline non-conformance? Related: would a .clang-format file also be useful? git-clang-format can be used to apply formatting changes only on the code that's been changed. Also: should we consider excluding any exceptional files that we don't expect to conform? --- v2 - drop mention of custom major modes (not needed here) - include section for assembly --- .editorconfig | 29 + 1 file changed, 29 insertions(+) create mode 100644 .editorconfig diff --git a/.editorconfig b/.editorconfig new file mode 100644 index 000..c72a55c --- /dev/null +++ b/.editorconfig @@ -0,0 +1,29 @@ +# EditorConfig is a file format and collection of text editor plugins +# for maintaining consistent coding styles between different editors +# and IDEs. Most popular editors support this either natively or via +# plugin. +# +# Check https://editorconfig.org for details. +# + +root = true + +[*] +end_of_line = lf +insert_final_newline = true +charset = utf-8 + +[Makefile*] +indent_style = tab +indent_size = 8 +emacs_mode = makefile + +[*.{c,h}] +indent_style = space +indent_size = 4 +emacs_mode = c + +[*.{s,S}] +indent_style = tab +indent_size = 8 +emacs_mode = asm
Re: [PATCH] hw/usb/hcd-ohci: Fix ohci_service_td: accept valid TDs
On Tue, 21 May 2024 at 00:26, David Hubbard wrote: > > From: Cord Amfmgm > > This changes the way the ohci emulation handles a Transfer Descriptor with > "Current Buffer Pointer" set to "Buffer End" + 1. > > The OHCI spec 4.3.1.2 Table 4-2 allows td.cbp to be one byte more than td.be > to signal the buffer has zero length. Currently qemu only accepts zero-length > Transfer Descriptors if the td.cbp is equal to 0, while actual OHCI hardware > accepts both cases. > > The qemu ohci emulation has a regression in ohci_service_td. Version 4.2 > and earlier matched the spec. (I haven't taken the time to bisect exactly > where the logic was changed.) Almost certainly this was commit 1328fe0c32d54 ("hw: usb: hcd-ohci: check len and frame_number variables"), which added these bounds checks. Prior to that we did no bounds checking at all, which meant that we permitted cbp=be+1 to mean a zero length, but also that we permitted the guest to overrun host-side buffers by specifying completely bogus cbp and be values. The timeframe is more or less right (2020), at least. -- PMM
[PATCH v2 0/3] semihosting: Restrict to TCG
v2: Address Paolo's comment Semihosting currently uses the TCG probe_access API, so it is pointless to have it in the binary when TCG isn't. It could be implemented for other accelerators, but work need to be done. Meanwhile, do not enable it unless TCG is available. Philippe Mathieu-Daudé (3): target/mips: Restrict semihosting to TCG target/riscv: Restrict semihosting to TCG semihosting: Restrict to TCG semihosting/Kconfig | 1 + target/mips/Kconfig | 2 +- target/riscv/Kconfig | 4 ++-- 3 files changed, 4 insertions(+), 3 deletions(-) -- 2.41.0
[PATCH v2 2/3] target/riscv: Restrict semihosting to TCG
Semihosting currently uses the TCG probe_access API. To prepare for encoding the TCG dependency in Kconfig, do not enable it unless TCG is available. Suggested-by: Paolo Bonzini Signed-off-by: Philippe Mathieu-Daudé --- target/riscv/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/target/riscv/Kconfig b/target/riscv/Kconfig index 5f30df22f2..cbafbedaeb 100644 --- a/target/riscv/Kconfig +++ b/target/riscv/Kconfig @@ -1,9 +1,9 @@ config RISCV32 bool -select ARM_COMPATIBLE_SEMIHOSTING # for do_common_semihosting() +imply ARM_COMPATIBLE_SEMIHOSTING if TCG # for do_common_semihosting() select DEVICE_TREE # needed by boot.c config RISCV64 bool -select ARM_COMPATIBLE_SEMIHOSTING # for do_common_semihosting() +imply ARM_COMPATIBLE_SEMIHOSTING if TCG # for do_common_semihosting() select DEVICE_TREE # needed by boot.c -- 2.41.0
[PATCH v2 3/3] semihosting: Restrict to TCG
Semihosting currently uses the TCG probe_access API. It is pointless to have it in the binary when TCG isn't. Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Richard Henderson --- semihosting/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/semihosting/Kconfig b/semihosting/Kconfig index eaf3a20ef5..fbe6ac87f9 100644 --- a/semihosting/Kconfig +++ b/semihosting/Kconfig @@ -1,6 +1,7 @@ config SEMIHOSTING bool + depends on TCG config ARM_COMPATIBLE_SEMIHOSTING bool -- 2.41.0
Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
On Thu, May 30, 2024, Igor Mammedov wrote: > On Thu, 30 May 2024 21:54:47 +0800 Zhao Liu wrote: ... > > > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so > > > >> that it doesn't have a chance to check MWAIT against host features and > > > >> will be advertised to the guest regardless of whether it's supported by > > > >> the host or not. > > > >> > > > >> x86_cpu_realizefn() > > > >> x86_cpu_filter_features() > > > >> cpu_exec_realizefn() > > > >> kvm_cpu_realizefn > > > >> host_cpu_realizefn > > > >> host_cpu_enable_cpu_pm > > > >> env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR; > > > >> > > > >> > > > >> If it's not supported by the host, executing MONITOR or MWAIT > > > >> instructions from the guest triggers #UD, no matter MWAIT_EXITING > > > >> control is set or not. > > > > > > > > If I recall right, kvm was able to emulate mwait/monitor. > > > > So question is why it leads to exception instead? Because KVM doesn't emulated MONITOR/MWAIT on #UD. > > > KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I > > > didn't find explicit proof from Intel SDM that #UD exceptions take > > > precedence over MWAIT/MONITOR VM exits, but this is my speculation. Yeah, typically #UD takes priority over VM-Exit interception checks. AMD's APM is much more explicit and states that all exceptions are checked on MONITOR/MWAIT before the interception check. > > > For example, in ancient machines which don't support MWAIT yet, the only > > > way it can do is #UD, not MWAIT VM exit? Not really relevant, because such CPUs wouldn't have MWAIT-exiting. > > For the Host which doesn't support MWAIT, it shouldn't have the VMX > > control bit for mwait exit either, right? > > > > Could you pls check this on your machine? If VMX doesn't support this > > exit event, then triggering an exception will make sense. > > My assumption (probably wrong) was that KVM would emulate mwait if it's > unavailable, Nope. In order to limit the attack surface of the emulator on modern CPUs, KVM only emulates select instructions in response to a #UD. But even if KVM did emulate MONITOR/MWAIT on #UD, this is inarguably a QEMU bug, e.g. QEMU will effectively coerce the guest into using a idle-polling mechanism.
[PATCH v2 1/3] target/mips: Restrict semihosting to TCG
Semihosting currently uses the TCG probe_access API. To prepare for encoding the TCG dependency in Kconfig, do not enable it unless TCG is available. Suggested-by: Paolo Bonzini Signed-off-by: Philippe Mathieu-Daudé --- target/mips/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/mips/Kconfig b/target/mips/Kconfig index eb19c94c7d..876048b150 100644 --- a/target/mips/Kconfig +++ b/target/mips/Kconfig @@ -1,6 +1,6 @@ config MIPS bool -select SEMIHOSTING +imply SEMIHOSTING if TCG config MIPS64 bool -- 2.41.0
Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
On Thu, 30 May 2024 21:54:47 +0800 Zhao Liu wrote: > Hi Zide, > > On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote: > > Date: Wed, 29 May 2024 10:31:21 -0700 > > From: "Chen, Zide" > > Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off > > > > > > > > On 5/29/2024 5:46 AM, Igor Mammedov wrote: > > > On Tue, 28 May 2024 11:16:59 -0700 > > > "Chen, Zide" wrote: > > > > > >> On 5/28/2024 2:23 AM, Igor Mammedov wrote: > > >>> On Fri, 24 May 2024 13:00:14 -0700 > > >>> Zide Chen wrote: > > >>> > > Currently, if running "-overcommit cpu-pm=on" on hosts that don't > > have MWAIT support, the MWAIT/MONITOR feature is advertised to the > > guest and executing MWAIT/MONITOR on the guest triggers #UD. > > >>> > > >>> this is missing proper description how do you trigger issue > > >>> with reproducer and detailed description why guest sees MWAIT > > >>> when it's not supported by host. > > >> > > >> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the > > > it's bette to provide full QEMU CLI and host/guest kernels used and what > > > hardware was used if it's relevant so others can reproduce problem. > > > > I ever reproduced this on an older Intel Icelake machine, a > > Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic > > issue, not specific to particular models. > > > > For the CLI, I think the only command line options that matter are > > -overcommit cpu-pm=on: to set enable_cpu_pm > > -cpu host: so that cpu->max_features is set > > > > For QEMU version, as long as it's after this commit: 662175b91ff2 > > ("i386: reorder call to cpu_exec_realizefn") > > > > The guest fails to boot: > > > > [ 24.825568] smpboot: x86: Booting SMP configuration: > > [ 24.826377] node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 > > #13 #14 #15 #17 > > [ 24.985799] node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135 > > #136 #137 #138 #139 #140 #141 #142 #143 #145 > > [ 25.136955] invalid opcode: 1 PREEMPT SMP NOPTI > > [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2 > > [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04 > > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80 > > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15 > > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8 > > [ 25.137790] RSP: :91403e70 EFLAGS: 00010046 > > [ 25.137790] RAX: 9140a980 RBX: 9140a980 RCX: > > > > [ 25.137790] RDX: RSI: 97f1ade21b20 RDI: > > 0004 > > [ 25.137790] RBP: R08: 0005da4709cb R09: > > 0001 > > [ 25.137790] R10: 5da4 R11: 0009 R12: > > > > [ 25.137790] R13: 98573ff90fc0 R14: 9140a038 R15: > > 00093ff0 > > [ 25.137790] FS: () GS:97f1ade0() > > knlGS: > > [ 25.137790] CS: 0010 DS: ES: CR0: 80050033 > > [ 25.137790] CR2: 97d8aa801000 CR3: 0049e9430001 CR4: > > 00770ef0 > > [ 25.137790] DR0: DR1: DR2: > > > > [ 25.137790] DR3: DR6: 07f0 DR7: > > 0400 > > [ 25.137790] PKRU: 5554 > > [ 25.137790] Call Trace: > > [ 25.137790] > > [ 25.137790] ? die+0x37/0x90 > > [ 25.137790] ? do_trap+0xe3/0x110 > > [ 25.137790] ? mwait_idle+0x35/0x80 > > [ 25.137790] ? do_error_trap+0x6a/0x90 > > [ 25.137790] ? mwait_idle+0x35/0x80 > > [ 25.137790] ? exc_invalid_op+0x52/0x70 > > [ 25.137790] ? mwait_idle+0x35/0x80 > > [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20 > > [ 25.137790] ? mwait_idle+0x35/0x80 > > [ 25.137790] default_idle_call+0x30/0x100 > > [ 25.137790] cpuidle_idle_call+0x12c/0x170 > > [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0 > > [ 25.137790] do_idle+0x7f/0xd0 > > [ 25.137790] cpu_startup_entry+0x29/0x30 > > [ 25.137790] rest_init+0xcc/0xd0 > > [ 25.137790] start_kernel+0x396/0x5d0 > > [ 25.137790] x86_64_start_reservations+0x18/0x30 > > [ 25.137790] x86_64_start_kernel+0xe7/0xf0 > > [ 25.137790] common_startup_64+0x13e/0x148 > > [ 25.137790] > > [ 25.137790] Modules linked in: > > [ 25.137790] --[ end trace ]-- > > [ 25.137790] invalid opcode: 2 PREEMPT SMP NOPTI > > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80 > > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15 > > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8 > > > > > > > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so > > >> that it doesn't have a chance to check MWAIT against host features and > > >> will be advertised to the guest regardless of whether it's supported by > > >> the host or not. > > >> > > >> x86_cpu_realizefn() > > >> x86_cpu_filter_features() > > >> cpu_exec_realizefn() > > >> kvm_cpu_realizefn > >
Re: [PATCH] target/arm: Disable SVE extensions when SVE is disabled
On Sun, 26 May 2024 at 21:46, Richard Henderson wrote: > > From: Marcin Juszkiewicz > > Cc: qemu-sta...@nongnu.org > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2304 > Reported-by: Marcin Juszkiewicz > Signed-off-by: Richard Henderson > --- Applied to target-arm.next, thanks. (Looks like we already got this right for SME: arm_cpu_sme_finalize() clears ID_AA64SMFR0_EL1 if SME is disabled.) -- PMM
Re: [PATCH] hw/net: prevent potential NULL dereference
Thanks for review. Would it be correct to use hcall_dprintf() as in other functions of the module? For example, in h_add_logical_lan_buffer(). Best regards, Oleg. 29.05.2024 16:52, Philippe Mathieu-Daudé пишет: On 29/5/24 13:07, Oleg Sviridov wrote: Pointer, returned from function 'spapr_vio_find_by_reg', may be NULL and is dereferenced immediately after. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Oleg Sviridov --- hw/net/spapr_llan.c | 4 1 file changed, 4 insertions(+) diff --git a/hw/net/spapr_llan.c b/hw/net/spapr_llan.c index ecb30b7c76..f40b733229 100644 --- a/hw/net/spapr_llan.c +++ b/hw/net/spapr_llan.c @@ -770,6 +770,10 @@ static target_ulong h_change_logical_lan_mac(PowerPCCPU *cpu, SpaprVioVlan *dev = VIO_SPAPR_VLAN_DEVICE(sdev); int i; Trying to change a MAC when no NIC is present is dubious, we could at least report this using qemu_log_mask(LOG_GUEST_ERROR). + if (!dev) { + return H_PARAMETER; + } + for (i = 0; i < ETH_ALEN; i++) { dev->nicconf.macaddr.a[ETH_ALEN - i - 1] = macaddr & 0xff; macaddr >>= 8;
Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
On Thu, 30 May 2024 21:54:47 +0800 Zhao Liu wrote: > Hi Zide, > > On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote: > > Date: Wed, 29 May 2024 10:31:21 -0700 > > From: "Chen, Zide" > > Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off > > > > > > > > On 5/29/2024 5:46 AM, Igor Mammedov wrote: > > > On Tue, 28 May 2024 11:16:59 -0700 > > > "Chen, Zide" wrote: > > > > > >> On 5/28/2024 2:23 AM, Igor Mammedov wrote: > > >>> On Fri, 24 May 2024 13:00:14 -0700 > > >>> Zide Chen wrote: > > >>> > > Currently, if running "-overcommit cpu-pm=on" on hosts that don't > > have MWAIT support, the MWAIT/MONITOR feature is advertised to the > > guest and executing MWAIT/MONITOR on the guest triggers #UD. > > >>> > > >>> this is missing proper description how do you trigger issue > > >>> with reproducer and detailed description why guest sees MWAIT > > >>> when it's not supported by host. > > >> > > >> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the > > > it's bette to provide full QEMU CLI and host/guest kernels used and what > > > hardware was used if it's relevant so others can reproduce problem. > > > > I ever reproduced this on an older Intel Icelake machine, a > > Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic > > issue, not specific to particular models. > > > > For the CLI, I think the only command line options that matter are > > -overcommit cpu-pm=on: to set enable_cpu_pm > > -cpu host: so that cpu->max_features is set > > > > For QEMU version, as long as it's after this commit: 662175b91ff2 > > ("i386: reorder call to cpu_exec_realizefn") > > > > The guest fails to boot: > > > > [ 24.825568] smpboot: x86: Booting SMP configuration: > > [ 24.826377] node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 > > #13 #14 #15 #17 > > [ 24.985799] node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135 > > #136 #137 #138 #139 #140 #141 #142 #143 #145 > > [ 25.136955] invalid opcode: 1 PREEMPT SMP NOPTI > > [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2 > > [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04 > > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80 > > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15 > > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8 > > [ 25.137790] RSP: :91403e70 EFLAGS: 00010046 > > [ 25.137790] RAX: 9140a980 RBX: 9140a980 RCX: > > > > [ 25.137790] RDX: RSI: 97f1ade21b20 RDI: > > 0004 > > [ 25.137790] RBP: R08: 0005da4709cb R09: > > 0001 > > [ 25.137790] R10: 5da4 R11: 0009 R12: > > > > [ 25.137790] R13: 98573ff90fc0 R14: 9140a038 R15: > > 00093ff0 > > [ 25.137790] FS: () GS:97f1ade0() > > knlGS: > > [ 25.137790] CS: 0010 DS: ES: CR0: 80050033 > > [ 25.137790] CR2: 97d8aa801000 CR3: 0049e9430001 CR4: > > 00770ef0 > > [ 25.137790] DR0: DR1: DR2: > > > > [ 25.137790] DR3: DR6: 07f0 DR7: > > 0400 > > [ 25.137790] PKRU: 5554 > > [ 25.137790] Call Trace: > > [ 25.137790] > > [ 25.137790] ? die+0x37/0x90 > > [ 25.137790] ? do_trap+0xe3/0x110 > > [ 25.137790] ? mwait_idle+0x35/0x80 > > [ 25.137790] ? do_error_trap+0x6a/0x90 > > [ 25.137790] ? mwait_idle+0x35/0x80 > > [ 25.137790] ? exc_invalid_op+0x52/0x70 > > [ 25.137790] ? mwait_idle+0x35/0x80 > > [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20 > > [ 25.137790] ? mwait_idle+0x35/0x80 > > [ 25.137790] default_idle_call+0x30/0x100 > > [ 25.137790] cpuidle_idle_call+0x12c/0x170 > > [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0 > > [ 25.137790] do_idle+0x7f/0xd0 > > [ 25.137790] cpu_startup_entry+0x29/0x30 > > [ 25.137790] rest_init+0xcc/0xd0 > > [ 25.137790] start_kernel+0x396/0x5d0 > > [ 25.137790] x86_64_start_reservations+0x18/0x30 > > [ 25.137790] x86_64_start_kernel+0xe7/0xf0 > > [ 25.137790] common_startup_64+0x13e/0x148 > > [ 25.137790] > > [ 25.137790] Modules linked in: > > [ 25.137790] --[ end trace ]-- > > [ 25.137790] invalid opcode: 2 PREEMPT SMP NOPTI > > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80 > > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15 > > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8 > > > > > > > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so > > >> that it doesn't have a chance to check MWAIT against host features and > > >> will be advertised to the guest regardless of whether it's supported by > > >> the host or not. > > >> > > >> x86_cpu_realizefn() > > >> x86_cpu_filter_features() > > >> cpu_exec_realizefn() > > >> kvm_cpu_realizefn > >
Re: [QEMU][PATCH 1/1] hw/net/can: Fix sorting of the tx queue
Hi Shiva, On Wed, May 29, 2024 at 03:54:58PM +0530, Shiva sagar Myana wrote: > Returning an uint32_t casted to a gint from g_cmp_ids causes the tx queue to > become wrongly sorted when executing g_slist_sort. Fix this by always > returning -1 or 1 from g_cmp_ids based on the ID comparison instead. > Also, if two message IDs are the same, sort them by using their index and > transmit the message at the lowest index first. > > Signed-off-by: Shiva sagar Myana With the subject line modified to below (for clarity): "hw/net/can/xlnx-versal-canfd: Fix sorting of the tx queue" Reviewed-by: Francisco Iglesias BR, F > --- > hw/net/can/xlnx-versal-canfd.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/hw/net/can/xlnx-versal-canfd.c b/hw/net/can/xlnx-versal-canfd.c > index 47a14cfe63..5f083c21e9 100644 > --- a/hw/net/can/xlnx-versal-canfd.c > +++ b/hw/net/can/xlnx-versal-canfd.c > @@ -1312,7 +1312,10 @@ static gint g_cmp_ids(gconstpointer data1, > gconstpointer data2) > tx_ready_reg_info *tx_reg_1 = (tx_ready_reg_info *) data1; > tx_ready_reg_info *tx_reg_2 = (tx_ready_reg_info *) data2; > > -return tx_reg_1->can_id - tx_reg_2->can_id; > +if (tx_reg_1->can_id == tx_reg_2->can_id) { > +return (tx_reg_1->reg_num < tx_reg_2->reg_num) ? -1 : 1; > +} > +return (tx_reg_1->can_id < tx_reg_2->can_id) ? -1 : 1; > } > > static void free_list(GSList *list) > -- > 2.34.1 >
Re: [PATCH v3 00/33] target/arm: Convert a64 advsimd to decodetree (part 1b)
On Tue, 28 May 2024 at 21:31, Richard Henderson wrote: > > Changes for v3: > * Reword prefetch unpredictable patch. > * Validate vector length when qc is an implied operand. > * Adjust some legacy decode based on review. > * Apply r-b. > > Patches needing review: > 01-target-arm-Diagnose-UNPREDICTABLE-operands-to-PLD.patch > 03-target-arm-Assert-oprsz-in-range-when-using-vfp.q.patch > 04-target-arm-Convert-SUQADD-and-USQADD-to-gvec.patch > 10-target-arm-Convert-SRSHL-and-URSHL-register-to-gv.patch > 12-target-arm-Convert-SQSHL-and-UQSHL-register-to-gv.patch > 31-target-arm-Convert-SQDMULH-SQRDMULH-to-decodetree.patch > 32-target-arm-Convert-FMADD-FMSUB-FNMADD-FNMSUB-to-d.patch Applied 2-33 to target-arm.next, thanks. (Dropped the PLD patch for the reasons we discussed in the other thread.) -- PMM
Re: [Semihosting Tests PATCH v2 0/3] add SYS_GET_CMDLINE test
On Thu, 30 May 2024 at 12:23, Alex Bennée wrote: > > Hi Peter, > > Looking at bug #2322 I wanted to make sure SYS_GET_CMDLINE works as I > expected. While at it I needed to fix a compile error with headers > which I guess we got away with on earlier compilers. > > I've added an editorconfig for good measure. > > v2 > - addressed review comments > > Alex Bennée (3): > .editorconfig: add code conventions for tooling > update includes for bare metal compiling > add SYS_GET_CMDLINE test Applied to the semihosting-tests git repo, thanks. -- PMM