date:20240530

Re: [PATCH V2 3/3] target/i386: Move host_cpu_enable_cpu_pm into kvm_cpu_realizefn()

2024-05-30 Thread Zhao Liu

On Fri, May 24, 2024 at 01:00:17PM -0700, Zide Chen wrote:
> Date: Fri, 24 May 2024 13:00:17 -0700
> From: Zide Chen 
> Subject: [PATCH V2 3/3] target/i386: Move host_cpu_enable_cpu_pm into
>  kvm_cpu_realizefn()
> X-Mailer: git-send-email 2.34.1
> 
> It seems not a good idea to expand features in host_cpu_realizefn,
> which is for host CPU only. 

It is restricted by max_features and should be part of the "max" CPU,
and the current target/i386/host-cpu.c should only serve the “host” CPU.

> Additionally, cpu-pm option is KVM
> specific, and it's cleaner to put it in kvm_cpu_realizefn(), together
> with the WAITPKG code.
> 
> Fixes: f5cc5a5c1686 ("i386: split cpu accelerators from cpu.c, using 
> AccelCPUClass")
> Signed-off-by: Zide Chen 
> ---
>  target/i386/host-cpu.c| 12 
>  target/i386/kvm/kvm-cpu.c | 11 +--
>  2 files changed, 9 insertions(+), 14 deletions(-)
> 
> diff --git a/target/i386/host-cpu.c b/target/i386/host-cpu.c
> index 280e427c017c..8b8bf5afeccf 100644
> --- a/target/i386/host-cpu.c
> +++ b/target/i386/host-cpu.c
> @@ -42,15 +42,6 @@ static uint32_t host_cpu_phys_bits(void)
>  return host_phys_bits;
>  }
>  
> -static void host_cpu_enable_cpu_pm(X86CPU *cpu)
> -{
> -CPUX86State *env = &cpu->env;
> -
> -host_cpuid(5, 0, &cpu->mwait.eax, &cpu->mwait.ebx,
> -   &cpu->mwait.ecx, &cpu->mwait.edx);
> -env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> -}
> -
>  static uint32_t host_cpu_adjust_phys_bits(X86CPU *cpu)
>  {
>  uint32_t host_phys_bits = host_cpu_phys_bits();
> @@ -83,9 +74,6 @@ bool host_cpu_realizefn(CPUState *cs, Error **errp)
>  X86CPU *cpu = X86_CPU(cs);
>  CPUX86State *env = &cpu->env;
>  
> -if (cpu->max_features && enable_cpu_pm) {
> -host_cpu_enable_cpu_pm(cpu);
> -}
>  if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
>  uint32_t phys_bits = host_cpu_adjust_phys_bits(cpu);
>  
> diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
> index 3adcedf0dbc3..197c892da89a 100644
> --- a/target/i386/kvm/kvm-cpu.c
> +++ b/target/i386/kvm/kvm-cpu.c
> @@ -64,9 +64,16 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
>   *   cpu_common_realizefn() (via xcc->parent_realize)
>   */
>  if (cpu->max_features) {
> -if (enable_cpu_pm && kvm_has_waitpkg()) {
> -env->features[FEAT_7_0_ECX] |= CPUID_7_0_ECX_WAITPKG;
> +if (enable_cpu_pm) {
> +if (kvm_has_waitpkg()) {
> +env->features[FEAT_7_0_ECX] |= CPUID_7_0_ECX_WAITPKG;
> +}

If you agree with my comment in patch 2, here we need a mwait bit check.

> +host_cpuid(5, 0, &cpu->mwait.eax, &cpu->mwait.ebx,
> +   &cpu->mwait.ecx, &cpu->mwait.edx);
> +env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
>  }
> +
>  if (cpu->ucode_rev == 0) {
>  cpu->ucode_rev =
>  kvm_arch_get_supported_msr_feature(kvm_state,
> -- 
> 2.34.1
> 
>

Re: Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159

2024-05-30 Thread Gavin Shan


On 5/31/24 14:19, Itaru Kitayama wrote:

On May 30, 2024, at 22:30, Philippe Mathieu-Daudé  wrote:
Cc'ing more developers

On 30/5/24 06:30, Itaru Kitayama wrote:

Hi,
When I see a Realm VM creation fails with:
Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159:
qemu-system-aarch64: RME: failed to configure SVE: Invalid argument
test.sh: line 8:  2502 Aborted qemu-system-aarch64 -M 
'virt,acpi=off,gic-version=3' -cpu host -enable-kvm -smp 2 -m 512M -overcommit 
'mem-lock=on' -M 'confidential-guest-support=rme0' -object 
'rme-guest,id=rme0,measurement-algo=sha512,num-pmu-counters=6,sve-vector-length=256'
 -kernel Image -initrd rootfs.cpio -append 'earycon console=ttyAMA0 
rdinit=/sbin/init' -nographic -net none
do I need to suspect first the VMM, QEMU, or the Image? The kernel is built 
with LLVM, does it matter?
Thanks,
Itaru.




I’m testing Jean’s repo at:

https://git.codelinaro.org/linaro/dcap/qemu/-/tree/cca/v2?ref_type=heads



I got a chance to try CCA software components, suggested by [1]. However, the 
edk2
is stuck somewhere. I didn't reach to stage of loading guest kernel yet. I'm 
replying
to see if anyone has a idea.

[1] 
https://linaro.atlassian.net/wiki/spaces/QEMU/pages/29051027459/Building+an+RME+stack+for+QEMU


---> tf-rmm

# git clone https://git.codelinaro.org/linaro/dcap/rmm.git tf-rmm
# cd tf-rmm
# git checkout origin/cca/v2 -b cca/v2
# git submodule update --init --recursive
# export CROSS_COMPILE=
# cmake -DCMAKE_BUILD_TYPE=Debug -DRMM_CONFIG=qemu_virt_defcfg -B build-qemu
# cmake --build build-qemu

---> edk2

# git clone g...@github.com:tianocore/edk2.git
# cd edk2
# git submodule update --init --recursive
# source edksetup.sh
# make -j -C BaseTools
# export GCC5_AARCH64_PREFIX=
# build -b RELEASE -a AARCH64 -t GCC5 -p ArmVirtPkg/ArmVirtQemuKernel.dsc

---> tf-a

# git clone https://git.codelinaro.org/linaro/dcap/tf-a/trusted-firmware-a.git 
tf-a
# cd tf-a
# git checkout origin/cca/v2 -b cca/v2
# make -j CROSS_COMPILE= PLAT=qemu ENABLE_RME=1 DEBUG=1 LOG_LEVEL=40 \
  QEMU_USE_GIC_DRIVER=QEMU_GICV3 RMM=../rmm/build-qemu/Debug/rmm.img \
  BL33=../edk2/Build/ArmVirtQemuKernel-AARCH64/RELEASE_GCC5/FV/QEMU_EFI.fd all 
fip
# dd if=build/qemu/debug/bl1.bin of=flash.bin
# dd if=build/qemu/debug/fip.bin of=flash.bin seek=64 bs=4096

---> QEMU

# git clone https://git.qemu.org/git/qemu.git qemu.main
# cd qemu.main
# ./configure --target-list=aarch64-softmmu
# make -j 60

---> boot the host on the emulated hardware

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 
   \
-M virt,virtualization=on,secure=on,gic-version=3,acpi=off  
   \
-cpu max,x-rme=on -m 8G -smp 8  
   \
-nodefaults -monitor none -serial mon:stdio -nographic  
   \
-bios /home/gavin/sandbox/CCA/tf-a/flash.bin
   \
-kernel /home/gavin/sandbox/CCA/linux/arch/arm64/boot/Image 
   \
-drive 
format=raw,if=none,file=${CCA}/buildroot/output/images/rootfs.ext4,id=hd0   \
-device virtio-blk-pci,drive=hd0
   \
-append "root=/dev/vda console=ttyAMA0"
  :
NOTICE:  Booting Trusted Firmware
NOTICE:  BL1: v2.10.0(debug):99e0b97aa-dirty
NOTICE:  BL1: Built : 00:31:35, May 31 2024
INFO:BL1: RAM 0xe0ee000 - 0xe0f7000
INFO:BL1: Loading BL2
INFO:Loading image id=1 at address 0xe06b000
INFO:Image id=1 loaded: 0xe06b000 - 0xe0742d1
NOTICE:  BL1: Booting BL2
INFO:Entry point address = 0xe06b000
INFO:SPSR = 0x3cd
INFO:[GPT] Boot Configuration
INFO:  PPS/T: 0x2/40
INFO:  PGS/P: 0x0/12
INFO:  L0GPTSZ/S: 0x0/30
INFO:  PAS count: 0x6
INFO:  L0 base:   0xedfe000
INFO:[GPT] PAS[0]: base 0xe001000, size 0xff000, GPI 0xa, type 0x1
INFO:[GPT] PAS[1]: base 0xe10, size 0xcfe000, GPI 0x8, type 0x1
INFO:[GPT] PAS[2]: base 0xedfe000, size 0x202000, GPI 0xa, type 0x1
INFO:[GPT] PAS[3]: base 0x4000, size 0x10, GPI 0x9, type 0x1
INFO:[GPT] PAS[4]: base 0x4010, size 0x280, GPI 0xb, type 0x1
INFO:[GPT] PAS[5]: base 0x4290, size 0x1fd70, GPI 0x9, type 0x1
INFO:Enabling Granule Protection Checks
NOTICE:  BL2: v2.10.0(debug):99e0b97aa-dirty
NOTICE:  BL2: Built : 00:31:35, May 31 2024
INFO:BL2: Doing platform setup
INFO:Reserved RMM memory [0x4010, 0x428f] in Device tree
INFO:BL2: Loading image id 3
INFO:Loading image id=3 at address 0xe0a
INFO:Image id=3 loaded: 0xe0a - 0xe0b10c4
INFO:BL2: Loading image id 35
INFO:Loading image id=35 at address 0x4010
INFO:Image id=35 loaded: 0x4010 - 0x403033b0
INFO:BL2: Loading image id 5
INFO:Loading image id=5 at address 0x6000
INFO:Image id=5 loaded: 0x6000 - 0x6020
NOTICE:  BL2: Booting BL31
INFO:Entry point address = 0xe0a
INFO:SPSR = 0x3cd
NOTICE:  BL31: v2.10.0(debug

Re: [PATCH V2 2/3] target/i386: call cpu_exec_realizefn before x86_cpu_filter_features

2024-05-30 Thread Zhao Liu

Hi Zide,

On Fri, May 24, 2024 at 01:00:16PM -0700, Zide Chen wrote:
> Date: Fri, 24 May 2024 13:00:16 -0700
> From: Zide Chen 
> Subject: [PATCH V2 2/3] target/i386: call cpu_exec_realizefn before
>  x86_cpu_filter_features
> X-Mailer: git-send-email 2.34.1
> 
> cpu_exec_realizefn which calls the accel-specific realizefn may expand
> features.  e.g., some accel-specific options may require extra features
> to be enabled, and it's appropriate to expand these features in accel-
> specific realizefn.
> 
> One such example is the cpu-pm option, which may add CPUID_EXT_MONITOR.
> 
> Thus, call cpu_exec_realizefn before x86_cpu_filter_features to ensure
> that it won't expose features not supported by the host.
> 
> Fixes: 662175b91ff2 ("i386: reorder call to cpu_exec_realizefn")
> Suggested-by: Xiaoyao Li 
> Signed-off-by: Zide Chen 
> ---
>  target/i386/cpu.c | 24 
>  target/i386/kvm/kvm-cpu.c |  1 -
>  2 files changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index bc2dceb647fa..a1c1c785bd2f 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -7604,6 +7604,18 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
> **errp)
>  }
>  }
>  
> +/*
> + * note: the call to the framework needs to happen after feature 
> expansion,
> + * but before the checks/modifications to ucode_rev, mwait, phys_bits.
> + * These may be set by the accel-specific code,
> + * and the results are subsequently checked / assumed in this function.
> + */
> +cpu_exec_realizefn(cs, &local_err);
> +if (local_err != NULL) {
> +error_propagate(errp, local_err);
> +return;
> +}
> +
>  x86_cpu_filter_features(cpu, cpu->check_cpuid || cpu->enforce_cpuid);

For your case, which sets cpu-pm=on via overcommit, then
x86_cpu_filter_features() will complain that mwait is not supported.

Such warning is not necessary, because the purpose of overcommit (from
code) is only to support mwait when possible, not to commit to support
mwait in Guest.

Additionally, I understand x86_cpu_filter_features() is primarily
intended to filter features configured by the user, and the changes of
CPUID after x86_cpu_filter_features() should by default be regarded like
"QEMU knows what it is doing".

I feel adding a check for the CPUID mwait bit in host_cpu_realizefn()
is enough, after all, this bit should be present if host supports mwait
and enable_cpu_pm (in kvm_arch_get_supported_cpuid()).

Thanks,
Zhao

[PATCH] cxl: Get rid of unused cfmw_list

2024-05-30 Thread Li Zhijian via

There is no user for this member. All '-M cxl-fmw.N' options have
been parsed and saved to CXLState.fixed_windows.

Signed-off-by: Li Zhijian 
---
 hw/cxl/cxl-host.c| 1 -
 include/hw/cxl/cxl.h | 1 -
 2 files changed, 2 deletions(-)

diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
index c5f5fcfd64d0..926d3d3da705 100644
--- a/hw/cxl/cxl-host.c
+++ b/hw/cxl/cxl-host.c
@@ -335,7 +335,6 @@ static void machine_set_cfmw(Object *obj, Visitor *v, const 
char *name,
 for (it = cfmw_list; it; it = it->next) {
 cxl_fixed_memory_window_config(state, it->value, errp);
 }
-state->cfmw_list = cfmw_list;
 }
 
 void cxl_machine_init(Object *obj, CXLState *state)
diff --git a/include/hw/cxl/cxl.h b/include/hw/cxl/cxl.h
index 75e47b686441..e3ecbef03872 100644
--- a/include/hw/cxl/cxl.h
+++ b/include/hw/cxl/cxl.h
@@ -43,7 +43,6 @@ typedef struct CXLState {
 MemoryRegion host_mr;
 unsigned int next_mr_idx;
 GList *fixed_windows;
-CXLFixedMemoryWindowOptionsList *cfmw_list;
 } CXLState;
 
 struct CXLHost {
-- 
2.29.2

Re: tests/avocado: Add LoongArch machine start test

2024-05-30 Thread Jiaxun Yang




在2024年5月31日五月 上午2:52，gaosong写道：
> 在 2024/5/30 下午9:16, Jiaxun Yang 写道:
>>
>> 在2024年5月30日五月 下午2:00，gaosong写道：
>> [...]
 FYI, the test does not seem to work anymore - apparently the binaries
 have changed and now the hashes do not match anymore. Could you please
 update it? (preferably with some versioned binaries that do not change
 in the course of time?)

>>> Thank you,  I had send a patch to fix it.
>> Hi Song,
>>
>> As LoongArch EDK2 support has been merged long ago, do you to make a clean
>> build and add it to pc-bios directory?
> EDK2 LoongArchVirt under OvmfPkg is being committed to upstream.
>
> PR:
> https://github.com/tianocore/edk2/pull/5208

I meant here:

https://gitlab.com/qemu-project/qemu/-/tree/master/pc-bios?ref_type=heads

Thanks
>
> Thanks
> Song Gao
>>
>> Thanks
>> - Jiaxun

-- 
- Jiaxun

RE: [PATCH v4 12/16] aspeed/soc: Add AST2700 support

2024-05-30 Thread Jamin Lin

Hi Cedric,

> From: Cédric Le Goater 
> Subject: Re: [PATCH v4 12/16] aspeed/soc: Add AST2700 support
> 
> On 5/27/24 10:02, Jamin Lin wrote:
> > Initial definitions for a simple machine using an AST2700 SOC (Cortex-a35
> CPU).
> >
> > AST2700 SOC and its interrupt controller are too complex to handle in
> > the common Aspeed SoC framework. We introduce a new ast2700 class with
> > instance_init and realize handlers.
> >
> > AST2700 is a 64 bits quad core cpus and support 8 watchdog.
> > Update maximum ASPEED_CPUS_NUM to 4 and ASPEED_WDTS_NUM to 8.
> > In addition, update AspeedSocState to support scuio, sli, sliio and intc.
> >
> > Add TYPE_ASPEED27X0_SOC machine type.
> >
> > The SDMC controller is unlocked at SPL stage.
> > At present, only supports to emulate booting start from u-boot stage.
> > Set SDMC controller unlocked by default.
> >
> > In INTC, each interrupt of INT 128 to INT 136 combines 32 interrupts.
> > It connect GICINT IRQ GPIO-OUTPUT pins to GIC device with irq 128 to 136.
> > And, if a device irq is 128 to 136, its irq GPIO-OUTPUT pin is
> > connected to GICINT or-gates instead of GIC device.
> >
> > Signed-off-by: Troy Lee 
> > Signed-off-by: Jamin Lin 
> > ---
> >   hw/arm/aspeed_ast27x0.c | 563
> 
> >   hw/arm/meson.build  |   1 +
> >   include/hw/arm/aspeed_soc.h |  26 +-
> >   3 files changed, 588 insertions(+), 2 deletions(-)
> >   create mode 100644 hw/arm/aspeed_ast27x0.c
> >
> > diff --git a/hw/arm/aspeed_ast27x0.c b/hw/arm/aspeed_ast27x0.c new
> > file mode 100644 index 00..a3a03fc1ca
> > --- /dev/null
> > +++ b/hw/arm/aspeed_ast27x0.c
> > @@ -0,0 +1,563 @@
> > +/*
> > + * ASPEED SoC 27x0 family
> > + *
> > + * Copyright (C) 2024 ASPEED Technology Inc.
> > + *
> > + * This code is licensed under the GPL version 2 or later.  See
> > + * the COPYING file in the top-level directory.
> > + *
> > + * Implementation extracted from the AST2600 and adapted for AST27x0.
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qapi/error.h"
> > +#include "hw/misc/unimp.h"
> > +#include "hw/arm/aspeed_soc.h"
> > +#include "qemu/module.h"
> > +#include "qemu/error-report.h"
> > +#include "hw/i2c/aspeed_i2c.h"
> > +#include "net/net.h"
> > +#include "sysemu/sysemu.h"
> > +#include "hw/intc/arm_gicv3.h"
> > +#include "qapi/qmp/qlist.h"
> > +
> > +static const hwaddr aspeed_soc_ast2700_memmap[] = {
> > +[ASPEED_DEV_SPI_BOOT]  =  0x4,
> > +[ASPEED_DEV_SRAM]  =  0x1000,
> > +[ASPEED_DEV_SDMC]  =  0x12C0,
> > +[ASPEED_DEV_SCU]   =  0x12C02000,
> > +[ASPEED_DEV_SCUIO] =  0x14C02000,
> > +[ASPEED_DEV_UART0] =  0X14C33000,
> > +[ASPEED_DEV_UART1] =  0X14C33100,
> > +[ASPEED_DEV_UART2] =  0X14C33200,
> > +[ASPEED_DEV_UART3] =  0X14C33300,
> > +[ASPEED_DEV_UART4] =  0X12C1A000,
> > +[ASPEED_DEV_UART5] =  0X14C33400,
> > +[ASPEED_DEV_UART6] =  0X14C33500,
> > +[ASPEED_DEV_UART7] =  0X14C33600,
> > +[ASPEED_DEV_UART8] =  0X14C33700,
> > +[ASPEED_DEV_UART9] =  0X14C33800,
> > +[ASPEED_DEV_UART10]=  0X14C33900,
> > +[ASPEED_DEV_UART11]=  0X14C33A00,
> > +[ASPEED_DEV_UART12]=  0X14C33B00,
> > +[ASPEED_DEV_WDT]   =  0x14C37000,
> > +[ASPEED_DEV_VUART] =  0X14C3,
> > +[ASPEED_DEV_FMC]   =  0x1400,
> > +[ASPEED_DEV_SPI0]  =  0x1401,
> > +[ASPEED_DEV_SPI1]  =  0x1402,
> > +[ASPEED_DEV_SPI2]  =  0x1403,
> > +[ASPEED_DEV_SDRAM] =  0x4,
> > +[ASPEED_DEV_MII1]  =  0x1404,
> > +[ASPEED_DEV_MII2]  =  0x14040008,
> > +[ASPEED_DEV_MII3]  =  0x14040010,
> > +[ASPEED_DEV_ETH1]  =  0x1405,
> > +[ASPEED_DEV_ETH2]  =  0x1406,
> > +[ASPEED_DEV_ETH3]  =  0x1407,
> > +[ASPEED_DEV_EMMC]  =  0x1209,
> > +[ASPEED_DEV_INTC]  =  0x1210,
> > +[ASPEED_DEV_SLI]   =  0x12C17000,
> > +[ASPEED_DEV_SLIIO] =  0x14C1E000,
> > +[ASPEED_GIC_DIST]  =  0x1220,
> > +[ASPEED_GIC_REDIST]=  0x1228,
> > +};
> > +
> > +#define AST2700_MAX_IRQ 288
> > +
> > +/* Shared Peripheral Interrupt values below are offset by -32 from
> > +datasheet */ static const int aspeed_soc_ast2700_irqmap[] = {
> > +[ASPEED_DEV_UART0] = 132,
> > +[ASPEED_DEV_UART1] = 132,
> > +[ASPEED_DEV_UART2] = 132,
> > +[ASPEED_DEV_UART3] = 132,
> > +[ASPEED_DEV_UART4] = 8,
> > +[ASPEED_DEV_UART5] = 132,
> > +[ASPEED_DEV_UART6] = 132,
> > +[ASPEED_DEV_UART7] = 132,
> > +[ASPEED_DEV_UART8] = 132,
> > +[ASPEED_DEV_UART9] = 132,
> > +[ASPEED_DEV_UART10]= 132,
> > +[ASPEED_DEV_UART11]= 132,
> > +[ASPEED_DEV_UART12]= 132,
> > +[ASPEED_DEV_FMC]   = 131,
> > +[ASPEED_DEV_SDMC]  = 0,
> > +[ASPEED_DEV_SCU]   = 12,
> > +[ASPEED_DEV_ADC]   =

Re: [PATCH] Use "void *" as parameter for functions that are used for aio_set_event_notifier()

2024-05-30 Thread Thomas Huth


On 29/05/2024 20.22, Stefan Hajnoczi wrote:

On Wed, May 29, 2024 at 07:49:48PM +0200, Thomas Huth wrote:

aio_set_event_notifier() and aio_set_event_notifier_poll() in
util/aio-posix.c and util/aio-win32.c are casting function pointers of
functions that take an "EventNotifier *" pointer as parameter to function
pointers that take a "void *" pointer as parameter (i.e. the IOHandler
type). When those function pointers are later used to call the referenced
function, this triggers undefined behavior errors with the latest version
of Clang in Fedora 40 when compiling with the option "-fsanitize=undefined".
And this also prevents enabling the strict mode of CFI which is currently
disabled with -fsanitize-cfi-icall-generalize-pointers. Thus let us avoid
the problem by using "void *" as parameter in all spots where it is needed.

Signed-off-by: Thomas Huth 
---
  Yes, I know, the patch looks ugly ... but I don't see a better way to
  tackle this. If someone has a better idea, suggestions are welcome!


An alternative is adding EventNotifierHandler *io_read, *io_poll_ready,
*io_poll_begin, and *io_poll_end fields to EventNotifier so that
aio_set_event_notifier() and aio_set_event_notifier_poll() can pass
helper functions to the underlying aio_set_fd_handler() and
aio_set_fd_poll() APIs. These helper functions then invoke the
EventNotifier callbacks:

/* Helpers */
static void event_notifier_io_read(void *opaque)
{
 EventNotifier *notifier = opaque;
 notifier->io_read(notifier);
}


That's a nice idea, thanks, I'll give it a try!

 Thomas

Re: [PATCH V2 1/3] vl: Allow multiple -overcommit commands

2024-05-30 Thread Thomas Huth

On 30/05/2024 16.01, Zhao Liu wrote:

On Mon, May 27, 2024 at 07:19:56AM +0200, Thomas Huth wrote:

Date: Mon, 27 May 2024 07:19:56 +0200
From: Thomas Huth 
Subject: Re: [PATCH V2 1/3] vl: Allow multiple -overcommit commands

On 24/05/2024 22.00, Zide Chen wrote:

Both cpu-pm and mem-lock are related to system resource overcommit, but
they are separate from each other, in terms of how they are realized,
and of course, they are applied to different system resources.

It's tempting to use separate command lines to specify their behavior.
e.g., in the following example, the cpu-pm command is quietly
overwritten, and it's not easy to notice it without careful inspection.

--overcommit mem-lock=on
--overcommit cpu-pm=on

Fixes: c8c9dc42b7ca ("Remove the deprecated -realtime option")
Suggested-by: Thomas Huth 
Signed-off-by: Zide Chen 
---

v2:

Thanks to Thomas' suggestion, changed to this better approach, which
is more generic and can handle situations like: "enabled the option in
the config file, and now you'd like to disable it on the command line
again".

   system/vl.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/system/vl.c b/system/vl.c
index a3eede5fa5b8..dfa6cdd9283b 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -3545,8 +3545,8 @@ void qemu_init(int argc, char **argv)
   if (!opts) {
   exit(1);
   }
-enable_mlock = qemu_opt_get_bool(opts, "mem-lock", false);
-enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", false);
+enable_mlock = qemu_opt_get_bool(opts, "mem-lock", 
enable_mlock);
+enable_cpu_pm = qemu_opt_get_bool(opts, "cpu-pm", 
enable_cpu_pm);
   break;
   case QEMU_OPTION_compat:
   {

Reviewed-by: Thomas Huth 

Hi Thomas,

BTW, do you think it's a good idea to define the overcommit via QAPI way
(defined in a json file)? ;-)

My rough understanding is that all APIs are better to be defined via
QAPI to go JSON compatible.

Sorry, no clue whether it makes sense here... CC:-ing Markus for 
recommendations.

 Thomas

Re: [PATCH] hw/net: prevent potential NULL dereference

2024-05-30 Thread David Gibson

On Thu, May 30, 2024 at 10:03:51AM +0100, Peter Maydell wrote:
> On Thu, 30 May 2024 at 01:52, David Gibson  
> wrote:
> >
> > On Wed, May 29, 2024 at 02:07:18PM +0300, Oleg Sviridov wrote:
> > > Pointer, returned from function 'spapr_vio_find_by_reg', may be NULL and 
> > > is dereferenced immediately after.
> > >
> > > Found by Linux Verification Center (linuxtesting.org) with SVACE.
> > >
> > > Signed-off-by: Oleg Sviridov 
> > > ---
> > >  hw/net/spapr_llan.c | 4 
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/hw/net/spapr_llan.c b/hw/net/spapr_llan.c
> > > index ecb30b7c76..f40b733229 100644
> > > --- a/hw/net/spapr_llan.c
> > > +++ b/hw/net/spapr_llan.c
> > > @@ -770,6 +770,10 @@ static target_ulong 
> > > h_change_logical_lan_mac(PowerPCCPU *cpu,
> > >  SpaprVioVlan *dev = VIO_SPAPR_VLAN_DEVICE(sdev);
> >
> > Hmm... I thought VIO_SPAPR_VLAN_DEVICE() was supposed to abort if sdev
> > was NULL or not of the right type.  Or have the rules for qom helpers
> > changed since I wrote this.
> 
> QOM casts abort if the type is wrong, but a NULL pointer is
> passed through as a NULL pointer.

Ah, my mistake.  LGTM, then.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [PATCH 4/4] hw/s390x: Deprecate the QMP @dump-skeys command

2024-05-30 Thread Thomas Huth


On 30/05/2024 09.45, Philippe Mathieu-Daudé wrote:

Prefer @dump-s390-skeys which is target agnostic.

Signed-off-by: Philippe Mathieu-Daudé 
---
  docs/about/deprecated.rst | 5 +
  qapi/misc-target.json | 5 +
  2 files changed, 10 insertions(+)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 40585ca7d5..3cb43085ba 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -148,6 +148,11 @@ accepted incorrect commands will return an error. Users 
should make sure that
  all arguments passed to ``device_add`` are consistent with the documented
  property types.
  
+``dump-skeys`` (since 9.1)

+''
+
+Use the more generic ``dump-s390-skeys`` command.


FWIW, that's "more specific", not "more generic".

But as I said in my reply to the cover letter, we should maybe consider a 
more generic command instead, indeed.


 Thomas

Re: [PATCH 0/4] hw/s390x: Alias @dump-skeys -> @dump-s390-skey and deprecate

2024-05-30 Thread Thomas Huth


On 30/05/2024 09.45, Philippe Mathieu-Daudé wrote:

We are trying to unify all qemu-system-FOO to a single binary.
In order to do that we need to remove QAPI target specific code.

@dump-skeys is only available on qemu-system-s390x. This series
rename it as @dump-s390-skey, making it available on other
binaries. We take care of backward compatibility via deprecation.

Philippe Mathieu-Daudé (4):
   hw/s390x: Introduce the @dump-s390-skeys QMP command
   hw/s390x: Introduce the 'dump_s390_skeys' HMP command
   hw/s390x: Deprecate the HMP 'dump_skeys' command
   hw/s390x: Deprecate the QMP @dump-skeys command


Why do we have to rename the command? Just for the sake of it? I think 
renaming HMP commands is maybe ok, but breaking the API in QMP is something 
you should consider twice.


And even if we decide to rename ... maybe we should discuss whether it makes 
sense to come up with a generic command instead: As far as I know, ARM also 
has something similar, called MTE. Maybe we also want to dump MTE keys one 
day? So the new command should maybe be called "dump-memory-keys" instead? 
Or should it maybe rather be an option to the existing "dump-guest-memory" 
command instead?


 Thomas

Re: Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159

2024-05-30 Thread Itaru Kitayama




> On May 30, 2024, at 22:30, Philippe Mathieu-Daudé  wrote:
> 
> Cc'ing more developers
> 
> On 30/5/24 06:30, Itaru Kitayama wrote:
>> Hi,
>> When I see a Realm VM creation fails with:
>> Unexpected error in rme_configure_one() at ../target/arm/kvm-rme.c:159:
>> qemu-system-aarch64: RME: failed to configure SVE: Invalid argument
>> test.sh: line 8:  2502 Aborted qemu-system-aarch64 -M 
>> 'virt,acpi=off,gic-version=3' -cpu host -enable-kvm -smp 2 -m 512M 
>> -overcommit 'mem-lock=on' -M 'confidential-guest-support=rme0' -object 
>> 'rme-guest,id=rme0,measurement-algo=sha512,num-pmu-counters=6,sve-vector-length=256'
>>  -kernel Image -initrd rootfs.cpio -append 'earycon console=ttyAMA0 
>> rdinit=/sbin/init' -nographic -net none
>> do I need to suspect first the VMM, QEMU, or the Image? The kernel is built 
>> with LLVM, does it matter?
>> Thanks,
>> Itaru.
> 

I’m testing Jean’s repo at:

https://git.codelinaro.org/linaro/dcap/qemu/-/tree/cca/v2?ref_type=heads

Itaru.

[PATCH v4 09/10] hw/nvme: add reservation protocal command

2024-05-30 Thread Changqi Lu

Add reservation acquire, reservation register,
reservation release and reservation report commands
in the nvme device layer.

By introducing these commands, this enables the nvme
device to perform reservation-related tasks, including
querying keys, querying reservation status, registering
reservation keys, initiating and releasing reservations,
as well as clearing and preempting reservations held by
other keys.

These commands are crucial for management and control of
shared storage resources in a persistent manner.
Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 hw/nvme/ctrl.c   | 321 ++-
 hw/nvme/nvme.h   |   4 +
 include/block/nvme.h |  37 +
 3 files changed, 361 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 182307a48b..033abd0afe 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -294,6 +294,10 @@ static const uint32_t nvme_cse_iocs_nvm[256] = {
 [NVME_CMD_COMPARE]  = NVME_CMD_EFF_CSUPP,
 [NVME_CMD_IO_MGMT_RECV] = NVME_CMD_EFF_CSUPP,
 [NVME_CMD_IO_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+[NVME_CMD_RESV_REGISTER]= NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_REPORT]  = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_ACQUIRE] = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_RELEASE] = NVME_CMD_EFF_CSUPP,
 };
 
 static const uint32_t nvme_cse_iocs_zoned[256] = {
@@ -308,6 +312,10 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
 [NVME_CMD_ZONE_APPEND]  = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_CMD_ZONE_MGMT_SEND]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_CMD_ZONE_MGMT_RECV]   = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_REGISTER]= NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_REPORT]  = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_ACQUIRE] = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_RELEASE] = NVME_CMD_EFF_CSUPP,
 };
 
 static void nvme_process_sq(void *opaque);
@@ -1745,6 +1753,7 @@ static void nvme_aio_err(NvmeRequest *req, int ret)
 
 switch (req->cmd.opcode) {
 case NVME_CMD_READ:
+case NVME_CMD_RESV_REPORT:
 status = NVME_UNRECOVERED_READ;
 break;
 case NVME_CMD_FLUSH:
@@ -1752,6 +1761,9 @@ static void nvme_aio_err(NvmeRequest *req, int ret)
 case NVME_CMD_WRITE_ZEROES:
 case NVME_CMD_ZONE_APPEND:
 case NVME_CMD_COPY:
+case NVME_CMD_RESV_REGISTER:
+case NVME_CMD_RESV_ACQUIRE:
+case NVME_CMD_RESV_RELEASE:
 status = NVME_WRITE_FAULT;
 break;
 default:
@@ -2127,7 +2139,10 @@ static inline bool nvme_is_write(NvmeRequest *req)
 
 return rw->opcode == NVME_CMD_WRITE ||
rw->opcode == NVME_CMD_ZONE_APPEND ||
-   rw->opcode == NVME_CMD_WRITE_ZEROES;
+   rw->opcode == NVME_CMD_WRITE_ZEROES ||
+   rw->opcode == NVME_CMD_RESV_REGISTER ||
+   rw->opcode == NVME_CMD_RESV_ACQUIRE ||
+   rw->opcode == NVME_CMD_RESV_RELEASE;
 }
 
 static void nvme_misc_cb(void *opaque, int ret)
@@ -2692,6 +2707,302 @@ static uint16_t nvme_verify(NvmeCtrl *n, NvmeRequest 
*req)
 return NVME_NO_COMPLETE;
 }
 
+typedef struct NvmeKeyInfo {
+uint64_t cr_key;
+uint64_t nr_key;
+} NvmeKeyInfo;
+
+static uint16_t nvme_resv_register(NvmeCtrl *n, NvmeRequest *req)
+{
+int ret;
+NvmeKeyInfo key_info;
+NvmeNamespace *ns = req->ns;
+uint32_t cdw10 = le32_to_cpu(req->cmd.cdw10);
+bool ignore_key = cdw10 >> 3 & 0x1;
+uint8_t action = cdw10 & 0x7;
+uint8_t ptpl = cdw10 >> 30 & 0x3;
+bool aptpl;
+
+switch (ptpl) {
+case NVME_RESV_PTPL_NO_CHANGE:
+aptpl = (ns->id_ns.rescap & NVME_PR_CAP_PTPL) ? true : false;
+break;
+case NVME_RESV_PTPL_DISABLE:
+aptpl = false;
+break;
+case NVME_RESV_PTPL_ENABLE:
+aptpl = true;
+break;
+default:
+return NVME_INVALID_FIELD;
+}
+
+ret = nvme_h2c(n, (uint8_t *)&key_info, sizeof(NvmeKeyInfo), req);
+if (ret) {
+return ret;
+}
+
+switch (action) {
+case NVME_RESV_REGISTER_ACTION_REGISTER:
+req->aiocb = blk_aio_pr_register(ns->blkconf.blk, 0,
+ key_info.nr_key, 0, aptpl,
+ ignore_key, nvme_misc_cb,
+ req);
+break;
+case NVME_RESV_REGISTER_ACTION_UNREGISTER:
+req->aiocb = blk_aio_pr_register(ns->blkconf.blk, key_info.cr_key, 0,
+ 0, aptpl, ignore_key,
+ nvme_misc_cb, req);
+break;
+case NVME_RESV_REGISTER_ACTION_REPLACE:
+req->aiocb = blk_aio_pr_register(ns->blkconf.blk, key_info.cr_key,
+ key_info.nr_key, 0, aptpl, ignore_key,
+ nvme_misc_cb, req);
+break;
+default:
+return

[PATCH v4 02/10] block/raw: add persistent reservation in/out driver

2024-05-30 Thread Changqi Lu

Add persistent reservation in/out operations for raw driver.
The following methods are implemented: bdrv_co_pr_read_keys,
bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 block/raw-format.c | 56 ++
 1 file changed, 56 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index ac7e8495f6..3746bc1bd3 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -454,6 +454,55 @@ raw_co_ioctl(BlockDriverState *bs, unsigned long int req, 
void *buf)
 return bdrv_co_ioctl(bs->file->bs, req, buf);
 }
 
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation,
+uint32_t num_keys, uint64_t *keys)
+{
+
+return bdrv_co_pr_read_keys(bs->file->bs, generation, num_keys, keys);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_read_reservation(BlockDriverState *bs, uint32_t *generation,
+   uint64_t *key, BlockPrType *type)
+{
+return bdrv_co_pr_read_reservation(bs->file->bs, generation, key, type);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_register(BlockDriverState *bs, uint64_t old_key,
+   uint64_t new_key, BlockPrType type,
+   bool ptpl, bool ignore_key)
+{
+return bdrv_co_pr_register(bs->file->bs, old_key, new_key,
+   type, ptpl, ignore_key);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_reserve(BlockDriverState *bs, uint64_t key, BlockPrType type)
+{
+return bdrv_co_pr_reserve(bs->file->bs, key, type);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_release(BlockDriverState *bs, uint64_t key, BlockPrType type)
+{
+return bdrv_co_pr_release(bs->file->bs, key, type);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_clear(BlockDriverState *bs, uint64_t key)
+{
+return bdrv_co_pr_clear(bs->file->bs, key);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_preempt(BlockDriverState *bs, uint64_t old_key,
+  uint64_t new_key, BlockPrType type, bool abort)
+{
+return bdrv_co_pr_preempt(bs->file->bs, old_key, new_key, type, abort);
+}
+
 static int GRAPH_RDLOCK raw_has_zero_init(BlockDriverState *bs)
 {
 return bdrv_has_zero_init(bs->file->bs);
@@ -672,6 +721,13 @@ BlockDriver bdrv_raw = {
 .strong_runtime_opts  = raw_strong_runtime_opts,
 .mutable_opts = mutable_opts,
 .bdrv_cancel_in_flight = raw_cancel_in_flight,
+.bdrv_co_pr_read_keys= raw_co_pr_read_keys,
+.bdrv_co_pr_read_reservation = raw_co_pr_read_reservation,
+.bdrv_co_pr_register = raw_co_pr_register,
+.bdrv_co_pr_reserve  = raw_co_pr_reserve,
+.bdrv_co_pr_release  = raw_co_pr_release,
+.bdrv_co_pr_clear= raw_co_pr_clear,
+.bdrv_co_pr_preempt  = raw_co_pr_preempt,
 };
 
 static void bdrv_raw_init(void)
-- 
2.20.1

[PATCH v4 10/10] block/iscsi: add persistent reservation in/out driver

2024-05-30 Thread Changqi Lu

Add persistent reservation in/out operations for iscsi driver.
The following methods are implemented: bdrv_co_pr_read_keys,
bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 block/iscsi.c | 443 ++
 1 file changed, 443 insertions(+)

diff --git a/block/iscsi.c b/block/iscsi.c
index 2ff14b7472..d94ebe35bd 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -96,6 +96,7 @@ typedef struct IscsiLun {
 unsigned long *allocmap_valid;
 long allocmap_size;
 int cluster_size;
+uint8_t pr_cap;
 bool use_16_for_rw;
 bool write_protected;
 bool lbpme;
@@ -280,6 +281,8 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
 iTask->err_code = -error;
 iTask->err_str = g_strdup(iscsi_get_error(iscsi));
 }
+} else if (status == SCSI_STATUS_RESERVATION_CONFLICT) {
+iTask->err_code = -EBADE;
 }
 }
 }
@@ -1792,6 +1795,52 @@ static void iscsi_save_designator(IscsiLun *lun,
 }
 }
 
+static void iscsi_get_pr_cap_sync(IscsiLun *iscsilun, Error **errp)
+{
+struct scsi_task *task = NULL;
+struct scsi_persistent_reserve_in_report_capabilities *rc = NULL;
+int retries = ISCSI_CMD_RETRIES;
+int xferlen = sizeof(struct 
scsi_persistent_reserve_in_report_capabilities);
+
+do {
+if (task != NULL) {
+scsi_free_scsi_task(task);
+task = NULL;
+}
+
+task = iscsi_persistent_reserve_in_sync(iscsilun->iscsi,
+   iscsilun->lun, SCSI_PR_IN_REPORT_CAPABILITIES, xferlen);
+if (task != NULL && task->status == SCSI_STATUS_GOOD) {
+rc = scsi_datain_unmarshall(task);
+if (rc == NULL) {
+error_setg(errp,
+"iSCSI: Failed to unmarshall report capabilities data.");
+} else {
+iscsilun->pr_cap =
+scsi_pr_cap_to_block(rc->persistent_reservation_type_mask);
+iscsilun->pr_cap |= (rc->ptpl_a) ? BLK_PR_CAP_PTPL : 0;
+}
+break;
+}
+
+if (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
+&& task->sense.key == SCSI_SENSE_UNIT_ATTENTION) {
+break;
+}
+
+} while (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
+ && task->sense.key == SCSI_SENSE_UNIT_ATTENTION
+ && retries-- > 0);
+
+if (task == NULL || task->status != SCSI_STATUS_GOOD) {
+error_setg(errp, "iSCSI: failed to send report capabilities command");
+}
+
+if (task) {
+scsi_free_scsi_task(task);
+}
+}
+
 static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
   Error **errp)
 {
@@ -2024,6 +2073,11 @@ static int iscsi_open(BlockDriverState *bs, QDict 
*options, int flags,
 bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
 }
 
+iscsi_get_pr_cap_sync(iscsilun, &local_err);
+if (local_err != NULL) {
+error_propagate(errp, local_err);
+ret = -EINVAL;
+}
 out:
 qemu_opts_del(opts);
 g_free(initiator_name);
@@ -2110,6 +2164,8 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.opt_transfer = pow2floor(iscsilun->bl.opt_xfer_len *
 iscsilun->block_size);
 }
+
+bs->bl.pr_cap = iscsilun->pr_cap;
 }
 
 /* Note that this will not re-establish a connection with an iSCSI target - it
@@ -2408,6 +2464,385 @@ out_unlock:
 return r;
 }
 
+static int coroutine_fn
+iscsi_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation,
+  uint32_t num_keys, uint64_t *keys)
+{
+IscsiLun *iscsilun = bs->opaque;
+QEMUIOVector qiov;
+struct IscsiTask iTask;
+int xferlen = sizeof(struct scsi_persistent_reserve_in_read_keys) +
+  sizeof(uint64_t) * num_keys;
+uint8_t *buf = g_malloc0(xferlen);
+int32_t num_collect_keys = 0;
+int r = 0;
+
+qemu_iovec_init_buf(&qiov, buf, xferlen);
+iscsi_co_init_iscsitask(iscsilun, &iTask);
+qemu_mutex_lock(&iscsilun->mutex);
+retry:
+iTask.task = iscsi_persistent_reserve_in_task(iscsilun->iscsi,
+ iscsilun->lun, SCSI_PR_IN_READ_KEYS, xferlen,
+ iscsi_co_generic_cb, &iTask);
+
+if (iTask.task == NULL) {
+qemu_mutex_unlock(&iscsilun->mutex);
+return -ENOMEM;
+}
+
+scsi_task_set_iov_in(iTask.task, (struct scsi_iovec *)qiov.iov, qiov.niov);
+iscsi_co_wait_for_task(&iTask, iscsilun);
+
+if (iTask.task != NULL) {
+scsi_free_scsi_task(iTask.task);
+iTask.task = NULL;
+}
+
+if (iTask.do_retry) {
+iTask.complete = 0;
+goto retry;
+}
+
+

[PATCH v4 05/10] hw/scsi: add persistent reservation in/out api for scsi device

2024-05-30 Thread Changqi Lu

Add persistent reservation in/out operations in the
SCSI device layer. By introducing the persistent
reservation in/out api, this enables the SCSI device
to perform reservation-related tasks, including querying
keys, querying reservation status, registering reservation
keys, initiating and releasing reservations, as well as
clearing and preempting reservations held by other keys.

These operations are crucial for management and control of
shared storage resources in a persistent manner.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 hw/scsi/scsi-disk.c | 352 
 1 file changed, 352 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 4bd7af9d0c..0e964dbd87 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -32,6 +32,7 @@
 #include "migration/vmstate.h"
 #include "hw/scsi/emulation.h"
 #include "scsi/constants.h"
+#include "scsi/utils.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/blockdev.h"
 #include "hw/block/block.h"
@@ -42,6 +43,7 @@
 #include "qemu/cutils.h"
 #include "trace.h"
 #include "qom/object.h"
+#include "block/block_int.h"
 
 #ifdef __linux
 #include 
@@ -1474,6 +1476,346 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
*req)
 scsi_req_complete(&r->req, GOOD);
 }
 
+typedef struct SCSIPrReadKeys {
+uint32_t generation;
+uint32_t num_keys;
+uint64_t *keys;
+void *req;
+} SCSIPrReadKeys;
+
+typedef struct SCSIPrReadReservation {
+uint32_t generation;
+uint64_t key;
+BlockPrType type;
+void *req;
+} SCSIPrReadReservation;
+
+static void scsi_pr_read_keys_complete(void *opaque, int ret)
+{
+int num_keys;
+uint8_t *buf;
+SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
+SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
+SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
+
+assert(blk_get_aio_context(s->qdev.conf.blk) ==
+qemu_get_current_aio_context());
+
+assert(r->req.aiocb != NULL);
+r->req.aiocb = NULL;
+
+if (scsi_disk_req_check_error(r, ret, true)) {
+goto done;
+}
+
+buf = scsi_req_get_buf(&r->req);
+num_keys = MIN(blk_keys->num_keys, ret);
+blk_keys->generation = cpu_to_be32(blk_keys->generation);
+memcpy(&buf[0], &blk_keys->generation, 4);
+for (int i = 0; i < num_keys; i++) {
+blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
+memcpy(&buf[8 + i * 8], &blk_keys->keys[i], 8);
+}
+num_keys = cpu_to_be32(num_keys * 8);
+memcpy(&buf[4], &num_keys, 4);
+
+scsi_req_data(&r->req, r->buflen);
+done:
+scsi_req_unref(&r->req);
+g_free(blk_keys->keys);
+g_free(blk_keys);
+}
+
+static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
+{
+SCSIPrReadKeys *blk_keys;
+SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
+SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
+int buflen = MIN(r->req.cmd.xfer, r->buflen);
+int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
+
+blk_keys = g_new0(SCSIPrReadKeys, 1);
+blk_keys->generation = 0;
+/* num_keys is the maximum number of keys that can be transmitted */
+blk_keys->num_keys = num_keys;
+blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
+blk_keys->req = r;
+
+/* The request is used as the AIO opaque value, so add a ref.  */
+scsi_req_ref(&r->req);
+r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
&blk_keys->generation,
+blk_keys->num_keys, blk_keys->keys,
+scsi_pr_read_keys_complete, blk_keys);
+return 0;
+}
+
+static void scsi_pr_read_reservation_complete(void *opaque, int ret)
+{
+uint8_t *buf;
+uint32_t additional_len = 0;
+SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
+SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
+SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
+
+assert(blk_get_aio_context(s->qdev.conf.blk) ==
+qemu_get_current_aio_context());
+
+assert(r->req.aiocb != NULL);
+r->req.aiocb = NULL;
+
+if (scsi_disk_req_check_error(r, ret, true)) {
+goto done;
+}
+
+buf = scsi_req_get_buf(&r->req);
+blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
+memcpy(&buf[0], &blk_rsv->generation, 4);
+if (ret) {
+additional_len = cpu_to_be32(16);
+blk_rsv->key = cpu_to_be64(blk_rsv->key);
+memcpy(&buf[8], &blk_rsv->key, 8);
+buf[21] = block_pr_type_to_scsi(blk_rsv->type) & 0xf;
+} else {
+additional_len = cpu_to_be32(0);
+}
+
+memcpy(&buf[4], &additional_len, 4);
+scsi_req_data(&r->req, r->buflen);
+
+done:
+scsi_req_unref(&r->req);
+g_free(blk_rsv);
+}
+
+static int scsi_disk_emulate_pr_read_reservation(SCSIRequest *req)
+{
+SCSIPrReadReservation *blk_rsv;
+SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
+SCSIDiskS

[PATCH v4 01/10] block: add persistent reservation in/out api

2024-05-30 Thread Changqi Lu

Add persistent reservation in/out operations
at the block level. The following operations
are included:

- read_keys:retrieves the list of registered keys.
- read_reservation: retrieves the current reservation status.
- register: registers a new reservation key.
- reserve:  initiates a reservation for a specific key.
- release:  releases a reservation for a specific key.
- clear:clears all existing reservations.
- preempt:  preempts a reservation held by another key.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 block/block-backend.c | 397 ++
 block/io.c| 163 
 include/block/block-common.h  |  40 +++
 include/block/block-io.h  |  20 ++
 include/block/block_int-common.h  |  84 +++
 include/sysemu/block-backend-io.h |  24 ++
 6 files changed, 728 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index db6f9b92a3..6707d94df7 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1770,6 +1770,403 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned 
long int req, void *buf,
 return blk_aio_prwv(blk, req, 0, buf, blk_aio_ioctl_entry, 0, cb, opaque);
 }
 
+typedef struct BlkPrInCo {
+BlockBackend *blk;
+uint32_t *generation;
+uint32_t num_keys;
+BlockPrType *type;
+uint64_t *keys;
+int ret;
+} BlkPrInCo;
+
+typedef struct BlkPrInCB {
+BlockAIOCB common;
+BlkPrInCo prco;
+bool has_returned;
+} BlkPrInCB;
+
+static const AIOCBInfo blk_pr_in_aiocb_info = {
+.aiocb_size = sizeof(BlkPrInCB),
+};
+
+static void blk_pr_in_complete(BlkPrInCB *acb)
+{
+if (acb->has_returned) {
+acb->common.cb(acb->common.opaque, acb->prco.ret);
+blk_dec_in_flight(acb->prco.blk);
+qemu_aio_unref(acb);
+}
+}
+
+static void blk_pr_in_complete_bh(void *opaque)
+{
+BlkPrInCB *acb = opaque;
+assert(acb->has_returned);
+blk_pr_in_complete(acb);
+}
+
+static BlockAIOCB *blk_aio_pr_in(BlockBackend *blk, uint32_t *generation,
+ uint32_t num_keys, BlockPrType *type,
+ uint64_t *keys, CoroutineEntry co_entry,
+ BlockCompletionFunc *cb, void *opaque)
+{
+BlkPrInCB *acb;
+Coroutine *co;
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_pr_in_aiocb_info, blk, cb, opaque);
+acb->prco = (BlkPrInCo) {
+.blk= blk,
+.generation = generation,
+.num_keys   = num_keys,
+.type   = type,
+.ret= NOT_DONE,
+.keys   = keys,
+};
+acb->has_returned = false;
+
+co = qemu_coroutine_create(co_entry, acb);
+aio_co_enter(qemu_get_current_aio_context(), co);
+
+acb->has_returned = true;
+if (acb->prco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
+ blk_pr_in_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+/* To be called between exactly one pair of blk_inc/dec_in_flight() */
+static int coroutine_fn
+blk_aio_pr_do_read_keys(BlockBackend *blk, uint32_t *generation,
+uint32_t num_keys, uint64_t *keys)
+{
+IO_CODE();
+
+blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
+
+if (!blk_co_is_available(blk)) {
+return -ENOMEDIUM;
+}
+
+return bdrv_co_pr_read_keys(blk_bs(blk), generation, num_keys, keys);
+}
+
+static void coroutine_fn blk_aio_pr_read_keys_entry(void *opaque)
+{
+BlkPrInCB *acb = opaque;
+BlkPrInCo *prco = &acb->prco;
+
+prco->ret = blk_aio_pr_do_read_keys(prco->blk, prco->generation,
+prco->num_keys, prco->keys);
+blk_pr_in_complete(acb);
+}
+
+BlockAIOCB *blk_aio_pr_read_keys(BlockBackend *blk, uint32_t *generation,
+ uint32_t num_keys, uint64_t *keys,
+ BlockCompletionFunc *cb, void *opaque)
+{
+IO_CODE();
+return blk_aio_pr_in(blk, generation, num_keys, NULL, keys,
+ blk_aio_pr_read_keys_entry, cb, opaque);
+}
+
+/* To be called between exactly one pair of blk_inc/dec_in_flight() */
+static int coroutine_fn
+blk_aio_pr_do_read_reservation(BlockBackend *blk, uint32_t *generation,
+   uint64_t *key, BlockPrType *type)
+{
+IO_CODE();
+
+blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
+
+if (!blk_co_is_available(blk)) {
+return -ENOMEDIUM;
+}
+
+return bdrv_co_pr_read_reservation(blk_bs(blk), generation, key, type);
+}
+
+static void coroutine_fn blk_aio_pr_read_reservation_entry(void *opaque)
+{
+BlkPrInCB *acb = opaque;
+BlkPrInCo *prco = &acb->prco;
+
+prco->ret = blk_aio_pr_do_read_reservation(prco->blk, prco->generation,
+   prco->keys, prco->

[PATCH v4 08/10] hw/nvme: enable ONCS and rescap function

2024-05-30 Thread Changqi Lu

This commit enables ONCS to support the reservation
function at the controller level. Also enables rescap
function in the namespace by detecting the supported reservation
function in the backend driver.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 hw/nvme/ctrl.c | 3 ++-
 hw/nvme/ns.c   | 5 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 127c3d2383..182307a48b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8248,7 +8248,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->nn = cpu_to_le32(NVME_MAX_NAMESPACES);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP |
NVME_ONCS_FEATURES | NVME_ONCS_DSM |
-   NVME_ONCS_COMPARE | NVME_ONCS_COPY);
+   NVME_ONCS_COMPARE | NVME_ONCS_COPY |
+   NVME_ONCS_RESRVATIONS);
 
 /*
  * NOTE: If this device ever supports a command set that does NOT use 0x0
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index ea8db175db..320c9bf658 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -20,6 +20,7 @@
 #include "qemu/bitops.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
+#include "block/block_int.h"
 
 #include "nvme.h"
 #include "trace.h"
@@ -33,6 +34,7 @@ void nvme_ns_init_format(NvmeNamespace *ns)
 BlockDriverInfo bdi;
 int npdg, ret;
 int64_t nlbas;
+uint8_t blk_pr_cap;
 
 ns->lbaf = id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
 ns->lbasz = 1 << ns->lbaf.ds;
@@ -55,6 +57,9 @@ void nvme_ns_init_format(NvmeNamespace *ns)
 }
 
 id_ns->npda = id_ns->npdg = npdg - 1;
+
+blk_pr_cap = blk_bs(ns->blkconf.blk)->file->bs->bl.pr_cap;
+id_ns->rescap = block_pr_cap_to_nvme(blk_pr_cap);
 }
 
 static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
-- 
2.20.1

[PATCH v4 03/10] scsi/constant: add persistent reservation in/out protocol constants

2024-05-30 Thread Changqi Lu

Add constants for the persistent reservation in/out protocol
in the scsi/constant module. The constants include the persistent
reservation command, type, and scope values defined in sections
6.13 and 6.14 of the SCSI Primary Commands-4 (SPC-4) specification.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 include/scsi/constants.h | 52 
 1 file changed, 52 insertions(+)

diff --git a/include/scsi/constants.h b/include/scsi/constants.h
index 9b98451912..922a314535 100644
--- a/include/scsi/constants.h
+++ b/include/scsi/constants.h
@@ -319,4 +319,56 @@
 #define IDENT_DESCR_TGT_DESCR_SIZE 32
 #define XCOPY_BLK2BLK_SEG_DESC_SIZE 28
 
+typedef enum {
+SCSI_PR_WRITE_EXCLUSIVE = 0x01,
+SCSI_PR_EXCLUSIVE_ACCESS= 0x03,
+SCSI_PR_WRITE_EXCLUSIVE_REGS_ONLY   = 0x05,
+SCSI_PR_EXCLUSIVE_ACCESS_REGS_ONLY  = 0x06,
+SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS= 0x07,
+SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS   = 0x08,
+} SCSIPrType;
+
+typedef enum {
+SCSI_PR_LU_SCOPE  = 0x00,
+} SCSIPrScope;
+
+typedef enum {
+SCSI_PR_OUT_REGISTER = 0x0,
+SCSI_PR_OUT_RESERVE  = 0x1,
+SCSI_PR_OUT_RELEASE  = 0x2,
+SCSI_PR_OUT_CLEAR= 0x3,
+SCSI_PR_OUT_PREEMPT  = 0x4,
+SCSI_PR_OUT_PREEMPT_AND_ABORT= 0x5,
+SCSI_PR_OUT_REG_AND_IGNORE_KEY   = 0x6,
+SCSI_PR_OUT_REG_AND_MOVE = 0x7,
+} SCSIPrOutAction;
+
+typedef enum {
+SCSI_PR_IN_READ_KEYS = 0x0,
+SCSI_PR_IN_READ_RESERVATION  = 0x1,
+SCSI_PR_IN_REPORT_CAPABILITIES   = 0x2,
+} SCSIPrInAction;
+
+typedef enum {
+/* Exclusive Access All Registrants reservation type */
+SCSI_PR_CAP_EX_AC_AR = 1 << 0,
+/* Write Exclusive reservation type */
+SCSI_PR_CAP_WR_EX = 1 << 9,
+/* Exclusive Access reservation type */
+SCSI_PR_CAP_EX_AC = 1 << 11,
+/* Write Exclusive Registrants Only reservation type */
+SCSI_PR_CAP_WR_EX_RO = 1 << 13,
+/* Exclusive Access Registrants Only reservation type */
+SCSI_PR_CAP_EX_AC_RO = 1 << 14,
+/* Write Exclusive All Registrants reservation type */
+SCSI_PR_CAP_WR_EX_AR = 1 << 15,
+
+SCSI_PR_CAP_ALL = (SCSI_PR_CAP_EX_AC_AR |
+  SCSI_PR_CAP_WR_EX |
+  SCSI_PR_CAP_EX_AC |
+  SCSI_PR_CAP_WR_EX_RO |
+  SCSI_PR_CAP_EX_AC_RO |
+  SCSI_PR_CAP_WR_EX_AR),
+} SCSIPrCap;
+
 #endif
-- 
2.20.1

[PATCH v4 04/10] scsi/util: add helper functions for persistent reservation types conversion

2024-05-30 Thread Changqi Lu

This commit introduces two helper functions
that facilitate the conversion between the
persistent reservation types used in the SCSI
protocol and those used in the block layer.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 include/scsi/utils.h |  8 +
 scsi/utils.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/scsi/utils.h b/include/scsi/utils.h
index d5c8efa16e..89a0b082fb 100644
--- a/include/scsi/utils.h
+++ b/include/scsi/utils.h
@@ -1,6 +1,8 @@
 #ifndef SCSI_UTILS_H
 #define SCSI_UTILS_H
 
+#include "block/block-common.h"
+#include "scsi/constants.h"
 #ifdef CONFIG_LINUX
 #include 
 #endif
@@ -135,6 +137,12 @@ uint32_t scsi_data_cdb_xfer(uint8_t *buf);
 uint32_t scsi_cdb_xfer(uint8_t *buf);
 int scsi_cdb_length(uint8_t *buf);
 
+BlockPrType scsi_pr_type_to_block(SCSIPrType type);
+SCSIPrType block_pr_type_to_scsi(BlockPrType type);
+
+uint8_t scsi_pr_cap_to_block(uint16_t scsi_pr_cap);
+uint16_t block_pr_cap_to_scsi(uint8_t block_pr_cap);
+
 /* Linux SG_IO interface.  */
 #ifdef CONFIG_LINUX
 #define SG_ERR_DRIVER_TIMEOUT  0x06
diff --git a/scsi/utils.c b/scsi/utils.c
index 357b036671..0dfdeb499d 100644
--- a/scsi/utils.c
+++ b/scsi/utils.c
@@ -658,3 +658,84 @@ int scsi_sense_from_host_status(uint8_t host_status,
 }
 return GOOD;
 }
+
+BlockPrType scsi_pr_type_to_block(SCSIPrType type)
+{
+switch (type) {
+case SCSI_PR_WRITE_EXCLUSIVE:
+return BLK_PR_WRITE_EXCLUSIVE;
+case SCSI_PR_EXCLUSIVE_ACCESS:
+return BLK_PR_EXCLUSIVE_ACCESS;
+case SCSI_PR_WRITE_EXCLUSIVE_REGS_ONLY:
+return BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY;
+case SCSI_PR_EXCLUSIVE_ACCESS_REGS_ONLY:
+return BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY;
+case SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS:
+return BLK_PR_WRITE_EXCLUSIVE_ALL_REGS;
+case SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS:
+return BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS;
+}
+
+return 0;
+}
+
+SCSIPrType block_pr_type_to_scsi(BlockPrType type)
+{
+switch (type) {
+case BLK_PR_WRITE_EXCLUSIVE:
+return SCSI_PR_WRITE_EXCLUSIVE;
+case BLK_PR_EXCLUSIVE_ACCESS:
+return SCSI_PR_EXCLUSIVE_ACCESS;
+case BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY:
+return SCSI_PR_WRITE_EXCLUSIVE_REGS_ONLY;
+case BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY:
+return SCSI_PR_EXCLUSIVE_ACCESS_REGS_ONLY;
+case BLK_PR_WRITE_EXCLUSIVE_ALL_REGS:
+return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS;
+case BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS:
+return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS;
+}
+
+return 0;
+}
+
+
+uint8_t scsi_pr_cap_to_block(uint16_t scsi_pr_cap)
+{
+uint8_t res = 0;
+
+res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX) ?
+   BLK_PR_CAP_WR_EX : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC) ?
+   BLK_PR_CAP_EX_AC : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX_RO) ?
+   BLK_PR_CAP_WR_EX_RO : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC_RO) ?
+   BLK_PR_CAP_EX_AC_RO : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX_AR) ?
+   BLK_PR_CAP_WR_EX_AR : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC_AR) ?
+   BLK_PR_CAP_EX_AC_AR : 0;
+
+return res;
+}
+
+uint16_t block_pr_cap_to_scsi(uint8_t block_pr_cap)
+{
+uint16_t res = 0;
+
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX) ?
+  SCSI_PR_CAP_WR_EX : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC) ?
+  SCSI_PR_CAP_EX_AC : 0;
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX_RO) ?
+  SCSI_PR_CAP_WR_EX_RO : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC_RO) ?
+  SCSI_PR_CAP_EX_AC_RO : 0;
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX_AR) ?
+  SCSI_PR_CAP_WR_EX_AR : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC_AR) ?
+  SCSI_PR_CAP_EX_AC_AR : 0;
+
+return res;
+}
-- 
2.20.1

[PATCH v4 00/10] Support persistent reservation operations

2024-05-30 Thread Changqi Lu

Hi,

Patch v4 has been modified. Many thanks to Klaus for the
code review.

Could anyone please review the core block layer and
SCSI layer related codes?

v3->v4:
- At the nvme layer, the two patches of enabling the ONCS
  function and enabling rescap are combined into one.
- At the nvme layer, add helper functions for pr capacity
  conversion between the block layer and the nvme layer.

v2->v3:
In v2 Persist Through Power Loss(PTPL) is enable default.
In v3 PTPL is supported, which is passed as a parameter.

v1->v2:
- Add sg_persist --report-capabilities for SCSI protocol and enable
  oncs and rescap for NVMe protocol.
- Add persistent reservation capabilities constants and helper functions for
  SCSI and NVMe protocol.
- Add comments for necessary APIs.

v1:
- Add seven APIs about persistent reservation command for block layer.
  These APIs including reading keys, reading reservations, registering,
  reserving, releasing, clearing and preempting.
- Add the necessary pr-related operation APIs for both the
  SCSI protocol and NVMe protocol at the device layer.
- Add scsi driver at the driver layer to verify the functions

Changqi Lu (10):
  block: add persistent reservation in/out api
  block/raw: add persistent reservation in/out driver
  scsi/constant: add persistent reservation in/out protocol constants
  scsi/util: add helper functions for persistent reservation types
conversion
  hw/scsi: add persistent reservation in/out api for scsi device
  block/nvme: add reservation command protocol constants
  hw/nvme: add helper functions for converting reservation types
  hw/nvme: enable ONCS and rescap function
  hw/nvme: add reservation protocal command
  block/iscsi: add persistent reservation in/out driver

 block/block-backend.c | 397 ++
 block/io.c| 163 +++
 block/iscsi.c | 443 ++
 block/raw-format.c|  56 
 hw/nvme/ctrl.c| 324 +-
 hw/nvme/ns.c  |   5 +
 hw/nvme/nvme.h|  84 ++
 hw/scsi/scsi-disk.c   | 352 
 include/block/block-common.h  |  40 +++
 include/block/block-io.h  |  20 ++
 include/block/block_int-common.h  |  84 ++
 include/block/nvme.h  |  98 +++
 include/scsi/constants.h  |  52 
 include/scsi/utils.h  |   8 +
 include/sysemu/block-backend-io.h |  24 ++
 scsi/utils.c  |  81 ++
 16 files changed, 2229 insertions(+), 2 deletions(-)

-- 
2.20.1

[PATCH v4 07/10] hw/nvme: add helper functions for converting reservation types

2024-05-30 Thread Changqi Lu

This commit introduces two helper functions
that facilitate the conversion between the
reservation types used in the NVME protocol
and those used in the block layer.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 hw/nvme/nvme.h | 80 ++
 1 file changed, 80 insertions(+)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index bed8191bd5..b1ad27c8f2 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -474,6 +474,86 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
 }
 }
 
+static inline NvmeResvType block_pr_type_to_nvme(BlockPrType type)
+{
+switch (type) {
+case BLK_PR_WRITE_EXCLUSIVE:
+return NVME_RESV_WRITE_EXCLUSIVE;
+case BLK_PR_EXCLUSIVE_ACCESS:
+return NVME_RESV_EXCLUSIVE_ACCESS;
+case BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY:
+return NVME_RESV_WRITE_EXCLUSIVE_REGS_ONLY;
+case BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY:
+return NVME_RESV_EXCLUSIVE_ACCESS_REGS_ONLY;
+case BLK_PR_WRITE_EXCLUSIVE_ALL_REGS:
+return NVME_RESV_WRITE_EXCLUSIVE_ALL_REGS;
+case BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS:
+return NVME_RESV_EXCLUSIVE_ACCESS_ALL_REGS;
+}
+
+return 0;
+}
+
+static inline BlockPrType nvme_pr_type_to_block(NvmeResvType type)
+{
+switch (type) {
+case NVME_RESV_WRITE_EXCLUSIVE:
+return BLK_PR_WRITE_EXCLUSIVE;
+case NVME_RESV_EXCLUSIVE_ACCESS:
+return BLK_PR_EXCLUSIVE_ACCESS;
+case NVME_RESV_WRITE_EXCLUSIVE_REGS_ONLY:
+return BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY;
+case NVME_RESV_EXCLUSIVE_ACCESS_REGS_ONLY:
+return BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY;
+case NVME_RESV_WRITE_EXCLUSIVE_ALL_REGS:
+return BLK_PR_WRITE_EXCLUSIVE_ALL_REGS;
+case NVME_RESV_EXCLUSIVE_ACCESS_ALL_REGS:
+return BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS;
+}
+
+return 0;
+}
+
+static inline uint8_t nvme_pr_cap_to_block(uint16_t nvme_pr_cap)
+{
+uint8_t res = 0;
+
+res |= (nvme_pr_cap & NVME_PR_CAP_WR_EX) ?
+   BLK_PR_CAP_WR_EX : 0;
+res |= (nvme_pr_cap & NVME_PR_CAP_EX_AC) ?
+   BLK_PR_CAP_EX_AC : 0;
+res |= (nvme_pr_cap & NVME_PR_CAP_WR_EX_RO) ?
+   BLK_PR_CAP_WR_EX_RO : 0;
+res |= (nvme_pr_cap & NVME_PR_CAP_EX_AC_RO) ?
+   BLK_PR_CAP_EX_AC_RO : 0;
+res |= (nvme_pr_cap & NVME_PR_CAP_WR_EX_AR) ?
+   BLK_PR_CAP_WR_EX_AR : 0;
+res |= (nvme_pr_cap & NVME_PR_CAP_EX_AC_AR) ?
+   BLK_PR_CAP_EX_AC_AR : 0;
+
+return res;
+}
+
+static inline uint8_t block_pr_cap_to_nvme(uint8_t block_pr_cap)
+{
+uint16_t res = 0;
+
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX) ?
+  NVME_PR_CAP_WR_EX : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC) ?
+  NVME_PR_CAP_EX_AC : 0;
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX_RO) ?
+  NVME_PR_CAP_WR_EX_RO : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC_RO) ?
+  NVME_PR_CAP_EX_AC_RO : 0;
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX_AR) ?
+  NVME_PR_CAP_WR_EX_AR : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC_AR) ?
+  NVME_PR_CAP_EX_AC_AR : 0;
+
+return res;
+}
+
 typedef struct NvmeSQueue {
 struct NvmeCtrl *ctrl;
 uint16_tsqid;
-- 
2.20.1

[PATCH v4 06/10] block/nvme: add reservation command protocol constants

2024-05-30 Thread Changqi Lu

Add constants for the NVMe persistent command protocol.
The constants include the reservation command opcode and
reservation type values defined in section 7 of the NVMe
2.0 specification.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 include/block/nvme.h | 61 
 1 file changed, 61 insertions(+)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index bb231d0b9a..da6ccb0f3b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -633,6 +633,10 @@ enum NvmeIoCommands {
 NVME_CMD_WRITE_ZEROES   = 0x08,
 NVME_CMD_DSM= 0x09,
 NVME_CMD_VERIFY = 0x0c,
+NVME_CMD_RESV_REGISTER  = 0x0d,
+NVME_CMD_RESV_REPORT= 0x0e,
+NVME_CMD_RESV_ACQUIRE   = 0x11,
+NVME_CMD_RESV_RELEASE   = 0x15,
 NVME_CMD_IO_MGMT_RECV   = 0x12,
 NVME_CMD_COPY   = 0x19,
 NVME_CMD_IO_MGMT_SEND   = 0x1d,
@@ -641,6 +645,63 @@ enum NvmeIoCommands {
 NVME_CMD_ZONE_APPEND= 0x7d,
 };
 
+typedef enum {
+NVME_RESV_REGISTER_ACTION_REGISTER  = 0x00,
+NVME_RESV_REGISTER_ACTION_UNREGISTER= 0x01,
+NVME_RESV_REGISTER_ACTION_REPLACE   = 0x02,
+} NvmeReservationRegisterAction;
+
+typedef enum {
+NVME_RESV_RELEASE_ACTION_RELEASE= 0x00,
+NVME_RESV_RELEASE_ACTION_CLEAR  = 0x01,
+} NvmeReservationReleaseAction;
+
+typedef enum {
+NVME_RESV_ACQUIRE_ACTION_ACQUIRE= 0x00,
+NVME_RESV_ACQUIRE_ACTION_PREEMPT= 0x01,
+NVME_RESV_ACQUIRE_ACTION_PREEMPT_AND_ABORT  = 0x02,
+} NvmeReservationAcquireAction;
+
+typedef enum {
+NVME_RESV_WRITE_EXCLUSIVE   = 0x01,
+NVME_RESV_EXCLUSIVE_ACCESS  = 0x02,
+NVME_RESV_WRITE_EXCLUSIVE_REGS_ONLY = 0x03,
+NVME_RESV_EXCLUSIVE_ACCESS_REGS_ONLY= 0x04,
+NVME_RESV_WRITE_EXCLUSIVE_ALL_REGS  = 0x05,
+NVME_RESV_EXCLUSIVE_ACCESS_ALL_REGS = 0x06,
+} NvmeResvType;
+
+typedef enum {
+NVME_RESV_PTPL_NO_CHANGE = 0x00,
+NVME_RESV_PTPL_DISABLE   = 0x02,
+NVME_RESV_PTPL_ENABLE= 0x03,
+} NvmeResvPTPL;
+
+typedef enum NVMEPrCap {
+/* Persist Through Power Loss */
+NVME_PR_CAP_PTPL = 1 << 0,
+/* Write Exclusive reservation type */
+NVME_PR_CAP_WR_EX = 1 << 1,
+/* Exclusive Access reservation type */
+NVME_PR_CAP_EX_AC = 1 << 2,
+/* Write Exclusive Registrants Only reservation type */
+NVME_PR_CAP_WR_EX_RO = 1 << 3,
+/* Exclusive Access Registrants Only reservation type */
+NVME_PR_CAP_EX_AC_RO = 1 << 4,
+/* Write Exclusive All Registrants reservation type */
+NVME_PR_CAP_WR_EX_AR = 1 << 5,
+/* Exclusive Access All Registrants reservation type */
+NVME_PR_CAP_EX_AC_AR = 1 << 6,
+
+NVME_PR_CAP_ALL = (NVME_PR_CAP_PTPL |
+  NVME_PR_CAP_WR_EX |
+  NVME_PR_CAP_EX_AC |
+  NVME_PR_CAP_WR_EX_RO |
+  NVME_PR_CAP_EX_AC_RO |
+  NVME_PR_CAP_WR_EX_AR |
+  NVME_PR_CAP_EX_AC_AR),
+} NvmePrCap;
+
 typedef struct QEMU_PACKED NvmeDeleteQ {
 uint8_t opcode;
 uint8_t flags;
-- 
2.20.1

Re: tests/avocado: Add LoongArch machine start test

2024-05-30 Thread gaosong


在 2024/5/30 下午9:16, Jiaxun Yang 写道:


在2024年5月30日五月 下午2:00，gaosong写道：
[...]

FYI, the test does not seem to work anymore - apparently the binaries
have changed and now the hashes do not match anymore. Could you please
update it? (preferably with some versioned binaries that do not change
in the course of time?)


Thank you,  I had send a patch to fix it.

Hi Song,

As LoongArch EDK2 support has been merged long ago, do you to make a clean
build and add it to pc-bios directory?

EDK2 LoongArchVirt under OvmfPkg is being committed to upstream.

PR:
https://github.com/tianocore/edk2/pull/5208

Thanks
Song Gao


Thanks
- Jiaxun

Re: [PATCH 5/5] core/cpu-common: initialise plugin state before thread creation

2024-05-30 Thread Pierrick Bouvier


On 5/30/24 12:42, Alex Bennée wrote:

Originally I tried to move where vCPU thread initialisation to later
in realize. However pulling that thread (sic) got gnarly really
quickly. It turns out some steps of CPU realization need values that
can only be determined from the running vCPU thread.

However having moved enough out of the thread creation we can now
queue work before the thread starts (at least for TCG guests) and
avoid the race between vcpu_init and other vcpu states a plugin might
subscribe to.

Signed-off-by: Alex Bennée 
---
  hw/core/cpu-common.c | 20 
  1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index 6cfc01593a..bf1a7b8892 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -222,14 +222,6 @@ static void cpu_common_realizefn(DeviceState *dev, Error 
**errp)
  cpu_resume(cpu);
  }
  
-/* Plugin initialization must wait until the cpu start executing code */

-#ifdef CONFIG_PLUGIN
-if (tcg_enabled()) {
-cpu->plugin_state = qemu_plugin_create_vcpu_state();
-async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL);
-}
-#endif
-
  /* NOTE: latest generic point where the cpu is fully realized */
  }
  
@@ -273,6 +265,18 @@ static void cpu_common_initfn(Object *obj)

  QTAILQ_INIT(&cpu->watchpoints);
  
  cpu_exec_initfn(cpu);

+
+/*
+ * Plugin initialization must wait until the cpu start executing
+ * code, but we must queue this work before the threads are
+ * created to ensure we don't race.
+ */
+#ifdef CONFIG_PLUGIN
+if (tcg_enabled()) {
+cpu->plugin_state = qemu_plugin_create_vcpu_state();
+async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL);
+}
+#endif
  }
  
  static void cpu_common_finalize(Object *obj)


Could you check it works for all combination?
- user-mode
- system-mode tcg
- system-mode mttcg

When I tried to move this code around, one of them didn't work correctly.

Else,
Reviewed-by: Pierrick Bouvier

Re: [PATCH 4/5] plugins: remove special casing for cpu->realized

2024-05-30 Thread Pierrick Bouvier


On 5/30/24 12:42, Alex Bennée wrote:

Now the condition variable is initialised early on we don't need to go
through hoops to avoid calling async_run_on_cpu.

Signed-off-by: Alex Bennée 
---
  plugins/core.c | 6 +-
  1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/plugins/core.c b/plugins/core.c
index 0726bc7f25..badede28cf 100644
--- a/plugins/core.c
+++ b/plugins/core.c
@@ -65,11 +65,7 @@ static void plugin_cpu_update__locked(gpointer k, gpointer 
v, gpointer udata)
  CPUState *cpu = container_of(k, CPUState, cpu_index);
  run_on_cpu_data mask = RUN_ON_CPU_HOST_ULONG(*plugin.mask);
  
-if (DEVICE(cpu)->realized) {

-async_run_on_cpu(cpu, plugin_cpu_update__async, mask);
-} else {
-plugin_cpu_update__async(cpu, mask);
-}
+async_run_on_cpu(cpu, plugin_cpu_update__async, mask);
  }
  
  void plugin_unregister_cb__locked(struct qemu_plugin_ctx *ctx,


Reviewed-by: Pierrick Bouvier

Re: [PATCH 3/5] cpu-target: don't set cpu->thread_id to bogus value

2024-05-30 Thread Pierrick Bouvier


On 5/30/24 12:42, Alex Bennée wrote:

The thread_id isn't valid until the threads are created. There is no
point setting it here. The only thing that cares about the thread_id
is qmp_query_cpus_fast.

Signed-off-by: Alex Bennée 
---
  cpu-target.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/cpu-target.c b/cpu-target.c
index 5af120e8aa..499facf774 100644
--- a/cpu-target.c
+++ b/cpu-target.c
@@ -241,7 +241,6 @@ void cpu_exec_initfn(CPUState *cpu)
  cpu->num_ases = 0;
  
  #ifndef CONFIG_USER_ONLY

-cpu->thread_id = qemu_get_thread_id();
  cpu->memory = get_system_memory();
  object_ref(OBJECT(cpu->memory));
  #endif


Reviewed-by: Pierrick Bouvier

Re: [PATCH 1/5] hw/core: expand on the alignment of CPUState

2024-05-30 Thread Pierrick Bouvier


On 5/30/24 12:42, Alex Bennée wrote:

Make the relationship between CPUState, ArchCPU and cpu_env a bit
clearer in the kdoc comments.

Signed-off-by: Alex Bennée 
---
  include/hw/core/cpu.h | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index bb398e8237..35d345371b 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -391,7 +391,8 @@ struct qemu_work_item;
  #define CPU_UNSET_NUMA_NODE_ID -1
  
  /**

- * CPUState:
+ * struct CPUState - common state of one CPU core or thread.
+ *
   * @cpu_index: CPU index (informative).
   * @cluster_index: Identifies which cluster this CPU is in.
   *   For boards which don't define clusters or for "loose" CPUs not assigned
@@ -439,10 +440,15 @@ struct qemu_work_item;
   * @kvm_fetch_index: Keeps the index that we last fetched from the per-vCPU
   *dirty ring structure.
   *
- * State of one CPU core or thread.
+ * @neg_align: The CPUState is the common part of a concrete ArchCPU
+ * which is allocated when an individual CPU instance is created. As
+ * such care is taken is ensure there is no gap between between
+ * CPUState and CPUArchState within ArchCPU.
   *
- * Align, in order to match possible alignment required by CPUArchState,
- * and eliminate a hole between CPUState and CPUArchState within ArchCPU.
+ * @neg: The architectural register state ("cpu_env") immediately follows 
CPUState
+ * in ArchCPU and is passed to TCG code. The @neg structure holds some
+ * common TCG CPU variables which are accessed with a negative offset
+ * from cpu_env.
   */
  struct CPUState {
  /*< private >*/


Reviewed-by: Pierrick Bouvier

Re: [PATCH 2/5] cpu: move Qemu[Thread|Cond] setup into common code

2024-05-30 Thread Pierrick Bouvier


On 5/30/24 12:42, Alex Bennée wrote:

Aside from the round robin threads this is all common code. By
moving the halt_cond setup we also no longer need hacks to work around
the race between QOM object creation and thread creation.

It is a little ugly to free stuff up for the round robin thread but
better it deal with its own specialises than making the other
accelerators jump through hoops.

Signed-off-by: Alex Bennée 
---
  include/hw/core/cpu.h |  4 
  accel/dummy-cpus.c|  3 ---
  accel/hvf/hvf-accel-ops.c |  4 
  accel/kvm/kvm-accel-ops.c |  3 ---
  accel/tcg/tcg-accel-ops-mttcg.c   |  4 
  accel/tcg/tcg-accel-ops-rr.c  | 14 +++---
  hw/core/cpu-common.c  |  5 +
  target/i386/nvmm/nvmm-accel-ops.c |  3 ---
  target/i386/whpx/whpx-accel-ops.c |  3 ---
  9 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 35d345371b..a405119eda 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -404,10 +404,14 @@ struct qemu_work_item;
   * @tcg_cflags: Pre-computed cflags for this cpu.
   * @nr_cores: Number of cores within this CPU package.
   * @nr_threads: Number of threads within this CPU core.
+ * @thread: Host thread details, only live once @created is #true
+ * @sem: WIN32 only semaphore used only for qtest
+ * @thread_id: native thread id of vCPU, only live once @created is #true
   * @running: #true if CPU is currently running (lockless).
   * @has_waiter: #true if a CPU is currently waiting for the cpu_exec_end;
   * valid under cpu_list_lock.
   * @created: Indicates whether the CPU thread has been successfully created.
+ * @halt_cond: condition variable sleeping threads can wait on.
   * @interrupt_request: Indicates a pending interrupt request.
   * @halted: Nonzero if the CPU is in suspended state.
   * @stop: Indicates a pending stop request.
diff --git a/accel/dummy-cpus.c b/accel/dummy-cpus.c
index 20519f1ea4..f32d8c8dc3 100644
--- a/accel/dummy-cpus.c
+++ b/accel/dummy-cpus.c
@@ -68,9 +68,6 @@ void dummy_start_vcpu_thread(CPUState *cpu)
  {
  char thread_name[VCPU_THREAD_NAME_SIZE];
  
-cpu->thread = g_malloc0(sizeof(QemuThread));

-cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu->halt_cond);
  snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/DUMMY",
   cpu->cpu_index);
  qemu_thread_create(cpu->thread, thread_name, dummy_cpu_thread_fn, cpu,
diff --git a/accel/hvf/hvf-accel-ops.c b/accel/hvf/hvf-accel-ops.c
index 40d4187d9d..6f1e27ef46 100644
--- a/accel/hvf/hvf-accel-ops.c
+++ b/accel/hvf/hvf-accel-ops.c
@@ -463,10 +463,6 @@ static void hvf_start_vcpu_thread(CPUState *cpu)
   */
  assert(hvf_enabled());
  
-cpu->thread = g_malloc0(sizeof(QemuThread));

-cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu->halt_cond);
-
  snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
   cpu->cpu_index);
  qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
diff --git a/accel/kvm/kvm-accel-ops.c b/accel/kvm/kvm-accel-ops.c
index 94c828ac8d..c239dfc87a 100644
--- a/accel/kvm/kvm-accel-ops.c
+++ b/accel/kvm/kvm-accel-ops.c
@@ -66,9 +66,6 @@ static void kvm_start_vcpu_thread(CPUState *cpu)
  {
  char thread_name[VCPU_THREAD_NAME_SIZE];
  
-cpu->thread = g_malloc0(sizeof(QemuThread));

-cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu->halt_cond);
  snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/KVM",
   cpu->cpu_index);
  qemu_thread_create(cpu->thread, thread_name, kvm_vcpu_thread_fn,
diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c
index c552b45b8e..49814ec4af 100644
--- a/accel/tcg/tcg-accel-ops-mttcg.c
+++ b/accel/tcg/tcg-accel-ops-mttcg.c
@@ -137,10 +137,6 @@ void mttcg_start_vcpu_thread(CPUState *cpu)
  g_assert(tcg_enabled());
  tcg_cpu_init_cflags(cpu, current_machine->smp.max_cpus > 1);
  
-cpu->thread = g_new0(QemuThread, 1);

-cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu->halt_cond);
-
  /* create a thread per vCPU with TCG (MTTCG) */
  snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/TCG",
   cpu->cpu_index);
diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
index 894e73e52c..84c36c1450 100644
--- a/accel/tcg/tcg-accel-ops-rr.c
+++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -317,22 +317,22 @@ void rr_start_vcpu_thread(CPUState *cpu)
  tcg_cpu_init_cflags(cpu, false);
  
  if (!single_tcg_cpu_thread) {

-cpu->thread = g_new0(QemuThread, 1);
-cpu->halt_cond = g_new0(QemuCond, 1);
-qemu_cond_init(cpu->halt_cond);
+single_tcg_halt_cond = cpu->halt_cond;
+single_tcg_cpu_thread = cpu->thread;
  
  /* share a single thread for all cpus with TCG */

  snprintf(thread_name, VCPU_THREAD_NAME_SIZE,

[PATCH v3 3/6] sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time

2024-05-30 Thread Pierrick Bouvier

From: Alex Bennée 

Move the key functionality of moving time forward into the clock
sub-system itself. This will allow us to plumb in time control into
plugins.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Signed-off-by: Pierrick Bouvier 
---
 include/qemu/timer.h | 15 +++
 system/qtest.c   | 25 +++--
 util/qemu-timer.c| 26 ++
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/qemu/timer.h b/include/qemu/timer.h
index 9a366e551fb..910587d8293 100644
--- a/include/qemu/timer.h
+++ b/include/qemu/timer.h
@@ -245,6 +245,21 @@ bool qemu_clock_run_timers(QEMUClockType type);
  */
 bool qemu_clock_run_all_timers(void);
 
+/**
+ * qemu_clock_advance_virtual_time(): advance the virtual time tick
+ * @target: target time in nanoseconds
+ *
+ * This function is used where the control of the flow of time has
+ * been delegated to outside the clock subsystem (be it qtest, icount
+ * or some other external source). You can ask the clock system to
+ * return @early at the first expired timer.
+ *
+ * Time can only move forward, attempts to reverse time would lead to
+ * an error.
+ *
+ * Returns: new virtual time.
+ */
+int64_t qemu_clock_advance_virtual_time(int64_t dest);
 
 /*
  * QEMUTimerList
diff --git a/system/qtest.c b/system/qtest.c
index ee8b139e982..e6f6b4e62d5 100644
--- a/system/qtest.c
+++ b/system/qtest.c
@@ -337,26 +337,6 @@ void qtest_set_virtual_clock(int64_t count)
 qatomic_set_i64(&qtest_clock_counter, count);
 }
 
-static void qtest_clock_warp(int64_t dest)
-{
-int64_t clock = cpus_get_virtual_clock();
-AioContext *aio_context;
-assert(qtest_enabled());
-aio_context = qemu_get_aio_context();
-while (clock < dest) {
-int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL,
-  QEMU_TIMER_ATTR_ALL);
-int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
-
-cpus_set_virtual_clock(cpus_get_virtual_clock() + warp);
-
-qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
-timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
-clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
-}
-qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
-}
-
 static bool (*process_command_cb)(CharBackend *chr, gchar **words);
 
 void qtest_set_command_cb(bool (*pc_cb)(CharBackend *chr, gchar **words))
@@ -755,7 +735,8 @@ static void qtest_process_command(CharBackend *chr, gchar 
**words)
 ns = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL,
 QEMU_TIMER_ATTR_ALL);
 }
-qtest_clock_warp(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns);
+qemu_clock_advance_virtual_time(
+qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns);
 qtest_send_prefix(chr);
 qtest_sendf(chr, "OK %"PRIi64"\n",
 (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL));
@@ -781,7 +762,7 @@ static void qtest_process_command(CharBackend *chr, gchar 
**words)
 g_assert(words[1]);
 ret = qemu_strtoi64(words[1], NULL, 0, &ns);
 g_assert(ret == 0);
-qtest_clock_warp(ns);
+qemu_clock_advance_virtual_time(ns);
 qtest_send_prefix(chr);
 qtest_sendf(chr, "OK %"PRIi64"\n",
 (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL));
diff --git a/util/qemu-timer.c b/util/qemu-timer.c
index 6a0de33dd2b..213114be68c 100644
--- a/util/qemu-timer.c
+++ b/util/qemu-timer.c
@@ -645,6 +645,11 @@ int64_t qemu_clock_get_ns(QEMUClockType type)
 }
 }
 
+static void qemu_virtual_clock_set_ns(int64_t time)
+{
+return cpus_set_virtual_clock(time);
+}
+
 void init_clocks(QEMUTimerListNotifyCB *notify_cb)
 {
 QEMUClockType type;
@@ -675,3 +680,24 @@ bool qemu_clock_run_all_timers(void)
 
 return progress;
 }
+
+int64_t qemu_clock_advance_virtual_time(int64_t dest)
+{
+int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+AioContext *aio_context;
+aio_context = qemu_get_aio_context();
+while (clock < dest) {
+int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL,
+  QEMU_TIMER_ATTR_ALL);
+int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
+
+qemu_virtual_clock_set_ns(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 
warp);
+
+qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
+timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
+clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+}
+qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
+
+return clock;
+}
-- 
2.39.2

[PATCH v3 4/6] qtest: move qtest_{get, set}_virtual_clock to accel/qtest/qtest.c

2024-05-30 Thread Pierrick Bouvier

Signed-off-by: Pierrick Bouvier 
---
 include/sysemu/qtest.h |  3 ---
 accel/qtest/qtest.c| 12 
 system/qtest.c | 12 
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h
index 45f3b7e1df5..c161d751650 100644
--- a/include/sysemu/qtest.h
+++ b/include/sysemu/qtest.h
@@ -34,9 +34,6 @@ void qtest_server_init(const char *qtest_chrdev, const char 
*qtest_log, Error **
 void qtest_server_set_send_handler(void (*send)(void *, const char *),
  void *opaque);
 void qtest_server_inproc_recv(void *opaque, const char *buf);
-
-int64_t qtest_get_virtual_clock(void);
-void qtest_set_virtual_clock(int64_t count);
 #endif
 
 #endif
diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c
index 53182e6c2ae..bf14032d294 100644
--- a/accel/qtest/qtest.c
+++ b/accel/qtest/qtest.c
@@ -24,6 +24,18 @@
 #include "qemu/main-loop.h"
 #include "hw/core/cpu.h"
 
+static int64_t qtest_clock_counter;
+
+static int64_t qtest_get_virtual_clock(void)
+{
+return qatomic_read_i64(&qtest_clock_counter);
+}
+
+static void qtest_set_virtual_clock(int64_t count)
+{
+qatomic_set_i64(&qtest_clock_counter, count);
+}
+
 static int qtest_init_accel(MachineState *ms)
 {
 return 0;
diff --git a/system/qtest.c b/system/qtest.c
index e6f6b4e62d5..ba210780ec0 100644
--- a/system/qtest.c
+++ b/system/qtest.c
@@ -325,18 +325,6 @@ static void qtest_irq_handler(void *opaque, int n, int 
level)
 }
 }
 
-static int64_t qtest_clock_counter;
-
-int64_t qtest_get_virtual_clock(void)
-{
-return qatomic_read_i64(&qtest_clock_counter);
-}
-
-void qtest_set_virtual_clock(int64_t count)
-{
-qatomic_set_i64(&qtest_clock_counter, count);
-}
-
 static bool (*process_command_cb)(CharBackend *chr, gchar **words);
 
 void qtest_set_command_cb(bool (*pc_cb)(CharBackend *chr, gchar **words))
-- 
2.39.2

[PATCH v3 1/6] sysemu: add set_virtual_time to accel ops

2024-05-30 Thread Pierrick Bouvier

From: Alex Bennée 

We are about to remove direct calls to individual accelerators for
this information and will need a central point for plugins to hook
into time changes.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Pierrick Bouvier 
---
 include/sysemu/accel-ops.h | 18 +-
 include/sysemu/cpu-timers.h|  3 ++-
 ...et-virtual-clock.c => cpus-virtual-clock.c} |  5 +
 system/cpus.c  | 11 +++
 stubs/meson.build  |  2 +-
 5 files changed, 36 insertions(+), 3 deletions(-)
 rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%)

diff --git a/include/sysemu/accel-ops.h b/include/sysemu/accel-ops.h
index ef91fc28bbd..a0886722305 100644
--- a/include/sysemu/accel-ops.h
+++ b/include/sysemu/accel-ops.h
@@ -20,7 +20,12 @@
 typedef struct AccelOpsClass AccelOpsClass;
 DECLARE_CLASS_CHECKERS(AccelOpsClass, ACCEL_OPS, TYPE_ACCEL_OPS)
 
-/* cpus.c operations interface */
+/**
+ * struct AccelOpsClass - accelerator interfaces
+ *
+ * This structure is used to abstract accelerator differences from the
+ * core CPU code. Not all have to be implemented.
+ */
 struct AccelOpsClass {
 /*< private >*/
 ObjectClass parent_class;
@@ -44,7 +49,18 @@ struct AccelOpsClass {
 
 void (*handle_interrupt)(CPUState *cpu, int mask);
 
+/**
+ * @get_virtual_clock: fetch virtual clock
+ * @set_virtual_clock: set virtual clock
+ *
+ * These allow the timer subsystem to defer to the accelerator to
+ * fetch time. The set function is needed if the accelerator wants
+ * to track the changes to time as the timer is warped through
+ * various timer events.
+ */
 int64_t (*get_virtual_clock)(void);
+void (*set_virtual_clock)(int64_t time);
+
 int64_t (*get_elapsed_ticks)(void);
 
 /* gdbstub hooks */
diff --git a/include/sysemu/cpu-timers.h b/include/sysemu/cpu-timers.h
index d86738a378d..7bfa960fbd6 100644
--- a/include/sysemu/cpu-timers.h
+++ b/include/sysemu/cpu-timers.h
@@ -96,8 +96,9 @@ int64_t cpu_get_clock(void);
 
 void qemu_timer_notify_cb(void *opaque, QEMUClockType type);
 
-/* get the VIRTUAL clock and VM elapsed ticks via the cpus accel interface */
+/* get/set VIRTUAL clock and VM elapsed ticks via the cpus accel interface */
 int64_t cpus_get_virtual_clock(void);
+void cpus_set_virtual_clock(int64_t new_time);
 int64_t cpus_get_elapsed_ticks(void);
 
 #endif /* SYSEMU_CPU_TIMERS_H */
diff --git a/stubs/cpus-get-virtual-clock.c b/stubs/cpus-virtual-clock.c
similarity index 68%
rename from stubs/cpus-get-virtual-clock.c
rename to stubs/cpus-virtual-clock.c
index fd447d53f3c..af7c1a1d403 100644
--- a/stubs/cpus-get-virtual-clock.c
+++ b/stubs/cpus-virtual-clock.c
@@ -6,3 +6,8 @@ int64_t cpus_get_virtual_clock(void)
 {
 return cpu_get_clock();
 }
+
+void cpus_set_virtual_clock(int64_t new_time)
+{
+/* do nothing */
+}
diff --git a/system/cpus.c b/system/cpus.c
index f8fa78f33d4..d3640c95030 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -229,6 +229,17 @@ int64_t cpus_get_virtual_clock(void)
 return cpu_get_clock();
 }
 
+/*
+ * Signal the new virtual time to the accelerator. This is only needed
+ * by accelerators that need to track the changes as we warp time.
+ */
+void cpus_set_virtual_clock(int64_t new_time)
+{
+if (cpus_accel && cpus_accel->set_virtual_clock) {
+cpus_accel->set_virtual_clock(new_time);
+}
+}
+
 /*
  * return the time elapsed in VM between vm_start and vm_stop.  Unless
  * icount is active, cpus_get_elapsed_ticks() uses units of the host CPU cycle
diff --git a/stubs/meson.build b/stubs/meson.build
index 3b9d42023cb..a1deafde08c 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -28,7 +28,7 @@ endif
 if have_block or have_ga
   stub_ss.add(files('replay-tools.c'))
   # stubs for hooks in util/main-loop.c, util/async.c etc.
-  stub_ss.add(files('cpus-get-virtual-clock.c'))
+  stub_ss.add(files('cpus-virtual-clock.c'))
   stub_ss.add(files('icount.c'))
   stub_ss.add(files('graph-lock.c'))
   if linux_io_uring.found()
-- 
2.39.2

Re: [PATCH v2 0/6] Implement icount=auto using TCG Plugins

2024-05-30 Thread Pierrick Bouvier


A v3 was just sent, with a fix to the algorithm used.

On 5/30/24 10:49, Pierrick Bouvier wrote:

The goal here is to be able to scale temporally execution of qemu-user/system,
using a given number of instructions per second.

We define a virtual clock, that can be late or in advance compared to real time.
When we are in advance, we slow execution (by sleeping) until catching real
time.

Finally, we should be able to cleanup icount=auto mode completely, and keep
icount usage for determistic purposes only.

It is built upon new TCG Plugins inline ops (store + conditional callbacks), now
merged on master.

Example in user-mode:

- Retrieve number of instructions to execute /bin/true
$ ./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d plugin 
/bin/true
cpu 0 insns: 120546
total insns: 120546
- Slow execution to match 5 seconds
$ time ./build/qemu-x86_64 -plugin 
./build/contrib/plugins/libips,ips=$((120546/5)) /bin/true
real0m4.985s

v2
--

- Added missing personal Signed-off-by for commits from Alex
- Fix bad rebase in stubs/meson.build
- move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c
- A race condition was identified for plugins init/idle/resume, but is not
   related to this series, and will be fixed in another one:
   
https://lore.kernel.org/qemu-devel/20240529152219.825680-1-alex.ben...@linaro.org/

Alex Bennée (4):
   sysemu: add set_virtual_time to accel ops
   qtest: use cpu interface in qtest_clock_warp
   sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time
   plugins: add time control API

Pierrick Bouvier (2):
   qtest: move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c
   contrib/plugins: add ips plugin example for cost modeling

  include/qemu/qemu-plugin.h|  23 ++
  include/qemu/timer.h  |  15 ++
  include/sysemu/accel-ops.h|  18 +-
  include/sysemu/cpu-timers.h   |   3 +-
  include/sysemu/qtest.h|   2 -
  accel/qtest/qtest.c   |  13 +
  contrib/plugins/ips.c | 239 ++
  plugins/api.c |  31 +++
  ...t-virtual-clock.c => cpus-virtual-clock.c} |   5 +
  system/cpus.c |  11 +
  system/qtest.c|  37 +--
  util/qemu-timer.c |  26 ++
  contrib/plugins/Makefile  |   1 +
  plugins/qemu-plugins.symbols  |   2 +
  stubs/meson.build |   2 +-
  15 files changed, 389 insertions(+), 39 deletions(-)
  create mode 100644 contrib/plugins/ips.c
  rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%)

[PATCH v3 0/6] Implement icount=auto using TCG Plugins

2024-05-30 Thread Pierrick Bouvier

The goal here is to be able to scale temporally execution of qemu-user/system,
using a given number of instructions per second.

We define a virtual clock, that can be late or in advance compared to real time.
When we are in advance, we slow execution (by sleeping) until catching real
time.

Finally, we should be able to cleanup icount=auto mode completely, and keep
icount usage for determistic purposes only.

It is built upon new TCG Plugins inline ops (store + conditional callbacks), now
merged on master.

Example in user-mode:

- Retrieve number of instructions to execute /bin/true
$ ./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d plugin 
/bin/true
cpu 0 insns: 120546
total insns: 120546
- Slow execution to match 5 seconds
$ time ./build/qemu-x86_64 -plugin 
./build/contrib/plugins/libips,ips=$((120546/5)) /bin/true
real0m4.985s

Tested in system mode by booting a full debian system, and using:
$ sysbench cpu run
Performance decrease linearly with the given number of ips.

v2
--

- Added missing personal Signed-off-by for commits from Alex
- Fix bad rebase in stubs/meson.build
- move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c
- A race condition was identified for plugins init/idle/resume, but is not
  related to this series, and will be fixed in another one:
  
https://lore.kernel.org/qemu-devel/20240529152219.825680-1-alex.ben...@linaro.org/

v3
--

- remove precise execution (per insn, per tb is enough)
- fixed algorithm used. Instead of comparing from start time of the system, we
  just check on a given quantum of time that we didn't run too fast. It is more
  simple, there is no need to track idle/resume events, and a vcpu can sleep
  correctly.
- use "sysbench cpu run" in system mode to check execution is slowed as 
expected.
- do not use idle/resume callback

Alex Bennée (4):
  sysemu: add set_virtual_time to accel ops
  qtest: use cpu interface in qtest_clock_warp
  sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time
  plugins: add time control API

Pierrick Bouvier (2):
  qtest: move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c
  contrib/plugins: add ips plugin example for cost modeling

 include/qemu/qemu-plugin.h|  23 +++
 include/qemu/timer.h  |  15 ++
 include/sysemu/accel-ops.h|  18 +-
 include/sysemu/cpu-timers.h   |   3 +-
 include/sysemu/qtest.h|   2 -
 accel/qtest/qtest.c   |  13 ++
 contrib/plugins/ips.c | 164 ++
 plugins/api.c |  31 
 ...t-virtual-clock.c => cpus-virtual-clock.c} |   5 +
 system/cpus.c |  11 ++
 system/qtest.c|  37 +---
 util/qemu-timer.c |  26 +++
 contrib/plugins/Makefile  |   1 +
 plugins/qemu-plugins.symbols  |   2 +
 stubs/meson.build |   2 +-
 15 files changed, 314 insertions(+), 39 deletions(-)
 create mode 100644 contrib/plugins/ips.c
 rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%)

-- 
2.39.2

[PATCH v3 6/6] contrib/plugins: add ips plugin example for cost modeling

2024-05-30 Thread Pierrick Bouvier

This plugin uses the new time control interface to make decisions
about the state of time during the emulation. The algorithm is
currently very simple. The user specifies an ips rate which applies
per core. If the core runs ahead of its allocated execution time the
plugin sleeps for a bit to let real time catch up. Either way time is
updated for the emulation as a function of total executed instructions
with some adjustments for cores that idle.

Examples


Slow down execution of /bin/true:
$ num_insn=$(./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d 
plugin /bin/true |& grep total | sed -e 's/.*: //')
$ time ./build/qemu-x86_64 -plugin 
./build/contrib/plugins/libips.so,ips=$(($num_insn/4)) /bin/true
real 4.000s

Boot a Linux kernel simulating a 250MHz cpu:
$ /build/qemu-system-x86_64 -kernel /boot/vmlinuz-6.1.0-21-amd64 -append 
"console=ttyS0" -plugin 
./build/contrib/plugins/libips.so,ips=$((250*1000*1000)) -smp 1 -m 512
check time until kernel panic on serial0

Tested in system mode by booting a full debian system, and using:
$ sysbench cpu run
Performance decrease linearly with the given number of ips.

Signed-off-by: Pierrick Bouvier 
---
 contrib/plugins/ips.c| 164 +++
 contrib/plugins/Makefile |   1 +
 2 files changed, 165 insertions(+)
 create mode 100644 contrib/plugins/ips.c

diff --git a/contrib/plugins/ips.c b/contrib/plugins/ips.c
new file mode 100644
index 000..db77729264b
--- /dev/null
+++ b/contrib/plugins/ips.c
@@ -0,0 +1,164 @@
+/*
+ * ips rate limiting plugin.
+ *
+ * This plugin can be used to restrict the execution of a system to a
+ * particular number of Instructions Per Second (ips). This controls
+ * time as seen by the guest so while wall-clock time may be longer
+ * from the guests point of view time will pass at the normal rate.
+ *
+ * This uses the new plugin API which allows the plugin to control
+ * system time.
+ *
+ * Copyright (c) 2023 Linaro Ltd
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include 
+#include 
+#include 
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+
+/* how many times do we update time per sec */
+#define NUM_TIME_UPDATE_PER_SEC 10
+#define NSEC_IN_ONE_SEC (1000 * 1000 * 1000)
+
+static GMutex global_state_lock;
+
+static uint64_t max_insn_per_second = 1000 * 1000 * 1000; /* ips per core, per 
second */
+static uint64_t max_insn_per_quantum; /* trap every N instructions */
+static int64_t virtual_time_ns; /* last set virtual time */
+
+static const void *time_handle;
+
+typedef struct {
+uint64_t total_insn;
+uint64_t quantum_insn; /* insn in last quantum */
+int64_t last_quantum_time; /* time when last quantum started */
+} vCPUTime;
+
+struct qemu_plugin_scoreboard *vcpus;
+
+/* return epoch time in ns */
+static int64_t now_ns(void)
+{
+return g_get_real_time() * 1000;
+}
+
+static uint64_t num_insn_during(int64_t elapsed_ns)
+{
+double num_secs = elapsed_ns / (double) NSEC_IN_ONE_SEC;
+return num_secs * (double) max_insn_per_second;
+}
+
+static int64_t time_for_insn(uint64_t num_insn)
+{
+double num_secs = (double) num_insn / (double) max_insn_per_second;
+return num_secs * (double) NSEC_IN_ONE_SEC;
+}
+
+static void update_system_time(vCPUTime *vcpu)
+{
+int64_t elapsed_ns = now_ns() - vcpu->last_quantum_time;
+uint64_t max_insn = num_insn_during(elapsed_ns);
+
+if (vcpu->quantum_insn >= max_insn) {
+/* this vcpu ran faster than expected, so it has to sleep */
+uint64_t insn_advance = vcpu->quantum_insn - max_insn;
+uint64_t time_advance_ns = time_for_insn(insn_advance);
+int64_t sleep_us = time_advance_ns / 1000;
+g_usleep(sleep_us);
+}
+
+vcpu->total_insn += vcpu->quantum_insn;
+vcpu->quantum_insn = 0;
+vcpu->last_quantum_time = now_ns();
+
+/* based on total number of instructions, what should be the new time? */
+int64_t new_virtual_time = time_for_insn(vcpu->total_insn);
+
+g_mutex_lock(&global_state_lock);
+
+/* Time only moves forward. Another vcpu might have updated it already. */
+if (new_virtual_time > virtual_time_ns) {
+qemu_plugin_update_ns(time_handle, new_virtual_time);
+virtual_time_ns = new_virtual_time;
+}
+
+g_mutex_unlock(&global_state_lock);
+}
+
+static void vcpu_init(qemu_plugin_id_t id, unsigned int cpu_index)
+{
+vCPUTime *vcpu = qemu_plugin_scoreboard_find(vcpus, cpu_index);
+vcpu->total_insn = 0;
+vcpu->quantum_insn = 0;
+vcpu->last_quantum_time = now_ns();
+}
+
+static void vcpu_exit(qemu_plugin_id_t id, unsigned int cpu_index)
+{
+vCPUTime *vcpu = qemu_plugin_scoreboard_find(vcpus, cpu_index);
+update_system_time(vcpu);
+}
+
+static void every_quantum_insn(unsigned int cpu_index, void *udata)
+{
+vCPUTime *vcpu = qemu_plugin_scoreboard_find(vcpus, cpu_index);
+g_assert(vcpu->quantum_insn >= max_insn_per_quantum);
+update_system_time(v

[PATCH v3 5/6] plugins: add time control API

2024-05-30 Thread Pierrick Bouvier

From: Alex Bennée 

Expose the ability to control time through the plugin API. Only one
plugin can control time so it has to request control when loaded.
There are probably more corner cases to catch here.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Signed-off-by: Pierrick Bouvier 
---
 include/qemu/qemu-plugin.h   | 23 +++
 plugins/api.c| 31 +++
 plugins/qemu-plugins.symbols |  2 ++
 3 files changed, 56 insertions(+)

diff --git a/include/qemu/qemu-plugin.h b/include/qemu/qemu-plugin.h
index 95703d8fec1..80b1637cede 100644
--- a/include/qemu/qemu-plugin.h
+++ b/include/qemu/qemu-plugin.h
@@ -661,6 +661,29 @@ void qemu_plugin_register_vcpu_mem_inline_per_vcpu(
 qemu_plugin_u64 entry,
 uint64_t imm);
 
+/**
+ * qemu_plugin_request_time_control() - request the ability to control time
+ *
+ * This grants the plugin the ability to control system time. Only one
+ * plugin can control time so if multiple plugins request the ability
+ * all but the first will fail.
+ *
+ * Returns an opaque handle or NULL if fails
+ */
+const void *qemu_plugin_request_time_control(void);
+
+/**
+ * qemu_plugin_update_ns() - update system emulation time
+ * @handle: opaque handle returned by qemu_plugin_request_time_control()
+ * @time: time in nanoseconds
+ *
+ * This allows an appropriately authorised plugin (i.e. holding the
+ * time control handle) to move system time forward to @time.
+ *
+ * Start time is 0.
+ */
+void qemu_plugin_update_ns(const void *handle, int64_t time);
+
 typedef void
 (*qemu_plugin_vcpu_syscall_cb_t)(qemu_plugin_id_t id, unsigned int vcpu_index,
  int64_t num, uint64_t a1, uint64_t a2,
diff --git a/plugins/api.c b/plugins/api.c
index 5a0a7f8c712..26822b69ea2 100644
--- a/plugins/api.c
+++ b/plugins/api.c
@@ -39,6 +39,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/plugin.h"
 #include "qemu/log.h"
+#include "qemu/timer.h"
 #include "tcg/tcg.h"
 #include "exec/exec-all.h"
 #include "exec/gdbstub.h"
@@ -583,3 +584,33 @@ uint64_t qemu_plugin_u64_sum(qemu_plugin_u64 entry)
 }
 return total;
 }
+
+/*
+ * Time control
+ */
+static bool has_control;
+
+const void *qemu_plugin_request_time_control(void)
+{
+if (!has_control) {
+has_control = true;
+return &has_control;
+}
+return NULL;
+}
+
+static void advance_virtual_time__async(CPUState *cpu, run_on_cpu_data data)
+{
+int64_t new_time = data.host_ulong;
+qemu_clock_advance_virtual_time(new_time);
+}
+
+void qemu_plugin_update_ns(const void *handle, int64_t new_time)
+{
+if (handle == &has_control) {
+/* Need to execute out of cpu_exec, so bql can be locked. */
+async_run_on_cpu(current_cpu,
+ advance_virtual_time__async,
+ RUN_ON_CPU_HOST_ULONG(new_time));
+}
+}
diff --git a/plugins/qemu-plugins.symbols b/plugins/qemu-plugins.symbols
index aa0a77a319f..ca773d8d9fe 100644
--- a/plugins/qemu-plugins.symbols
+++ b/plugins/qemu-plugins.symbols
@@ -38,6 +38,7 @@
   qemu_plugin_register_vcpu_tb_exec_cond_cb;
   qemu_plugin_register_vcpu_tb_exec_inline_per_vcpu;
   qemu_plugin_register_vcpu_tb_trans_cb;
+  qemu_plugin_request_time_control;
   qemu_plugin_reset;
   qemu_plugin_scoreboard_free;
   qemu_plugin_scoreboard_find;
@@ -51,5 +52,6 @@
   qemu_plugin_u64_set;
   qemu_plugin_u64_sum;
   qemu_plugin_uninstall;
+  qemu_plugin_update_ns;
   qemu_plugin_vcpu_for_each;
 };
-- 
2.39.2

[PATCH v3 2/6] qtest: use cpu interface in qtest_clock_warp

2024-05-30 Thread Pierrick Bouvier

From: Alex Bennée 

This generalises the qtest_clock_warp code to use the AccelOps
handlers for updating its own sense of time. This will make the next
patch which moves the warp code closer to pure code motion.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Acked-by: Thomas Huth 
Signed-off-by: Pierrick Bouvier 
---
 include/sysemu/qtest.h | 1 +
 accel/qtest/qtest.c| 1 +
 system/qtest.c | 6 +++---
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h
index b5d5fd34637..45f3b7e1df5 100644
--- a/include/sysemu/qtest.h
+++ b/include/sysemu/qtest.h
@@ -36,6 +36,7 @@ void qtest_server_set_send_handler(void (*send)(void *, const 
char *),
 void qtest_server_inproc_recv(void *opaque, const char *buf);
 
 int64_t qtest_get_virtual_clock(void);
+void qtest_set_virtual_clock(int64_t count);
 #endif
 
 #endif
diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c
index f6056ac8361..53182e6c2ae 100644
--- a/accel/qtest/qtest.c
+++ b/accel/qtest/qtest.c
@@ -52,6 +52,7 @@ static void qtest_accel_ops_class_init(ObjectClass *oc, void 
*data)
 
 ops->create_vcpu_thread = dummy_start_vcpu_thread;
 ops->get_virtual_clock = qtest_get_virtual_clock;
+ops->set_virtual_clock = qtest_set_virtual_clock;
 };
 
 static const TypeInfo qtest_accel_ops_type = {
diff --git a/system/qtest.c b/system/qtest.c
index 6da58b3874e..ee8b139e982 100644
--- a/system/qtest.c
+++ b/system/qtest.c
@@ -332,14 +332,14 @@ int64_t qtest_get_virtual_clock(void)
 return qatomic_read_i64(&qtest_clock_counter);
 }
 
-static void qtest_set_virtual_clock(int64_t count)
+void qtest_set_virtual_clock(int64_t count)
 {
 qatomic_set_i64(&qtest_clock_counter, count);
 }
 
 static void qtest_clock_warp(int64_t dest)
 {
-int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+int64_t clock = cpus_get_virtual_clock();
 AioContext *aio_context;
 assert(qtest_enabled());
 aio_context = qemu_get_aio_context();
@@ -348,7 +348,7 @@ static void qtest_clock_warp(int64_t dest)
   QEMU_TIMER_ATTR_ALL);
 int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
 
-qtest_set_virtual_clock(qtest_get_virtual_clock() + warp);
+cpus_set_virtual_clock(cpus_get_virtual_clock() + warp);
 
 qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
 timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
-- 
2.39.2

Re: [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:43PM -0300, Fabiano Rosas wrote:
> We're about to enable the use of O_DIRECT in the migration code and
> due to the alignment restrictions imposed by filesystems we need to
> make sure the flag is only used when doing aligned IO.
> 
> The migration will do parallel IO to different regions of a file, so
> we need to use more than one file descriptor. Those cannot be obtained
> by duplicating (dup()) since duplicated file descriptors share the
> file status flags, including O_DIRECT. If one migration channel does
> unaligned IO while another sets O_DIRECT to do aligned IO, the
> filesystem would fail the unaligned operation.
> 
> The add-fd QMP command along with the fdset code are specifically
> designed to allow the user to pass a set of file descriptors with
> different access flags into QEMU to be later fetched by code that
> needs to alternate between those flags when doing IO.
> 
> Extend the fdset matching to behave the same with the O_DIRECT flag.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

One thing I am confused but totally irrelevant to this specific change.

I wonder why do we need dupfds at all, and why client needs to remove-fd at
all.

It's about what would go wrong if qmp client only add-fd, then if it's
consumed by anything, it gets moved from "fds" list to "dupfds" list.  The
thing is I don't expect the client should pass over any fd that will not be
consumed.  Then if it's always consumed, why bother dup() at all, and why
bother an explicit remove-fd.

-- 
Peter Xu

Re: [PATCH v2 11/18] migration/multifd: Add direct-io support

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:41PM -0300, Fabiano Rosas wrote:
> When multifd is used along with mapped-ram, we can take benefit of a
> filesystem that supports the O_DIRECT flag and perform direct I/O in
> the multifd threads. This brings a significant performance improvement
> because direct-io writes bypass the page cache which would otherwise
> be thrashed by the multifd data which is unlikely to be needed again
> in a short period of time.
> 
> To be able to use a multifd channel opened with O_DIRECT, we must
> ensure that a certain aligment is used. Filesystems usually require a
> block-size alignment for direct I/O. The way to achieve this is by
> enabling the mapped-ram feature, which already aligns its I/O properly
> (see MAPPED_RAM_FILE_OFFSET_ALIGNMENT at ram.c).
> 
> By setting O_DIRECT on the multifd channels, all writes to the same
> file descriptor need to be aligned as well, even the ones that come
> from outside multifd, such as the QEMUFile I/O from the main migration
> code. This makes it impossible to use the same file descriptor for the
> QEMUFile and for the multifd channels. The various flags and metadata
> written by the main migration code will always be unaligned by virtue
> of their small size. To workaround this issue, we'll require a second
> file descriptor to be used exclusively for direct I/O.
> 
> The second file descriptor can be obtained by QEMU by re-opening the
> migration file (already possible), or by being provided by the user or
> management application (support to be added in future patches).
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  migration/file.c  | 31 ++-
>  migration/file.h  |  1 -
>  migration/migration.c | 23 +++
>  3 files changed, 49 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index ba5b5c44ff..ac4d206492 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -50,12 +50,31 @@ void file_cleanup_outgoing_migration(void)
>  outgoing_args.fname = NULL;
>  }
>  
> +static void file_enable_direct_io(int *flags)
> +{
> +#ifdef O_DIRECT
> +if (migrate_direct_io()) {
> +*flags |= O_DIRECT;
> +}
> +#else
> +/* it should have been rejected when setting the parameter */
> +g_assert_not_reached();
> +#endif
> +}
> +
>  bool file_send_channel_create(gpointer opaque, Error **errp)
>  {
>  QIOChannelFile *ioc;
>  int flags = O_WRONLY;
>  bool ret = true;
>  
> +/*
> + * Attempt to enable O_DIRECT for the secondary channels. These
> + * are used for sending ram pages and writes should be guaranteed
> + * to be aligned to at least page size.
> + */
> +file_enable_direct_io(&flags);

Call this only if enabled?  That looks clearer, IMHO:

   if (migrate_direct_io()) {
   file_enable_direct_io(&flags);
   }

Then:

static void file_enable_direct_io(int *flags)
{
#ifdef O_DIRECT
*flags |= O_DIRECT;
#else
/* it should have been rejected when setting the parameter */
g_assert_not_reached();
#endif
}

If you remember we have similar multifd calls, and I hoped all multifd
functions are only invoked when multifd is enabled first.  Same thing.

> +
>  ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, errp);
>  if (!ioc) {
>  ret = false;
> @@ -116,21 +135,23 @@ static gboolean 
> file_accept_incoming_migration(QIOChannel *ioc,
>  return G_SOURCE_REMOVE;
>  }
>  
> -void file_create_incoming_channels(QIOChannel *ioc, Error **errp)
> +static void file_create_incoming_channels(QIOChannel *ioc, char *filename,
> +  Error **errp)
>  {
> -int i, fd, channels = 1;
> +int i, channels = 1;
>  g_autofree QIOChannel **iocs = NULL;
> +int flags = O_RDONLY;
>  
>  if (migrate_multifd()) {
>  channels += migrate_multifd_channels();
> +file_enable_direct_io(&flags);

Same here.

Other than that looks good.

Thanks,

>  }
>  
>  iocs = g_new0(QIOChannel *, channels);
> -fd = QIO_CHANNEL_FILE(ioc)->fd;
>  iocs[0] = ioc;
>  
>  for (i = 1; i < channels; i++) {
> -QIOChannelFile *fioc = qio_channel_file_new_dupfd(fd, errp);
> +QIOChannelFile *fioc = qio_channel_file_new_path(filename, flags, 0, 
> errp);
>  
>  if (!fioc) {
>  while (i) {
> @@ -170,7 +191,7 @@ void file_start_incoming_migration(FileMigrationArgs 
> *file_args, Error **errp)
>  return;
>  }
>  
> -file_create_incoming_channels(QIO_CHANNEL(fioc), errp);
> +file_create_incoming_channels(QIO_CHANNEL(fioc), filename, errp);
>  }
>  
>  int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
> diff --git a/migration/file.h b/migration/file.h
> index 7699c04677..9f71e87f74 100644
> --- a/migration/file.h
> +++ b/migration/file.h
> @@ -20,7 +20,6 @@ void file_start_outgoing_migration(MigrationState *s,
>  int file_parse_offset(char *filespec, uint64_t

Re: [PATCH] hw/usb/hcd-ohci: Fix ohci_service_td: accept valid TDs

2024-05-30 Thread Cord Amfmgm

On Thu, May 30, 2024 at 2:14 PM Alex Bennée  wrote:

> David Hubbard  writes:
>
> > From: Cord Amfmgm 
> >
> > This changes the way the ohci emulation handles a Transfer Descriptor
> with
> > "Current Buffer Pointer" set to "Buffer End" + 1.
> >
> > The OHCI spec 4.3.1.2 Table 4-2 allows td.cbp to be one byte more than
> td.be
> > to signal the buffer has zero length. Currently qemu only accepts
> zero-length
> > Transfer Descriptors if the td.cbp is equal to 0, while actual OHCI
> hardware
> > accepts both cases.
>
> Which version of the OHCI spec is this? I can't find it in the one copy
> Google throws up:
>
>
> http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/ohci_11.pdf
>
>
Replace http with https in that URL and it downloads the PDF OK - it is for
IEEE-1394 Firewire though.

Try this link: https://www.cs.usfca.edu/~cruse/cs698s10/hcir1_0a.pdf - I am
on page 35/160 (the page is numbered "21" on the bottom) for the Table 4-2.

Re: [PATCH v2 10/18] migration: Add direct-io parameter

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:40PM -0300, Fabiano Rosas wrote:
> Add the direct-io migration parameter that tells the migration code to
> use O_DIRECT when opening the migration stream file whenever possible.
> 
> This is currently only used with the mapped-ram migration that has a
> clear window guaranteed to perform aligned writes.
> 
> Acked-by: Markus Armbruster 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: hw/usb/hcd-ohci: Fix #1510, #303: pid not IN or OUT

2024-05-30 Thread Cord Amfmgm

On Thu, May 30, 2024 at 2:12 PM Alex Bennée  wrote:

> Cord Amfmgm  writes:
>
> > On Thu, May 30, 2024 at 3:33 AM Alex Bennée 
> wrote:
> >
> >  Cord Amfmgm  writes:
> >
> >  > On Tue, May 28, 2024 at 11:32 AM Peter Maydell <
> peter.mayd...@linaro.org> wrote:
> >  >
> >  >  On Tue, 28 May 2024 at 16:37, Cord Amfmgm 
> wrote:
> >  >  >
> >  >  > On Tue, May 28, 2024 at 9:03 AM Peter Maydell <
> peter.mayd...@linaro.org> wrote:
> >  >  >>
> >  >  >> On Mon, 20 May 2024 at 23:24, Cord Amfmgm 
> wrote:
> >  >  >> > On Mon, May 20, 2024 at 12:05 PM Peter Maydell <
> peter.mayd...@linaro.org> wrote:
> >  
> >  >  >> > And here's an example buffer of length 0 -- you probably
> already know what I'm going to do here:
> >  >  >> >
> >  >  >> > char buf[0];
> >  >  >> > char * CurrentBufferPointer = &buf[0];
> >  >  >> > char * BufferEnd = &buf[-1]; // "address of the last byte in
> the buffer"
> >  >  >> > // The OHCI Host Controller than advances CurrentBufferPointer
> like this: CurrentBufferPointer += 0
> >  >  >> > // After the transfer:
> >  >  >> > // CurrentBufferPointer = &buf[0];
> >  >  >> > // BufferEnd = &buf[-1];
> >  >  >>
> >  >  >> Right, but why do you think this is valid, rather than
> >  >  >> being a guest software bug? My reading of the spec is that it's
> >  >  >> pretty clear about how to say "zero length buffer", and this
> >  >  >> isn't it.
> >  >  >>
> >  >  >> Is there some real-world guest OS that programs the OHCI
> >  >  >> controller this way that we're trying to accommodate?
> >  >  >
> >  >  >
> >  >  > qemu versions 4.2 and before allowed this behavior.
> >  >
> >  >  So? That might just mean we had a bug and we fixed it.
> >  >  4.2 is a very old version of QEMU and nobody seems to have
> >  >  complained in the four years since we released 5.0 about this,
> >  >  which suggests that generally guest OS drivers don't try
> >  >  to send zero-length buffers in this way.
> >  >
> >  >  > I don't think it's valid to ask for a *popular* guest OS as a
> proof-of-concept because I'm not an expert on those.
> >  >
> >  >  I didn't ask for "popular"; I asked for "real-world".
> >  >  What is the actual guest code you're running that falls over
> >  >  because of the behaviour change?
> >  >
> >  >  More generally, why do you want this behaviour to be
> >  >  changed? Reasonable reasons might include:
> >  >   * we're out of spec based on reading the documentation
> >  >   * you're trying to run some old Windows VM/QNX/etc image,
> >  > and it doesn't work any more
> >  >   * all the real hardware we tested behaves this way
> >  >
> >  >  But don't necessarily include:
> >  >   * something somebody wrote and only tested on QEMU happens to
> >  > assume the old behaviour rather than following the hw spec
> >  >
> >  >  QEMU occasionally works around guest OS bugs, but only as
> >  >  when we really have to. It's usually better to fix the
> >  >  bug in the guest.
> >  >
> >  > It's not, and I've already demonstrated that real hardware is
> consistent with the fix in this patch.
> >  >
> >  > Please check your tone.
> >
> >  I don't think that is a particularly helpful comment for someone who is
> >  taking the time to review your patches. Reading through the thread I
> >  didn't see anything that said this is how real HW behaves but I may well
> >  have missed it. However you have a number of review comments to address
> >  so I suggest you spin a v2 of the series to address them and outline the
> >  reason to accept an out of spec transaction.
> >
> > I did a rework of the patch -- see my email from May 20, quoted below --
> and I was under the impression it addressed all the
> > review comments. Did I miss something? I apologize if I did.
>
> Ahh I see - I'd only seen this thread continue so wasn't aware a new
> version had been posted. For future patches consider using -vN when
> sending them so we can clearly see a new revision is available.
>
> >
> >> index acd6016980..71b54914d3 100644
> >> --- a/hw/usb/hcd-ohci.c
> >> +++ b/hw/usb/hcd-ohci.c
> >> @@ -941,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct
> ohci_ed *ed)
> >>  if ((td.cbp & 0xf000) != (td.be & 0xf000)) {
> >>  len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff);
> >>  } else {
> >> -if (td.cbp > td.be) {
> >> -trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be);
> >> +if (td.cbp - 1 > td.be) {  /* rely on td.cbp != 0 */
> >
> >> Reading through the thread I didn't see anything that said this is how
> real HW behaves but I may well have missed it.
> >
> > This is what I wrote regarding real HW:
> >
> > Results are:
> >
> >  qemu 4.2   | qemu HEAD  | actual HW
> > ++
> >  works fine | ohci_die() | works fine
> >
> > Would additional verification of the actual HW be useful?
> >
> > Peter posted the following which is more specific than "qemu 4.2" -- I
> agree this is most likely the qemu commi

Re: [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:39PM -0300, Fabiano Rosas wrote:
> We want to make use of the Error object to report fdset errors from
> qemu_open_internal() and passing the error pointer to qemu_open_old()
> would require changing all callers. Move the file channel to the new
> API instead.
> 
> Reviewed-by: Daniel P. Berrangé 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:38PM -0300, Fabiano Rosas wrote:
> I'm keeping the EACCES because callers expect to be able to look at
> errno.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:36PM -0300, Fabiano Rosas wrote:
> We've been up until now cleaning up any file descriptors that have
> been passed into QEMU and never duplicated[1,2]. A file descriptor
> without duplicates indicates that no part of QEMU has made use of
> it. This approach is starting to show some cracks now that we're
> starting to consume fds from the migration code:
> 
> - Doing cleanup every time the last monitor connection closes works to
>   reap unused fds, but also has the side effect of forcing the
>   management layer to pass the file descriptors again in case of a
>   disconnect/re-connect, if that happened to be the only monitor
>   connection.
> 
>   Another side effect is that removing an fd with qmp_remove_fd() is
>   effectively delayed until the last monitor connection closes.
> 
>   The reliance on mon_refcount is also problematic because it's racy.
> 
> - Checking runstate_is_running() skips the cleanup unless the VM is
>   running and avoids premature cleanup of the fds, but also has the
>   side effect of blocking the legitimate removal of an fd via
>   qmp_remove_fd() if the VM happens to be in another state.
> 
>   This affects qmp_remove_fd() and qmp_query_fdsets() in particular
>   because requesting a removal at a bad time (guest stopped) might
>   cause an fd to never be removed, or to be removed at a much later
>   point in time, causing the query command to continue showing the
>   supposedly removed fd/fdset.
> 
> Note that file descriptors that *have* been duplicated are owned by
> the code that uses them and will be removed after qemu_close() is
> called. Therefore we've decided that the best course of action to
> avoid the undesired side-effects is to stop managing non-duplicated
> file descriptors.
> 
> 1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect")
> 2- ebe52b592d ("monitor: Prevent removing fd from set during init")
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  monitor/fds.c  | 15 ---
>  monitor/hmp.c  |  2 --
>  monitor/monitor-internal.h |  1 -
>  monitor/qmp.c  |  2 --
>  4 files changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/monitor/fds.c b/monitor/fds.c
> index 101e21720d..f7b98814fa 100644
> --- a/monitor/fds.c
> +++ b/monitor/fds.c
> @@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, 
> Error **errp)
>  
>  static void monitor_fdset_free(MonFdset *mon_fdset)
>  {
> +/*
> + * Only remove an empty fdset. The fds are owned by the user and
> + * should have been removed with qmp_remove_fd(). The dup_fds are
> + * owned by QEMU and should have been removed with qemu_close().
> + */
>  if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
>  QLIST_REMOVE(mon_fdset, next);
>  g_free(mon_fdset);
> @@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>  MonFdsetFd *mon_fdset_fd_next;
>  
>  QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, 
> mon_fdset_fd_next) {
> -if ((mon_fdset_fd->removed ||
> -(QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
> -runstate_is_running()) {
> +if (mon_fdset_fd->removed) {

I don't know the code well, but I like it.

>  monitor_fdset_fd_free(mon_fdset_fd);
>  }
>  }
> @@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void)
>  
>  QEMU_LOCK_GUARD(&mon_fdsets_lock);
>  QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) {
> -monitor_fdset_cleanup(mon_fdset);
> +monitor_fdset_free(mon_fdset);
>  }
>  }
>  
> @@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd)
>  if (mon_fdset_fd_dup->fd == dup_fd) {
>  QLIST_REMOVE(mon_fdset_fd_dup, next);
>  g_free(mon_fdset_fd_dup);
> -if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
> -monitor_fdset_cleanup(mon_fdset);
> -}
> +monitor_fdset_free(mon_fdset);

This and above changes are not crystal clear to me where the _cleanup()
does extra check "removed" and clean those fds.

I think it'll make it easier for me to understand if this one and the next
are squashed together.  But maybe it's simply because I didn't fully
understand.

>  return;
>  }
>  }
> diff --git a/monitor/hmp.c b/monitor/hmp.c
> index 69c1b7e98a..460e8832f6 100644
> --- a/monitor/hmp.c
> +++ b/monitor/hmp.c
> @@ -1437,11 +1437,9 @@ static void monitor_event(void *opaque, QEMUChrEvent 
> event)
>  monitor_resume(mon);
>  }
>  qemu_mutex_unlock(&mon->mon_lock);
> -mon_refcount++;
>  break;
>  
>  case CHR_EVENT_CLOSED:
> -mon_refcount--;
>  monitor_fdsets_cleanup();
>  break;
>  
> diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
> index 252de85681..cb628f681d 100644
> --- a/

Re: [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:35PM -0300, Fabiano Rosas wrote:
> Introduce two new functions to remove and free no longer used fds and
> fdsets.
> 
> We need those to decouple the remove/free routines from
> monitor_fdset_cleanup() which will go away in the next patches.
> 
> The new functions:
> 
> - monitor_fdset_free() will be used when a monitor connection closes
>   and when an fd is removed to cleanup any fdset that is now empty.
> 
> - monitor_fdset_fd_free() will be used to remove one or more fds that
>   have been explicitly targeted by qmp_remove_fd().
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

One nitpick below.

> ---
>  monitor/fds.c | 26 ++
>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/monitor/fds.c b/monitor/fds.c
> index fb9f58c056..101e21720d 100644
> --- a/monitor/fds.c
> +++ b/monitor/fds.c
> @@ -167,6 +167,22 @@ int monitor_get_fd(Monitor *mon, const char *fdname, 
> Error **errp)
>  return -1;
>  }
>  
> +static void monitor_fdset_free(MonFdset *mon_fdset)
> +{
> +if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
> +QLIST_REMOVE(mon_fdset, next);
> +g_free(mon_fdset);
> +}
> +}

Would monitor_fdset_free_if_empty() (or similar) slightly better?

static void monitor_fdset_free(MonFdset *mon_fdset)
{
QLIST_REMOVE(mon_fdset, next);
g_free(mon_fdset);
}

static void monitor_fdset_free_if_empty(MonFdset *mon_fdset)
{
if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
monitor_fdset_free(mon_fdset);
}
}

> +
> +static void monitor_fdset_fd_free(MonFdsetFd *mon_fdset_fd)
> +{
> +close(mon_fdset_fd->fd);
> +g_free(mon_fdset_fd->opaque);
> +QLIST_REMOVE(mon_fdset_fd, next);
> +g_free(mon_fdset_fd);
> +}
> +
>  static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>  {
>  MonFdsetFd *mon_fdset_fd;
> @@ -176,17 +192,11 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>  if ((mon_fdset_fd->removed ||
>  (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
>  runstate_is_running()) {
> -close(mon_fdset_fd->fd);
> -g_free(mon_fdset_fd->opaque);
> -QLIST_REMOVE(mon_fdset_fd, next);
> -g_free(mon_fdset_fd);
> +monitor_fdset_fd_free(mon_fdset_fd);
>  }
>  }
>  
> -if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
> -QLIST_REMOVE(mon_fdset, next);
> -g_free(mon_fdset);
> -}
> +monitor_fdset_free(mon_fdset);
>  }
>  
>  void monitor_fdsets_cleanup(void)
> -- 
> 2.35.3
> 

-- 
Peter Xu

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-30 Thread Sean Hefty

> > For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> >
> > rdma_cm and ibverbs create two FDs respectively. The two FDs have
> > different responsibilities. rdma_cm fd is used to notify connection
> > establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin
> event can be monitored, which means that an rdma_cm event occurs. When
> the verbs fd is directly polled/epolled, only the pollin event can be 
> listened,
> which indicates that a new CQE is generated.
> >
> > Rsocket is a sub-module attached to the rdma_cm library and provides
> > rdma calls that are completely similar to socket interfaces. However,
> > this library returns only the rdma_cm fd for listening to link setup-related
> events and does not expose the verbs fd (readable and writable events for
> listening to data). Only the rpoll interface provided by the RSocket can be 
> used
> to listen to related events. However, QEMU uses the ppoll interface to listen 
> to
> the rdma_cm fd (gotten by raccept API).
> > And cannot listen to the verbs fd event. Only some hacking methods can be
> used to address this problem.
> >
> > Do you guys have any ideas? Thanks.

The current rsocket code allows calling rpoll() with non-rsocket fd's, so an 
app can use rpoll() directly in place of poll().  It may be easiest to add an 
rppoll() call to rsockets and call that when using RDMA.

In case the easy path isn't feasible:

An extension could allow extracting the actual fd's under an rsocket, in order 
to allow a user to call poll()/ppoll() directly.  But it would be non-trivial.

The 'fd' that represents an rsocket happens to be the fd related to the RDMA 
CM.  That's because an rsocket needs a unique integer value to report as an 
'fd' value which will not conflict with any other fd value that the app may 
have.  I would consider the fd value an implementation detail, rather than 
something which an app should depend upon.  (For example, the 'fd' value 
returned for a datagram rsocket is actually a UDP socket fd).

Once an rsocket is in the connected state, it's possible an extended 
rgetsockopt() or rfcntl() call could return the fd related to the CQ.  But if 
an app tried to call poll() on that fd, the results would not be as expected.  
For example, it's possible for data to be available to receive on the rsocket 
without the CQ fd being signaled.  Calling poll() on the CQ fd in this state 
could leave the app hanging.  This is a natural? result of races in the RDMA CQ 
signaling.  If you look at the rsocket rpoll() implementation, you'll see that 
it checks for data prior to sleeping.

For an app to safely wait in poll/ppoll on the CQ fd, it would need to invoke 
some sort of 'pre-poll' routine, which would perform the same checks done in 
rpoll() prior to blocking.  As a reference to a similar pre-poll routine, see 
the fi_trywait() call from this man page: 

https://ofiwg.github.io/libfabric/v1.21.0/man/fi_poll.3.html

This is for a different library but deals with the same underlying problem.  
Obviously adding an rtrywait() to rsockets is possible but wouldn't align with 
any socket API equivalent.

- Sean

[PATCH 1/5] hw/core: expand on the alignment of CPUState

2024-05-30 Thread Alex Bennée

Make the relationship between CPUState, ArchCPU and cpu_env a bit
clearer in the kdoc comments.

Signed-off-by: Alex Bennée 
---
 include/hw/core/cpu.h | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index bb398e8237..35d345371b 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -391,7 +391,8 @@ struct qemu_work_item;
 #define CPU_UNSET_NUMA_NODE_ID -1
 
 /**
- * CPUState:
+ * struct CPUState - common state of one CPU core or thread.
+ *
  * @cpu_index: CPU index (informative).
  * @cluster_index: Identifies which cluster this CPU is in.
  *   For boards which don't define clusters or for "loose" CPUs not assigned
@@ -439,10 +440,15 @@ struct qemu_work_item;
  * @kvm_fetch_index: Keeps the index that we last fetched from the per-vCPU
  *dirty ring structure.
  *
- * State of one CPU core or thread.
+ * @neg_align: The CPUState is the common part of a concrete ArchCPU
+ * which is allocated when an individual CPU instance is created. As
+ * such care is taken is ensure there is no gap between between
+ * CPUState and CPUArchState within ArchCPU.
  *
- * Align, in order to match possible alignment required by CPUArchState,
- * and eliminate a hole between CPUState and CPUArchState within ArchCPU.
+ * @neg: The architectural register state ("cpu_env") immediately follows 
CPUState
+ * in ArchCPU and is passed to TCG code. The @neg structure holds some
+ * common TCG CPU variables which are accessed with a negative offset
+ * from cpu_env.
  */
 struct CPUState {
 /*< private >*/
-- 
2.39.2

[PATCH 2/5] cpu: move Qemu[Thread|Cond] setup into common code

2024-05-30 Thread Alex Bennée

Aside from the round robin threads this is all common code. By
moving the halt_cond setup we also no longer need hacks to work around
the race between QOM object creation and thread creation.

It is a little ugly to free stuff up for the round robin thread but
better it deal with its own specialises than making the other
accelerators jump through hoops.

Signed-off-by: Alex Bennée 
---
 include/hw/core/cpu.h |  4 
 accel/dummy-cpus.c|  3 ---
 accel/hvf/hvf-accel-ops.c |  4 
 accel/kvm/kvm-accel-ops.c |  3 ---
 accel/tcg/tcg-accel-ops-mttcg.c   |  4 
 accel/tcg/tcg-accel-ops-rr.c  | 14 +++---
 hw/core/cpu-common.c  |  5 +
 target/i386/nvmm/nvmm-accel-ops.c |  3 ---
 target/i386/whpx/whpx-accel-ops.c |  3 ---
 9 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 35d345371b..a405119eda 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -404,10 +404,14 @@ struct qemu_work_item;
  * @tcg_cflags: Pre-computed cflags for this cpu.
  * @nr_cores: Number of cores within this CPU package.
  * @nr_threads: Number of threads within this CPU core.
+ * @thread: Host thread details, only live once @created is #true
+ * @sem: WIN32 only semaphore used only for qtest
+ * @thread_id: native thread id of vCPU, only live once @created is #true
  * @running: #true if CPU is currently running (lockless).
  * @has_waiter: #true if a CPU is currently waiting for the cpu_exec_end;
  * valid under cpu_list_lock.
  * @created: Indicates whether the CPU thread has been successfully created.
+ * @halt_cond: condition variable sleeping threads can wait on.
  * @interrupt_request: Indicates a pending interrupt request.
  * @halted: Nonzero if the CPU is in suspended state.
  * @stop: Indicates a pending stop request.
diff --git a/accel/dummy-cpus.c b/accel/dummy-cpus.c
index 20519f1ea4..f32d8c8dc3 100644
--- a/accel/dummy-cpus.c
+++ b/accel/dummy-cpus.c
@@ -68,9 +68,6 @@ void dummy_start_vcpu_thread(CPUState *cpu)
 {
 char thread_name[VCPU_THREAD_NAME_SIZE];
 
-cpu->thread = g_malloc0(sizeof(QemuThread));
-cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu->halt_cond);
 snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/DUMMY",
  cpu->cpu_index);
 qemu_thread_create(cpu->thread, thread_name, dummy_cpu_thread_fn, cpu,
diff --git a/accel/hvf/hvf-accel-ops.c b/accel/hvf/hvf-accel-ops.c
index 40d4187d9d..6f1e27ef46 100644
--- a/accel/hvf/hvf-accel-ops.c
+++ b/accel/hvf/hvf-accel-ops.c
@@ -463,10 +463,6 @@ static void hvf_start_vcpu_thread(CPUState *cpu)
  */
 assert(hvf_enabled());
 
-cpu->thread = g_malloc0(sizeof(QemuThread));
-cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu->halt_cond);
-
 snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
  cpu->cpu_index);
 qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
diff --git a/accel/kvm/kvm-accel-ops.c b/accel/kvm/kvm-accel-ops.c
index 94c828ac8d..c239dfc87a 100644
--- a/accel/kvm/kvm-accel-ops.c
+++ b/accel/kvm/kvm-accel-ops.c
@@ -66,9 +66,6 @@ static void kvm_start_vcpu_thread(CPUState *cpu)
 {
 char thread_name[VCPU_THREAD_NAME_SIZE];
 
-cpu->thread = g_malloc0(sizeof(QemuThread));
-cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu->halt_cond);
 snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/KVM",
  cpu->cpu_index);
 qemu_thread_create(cpu->thread, thread_name, kvm_vcpu_thread_fn,
diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c
index c552b45b8e..49814ec4af 100644
--- a/accel/tcg/tcg-accel-ops-mttcg.c
+++ b/accel/tcg/tcg-accel-ops-mttcg.c
@@ -137,10 +137,6 @@ void mttcg_start_vcpu_thread(CPUState *cpu)
 g_assert(tcg_enabled());
 tcg_cpu_init_cflags(cpu, current_machine->smp.max_cpus > 1);
 
-cpu->thread = g_new0(QemuThread, 1);
-cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-qemu_cond_init(cpu->halt_cond);
-
 /* create a thread per vCPU with TCG (MTTCG) */
 snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/TCG",
  cpu->cpu_index);
diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
index 894e73e52c..84c36c1450 100644
--- a/accel/tcg/tcg-accel-ops-rr.c
+++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -317,22 +317,22 @@ void rr_start_vcpu_thread(CPUState *cpu)
 tcg_cpu_init_cflags(cpu, false);
 
 if (!single_tcg_cpu_thread) {
-cpu->thread = g_new0(QemuThread, 1);
-cpu->halt_cond = g_new0(QemuCond, 1);
-qemu_cond_init(cpu->halt_cond);
+single_tcg_halt_cond = cpu->halt_cond;
+single_tcg_cpu_thread = cpu->thread;
 
 /* share a single thread for all cpus with TCG */
 snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "ALL CPUs/TCG");
 qemu_thread_create(cpu->thread, thread_name,

[PATCH 4/5] plugins: remove special casing for cpu->realized

2024-05-30 Thread Alex Bennée

Now the condition variable is initialised early on we don't need to go
through hoops to avoid calling async_run_on_cpu.

Signed-off-by: Alex Bennée 
---
 plugins/core.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/plugins/core.c b/plugins/core.c
index 0726bc7f25..badede28cf 100644
--- a/plugins/core.c
+++ b/plugins/core.c
@@ -65,11 +65,7 @@ static void plugin_cpu_update__locked(gpointer k, gpointer 
v, gpointer udata)
 CPUState *cpu = container_of(k, CPUState, cpu_index);
 run_on_cpu_data mask = RUN_ON_CPU_HOST_ULONG(*plugin.mask);
 
-if (DEVICE(cpu)->realized) {
-async_run_on_cpu(cpu, plugin_cpu_update__async, mask);
-} else {
-plugin_cpu_update__async(cpu, mask);
-}
+async_run_on_cpu(cpu, plugin_cpu_update__async, mask);
 }
 
 void plugin_unregister_cb__locked(struct qemu_plugin_ctx *ctx,
-- 
2.39.2

[PATCH 5/5] core/cpu-common: initialise plugin state before thread creation

2024-05-30 Thread Alex Bennée

Originally I tried to move where vCPU thread initialisation to later
in realize. However pulling that thread (sic) got gnarly really
quickly. It turns out some steps of CPU realization need values that
can only be determined from the running vCPU thread.

However having moved enough out of the thread creation we can now
queue work before the thread starts (at least for TCG guests) and
avoid the race between vcpu_init and other vcpu states a plugin might
subscribe to.

Signed-off-by: Alex Bennée 
---
 hw/core/cpu-common.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index 6cfc01593a..bf1a7b8892 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -222,14 +222,6 @@ static void cpu_common_realizefn(DeviceState *dev, Error 
**errp)
 cpu_resume(cpu);
 }
 
-/* Plugin initialization must wait until the cpu start executing code */
-#ifdef CONFIG_PLUGIN
-if (tcg_enabled()) {
-cpu->plugin_state = qemu_plugin_create_vcpu_state();
-async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL);
-}
-#endif
-
 /* NOTE: latest generic point where the cpu is fully realized */
 }
 
@@ -273,6 +265,18 @@ static void cpu_common_initfn(Object *obj)
 QTAILQ_INIT(&cpu->watchpoints);
 
 cpu_exec_initfn(cpu);
+
+/*
+ * Plugin initialization must wait until the cpu start executing
+ * code, but we must queue this work before the threads are
+ * created to ensure we don't race.
+ */
+#ifdef CONFIG_PLUGIN
+if (tcg_enabled()) {
+cpu->plugin_state = qemu_plugin_create_vcpu_state();
+async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL);
+}
+#endif
 }
 
 static void cpu_common_finalize(Object *obj)
-- 
2.39.2

[PATCH 0/5] cpus: a few tweaks to CPU realization

2024-05-30 Thread Alex Bennée

The recent IPS plugin exposed a race condition between vcpu_init
callbacks and the other vcpu state callbacks. I originally thought
there was some wider re-factoring to be done to clean this up but it
turns out things are broadly where they should be. However some of the
stuff allocated in the vCPU threads can clearly be done earlier so
I've moved enough from cpu_common_realizefn to cpu_common_initfn to
allow plugins to queue work before the threads start solving the race.

Please review.

Alex Bennée (5):
  hw/core: expand on the alignment of CPUState
  cpu: move Qemu[Thread|Cond] setup into common code
  cpu-target: don't set cpu->thread_id to bogus value
  plugins: remove special casing for cpu->realized
  core/cpu-common: initialise plugin state before thread creation

 include/hw/core/cpu.h | 18 ++
 accel/dummy-cpus.c|  3 ---
 accel/hvf/hvf-accel-ops.c |  4 
 accel/kvm/kvm-accel-ops.c |  3 ---
 accel/tcg/tcg-accel-ops-mttcg.c   |  4 
 accel/tcg/tcg-accel-ops-rr.c  | 14 +++---
 cpu-target.c  |  1 -
 hw/core/cpu-common.c  | 25 +
 plugins/core.c|  6 +-
 target/i386/nvmm/nvmm-accel-ops.c |  3 ---
 target/i386/whpx/whpx-accel-ops.c |  3 ---
 11 files changed, 39 insertions(+), 45 deletions(-)

-- 
2.39.2

[PATCH 3/5] cpu-target: don't set cpu->thread_id to bogus value

2024-05-30 Thread Alex Bennée

The thread_id isn't valid until the threads are created. There is no
point setting it here. The only thing that cares about the thread_id
is qmp_query_cpus_fast.

Signed-off-by: Alex Bennée 
---
 cpu-target.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/cpu-target.c b/cpu-target.c
index 5af120e8aa..499facf774 100644
--- a/cpu-target.c
+++ b/cpu-target.c
@@ -241,7 +241,6 @@ void cpu_exec_initfn(CPUState *cpu)
 cpu->num_ases = 0;
 
 #ifndef CONFIG_USER_ONLY
-cpu->thread_id = qemu_get_thread_id();
 cpu->memory = get_system_memory();
 object_ref(OBJECT(cpu->memory));
 #endif
-- 
2.39.2

Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-30 Thread Peter Xu

On Thu, May 30, 2024 at 01:17:05PM -0400, Steven Sistare wrote:
> On 5/28/2024 12:42 PM, Peter Xu wrote:
> > On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
> > > On 5/27/2024 1:45 PM, Peter Xu wrote:
> > > > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> > > > > I understand, thanks.  If I can help with any of your todo list,
> > > > > just ask - steve
> > > > 
> > > > Thanks for offering the help, Steve.  Started looking at this today, 
> > > > then I
> > > > found that I miss something high-level.  Let me ask here, and let me
> > > > apologize already for starting to throw multiple questions..
> > > > 
> > > > IIUC the whole idea of this patchset is to allow efficient QEMU 
> > > > upgrade, in
> > > > this case not host kernel but QEMU-only, and/or upper.
> > > > 
> > > > Is there any justification on why the complexity is needed here?  It 
> > > > looks
> > > > to me this one is more involved than cpr-reboot, so I'm thinking how 
> > > > much
> > > > we can get from the complexity, and whether it's worthwhile.  1000+ LOC 
> > > > is
> > > > the min support, and if we even expect more to come, that's really
> > > > important, IMHO.
> > > > 
> > > > For example, what's the major motivation of this whole work?  Is that 
> > > > more
> > > > on performance, or is it more for supporting the special devices like 
> > > > VFIO
> > > > which we used to not support, or something else?  I can't find them in
> > > > whatever cover letter I can find, including this one.
> > > > 
> > > > Firstly, regarding performance, IMHO it'll be always nice to share even
> > > > some very fundamental downtime measurement comparisons using the new 
> > > > exec
> > > > mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
> > > > have some number on hand when you started working on this feature years
> > > > ago?  Or maybe some old links on the list would help too, as I didn't
> > > > follow this work since the start.
> > > > 
> > > > On VFIO, IIUC you started out this project without VFIO migration being
> > > > there.  Now we have VFIO migration so not sure how much it would work 
> > > > for
> > > > the upgrade use case. Even with current VFIO migration, we may not want 
> > > > to
> > > > migrate device states for a local upgrade I suppose, as that can be a 
> > > > lot
> > > > depending on the type of device assigned.  However it'll be nice to 
> > > > discuss
> > > > this too if this is the major purpose of the series.
> > > > 
> > > > I think one other challenge on QEMU upgrade with VFIO devices is that 
> > > > the
> > > > dest QEMU won't be able to open the VFIO device when the src QEMU is 
> > > > still
> > > > using it as the owner.  IIUC this is a similar condition where QEMU 
> > > > wants
> > > > to have proper ownership transfer of a shared block device, and AFAIR 
> > > > right
> > > > now we resolved that issue using some form of file lock on the image 
> > > > file.
> > > > In this case it won't easily apply to a VFIO dev fd, but maybe we still
> > > > have other approaches, not sure whether you investigated any.  E.g. 
> > > > could
> > > > the VFIO handle be passed over using unix scm rights?  I think this 
> > > > might
> > > > remove one dependency of using exec which can cause quite some 
> > > > difference
> > > > v.s. a generic migration (from which regard, cpr-reboot is still a 
> > > > pretty
> > > > generic migration).
> > > > 
> > > > You also mentioned vhost/tap, is that also a major goal of this series 
> > > > in
> > > > the follow up patchsets?  Is this a problem only because this solution 
> > > > will
> > > > do exec?  Can it work if either the exec()ed qemu or dst qemu create the
> > > > vhost/tap fds when boot?
> > > > 
> > > > Meanwhile, could you elaborate a bit on the implication on chardevs?  
> > > > From
> > > > what I read in the doc update it looks like a major part of work in the
> > > > future, but I don't yet understand the issue..  Is it also relevant to 
> > > > the
> > > > exec() approach?
> > > > 
> > > > In all cases, some of such discussion would be really appreciated.  And 
> > > > if
> > > > you used to consider other approaches to solve this problem it'll be 
> > > > great
> > > > to mention how you chose this way.  Considering this work contains too 
> > > > many
> > > > things, it'll be nice if such discussion can start with the 
> > > > fundamentals,
> > > > e.g. on why exec() is a must.
> > > 
> > > The main goal of cpr-exec is providing a fast and reliable way to update
> > > qemu. cpr-reboot is not fast enough or general enough.  It requires the
> > > guest to support suspend and resume for all devices, and that takes 
> > > seconds.
> > > If one actually reboots the host, that adds more seconds, depending on
> > > system services.  cpr-exec takes 0.1 secs, and works every time, unlike
> > > like migration which can fail to converge on a busy system.  Live 
> > > migration
> > > also consumes more system and netwo

Re: [RFC PATCH] cpus: split qemu_init_vcpu and delay vCPU thread creation

2024-05-30 Thread Alex Bennée

Pierrick Bouvier  writes:

> On 5/29/24 08:22, Alex Bennée wrote:
>> This ensures we don't start the thread until cpu_common_realizefn has
>> finished. This ensures that plugins will always run
>> qemu_plugin_vcpu_init__async first before any other states. It doesn't
>> totally eliminate the race that plugin_cpu_update__locked has to work
>> around though. I found this while reviewing the ips plugin which makes
>> heavy use of the vcpu phase callbacks.
>> An alternative might be to move the explicit creation of vCPU
>> threads
>> to qdev_machine_creation_done()? It doesn't affect user-mode which
>> already has a thread to execute in and ensures the QOM object has
>> completed creation in cpu_create() before continuing.
>> Signed-off-by: Alex Bennée 
>> Cc: Pierrick Bouvier 
>> Cc: Philippe Mathieu-Daudé 
>> ---
>>   include/hw/core/cpu.h  |  8 
>>   accel/tcg/user-exec-stub.c |  5 +
>>   hw/core/cpu-common.c   |  7 ++-
>>   plugins/core.c |  5 +
>>   system/cpus.c  | 15 ++-
>>   5 files changed, 34 insertions(+), 6 deletions(-)
>> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
>> index bb398e8237..6920699585 100644
>> --- a/include/hw/core/cpu.h
>> +++ b/include/hw/core/cpu.h
>> @@ -1041,6 +1041,14 @@ void end_exclusive(void);
>>*/
>>   void qemu_init_vcpu(CPUState *cpu);
>>   +/**
>> + * qemu_start_vcpu:
>> + * @cpu: The vCPU to start.
>> + *
>> + * Create the vCPU thread and start it running.
>> + */
>> +void qemu_start_vcpu(CPUState *cpu);
>> +
>>   #define SSTEP_ENABLE  0x1  /* Enable simulated HW single stepping */
>>   #define SSTEP_NOIRQ   0x2  /* Do not use IRQ while single stepping */
>>   #define SSTEP_NOTIMER 0x4  /* Do not Timers while single stepping */
>> diff --git a/accel/tcg/user-exec-stub.c b/accel/tcg/user-exec-stub.c
>> index 4fbe2dbdc8..162bb72bbe 100644
>> --- a/accel/tcg/user-exec-stub.c
>> +++ b/accel/tcg/user-exec-stub.c
>> @@ -18,6 +18,11 @@ void cpu_exec_reset_hold(CPUState *cpu)
>>   {
>>   }
>>   +void qemu_start_vcpu(CPUState *cpu)
>> +{
>> +/* NOP for user-mode, we already have a thread */
>> +}
>> +
>>   /* User mode emulation does not support record/replay yet.  */
>> bool replay_exception(void)
>> diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
>> index 0f0a247f56..68895ddd59 100644
>> --- a/hw/core/cpu-common.c
>> +++ b/hw/core/cpu-common.c
>> @@ -230,7 +230,12 @@ static void cpu_common_realizefn(DeviceState *dev, 
>> Error **errp)
>>   }
>>   #endif
>>   -/* NOTE: latest generic point where the cpu is fully realized
>> */
>> +/*
>> + * With everything set up we can finally start the vCPU thread.
>> + * This is a NOP for linux-user.
>> + * NOTE: latest generic point where the cpu is fully realized
>> + */
>> +qemu_start_vcpu(cpu);
>>   }
>> static void cpu_common_unrealizefn(DeviceState *dev)
>> diff --git a/plugins/core.c b/plugins/core.c
>> index 0726bc7f25..1e5da7853b 100644
>> --- a/plugins/core.c
>> +++ b/plugins/core.c
>> @@ -65,6 +65,11 @@ static void plugin_cpu_update__locked(gpointer k, 
>> gpointer v, gpointer udata)
>>   CPUState *cpu = container_of(k, CPUState, cpu_index);
>>   run_on_cpu_data mask = RUN_ON_CPU_HOST_ULONG(*plugin.mask);
>>   +/*
>> + * There is a race condition between the starting of the vCPU
>> + * thread at the end of cpu_common_realizefn and when realized is
>> + * finally set.
>> + */
>
> Could we simply have an active wait here?
> while (!DEVICE(cpu)->realized) {}
>
> We have a guarantee it will be realized shortly, and if it's too hard
> to have a proper synchronization mechanism (introduce a
> realize_cond?), then waiting for the proper state does not seem too
> bad.
>
> It's a bit strange for me to document an existing race condition,
> instead of finding a solution.

If only it were that simple ;-)

Having been digging into this today it looks like there is a careful set
of dependencies on when threads need to be created during CPU
realization. I did try pushing the thread creation out of realization
but that breaks things like KVM which can't initialise some things
until the thread is up.

I'm now trying to move the plugin queuing its async work to after things
are initialised and before threads are created. My only concern now is
if I need to avoid kicking threads before they are created.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [PATCH] hw/usb/hcd-ohci: Fix ohci_service_td: accept valid TDs

2024-05-30 Thread Alex Bennée

David Hubbard  writes:

> From: Cord Amfmgm 
>
> This changes the way the ohci emulation handles a Transfer Descriptor with
> "Current Buffer Pointer" set to "Buffer End" + 1.
>
> The OHCI spec 4.3.1.2 Table 4-2 allows td.cbp to be one byte more than td.be
> to signal the buffer has zero length. Currently qemu only accepts zero-length
> Transfer Descriptors if the td.cbp is equal to 0, while actual OHCI hardware
> accepts both cases.

Which version of the OHCI spec is this? I can't find it in the one copy
Google throws up:

  
http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/ohci_11.pdf

> The qemu ohci emulation has a regression in ohci_service_td. Version 4.2
> and earlier matched the spec. (I haven't taken the time to bisect exactly
> where the logic was changed.)
>
> With a tiny OS[1] that boots and executes a test, the issue can be seen:
>
> * OS that sends USB requests to a USB mass storage device
>   but sends td.cbp = td.be + 1
> * qemu 4.2
> * qemu HEAD (4e66a0854)
> * Actual OHCI controller (hardware)


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: hw/usb/hcd-ohci: Fix #1510, #303: pid not IN or OUT

2024-05-30 Thread Alex Bennée

Cord Amfmgm  writes:

> On Thu, May 30, 2024 at 3:33 AM Alex Bennée  wrote:
>
>  Cord Amfmgm  writes:
>
>  > On Tue, May 28, 2024 at 11:32 AM Peter Maydell  
> wrote:
>  >
>  >  On Tue, 28 May 2024 at 16:37, Cord Amfmgm  wrote:
>  >  >
>  >  > On Tue, May 28, 2024 at 9:03 AM Peter Maydell 
>  wrote:
>  >  >>
>  >  >> On Mon, 20 May 2024 at 23:24, Cord Amfmgm  wrote:
>  >  >> > On Mon, May 20, 2024 at 12:05 PM Peter Maydell 
>  wrote:
>  
>  >  >> > And here's an example buffer of length 0 -- you probably already 
> know what I'm going to do here:
>  >  >> >
>  >  >> > char buf[0];
>  >  >> > char * CurrentBufferPointer = &buf[0];
>  >  >> > char * BufferEnd = &buf[-1]; // "address of the last byte in the 
> buffer"
>  >  >> > // The OHCI Host Controller than advances CurrentBufferPointer like 
> this: CurrentBufferPointer += 0
>  >  >> > // After the transfer:
>  >  >> > // CurrentBufferPointer = &buf[0];
>  >  >> > // BufferEnd = &buf[-1];
>  >  >>
>  >  >> Right, but why do you think this is valid, rather than
>  >  >> being a guest software bug? My reading of the spec is that it's
>  >  >> pretty clear about how to say "zero length buffer", and this
>  >  >> isn't it.
>  >  >>
>  >  >> Is there some real-world guest OS that programs the OHCI
>  >  >> controller this way that we're trying to accommodate?
>  >  >
>  >  >
>  >  > qemu versions 4.2 and before allowed this behavior.
>  >
>  >  So? That might just mean we had a bug and we fixed it.
>  >  4.2 is a very old version of QEMU and nobody seems to have
>  >  complained in the four years since we released 5.0 about this,
>  >  which suggests that generally guest OS drivers don't try
>  >  to send zero-length buffers in this way.
>  >
>  >  > I don't think it's valid to ask for a *popular* guest OS as a 
> proof-of-concept because I'm not an expert on those.
>  >
>  >  I didn't ask for "popular"; I asked for "real-world".
>  >  What is the actual guest code you're running that falls over
>  >  because of the behaviour change?
>  >
>  >  More generally, why do you want this behaviour to be
>  >  changed? Reasonable reasons might include:
>  >   * we're out of spec based on reading the documentation
>  >   * you're trying to run some old Windows VM/QNX/etc image,
>  > and it doesn't work any more
>  >   * all the real hardware we tested behaves this way
>  >
>  >  But don't necessarily include:
>  >   * something somebody wrote and only tested on QEMU happens to
>  > assume the old behaviour rather than following the hw spec
>  >
>  >  QEMU occasionally works around guest OS bugs, but only as
>  >  when we really have to. It's usually better to fix the
>  >  bug in the guest.
>  >
>  > It's not, and I've already demonstrated that real hardware is consistent 
> with the fix in this patch.
>  >
>  > Please check your tone.
>
>  I don't think that is a particularly helpful comment for someone who is
>  taking the time to review your patches. Reading through the thread I
>  didn't see anything that said this is how real HW behaves but I may well
>  have missed it. However you have a number of review comments to address
>  so I suggest you spin a v2 of the series to address them and outline the
>  reason to accept an out of spec transaction.
>
> I did a rework of the patch -- see my email from May 20, quoted below -- and 
> I was under the impression it addressed all the
> review comments. Did I miss something? I apologize if I did.

Ahh I see - I'd only seen this thread continue so wasn't aware a new
version had been posted. For future patches consider using -vN when
sending them so we can clearly see a new revision is available.

>
>> index acd6016980..71b54914d3 100644
>> --- a/hw/usb/hcd-ohci.c
>> +++ b/hw/usb/hcd-ohci.c
>> @@ -941,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct 
>> ohci_ed *ed)
>>  if ((td.cbp & 0xf000) != (td.be & 0xf000)) {
>>  len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff);
>>  } else {
>> -if (td.cbp > td.be) {
>> -trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be);
>> +if (td.cbp - 1 > td.be) {  /* rely on td.cbp != 0 */
>
>> Reading through the thread I didn't see anything that said this is how real 
>> HW behaves but I may well have missed it.
>
> This is what I wrote regarding real HW:
>
> Results are:
>
>  qemu 4.2   | qemu HEAD  | actual HW
> ++
>  works fine | ohci_die() | works fine
>
> Would additional verification of the actual HW be useful?
>
> Peter posted the following which is more specific than "qemu 4.2" -- I agree 
> this is most likely the qemu commit where this
> thread is focused:
>
>> Almost certainly this was commit 1328fe0c32d54 ("hw: usb: hcd-ohci:
>> check len and frame_number variables"), which added these bounds
>> checks. Prior to that we did no bounds checking at all, which
>> meant that we permitted cbp=be+1 to mean a zero length, but also
>> that we permitted the gu

Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-05-30 Thread Peter Xu

On Thu, May 30, 2024 at 01:12:40PM -0400, Steven Sistare wrote:
> On 5/29/2024 3:25 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:
> > > On 5/28/2024 5:44 PM, Peter Xu wrote:
> > > > On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:
> > > > > Preserve fields of RAMBlocks that allocate their host memory during 
> > > > > CPR so
> > > > > the RAM allocation can be recovered.
> > > > 
> > > > This sentence itself did not explain much, IMHO.  QEMU can share memory
> > > > using fd based memory already of all kinds, as long as the memory 
> > > > backend
> > > > is path-based it can be shared by sharing the same paths to dst.
> > > > 
> > > > This reads very confusing as a generic concept.  I mean, QEMU migration
> > > > relies on so many things to work right.  We mostly asks the users to 
> > > > "use
> > > > exactly the same cmdline for src/dst QEMU unless you know what you're
> > > > doing", otherwise many things can break.  That should also include 
> > > > ramblock
> > > > being matched between src/dst due to the same cmdlines provided on both
> > > > sides.  It'll be confusing to mention this when we thought the ramblocks
> > > > also rely on that fact.
> > > > 
> > > > So IIUC this sentence should be dropped in the real patch, and I'll try 
> > > > to
> > > > guess the real reason with below..
> > > 
> > > The properties of the implicitly created ramblocks must be preserved.
> > > The defaults can and do change between qemu releases, even when the 
> > > command-line
> > > parameters do not change for the explicit objects that cause these 
> > > implicit
> > > ramblocks to be created.
> > 
> > AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
> > have an example?  Would that already cause issue when migrate?
> 
> Alignment has changed, and used_length vs max_length changed when
> resizeable ramblocks were introduced.  I have dealt with these issues
> while supporting cpr for our internal use, and the learned lesson is to
> explicitly communicate the creation-time parameters to new qemu.

Why used_length can change?  I'm looking at ram_mig_ram_block_resized():

if (!migration_is_idle()) {
/*
 * Precopy code on the source cannot deal with the size of RAM blocks
 * changing at random points in time - especially after sending the
 * RAM block sizes in the migration stream, they must no longer change.
 * Abort and indicate a proper reason.
 */
error_setg(&err, "RAM block '%s' resized during precopy.", rb->idstr);
migration_cancel(err);
error_free(err);
}

We sent used_length upfront of a migration during SETUP phase.  Looks like
what you're describing can be something different, though?

Regarding to rb->align: isn't that mostly a constant, reflecting the MR's
alignment?  It's set when ramblock is created IIUC:

rb->align = mr->align;

When will the alignment change?

> 
> These are not an issue for migration because the ramblock is re-created
> and the data copied into the new memory.
> 
> > > > > Mirror the mr->align field in the RAMBlock to simplify the vmstate.
> > > > > Preserve the old host address, even though it is immediately 
> > > > > discarded,
> > > > > as it will be needed in the future for CPR with iommufd.  Preserve
> > > > > guest_memfd, even though CPR does not yet support it, to maintain 
> > > > > vmstate
> > > > > compatibility when it becomes supported.
> > > > 
> > > > .. It could be about the vfio vaddr update feature that you mentioned 
> > > > and
> > > > only for iommufd (as IIUC vfio still relies on iova ranges, then it 
> > > > won't
> > > > help here)?
> > > > 
> > > > If so, IMHO we should have this patch (or any variance form) to be there
> > > > for your upcoming vfio support.  Keeping this around like this will make
> > > > the series harder to review.  Or is it needed even before VFIO?
> > > 
> > > This patch is needed independently of vfio or iommufd.
> > > 
> > > guest_memfd is independent of vfio or iommufd.  It is a recent addition
> > > which I have not tried to support, but I added this placeholder field
> > > to it can be supported in the future without adding a new field later
> > > and maintaining backwards compatibility.
> > 
> > Is guest_memfd the only user so far, then?  If so, would it be possible we
> > split it as a separate effort on top of the base cpr-exec support?
> 
> I don't understand the question.  I am indeed deferring support for 
> guest_memfd
> to a future time.  For now, I am adding a blocker, and reserving a field for
> it in the preserved ramblock attributes, to avoid adding a subsection later.

I meant I'm thinking whether the new ramblock vmsd may not be required for
the initial implementation.

E.g., IIUC vaddr is required by iommufd, and so far that's not part of the
initial support.

Then I think a major thing is about the fds to be managed that will need to
be shared.  If we put

Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-05-30 Thread Peter Xu

On Thu, May 30, 2024 at 01:11:09PM -0400, Steven Sistare wrote:
> On 5/29/2024 3:14 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:
> > > > > diff --git a/system/memory.c b/system/memory.c
> > > > > index 49f1cb2..ca04a0e 100644
> > > > > --- a/system/memory.c
> > > > > +++ b/system/memory.c
> > > > > @@ -1552,8 +1552,9 @@ bool 
> > > > > memory_region_init_ram_nomigrate(MemoryRegion *mr,
> > > > >  uint64_t size,
> > > > >  Error **errp)
> > > > >{
> > > > > +uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> > > > 
> > > > If there's a machine option to "use memfd for allocations", then it's
> > > > shared mem... Hmm..
> > > > 
> > > > It is a bit confusing to me in quite a few levels:
> > > > 
> > > > - Why memory allocation method will be defined by a machine 
> > > > property,
> > > >   even if we have memory-backend-* which should cover everything?
> > > 
> > > Some memory regions are implicitly created, and have no explicit 
> > > representation
> > > on the qemu command line.  memfd-alloc affects those.
> > > 
> > > More generally, memfd-alloc affects all ramblock allocations that are
> > > not explicitly represented by memory-backend object.  Thus the simple
> > > command line "qemu -m 1G" does not explicitly describe an object, so it
> > > goes through the anonymous allocation path, and is affected by 
> > > memfd-alloc.
> > 
> > Can we simply now allow "qemu -m 1G" to work for cpr-exec?
> 
> I assume you meant "simply not allow".
> 
> Yes, I could do that, but I would need to explicitly add code to exclude this
> case, and add a blocker.  Right now it "just works" for all paths that lead to
> ram_block_alloc_host, without any special logic at the memory-backend level.
> And, I'm not convinced that simplifies the docs, as now I would need to tell
> the user that "-m 1G" and similar constructions do not work with cpr.
> 
> I can try to clarify the doc for -memfd-alloc as currently defined.

Why do we need to keep cpr working for existing qemu cmdlines?  We'll
already need to add more new cmdline options already anyway, right?

cpr-reboot wasn't doing it, and that made sense to me, so that new features
will require the user to opt-in for it, starting with changing its
cmdlines.

> 
> > AFAIU that's
> > what we do with cpr-reboot: we ask the user to specify the right things to
> > make other thing work.  Otherwise it won't.
> > 
> > > 
> > > Internally, create_default_memdev does create a memory-backend object.
> > > That is what my doc comment above refers to:
> > >Any associated memory-backend objects are created with share=on
> > > 
> > > An explicit "qemu -object memory-backend-*" is not affected by 
> > > memfd-alloc.
> > > 
> > > The qapi comments in patch "migration: cpr-exec mode" attempt to say all 
> > > that:
> > > 
> > > +# Memory backend objects must have the share=on attribute, and
> > > +# must be mmap'able in the new QEMU process.  For example,
> > > +# memory-backend-file is acceptable, but memory-backend-ram is
> > > +# not.
> > > +#
> > > +# The VM must be started with the '-machine memfd-alloc=on'
> > > +# option.  This causes implicit ram blocks -- those not explicitly
> > > +# described by a memory-backend object -- to be allocated by
> > > +# mmap'ing a memfd.  Examples include VGA, ROM, and even guest
> > > +# RAM when it is specified without a memory-backend object.
> > 
> > VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
> > memory-backend-file,share=on propertly, these should be the only outliers?
> > 
> > Are these important enough for the downtime?  Can we put them into the
> > migrated image alongside with the rest device states?
> 
> It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
> The pages must remain pinned during CPR to support ongoing DMA activity
> which could target those pages (which we do not quiesce), and the same
> physical pages must be used for the ramblocks in the new qemu process.

Ah ok, yes DMA can happen on the fly.

Guest mem is definitely the major DMA target and that can be covered by
-object memory-backend-*,shared=on cmdlines.

ROM is definitely not a DMA target.  So is VGA ram a target for, perhaps,
an assigned vGPU device?  Do we have a list of things that will need that?
Can we make them work somehow by sharing them like guest mem?

It'll be a complete tragedy if we introduced this whole thing only because
of some minority.  I want to understand whether there's any generic way to
solve this problem rather than this magical machine property.  IMHO it's
very not trivial to maintain.

> 
> > > > - Even if we have such a machine property, why setting "memfd" will
> > > >   always imply shared?  why not private?  After all it's not called
> > > >   "memfd-shared-alloc", and we can create private ma

Re: [PATCH V1 07/26] migration: VMStateId

2024-05-30 Thread Peter Xu

On Thu, May 30, 2024 at 01:11:26PM -0400, Steven Sistare wrote:
> On 5/29/2024 2:53 PM, Peter Xu wrote:
> > On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote:
> > > How about a more general name for the type:
> > > 
> > > migration/misc.h
> > >  typedef char (MigrationId)[256];
> > 
> > How about qemu/typedefs.h?  Not sure whether it's applicable. Markus (in
> > the loop) may have a better idea.
> > 
> > Meanwhile, s/MigrationID/IDString/?
> 
> typedefs.h has a different purpose; giving short names to types
> defined in internal include files.
> 
> This id is specific to migration, so I still think its name should reflect
> migration and it belongs in some include/migration/*.h file.
> 
> ramblocks and migration are already closely related.  There is nothing wrong
> with including a migration header in ramblock.h so it can use a migration 
> type.
> We already have:
>   include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h"
>   include/hw/display/ramfb.h:#include "migration/vmstate.h"
>   include/hw/input/pl050.h:#include "migration/vmstate.h"
>   include/hw/pci/shpc.h:#include "migration/vmstate.h"
>   include/hw/virtio/virtio.h:#include "migration/vmstate.h"
>   include/hw/hyperv/vmbus.h:#include "migration/vmstate.h"
> 
> The 256 byte magic length already appears in too many places, and my code
> would add more occurrences, so I really think that abstracting this type
> would be cleaner.

I agree having a typedef is nicer, but I don't understand why it must be
migration related.  It can be the type QEMU uses to represent any string
based ID, and that's a generic concept to me.

Migration can have a wrapper to process that type, but then migration will
include the generic header in that case, it feels more natural that way?

-- 
Peter Xu

[PATCH v2 6/6] contrib/plugins: add ips plugin example for cost modeling

2024-05-30 Thread Pierrick Bouvier

This plugin uses the new time control interface to make decisions
about the state of time during the emulation. The algorithm is
currently very simple. The user specifies an ips rate which applies
per core. If the core runs ahead of its allocated execution time the
plugin sleeps for a bit to let real time catch up. Either way time is
updated for the emulation as a function of total executed instructions
with some adjustments for cores that idle.

Examples


Slow down execution of /bin/true:
$ num_insn=$(./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d 
plugin /bin/true |& grep total | sed -e 's/.*: //')
$ time ./build/qemu-x86_64 -plugin 
./build/contrib/plugins/libips.so,ips=$(($num_insn/4)) /bin/true
real 4.000s

Boot a Linux kernel simulating a 250MHz cpu:
$ /build/qemu-system-x86_64 -kernel /boot/vmlinuz-6.1.0-21-amd64 -append 
"console=ttyS0" -plugin 
./build/contrib/plugins/libips.so,ips=$((250*1000*1000)) -smp 1 -m 512
check time until kernel panic on serial0

Signed-off-by: Pierrick Bouvier 
---
 contrib/plugins/ips.c| 239 +++
 contrib/plugins/Makefile |   1 +
 2 files changed, 240 insertions(+)
 create mode 100644 contrib/plugins/ips.c

diff --git a/contrib/plugins/ips.c b/contrib/plugins/ips.c
new file mode 100644
index 000..cf3159df391
--- /dev/null
+++ b/contrib/plugins/ips.c
@@ -0,0 +1,239 @@
+/*
+ * ips rate limiting plugin.
+ *
+ * This plugin can be used to restrict the execution of a system to a
+ * particular number of Instructions Per Second (ips). This controls
+ * time as seen by the guest so while wall-clock time may be longer
+ * from the guests point of view time will pass at the normal rate.
+ *
+ * This uses the new plugin API which allows the plugin to control
+ * system time.
+ *
+ * Copyright (c) 2023 Linaro Ltd
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include 
+#include 
+#include 
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+
+/* how many times do we update time per sec */
+#define NUM_TIME_UPDATE_PER_SEC 10
+#define NSEC_IN_ONE_SEC (1000 * 1000 * 1000)
+
+static GMutex global_state_lock;
+
+static uint64_t insn_per_second = 1000 * 1000; /* ips per core, per second */
+static uint64_t insn_quantum; /* trap every N instructions */
+static bool precise_execution; /* count every instruction */
+static int64_t start_time_ns; /* time (ns since epoch) first vCPU started */
+static int64_t virtual_time_ns; /* last set virtual time */
+
+static const void *time_handle;
+
+typedef enum {
+UNKNOWN = 0,
+EXECUTING,
+IDLE,
+FINISHED
+} vCPUState;
+
+typedef struct {
+uint64_t counter;
+uint64_t track_insn;
+vCPUState state;
+/* timestamp when vCPU entered state */
+int64_t last_state_time;
+} vCPUTime;
+
+struct qemu_plugin_scoreboard *vcpus;
+
+/* return epoch time in ns */
+static int64_t now_ns(void)
+{
+return g_get_real_time() * 1000;
+}
+
+static uint64_t num_insn_during(int64_t elapsed_ns)
+{
+double num_secs = elapsed_ns / (double) NSEC_IN_ONE_SEC;
+return num_secs * (double) insn_per_second;
+}
+
+static int64_t time_for_insn(uint64_t num_insn)
+{
+double num_secs = (double) num_insn / (double) insn_per_second;
+return num_secs * (double) NSEC_IN_ONE_SEC;
+}
+
+static int64_t uptime_ns(void)
+{
+int64_t now = now_ns();
+g_assert(now >= start_time_ns);
+return now - start_time_ns;
+}
+
+static void vcpu_set_state(vCPUTime *vcpu, vCPUState new_state)
+{
+vcpu->last_state_time = now_ns();
+vcpu->state = new_state;
+}
+
+static void update_system_time(vCPUTime *vcpu)
+{
+/* flush remaining instructions */
+vcpu->counter += vcpu->track_insn;
+vcpu->track_insn = 0;
+
+int64_t uptime = uptime_ns();
+uint64_t expected_insn = num_insn_during(uptime);
+
+if (vcpu->counter >= expected_insn) {
+/* this vcpu ran faster than expected, so it has to sleep */
+uint64_t insn_advance = vcpu->counter - expected_insn;
+uint64_t time_advance_ns = time_for_insn(insn_advance);
+int64_t sleep_us = time_advance_ns / 1000;
+g_usleep(sleep_us);
+}
+
+/* based on number of instructions, what should be the new time? */
+int64_t new_virtual_time = time_for_insn(vcpu->counter);
+
+g_mutex_lock(&global_state_lock);
+
+/* Time only moves forward. Another vcpu might have updated it already. */
+if (new_virtual_time > virtual_time_ns) {
+qemu_plugin_update_ns(time_handle, new_virtual_time);
+virtual_time_ns = new_virtual_time;
+}
+
+g_mutex_unlock(&global_state_lock);
+}
+
+static void set_start_time()
+{
+g_mutex_lock(&global_state_lock);
+if (!start_time_ns) {
+start_time_ns = now_ns();
+}
+g_mutex_unlock(&global_state_lock);
+}
+
+static void vcpu_init(qemu_plugin_id_t id, unsigned int cpu_index)
+{
+vCPUTime *vcpu = qemu_plugin_scoreboard_find(vcpus, cpu_index);
+/* ensure start time

[PATCH v2 5/6] plugins: add time control API

2024-05-30 Thread Pierrick Bouvier

From: Alex Bennée 

Expose the ability to control time through the plugin API. Only one
plugin can control time so it has to request control when loaded.
There are probably more corner cases to catch here.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Signed-off-by: Pierrick Bouvier 
---
 include/qemu/qemu-plugin.h   | 23 +++
 plugins/api.c| 31 +++
 plugins/qemu-plugins.symbols |  2 ++
 3 files changed, 56 insertions(+)

diff --git a/include/qemu/qemu-plugin.h b/include/qemu/qemu-plugin.h
index 95703d8fec1..80b1637cede 100644
--- a/include/qemu/qemu-plugin.h
+++ b/include/qemu/qemu-plugin.h
@@ -661,6 +661,29 @@ void qemu_plugin_register_vcpu_mem_inline_per_vcpu(
 qemu_plugin_u64 entry,
 uint64_t imm);
 
+/**
+ * qemu_plugin_request_time_control() - request the ability to control time
+ *
+ * This grants the plugin the ability to control system time. Only one
+ * plugin can control time so if multiple plugins request the ability
+ * all but the first will fail.
+ *
+ * Returns an opaque handle or NULL if fails
+ */
+const void *qemu_plugin_request_time_control(void);
+
+/**
+ * qemu_plugin_update_ns() - update system emulation time
+ * @handle: opaque handle returned by qemu_plugin_request_time_control()
+ * @time: time in nanoseconds
+ *
+ * This allows an appropriately authorised plugin (i.e. holding the
+ * time control handle) to move system time forward to @time.
+ *
+ * Start time is 0.
+ */
+void qemu_plugin_update_ns(const void *handle, int64_t time);
+
 typedef void
 (*qemu_plugin_vcpu_syscall_cb_t)(qemu_plugin_id_t id, unsigned int vcpu_index,
  int64_t num, uint64_t a1, uint64_t a2,
diff --git a/plugins/api.c b/plugins/api.c
index 5a0a7f8c712..26822b69ea2 100644
--- a/plugins/api.c
+++ b/plugins/api.c
@@ -39,6 +39,7 @@
 #include "qemu/main-loop.h"
 #include "qemu/plugin.h"
 #include "qemu/log.h"
+#include "qemu/timer.h"
 #include "tcg/tcg.h"
 #include "exec/exec-all.h"
 #include "exec/gdbstub.h"
@@ -583,3 +584,33 @@ uint64_t qemu_plugin_u64_sum(qemu_plugin_u64 entry)
 }
 return total;
 }
+
+/*
+ * Time control
+ */
+static bool has_control;
+
+const void *qemu_plugin_request_time_control(void)
+{
+if (!has_control) {
+has_control = true;
+return &has_control;
+}
+return NULL;
+}
+
+static void advance_virtual_time__async(CPUState *cpu, run_on_cpu_data data)
+{
+int64_t new_time = data.host_ulong;
+qemu_clock_advance_virtual_time(new_time);
+}
+
+void qemu_plugin_update_ns(const void *handle, int64_t new_time)
+{
+if (handle == &has_control) {
+/* Need to execute out of cpu_exec, so bql can be locked. */
+async_run_on_cpu(current_cpu,
+ advance_virtual_time__async,
+ RUN_ON_CPU_HOST_ULONG(new_time));
+}
+}
diff --git a/plugins/qemu-plugins.symbols b/plugins/qemu-plugins.symbols
index aa0a77a319f..ca773d8d9fe 100644
--- a/plugins/qemu-plugins.symbols
+++ b/plugins/qemu-plugins.symbols
@@ -38,6 +38,7 @@
   qemu_plugin_register_vcpu_tb_exec_cond_cb;
   qemu_plugin_register_vcpu_tb_exec_inline_per_vcpu;
   qemu_plugin_register_vcpu_tb_trans_cb;
+  qemu_plugin_request_time_control;
   qemu_plugin_reset;
   qemu_plugin_scoreboard_free;
   qemu_plugin_scoreboard_find;
@@ -51,5 +52,6 @@
   qemu_plugin_u64_set;
   qemu_plugin_u64_sum;
   qemu_plugin_uninstall;
+  qemu_plugin_update_ns;
   qemu_plugin_vcpu_for_each;
 };
-- 
2.39.2

[PATCH v2 2/6] qtest: use cpu interface in qtest_clock_warp

2024-05-30 Thread Pierrick Bouvier

From: Alex Bennée 

This generalises the qtest_clock_warp code to use the AccelOps
handlers for updating its own sense of time. This will make the next
patch which moves the warp code closer to pure code motion.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Acked-by: Thomas Huth 
Signed-off-by: Pierrick Bouvier 
---
 include/sysemu/qtest.h | 1 +
 accel/qtest/qtest.c| 1 +
 system/qtest.c | 6 +++---
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h
index b5d5fd34637..45f3b7e1df5 100644
--- a/include/sysemu/qtest.h
+++ b/include/sysemu/qtest.h
@@ -36,6 +36,7 @@ void qtest_server_set_send_handler(void (*send)(void *, const 
char *),
 void qtest_server_inproc_recv(void *opaque, const char *buf);
 
 int64_t qtest_get_virtual_clock(void);
+void qtest_set_virtual_clock(int64_t count);
 #endif
 
 #endif
diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c
index f6056ac8361..53182e6c2ae 100644
--- a/accel/qtest/qtest.c
+++ b/accel/qtest/qtest.c
@@ -52,6 +52,7 @@ static void qtest_accel_ops_class_init(ObjectClass *oc, void 
*data)
 
 ops->create_vcpu_thread = dummy_start_vcpu_thread;
 ops->get_virtual_clock = qtest_get_virtual_clock;
+ops->set_virtual_clock = qtest_set_virtual_clock;
 };
 
 static const TypeInfo qtest_accel_ops_type = {
diff --git a/system/qtest.c b/system/qtest.c
index 6da58b3874e..ee8b139e982 100644
--- a/system/qtest.c
+++ b/system/qtest.c
@@ -332,14 +332,14 @@ int64_t qtest_get_virtual_clock(void)
 return qatomic_read_i64(&qtest_clock_counter);
 }
 
-static void qtest_set_virtual_clock(int64_t count)
+void qtest_set_virtual_clock(int64_t count)
 {
 qatomic_set_i64(&qtest_clock_counter, count);
 }
 
 static void qtest_clock_warp(int64_t dest)
 {
-int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+int64_t clock = cpus_get_virtual_clock();
 AioContext *aio_context;
 assert(qtest_enabled());
 aio_context = qemu_get_aio_context();
@@ -348,7 +348,7 @@ static void qtest_clock_warp(int64_t dest)
   QEMU_TIMER_ATTR_ALL);
 int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
 
-qtest_set_virtual_clock(qtest_get_virtual_clock() + warp);
+cpus_set_virtual_clock(cpus_get_virtual_clock() + warp);
 
 qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
 timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
-- 
2.39.2

[PATCH v2 3/6] sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time

2024-05-30 Thread Pierrick Bouvier

From: Alex Bennée 

Move the key functionality of moving time forward into the clock
sub-system itself. This will allow us to plumb in time control into
plugins.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Signed-off-by: Pierrick Bouvier 
---
 include/qemu/timer.h | 15 +++
 system/qtest.c   | 25 +++--
 util/qemu-timer.c| 26 ++
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/qemu/timer.h b/include/qemu/timer.h
index 9a366e551fb..910587d8293 100644
--- a/include/qemu/timer.h
+++ b/include/qemu/timer.h
@@ -245,6 +245,21 @@ bool qemu_clock_run_timers(QEMUClockType type);
  */
 bool qemu_clock_run_all_timers(void);
 
+/**
+ * qemu_clock_advance_virtual_time(): advance the virtual time tick
+ * @target: target time in nanoseconds
+ *
+ * This function is used where the control of the flow of time has
+ * been delegated to outside the clock subsystem (be it qtest, icount
+ * or some other external source). You can ask the clock system to
+ * return @early at the first expired timer.
+ *
+ * Time can only move forward, attempts to reverse time would lead to
+ * an error.
+ *
+ * Returns: new virtual time.
+ */
+int64_t qemu_clock_advance_virtual_time(int64_t dest);
 
 /*
  * QEMUTimerList
diff --git a/system/qtest.c b/system/qtest.c
index ee8b139e982..e6f6b4e62d5 100644
--- a/system/qtest.c
+++ b/system/qtest.c
@@ -337,26 +337,6 @@ void qtest_set_virtual_clock(int64_t count)
 qatomic_set_i64(&qtest_clock_counter, count);
 }
 
-static void qtest_clock_warp(int64_t dest)
-{
-int64_t clock = cpus_get_virtual_clock();
-AioContext *aio_context;
-assert(qtest_enabled());
-aio_context = qemu_get_aio_context();
-while (clock < dest) {
-int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL,
-  QEMU_TIMER_ATTR_ALL);
-int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
-
-cpus_set_virtual_clock(cpus_get_virtual_clock() + warp);
-
-qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
-timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
-clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
-}
-qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
-}
-
 static bool (*process_command_cb)(CharBackend *chr, gchar **words);
 
 void qtest_set_command_cb(bool (*pc_cb)(CharBackend *chr, gchar **words))
@@ -755,7 +735,8 @@ static void qtest_process_command(CharBackend *chr, gchar 
**words)
 ns = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL,
 QEMU_TIMER_ATTR_ALL);
 }
-qtest_clock_warp(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns);
+qemu_clock_advance_virtual_time(
+qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + ns);
 qtest_send_prefix(chr);
 qtest_sendf(chr, "OK %"PRIi64"\n",
 (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL));
@@ -781,7 +762,7 @@ static void qtest_process_command(CharBackend *chr, gchar 
**words)
 g_assert(words[1]);
 ret = qemu_strtoi64(words[1], NULL, 0, &ns);
 g_assert(ret == 0);
-qtest_clock_warp(ns);
+qemu_clock_advance_virtual_time(ns);
 qtest_send_prefix(chr);
 qtest_sendf(chr, "OK %"PRIi64"\n",
 (int64_t)qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL));
diff --git a/util/qemu-timer.c b/util/qemu-timer.c
index 6a0de33dd2b..213114be68c 100644
--- a/util/qemu-timer.c
+++ b/util/qemu-timer.c
@@ -645,6 +645,11 @@ int64_t qemu_clock_get_ns(QEMUClockType type)
 }
 }
 
+static void qemu_virtual_clock_set_ns(int64_t time)
+{
+return cpus_set_virtual_clock(time);
+}
+
 void init_clocks(QEMUTimerListNotifyCB *notify_cb)
 {
 QEMUClockType type;
@@ -675,3 +680,24 @@ bool qemu_clock_run_all_timers(void)
 
 return progress;
 }
+
+int64_t qemu_clock_advance_virtual_time(int64_t dest)
+{
+int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+AioContext *aio_context;
+aio_context = qemu_get_aio_context();
+while (clock < dest) {
+int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL,
+  QEMU_TIMER_ATTR_ALL);
+int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
+
+qemu_virtual_clock_set_ns(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 
warp);
+
+qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
+timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
+clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+}
+qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
+
+return clock;
+}
-- 
2.39.2

[PATCH v2 4/6] qtest: move qtest_{get, set}_virtual_clock to accel/qtest/qtest.c

2024-05-30 Thread Pierrick Bouvier

Signed-off-by: Pierrick Bouvier 
---
 include/sysemu/qtest.h |  3 ---
 accel/qtest/qtest.c| 12 
 system/qtest.c | 12 
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h
index 45f3b7e1df5..c161d751650 100644
--- a/include/sysemu/qtest.h
+++ b/include/sysemu/qtest.h
@@ -34,9 +34,6 @@ void qtest_server_init(const char *qtest_chrdev, const char 
*qtest_log, Error **
 void qtest_server_set_send_handler(void (*send)(void *, const char *),
  void *opaque);
 void qtest_server_inproc_recv(void *opaque, const char *buf);
-
-int64_t qtest_get_virtual_clock(void);
-void qtest_set_virtual_clock(int64_t count);
 #endif
 
 #endif
diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c
index 53182e6c2ae..bf14032d294 100644
--- a/accel/qtest/qtest.c
+++ b/accel/qtest/qtest.c
@@ -24,6 +24,18 @@
 #include "qemu/main-loop.h"
 #include "hw/core/cpu.h"
 
+static int64_t qtest_clock_counter;
+
+static int64_t qtest_get_virtual_clock(void)
+{
+return qatomic_read_i64(&qtest_clock_counter);
+}
+
+static void qtest_set_virtual_clock(int64_t count)
+{
+qatomic_set_i64(&qtest_clock_counter, count);
+}
+
 static int qtest_init_accel(MachineState *ms)
 {
 return 0;
diff --git a/system/qtest.c b/system/qtest.c
index e6f6b4e62d5..ba210780ec0 100644
--- a/system/qtest.c
+++ b/system/qtest.c
@@ -325,18 +325,6 @@ static void qtest_irq_handler(void *opaque, int n, int 
level)
 }
 }
 
-static int64_t qtest_clock_counter;
-
-int64_t qtest_get_virtual_clock(void)
-{
-return qatomic_read_i64(&qtest_clock_counter);
-}
-
-void qtest_set_virtual_clock(int64_t count)
-{
-qatomic_set_i64(&qtest_clock_counter, count);
-}
-
 static bool (*process_command_cb)(CharBackend *chr, gchar **words);
 
 void qtest_set_command_cb(bool (*pc_cb)(CharBackend *chr, gchar **words))
-- 
2.39.2

[PATCH v2 1/6] sysemu: add set_virtual_time to accel ops

2024-05-30 Thread Pierrick Bouvier

From: Alex Bennée 

We are about to remove direct calls to individual accelerators for
this information and will need a central point for plugins to hook
into time changes.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Pierrick Bouvier 
---
 include/sysemu/accel-ops.h | 18 +-
 include/sysemu/cpu-timers.h|  3 ++-
 ...et-virtual-clock.c => cpus-virtual-clock.c} |  5 +
 system/cpus.c  | 11 +++
 stubs/meson.build  |  2 +-
 5 files changed, 36 insertions(+), 3 deletions(-)
 rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%)

diff --git a/include/sysemu/accel-ops.h b/include/sysemu/accel-ops.h
index ef91fc28bbd..a0886722305 100644
--- a/include/sysemu/accel-ops.h
+++ b/include/sysemu/accel-ops.h
@@ -20,7 +20,12 @@
 typedef struct AccelOpsClass AccelOpsClass;
 DECLARE_CLASS_CHECKERS(AccelOpsClass, ACCEL_OPS, TYPE_ACCEL_OPS)
 
-/* cpus.c operations interface */
+/**
+ * struct AccelOpsClass - accelerator interfaces
+ *
+ * This structure is used to abstract accelerator differences from the
+ * core CPU code. Not all have to be implemented.
+ */
 struct AccelOpsClass {
 /*< private >*/
 ObjectClass parent_class;
@@ -44,7 +49,18 @@ struct AccelOpsClass {
 
 void (*handle_interrupt)(CPUState *cpu, int mask);
 
+/**
+ * @get_virtual_clock: fetch virtual clock
+ * @set_virtual_clock: set virtual clock
+ *
+ * These allow the timer subsystem to defer to the accelerator to
+ * fetch time. The set function is needed if the accelerator wants
+ * to track the changes to time as the timer is warped through
+ * various timer events.
+ */
 int64_t (*get_virtual_clock)(void);
+void (*set_virtual_clock)(int64_t time);
+
 int64_t (*get_elapsed_ticks)(void);
 
 /* gdbstub hooks */
diff --git a/include/sysemu/cpu-timers.h b/include/sysemu/cpu-timers.h
index d86738a378d..7bfa960fbd6 100644
--- a/include/sysemu/cpu-timers.h
+++ b/include/sysemu/cpu-timers.h
@@ -96,8 +96,9 @@ int64_t cpu_get_clock(void);
 
 void qemu_timer_notify_cb(void *opaque, QEMUClockType type);
 
-/* get the VIRTUAL clock and VM elapsed ticks via the cpus accel interface */
+/* get/set VIRTUAL clock and VM elapsed ticks via the cpus accel interface */
 int64_t cpus_get_virtual_clock(void);
+void cpus_set_virtual_clock(int64_t new_time);
 int64_t cpus_get_elapsed_ticks(void);
 
 #endif /* SYSEMU_CPU_TIMERS_H */
diff --git a/stubs/cpus-get-virtual-clock.c b/stubs/cpus-virtual-clock.c
similarity index 68%
rename from stubs/cpus-get-virtual-clock.c
rename to stubs/cpus-virtual-clock.c
index fd447d53f3c..af7c1a1d403 100644
--- a/stubs/cpus-get-virtual-clock.c
+++ b/stubs/cpus-virtual-clock.c
@@ -6,3 +6,8 @@ int64_t cpus_get_virtual_clock(void)
 {
 return cpu_get_clock();
 }
+
+void cpus_set_virtual_clock(int64_t new_time)
+{
+/* do nothing */
+}
diff --git a/system/cpus.c b/system/cpus.c
index f8fa78f33d4..d3640c95030 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -229,6 +229,17 @@ int64_t cpus_get_virtual_clock(void)
 return cpu_get_clock();
 }
 
+/*
+ * Signal the new virtual time to the accelerator. This is only needed
+ * by accelerators that need to track the changes as we warp time.
+ */
+void cpus_set_virtual_clock(int64_t new_time)
+{
+if (cpus_accel && cpus_accel->set_virtual_clock) {
+cpus_accel->set_virtual_clock(new_time);
+}
+}
+
 /*
  * return the time elapsed in VM between vm_start and vm_stop.  Unless
  * icount is active, cpus_get_elapsed_ticks() uses units of the host CPU cycle
diff --git a/stubs/meson.build b/stubs/meson.build
index 3b9d42023cb..a1deafde08c 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -28,7 +28,7 @@ endif
 if have_block or have_ga
   stub_ss.add(files('replay-tools.c'))
   # stubs for hooks in util/main-loop.c, util/async.c etc.
-  stub_ss.add(files('cpus-get-virtual-clock.c'))
+  stub_ss.add(files('cpus-virtual-clock.c'))
   stub_ss.add(files('icount.c'))
   stub_ss.add(files('graph-lock.c'))
   if linux_io_uring.found()
-- 
2.39.2

[PATCH v2 0/6] Implement icount=auto using TCG Plugins

2024-05-30 Thread Pierrick Bouvier

The goal here is to be able to scale temporally execution of qemu-user/system,
using a given number of instructions per second.

We define a virtual clock, that can be late or in advance compared to real time.
When we are in advance, we slow execution (by sleeping) until catching real
time.

Finally, we should be able to cleanup icount=auto mode completely, and keep
icount usage for determistic purposes only.

It is built upon new TCG Plugins inline ops (store + conditional callbacks), now
merged on master.

Example in user-mode:

- Retrieve number of instructions to execute /bin/true
$ ./build/qemu-x86_64 -plugin ./build/tests/plugin/libinsn.so -d plugin 
/bin/true
cpu 0 insns: 120546
total insns: 120546
- Slow execution to match 5 seconds
$ time ./build/qemu-x86_64 -plugin 
./build/contrib/plugins/libips,ips=$((120546/5)) /bin/true
real0m4.985s

v2
--

- Added missing personal Signed-off-by for commits from Alex
- Fix bad rebase in stubs/meson.build
- move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c
- A race condition was identified for plugins init/idle/resume, but is not
  related to this series, and will be fixed in another one:
  
https://lore.kernel.org/qemu-devel/20240529152219.825680-1-alex.ben...@linaro.org/

Alex Bennée (4):
  sysemu: add set_virtual_time to accel ops
  qtest: use cpu interface in qtest_clock_warp
  sysemu: generalise qtest_warp_clock as qemu_clock_advance_virtual_time
  plugins: add time control API

Pierrick Bouvier (2):
  qtest: move qtest_{get,set}_virtual_clock to accel/qtest/qtest.c
  contrib/plugins: add ips plugin example for cost modeling

 include/qemu/qemu-plugin.h|  23 ++
 include/qemu/timer.h  |  15 ++
 include/sysemu/accel-ops.h|  18 +-
 include/sysemu/cpu-timers.h   |   3 +-
 include/sysemu/qtest.h|   2 -
 accel/qtest/qtest.c   |  13 +
 contrib/plugins/ips.c | 239 ++
 plugins/api.c |  31 +++
 ...t-virtual-clock.c => cpus-virtual-clock.c} |   5 +
 system/cpus.c |  11 +
 system/qtest.c|  37 +--
 util/qemu-timer.c |  26 ++
 contrib/plugins/Makefile  |   1 +
 plugins/qemu-plugins.symbols  |   2 +
 stubs/meson.build |   2 +-
 15 files changed, 389 insertions(+), 39 deletions(-)
 create mode 100644 contrib/plugins/ips.c
 rename stubs/{cpus-get-virtual-clock.c => cpus-virtual-clock.c} (68%)

-- 
2.39.2

Re: [PATCH RISU v2 06/13] risugen: Add sparc64 support

2024-05-30 Thread Richard Henderson


On 5/30/24 06:23, Peter Maydell wrote:

+sub open_asm($)
+{
+my ($basename) = @_;
+my $fname = $basename . ".s";
+open(ASM, ">", $fname) or die "can't open $fname: $!";
+select ASM;


I think that using Perl select like this is liable to be
rather confusing, because it has "action at a distance"
effects on every other print statement. I would prefer it
if we passed in the filehandle to the print statements
explicitly. (We can use a global if handing the filehandle
around to all the functions is annoying.)


I think I tried that and something didn't work exporting or importing the variable.  My 
perl fu is weak, so I probably made some trivial error.



r~

Re: [PATCH 2/5] qtest: use cpu interface in qtest_clock_warp

2024-05-30 Thread Pierrick Bouvier


On 5/29/24 23:32, Paolo Bonzini wrote:

On Fri, May 17, 2024 at 12:21 AM Pierrick Bouvier
 wrote:


From: Alex Bennée 

This generalises the qtest_clock_warp code to use the AccelOps
handlers for updating its own sense of time. This will make the next
patch which moves the warp code closer to pure code motion.

From: Alex Bennée 
Signed-off-by: Alex Bennée 
Acked-by: Thomas Huth 
---
  include/sysemu/qtest.h | 1 +
  accel/qtest/qtest.c| 1 +
  system/qtest.c | 6 +++---
  3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h
index b5d5fd34637..45f3b7e1df5 100644
--- a/include/sysemu/qtest.h
+++ b/include/sysemu/qtest.h
@@ -36,6 +36,7 @@ void qtest_server_set_send_handler(void (*send)(void *, const 
char *),
  void qtest_server_inproc_recv(void *opaque, const char *buf);

  int64_t qtest_get_virtual_clock(void);
+void qtest_set_virtual_clock(int64_t count);


You can move qtest_get_virtual_clock/qtest_set_virtual_clock to
accel/qtest/qtest.c instead, and make them static.

They are not used anymore in system/qtest.c, and it actually makes a
lot more sense that they aren't.



Changed for next revision,
thanks


Paolo


  #endif

  #endif
diff --git a/accel/qtest/qtest.c b/accel/qtest/qtest.c
index f6056ac8361..53182e6c2ae 100644
--- a/accel/qtest/qtest.c
+++ b/accel/qtest/qtest.c
@@ -52,6 +52,7 @@ static void qtest_accel_ops_class_init(ObjectClass *oc, void 
*data)

  ops->create_vcpu_thread = dummy_start_vcpu_thread;
  ops->get_virtual_clock = qtest_get_virtual_clock;
+ops->set_virtual_clock = qtest_set_virtual_clock;
  };

  static const TypeInfo qtest_accel_ops_type = {
diff --git a/system/qtest.c b/system/qtest.c
index 6da58b3874e..ee8b139e982 100644
--- a/system/qtest.c
+++ b/system/qtest.c
@@ -332,14 +332,14 @@ int64_t qtest_get_virtual_clock(void)
  return qatomic_read_i64(&qtest_clock_counter);
  }

-static void qtest_set_virtual_clock(int64_t count)
+void qtest_set_virtual_clock(int64_t count)
  {
  qatomic_set_i64(&qtest_clock_counter, count);
  }

  static void qtest_clock_warp(int64_t dest)
  {
-int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+int64_t clock = cpus_get_virtual_clock();
  AioContext *aio_context;
  assert(qtest_enabled());
  aio_context = qemu_get_aio_context();
@@ -348,7 +348,7 @@ static void qtest_clock_warp(int64_t dest)
QEMU_TIMER_ATTR_ALL);
  int64_t warp = qemu_soonest_timeout(dest - clock, deadline);

-qtest_set_virtual_clock(qtest_get_virtual_clock() + warp);
+cpus_set_virtual_clock(cpus_get_virtual_clock() + warp);

  qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
  timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
--
2.39.2

Re: [PATCH RISU v2 05/13] risugen: Be explicit about print destinations

2024-05-30 Thread Richard Henderson


On 5/30/24 05:51, Peter Maydell wrote:

@@ -87,13 +87,13 @@ sub progress_update($)
  my $barlen = int($proglen * $done / $progmax);
  if ($barlen != $lastprog) {
  $lastprog = $barlen;
-print "[" . "-" x $barlen . " " x ($proglen - $barlen) . "]\r";
+print STDOUT "[" . "-" x $barlen . " " x ($proglen - $barlen) . "]\r";
  }
  }

  sub progress_end()
  {
-print "[" . "-" x $proglen . "]\n";
+print STDOUT "[" . "-" x $proglen . "]\n";
  $| = 0;
  }


These are the progress-bar indicators -- shouldn't they go to STDERR,
not STDOUT, if we're going to be careful about where we send output?


Why?  I would think that only errors would go do stderr...




@@ -163,20 +163,20 @@ sub dump_insn_details($$)
  {
  # Dump the instruction details for one insn
  my ($insn, $rec) = @_;
-print "insn $insn: ";
+print STDOUT "insn $insn: ";
  my $insnwidth = $rec->{width};
  my $fixedbits = $rec->{fixedbits};
  my $fixedbitmask = $rec->{fixedbitmask};
  my $constraint = $rec->{blocks}{"constraints"};
-print sprintf(" insnwidth %d fixedbits %08x mask %08x ", $insnwidth, 
$fixedbits, $fixedbitmask);
+print STDOUT sprintf(" insnwidth %d fixedbits %08x mask %08x ", 
$insnwidth, $fixedbits, $fixedbitmask);
  if (defined $constraint) {
-print "constraint $constraint ";
+print STDOUT "constraint $constraint ";
  }
  for my $tuple (@{ $rec->{fields} }) {
  my ($var, $pos, $mask) = @$tuple;
-print "($var, $pos, " . sprintf("%08x", $mask) . ") ";
+print STDOUT "($var, $pos, " . sprintf("%08x", $mask) . ") ";
  }
-print "\n";
+print STDOUT "\n";
  }


As a debug-print helper routine maybe this should be STDERR too?


Likewise?


r~

Re: [RFC PATCH] cpus: split qemu_init_vcpu and delay vCPU thread creation

2024-05-30 Thread Pierrick Bouvier


On 5/29/24 08:22, Alex Bennée wrote:

This ensures we don't start the thread until cpu_common_realizefn has
finished. This ensures that plugins will always run
qemu_plugin_vcpu_init__async first before any other states. It doesn't
totally eliminate the race that plugin_cpu_update__locked has to work
around though. I found this while reviewing the ips plugin which makes
heavy use of the vcpu phase callbacks.

An alternative might be to move the explicit creation of vCPU threads
to qdev_machine_creation_done()? It doesn't affect user-mode which
already has a thread to execute in and ensures the QOM object has
completed creation in cpu_create() before continuing.

Signed-off-by: Alex Bennée 
Cc: Pierrick Bouvier 
Cc: Philippe Mathieu-Daudé 
---
  include/hw/core/cpu.h  |  8 
  accel/tcg/user-exec-stub.c |  5 +
  hw/core/cpu-common.c   |  7 ++-
  plugins/core.c |  5 +
  system/cpus.c  | 15 ++-
  5 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index bb398e8237..6920699585 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -1041,6 +1041,14 @@ void end_exclusive(void);
   */
  void qemu_init_vcpu(CPUState *cpu);
  
+/**

+ * qemu_start_vcpu:
+ * @cpu: The vCPU to start.
+ *
+ * Create the vCPU thread and start it running.
+ */
+void qemu_start_vcpu(CPUState *cpu);
+
  #define SSTEP_ENABLE  0x1  /* Enable simulated HW single stepping */
  #define SSTEP_NOIRQ   0x2  /* Do not use IRQ while single stepping */
  #define SSTEP_NOTIMER 0x4  /* Do not Timers while single stepping */
diff --git a/accel/tcg/user-exec-stub.c b/accel/tcg/user-exec-stub.c
index 4fbe2dbdc8..162bb72bbe 100644
--- a/accel/tcg/user-exec-stub.c
+++ b/accel/tcg/user-exec-stub.c
@@ -18,6 +18,11 @@ void cpu_exec_reset_hold(CPUState *cpu)
  {
  }
  
+void qemu_start_vcpu(CPUState *cpu)

+{
+/* NOP for user-mode, we already have a thread */
+}
+
  /* User mode emulation does not support record/replay yet.  */
  
  bool replay_exception(void)

diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index 0f0a247f56..68895ddd59 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -230,7 +230,12 @@ static void cpu_common_realizefn(DeviceState *dev, Error 
**errp)
  }
  #endif
  
-/* NOTE: latest generic point where the cpu is fully realized */

+/*
+ * With everything set up we can finally start the vCPU thread.
+ * This is a NOP for linux-user.
+ * NOTE: latest generic point where the cpu is fully realized
+ */
+qemu_start_vcpu(cpu);
  }
  
  static void cpu_common_unrealizefn(DeviceState *dev)

diff --git a/plugins/core.c b/plugins/core.c
index 0726bc7f25..1e5da7853b 100644
--- a/plugins/core.c
+++ b/plugins/core.c
@@ -65,6 +65,11 @@ static void plugin_cpu_update__locked(gpointer k, gpointer 
v, gpointer udata)
  CPUState *cpu = container_of(k, CPUState, cpu_index);
  run_on_cpu_data mask = RUN_ON_CPU_HOST_ULONG(*plugin.mask);
  
+/*

+ * There is a race condition between the starting of the vCPU
+ * thread at the end of cpu_common_realizefn and when realized is
+ * finally set.
+ */


Could we simply have an active wait here?
while (!DEVICE(cpu)->realized) {}

We have a guarantee it will be realized shortly, and if it's too hard to 
have a proper synchronization mechanism (introduce a realize_cond?), 
then waiting for the proper state does not seem too bad.


It's a bit strange for me to document an existing race condition, 
instead of finding a solution.



  if (DEVICE(cpu)->realized) {
  async_run_on_cpu(cpu, plugin_cpu_update__async, mask);
  } else {
diff --git a/system/cpus.c b/system/cpus.c
index d3640c9503..7dd8464c5e 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -488,11 +488,13 @@ void cpus_kick_thread(CPUState *cpu)
  
  void qemu_cpu_kick(CPUState *cpu)

  {
-qemu_cond_broadcast(cpu->halt_cond);
-if (cpus_accel->kick_vcpu_thread) {
-cpus_accel->kick_vcpu_thread(cpu);
-} else { /* default */
-cpus_kick_thread(cpu);
+if (cpu->halt_cond) {
+qemu_cond_broadcast(cpu->halt_cond);
+if (cpus_accel->kick_vcpu_thread) {
+cpus_accel->kick_vcpu_thread(cpu);
+} else { /* default */
+cpus_kick_thread(cpu);
+}
  }
  }
  
@@ -674,7 +676,10 @@ void qemu_init_vcpu(CPUState *cpu)

  cpu->num_ases = 1;
  cpu_address_space_init(cpu, 0, "cpu-memory", cpu->memory);
  }
+}
  
+void qemu_start_vcpu(CPUState *cpu)

+{
  /* accelerators all implement the AccelOpsClass */
  g_assert(cpus_accel != NULL && cpus_accel->create_vcpu_thread != NULL);
  cpus_accel->create_vcpu_thread(cpu);

Re: [PATCH 1/5] sysemu: add set_virtual_time to accel ops

2024-05-30 Thread Pierrick Bouvier


On 5/29/24 23:30, Paolo Bonzini wrote:

On Fri, May 17, 2024 at 12:21 AM Pierrick Bouvier
 wrote:

diff --git a/stubs/meson.build b/stubs/meson.build
index 3b9d42023cb..672213b7482 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -3,6 +3,11 @@
  # below, so that it is clear who needs the stubbed functionality.

  stub_ss.add(files('cpu-get-clock.c'))
+stub_ss.add(files('cpus-virtual-clock.c'))
+stub_ss.add(files('qemu-timer-notify-cb.c'))
+stub_ss.add(files('icount.c'))
+stub_ss.add(files('dump.c'))
+stub_ss.add(files('error-printf.c'))


Was there a problem when rebasing?

Paolo



Hi Paolo,

Yes, Philippe made the same comment. I'll fix it in next revision.
Thanks!


  stub_ss.add(files('fdset.c'))
  stub_ss.add(files('iothread-lock.c'))
  stub_ss.add(files('is-daemonized.c'))
@@ -28,7 +33,6 @@ endif
  if have_block or have_ga
stub_ss.add(files('replay-tools.c'))
# stubs for hooks in util/main-loop.c, util/async.c etc.
-  stub_ss.add(files('cpus-get-virtual-clock.c'))
stub_ss.add(files('icount.c'))
stub_ss.add(files('graph-lock.c'))
if linux_io_uring.found()
--
2.39.2

Re: [PATCH V1 00/26] Live update: cpr-exec

2024-05-30 Thread Steven Sistare via


On 5/28/2024 12:42 PM, Peter Xu wrote:

On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:

On 5/27/2024 1:45 PM, Peter Xu wrote:

On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:

I understand, thanks.  If I can help with any of your todo list,
just ask - steve


Thanks for offering the help, Steve.  Started looking at this today, then I
found that I miss something high-level.  Let me ask here, and let me
apologize already for starting to throw multiple questions..

IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
this case not host kernel but QEMU-only, and/or upper.

Is there any justification on why the complexity is needed here?  It looks
to me this one is more involved than cpr-reboot, so I'm thinking how much
we can get from the complexity, and whether it's worthwhile.  1000+ LOC is
the min support, and if we even expect more to come, that's really
important, IMHO.

For example, what's the major motivation of this whole work?  Is that more
on performance, or is it more for supporting the special devices like VFIO
which we used to not support, or something else?  I can't find them in
whatever cover letter I can find, including this one.

Firstly, regarding performance, IMHO it'll be always nice to share even
some very fundamental downtime measurement comparisons using the new exec
mode v.s. the old migration ways to upgrade QEMU binary.  Do you perhaps
have some number on hand when you started working on this feature years
ago?  Or maybe some old links on the list would help too, as I didn't
follow this work since the start.

On VFIO, IIUC you started out this project without VFIO migration being
there.  Now we have VFIO migration so not sure how much it would work for
the upgrade use case. Even with current VFIO migration, we may not want to
migrate device states for a local upgrade I suppose, as that can be a lot
depending on the type of device assigned.  However it'll be nice to discuss
this too if this is the major purpose of the series.

I think one other challenge on QEMU upgrade with VFIO devices is that the
dest QEMU won't be able to open the VFIO device when the src QEMU is still
using it as the owner.  IIUC this is a similar condition where QEMU wants
to have proper ownership transfer of a shared block device, and AFAIR right
now we resolved that issue using some form of file lock on the image file.
In this case it won't easily apply to a VFIO dev fd, but maybe we still
have other approaches, not sure whether you investigated any.  E.g. could
the VFIO handle be passed over using unix scm rights?  I think this might
remove one dependency of using exec which can cause quite some difference
v.s. a generic migration (from which regard, cpr-reboot is still a pretty
generic migration).

You also mentioned vhost/tap, is that also a major goal of this series in
the follow up patchsets?  Is this a problem only because this solution will
do exec?  Can it work if either the exec()ed qemu or dst qemu create the
vhost/tap fds when boot?

Meanwhile, could you elaborate a bit on the implication on chardevs?  From
what I read in the doc update it looks like a major part of work in the
future, but I don't yet understand the issue..  Is it also relevant to the
exec() approach?

In all cases, some of such discussion would be really appreciated.  And if
you used to consider other approaches to solve this problem it'll be great
to mention how you chose this way.  Considering this work contains too many
things, it'll be nice if such discussion can start with the fundamentals,
e.g. on why exec() is a must.


The main goal of cpr-exec is providing a fast and reliable way to update
qemu. cpr-reboot is not fast enough or general enough.  It requires the
guest to support suspend and resume for all devices, and that takes seconds.
If one actually reboots the host, that adds more seconds, depending on
system services.  cpr-exec takes 0.1 secs, and works every time, unlike
like migration which can fail to converge on a busy system.  Live migration
also consumes more system and network resources.


Right, but note that when I was thinking of a comparison between cpr-exec
v.s. normal migration, I didn't mean a "normal live migration".  I think
it's more of the case whether exec() can be avoided.  I had a feeling that
this exec() will cause a major part of work elsewhere but maybe I am wrong
as I didn't see the whole branch.


The only parts of this work that are specific to exec are these patches
and the qemu_clear_cloexec() calls in cpr.c.
  vl: helper to request re-exec
  migration: precreate vmstate for exec

The rest would be the same if some other mechanism were used to start
new qemu.   Additional code would be needed for the new mechanism, such
as SCM_RIGHTS sends.


AFAIU, "cpr-exec takes 0.1 secs" is a conditional result.  I think it at
least should be relevant to what devices are attached to the VM, right?

E.g., I observed at least two things that can dras

Re: [PATCH V1 19/26] physmem: preserve ram blocks for cpr

2024-05-30 Thread Steven Sistare via


On 5/29/2024 3:25 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:31:53PM -0400, Steven Sistare wrote:

On 5/28/2024 5:44 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:28AM -0700, Steve Sistare wrote:

Preserve fields of RAMBlocks that allocate their host memory during CPR so
the RAM allocation can be recovered.


This sentence itself did not explain much, IMHO.  QEMU can share memory
using fd based memory already of all kinds, as long as the memory backend
is path-based it can be shared by sharing the same paths to dst.

This reads very confusing as a generic concept.  I mean, QEMU migration
relies on so many things to work right.  We mostly asks the users to "use
exactly the same cmdline for src/dst QEMU unless you know what you're
doing", otherwise many things can break.  That should also include ramblock
being matched between src/dst due to the same cmdlines provided on both
sides.  It'll be confusing to mention this when we thought the ramblocks
also rely on that fact.

So IIUC this sentence should be dropped in the real patch, and I'll try to
guess the real reason with below..


The properties of the implicitly created ramblocks must be preserved.
The defaults can and do change between qemu releases, even when the command-line
parameters do not change for the explicit objects that cause these implicit
ramblocks to be created.


AFAIU, QEMU relies on ramblocks to be the same before this series.  Do you
have an example?  Would that already cause issue when migrate?


Alignment has changed, and used_length vs max_length changed when
resizeable ramblocks were introduced.  I have dealt with these issues
while supporting cpr for our internal use, and the learned lesson is to
explicitly communicate the creation-time parameters to new qemu.

These are not an issue for migration because the ramblock is re-created
and the data copied into the new memory.


Mirror the mr->align field in the RAMBlock to simplify the vmstate.
Preserve the old host address, even though it is immediately discarded,
as it will be needed in the future for CPR with iommufd.  Preserve
guest_memfd, even though CPR does not yet support it, to maintain vmstate
compatibility when it becomes supported.


.. It could be about the vfio vaddr update feature that you mentioned and
only for iommufd (as IIUC vfio still relies on iova ranges, then it won't
help here)?

If so, IMHO we should have this patch (or any variance form) to be there
for your upcoming vfio support.  Keeping this around like this will make
the series harder to review.  Or is it needed even before VFIO?


This patch is needed independently of vfio or iommufd.

guest_memfd is independent of vfio or iommufd.  It is a recent addition
which I have not tried to support, but I added this placeholder field
to it can be supported in the future without adding a new field later
and maintaining backwards compatibility.


Is guest_memfd the only user so far, then?  If so, would it be possible we
split it as a separate effort on top of the base cpr-exec support?


I don't understand the question.  I am indeed deferring support for guest_memfd
to a future time.  For now, I am adding a blocker, and reserving a field for
it in the preserved ramblock attributes, to avoid adding a subsection later.

- Steve

Re: [PATCH V1 07/26] migration: VMStateId

2024-05-30 Thread Steven Sistare via


On 5/29/2024 2:53 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:30:18PM -0400, Steven Sistare wrote:

How about a more general name for the type:

migration/misc.h
 typedef char (MigrationId)[256];


How about qemu/typedefs.h?  Not sure whether it's applicable. Markus (in
the loop) may have a better idea.

Meanwhile, s/MigrationID/IDString/?


typedefs.h has a different purpose; giving short names to types
defined in internal include files.

This id is specific to migration, so I still think its name should reflect
migration and it belongs in some include/migration/*.h file.

ramblocks and migration are already closely related.  There is nothing wrong
with including a migration header in ramblock.h so it can use a migration type.
We already have:
  include/hw/acpi/ich9_tco.h:#include "migration/vmstate.h"
  include/hw/display/ramfb.h:#include "migration/vmstate.h"
  include/hw/input/pl050.h:#include "migration/vmstate.h"
  include/hw/pci/shpc.h:#include "migration/vmstate.h"
  include/hw/virtio/virtio.h:#include "migration/vmstate.h"
  include/hw/hyperv/vmbus.h:#include "migration/vmstate.h"

The 256 byte magic length already appears in too many places, and my code
would add more occurrences, so I really think that abstracting this type
would be cleaner.

- Steve


exec/ramblock.h
 struct RAMBlock {
 MigrationId idstr;

migration/savevm.c
 typedef struct CompatEntry {
 MigrationId idstr;

 typedef struct SaveStateEntry {
 MigrationId idstr;


- Steve

Re: [PATCH V1 17/26] machine: memfd-alloc option

2024-05-30 Thread Steven Sistare via


On 5/29/2024 3:14 PM, Peter Xu wrote:

On Wed, May 29, 2024 at 01:31:38PM -0400, Steven Sistare wrote:

diff --git a/system/memory.c b/system/memory.c
index 49f1cb2..ca04a0e 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1552,8 +1552,9 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
 uint64_t size,
 Error **errp)
   {
+uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;


If there's a machine option to "use memfd for allocations", then it's
shared mem... Hmm..

It is a bit confusing to me in quite a few levels:

- Why memory allocation method will be defined by a machine property,
  even if we have memory-backend-* which should cover everything?


Some memory regions are implicitly created, and have no explicit representation
on the qemu command line.  memfd-alloc affects those.

More generally, memfd-alloc affects all ramblock allocations that are
not explicitly represented by memory-backend object.  Thus the simple
command line "qemu -m 1G" does not explicitly describe an object, so it
goes through the anonymous allocation path, and is affected by memfd-alloc.


Can we simply now allow "qemu -m 1G" to work for cpr-exec?  


I assume you meant "simply not allow".

Yes, I could do that, but I would need to explicitly add code to exclude this
case, and add a blocker.  Right now it "just works" for all paths that lead to
ram_block_alloc_host, without any special logic at the memory-backend level.
And, I'm not convinced that simplifies the docs, as now I would need to tell
the user that "-m 1G" and similar constructions do not work with cpr.

I can try to clarify the doc for -memfd-alloc as currently defined.


AFAIU that's
what we do with cpr-reboot: we ask the user to specify the right things to
make other thing work.  Otherwise it won't.



Internally, create_default_memdev does create a memory-backend object.
That is what my doc comment above refers to:
   Any associated memory-backend objects are created with share=on

An explicit "qemu -object memory-backend-*" is not affected by memfd-alloc.

The qapi comments in patch "migration: cpr-exec mode" attempt to say all that:

+# Memory backend objects must have the share=on attribute, and
+# must be mmap'able in the new QEMU process.  For example,
+# memory-backend-file is acceptable, but memory-backend-ram is
+# not.
+#
+# The VM must be started with the '-machine memfd-alloc=on'
+# option.  This causes implicit ram blocks -- those not explicitly
+# described by a memory-backend object -- to be allocated by
+# mmap'ing a memfd.  Examples include VGA, ROM, and even guest
+# RAM when it is specified without a memory-backend object.


VGA is IIRC 16MB chunk, ROM is even smaller.  If the user specifies -object
memory-backend-file,share=on propertly, these should be the only outliers?

Are these important enough for the downtime?  Can we put them into the
migrated image alongside with the rest device states?


It's not about downtime.  vfio, vdpa, and iommufd pin all guest pages.
The pages must remain pinned during CPR to support ongoing DMA activity
which could target those pages (which we do not quiesce), and the same
physical pages must be used for the ramblocks in the new qemu process.


- Even if we have such a machine property, why setting "memfd" will
  always imply shared?  why not private?  After all it's not called
  "memfd-shared-alloc", and we can create private mappings using
  e.g. memory-backend-memfd,share=off.


There is no use case for memfd-alloc with share=off, so no point IMO in
making the option more verbose.


Unfortunately this fact doesn't make the property easier to understand. :-( >

For cpr, the mapping with all its modifications must be visible to new
qemu when qemu mmaps it.


So this might be the important part - do you mean migrating
VGA/ROM/... small ramblocks won't work (besides any performance concerns)?
Could you elaborate?


Pinning.


Cpr-reboot already introduced lots of tricky knobs to QEMU.  We may need to
restrict that specialty to minimal, making the interfacing as clear as
possible, or (at least migration) maintainers will start to be soon scared
and running away, if such proposal was not shot down.

In short, I hope when we introduce new knobs for cpr, we shouldn't always
keep cpr-* modes in mind, but consider whenever the user can use it without
cpr-*.  I'm not sure whether it'll be always possible, but we should try.


I agree in principle.  FWIW, I have tried to generalize the functionality needed
by cpr so it can be used in other ways: per-mode blockers, per-mode notifiers,
precreate vmstate, factory objects; to base it on migration internals with
minimal change (vmstate); and to make minimal changes in the migration control
paths.

- Steve

Re: [PATCH V1 05/26] migration: precreate vmstate

2024-05-30 Thread Steven Sistare via


On 5/29/2024 2:39 PM, Peter Xu wrote:

On Tue, May 28, 2024 at 11:09:49AM -0400, Steven Sistare wrote:

On 5/27/2024 2:16 PM, Peter Xu wrote:

On Mon, Apr 29, 2024 at 08:55:14AM -0700, Steve Sistare wrote:

Provide the VMStateDescription precreate field to mark objects that must
be loaded on the incoming side before devices have been created, because
they provide properties that will be needed at creation time.  They will
be saved to and loaded from their own QEMUFile, via
qemu_savevm_precreate_save and qemu_savevm_precreate_load, but these
functions are not yet called in this patch.  Allow them to be called
before or after normal migration is active, when current_migration and
current_incoming are not valid.

Signed-off-by: Steve Sistare 
---
   include/migration/vmstate.h |  6 
   migration/savevm.c  | 69 
+
   migration/savevm.h  |  3 ++
   3 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 294d2d8..4691334 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -198,6 +198,12 @@ struct VMStateDescription {
* a QEMU_VM_SECTION_START section.
*/
   bool early_setup;
+
+/*
+ * Send/receive this object in the precreate migration stream.
+ */
+bool precreate;
+
   int version_id;
   int minimum_version_id;
   MigrationPriority priority;
diff --git a/migration/savevm.c b/migration/savevm.c
index 9789823..a30bcd9 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -239,6 +239,7 @@ static SaveState savevm_state = {
   #define SAVEVM_FOREACH(se, entry)\
   QTAILQ_FOREACH(se, &savevm_state.handlers, entry)\
+if (!se->vmsd || !se->vmsd->precreate)
   #define SAVEVM_FOREACH_ALL(se, entry)\
   QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
@@ -1006,13 +1007,19 @@ static void save_section_header(QEMUFile *f, 
SaveStateEntry *se,
   }
   }
+static bool send_section_footer(SaveStateEntry *se)
+{
+return (se->vmsd && se->vmsd->precreate) ||
+   migrate_get_current()->send_section_footer;
+}


Does the precreate vmsd "require" the footer?  Or it should also work?
IMHO it's less optimal to bind features without good reasons.


It is not required.  However, IMO we should not treat send-section-footer as
a fungible feature.  It is strictly an improvement, as was added to catch
misformated sections.  It is only registered as a feature for backwards
compatibility with qemu 2.3 and xen.

For a brand new data stream such as precreate, where we are not constrained
by backwards compatibility, we should unconditionally use the better protocol,
and always send the footer.


I see your point, but I still don't think we should mangle these things.
It makes future justification harder on whether section footer should be
sent.

Take example of whatever new feature added for migration like mapped-ram,
we might also want to enforce it by adding "return migrate_mapped_ram() ||
..." but it means we keep growing this with no benefit.

What you worry on "what if this is turned off" isn't a real one: nobody
will turn it off!  We started to deprecate machines, and I had a feeling
it will be enforced at some point by default.


That's fine, I'll delete the send_section_footer() function.

- Steve

Re: [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:33PM -0300, Fabiano Rosas wrote:
> Add a test for file migration using fdset. The passing of fds is more
> complex than using a file path. This is also the scenario where it's
> most important we ensure that the initial migration stream offset is
> respected because the fdset interface is the one used by the
> management layer when providing a non empty migration file.
> 
> Note that fd passing is not available on Windows, so anything that
> uses add-fd needs to exclude that platform.
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off

2024-05-30 Thread Chen, Zide




On 5/30/2024 6:54 AM, Zhao Liu wrote:
> Hi Zide,
> 
> On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote:
>> Date: Wed, 29 May 2024 10:31:21 -0700
>> From: "Chen, Zide" 
>> Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
>>
>>
>>
>> On 5/29/2024 5:46 AM, Igor Mammedov wrote:
>>> On Tue, 28 May 2024 11:16:59 -0700
>>> "Chen, Zide"  wrote:
>>>
 On 5/28/2024 2:23 AM, Igor Mammedov wrote:
> On Fri, 24 May 2024 13:00:14 -0700
> Zide Chen  wrote:
>   
>> Currently, if running "-overcommit cpu-pm=on" on hosts that don't
>> have MWAIT support, the MWAIT/MONITOR feature is advertised to the
>> guest and executing MWAIT/MONITOR on the guest triggers #UD.  
>
> this is missing proper description how do you trigger issue
> with reproducer and detailed description why guest sees MWAIT
> when it's not supported by host.  

 If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the
>>> it's bette to provide full QEMU CLI and host/guest kernels used and what
>>> hardware was used if it's relevant so others can reproduce problem.
>>
>> I ever reproduced this on an older Intel Icelake machine, a
>> Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
>> issue, not specific to particular models.
>>
>> For the CLI, I think the only command line options that matter are
>>  -overcommit cpu-pm=on: to set enable_cpu_pm
>>  -cpu host: so that cpu->max_features is set
>>
>> For QEMU version, as long as it's after this commit: 662175b91ff2
>> ("i386: reorder call to cpu_exec_realizefn")
>>
>> The guest fails to boot:
>>
>> [ 24.825568] smpboot: x86: Booting SMP configuration:
>> [ 24.826377]  node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
>> #13 #14 #15 #17
>> [ 24.985799]  node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
>> #136 #137 #138 #139 #140 #141 #142 #143 #145
>> [ 25.136955] invalid opcode:  1 PREEMPT SMP NOPTI
>> [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
>> [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
>> [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
>> [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
>> 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
>> [ 25.137790] RSP: :91403e70 EFLAGS: 00010046
>> [ 25.137790] RAX: 9140a980 RBX: 9140a980 RCX:
>> 
>> [ 25.137790] RDX:  RSI: 97f1ade21b20 RDI:
>> 0004
>> [ 25.137790] RBP:  R08: 0005da4709cb R09:
>> 0001
>> [ 25.137790] R10: 5da4 R11: 0009 R12:
>> 
>> [ 25.137790] R13: 98573ff90fc0 R14: 9140a038 R15:
>> 00093ff0
>> [ 25.137790] FS: () GS:97f1ade0()
>> knlGS:
>> [ 25.137790] CS: 0010 DS:  ES:  CR0: 80050033
>> [ 25.137790] CR2: 97d8aa801000 CR3: 0049e9430001 CR4:
>> 00770ef0
>> [ 25.137790] DR0:  DR1:  DR2:
>> 
>> [ 25.137790] DR3:  DR6: 07f0 DR7:
>> 0400
>> [ 25.137790] PKRU: 5554
>> [ 25.137790] Call Trace:
>> [ 25.137790] 
>> [ 25.137790] ? die+0x37/0x90
>> [ 25.137790] ? do_trap+0xe3/0x110
>> [ 25.137790] ? mwait_idle+0x35/0x80
>> [ 25.137790] ? do_error_trap+0x6a/0x90
>> [ 25.137790] ? mwait_idle+0x35/0x80
>> [ 25.137790] ? exc_invalid_op+0x52/0x70
>> [ 25.137790] ? mwait_idle+0x35/0x80
>> [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
>> [ 25.137790] ? mwait_idle+0x35/0x80
>> [ 25.137790] default_idle_call+0x30/0x100
>> [ 25.137790] cpuidle_idle_call+0x12c/0x170
>> [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
>> [ 25.137790] do_idle+0x7f/0xd0
>> [ 25.137790] cpu_startup_entry+0x29/0x30
>> [ 25.137790] rest_init+0xcc/0xd0
>> [ 25.137790] start_kernel+0x396/0x5d0
>> [ 25.137790] x86_64_start_reservations+0x18/0x30
>> [ 25.137790] x86_64_start_kernel+0xe7/0xf0
>> [ 25.137790] common_startup_64+0x13e/0x148
>> [ 25.137790] 
>> [ 25.137790] Modules linked in:
>> [ 25.137790] --[ end trace  ]--
>> [ 25.137790] invalid opcode:  2 PREEMPT SMP NOPTI
>> [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
>> [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
>> 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
>>
>>>
 following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
 that it doesn't have a chance to check MWAIT against host features and
 will be advertised to the guest regardless of whether it's supported by
 the host or not.

 x86_cpu_realizefn()
   x86_cpu_filter_features()
   cpu_exec_realizefn()
 kvm_cpu_realizefn
   host_cpu_realizefn
 host_cpu_enable_cpu_pm
   env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;


 If it's not supported by the host, executi

Re: [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:32PM -0300, Fabiano Rosas wrote:
> When doing file migration, QEMU accepts an offset that should be
> skipped when writing the migration stream to the file. The purpose of
> the offset is to allow the management layer to put its own metadata at
> the start of the file.
> 
> We have tests for this in migration-test, but only testing that the
> migration stream starts at the correct offset and not that it actually
> leaves the data intact. Unsurprisingly, there's been a bug in that
> area that the tests didn't catch.
> 
> Fix the tests to write some data to the offset region and check that
> it's actually there after the migration.
> 
> While here, switch to using g_get_file_contents() which is more
> portable than mmap().
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 01/18] migration: Fix file migration with fdset

2024-05-30 Thread Peter Xu

On Thu, May 23, 2024 at 04:05:31PM -0300, Fabiano Rosas wrote:
> When the "file:" migration support was added we missed the special
> case in the qemu_open_old implementation that allows for a particular
> file name format to be used to refer to a set of file descriptors that
> have been previously provided to QEMU via the add-fd QMP command.
> 
> When using this fdset feature, we should not truncate the migration
> file because being given an fd means that the management layer is in
> control of the file and will likely already have some data written to
> it. This is further indicated by the presence of the 'offset'
> argument, which indicates the start of the region where QEMU is
> allowed to write.
> 
> Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
> call, which will take the offset into consideration.
> 
> Fixes: 385f510df5 ("migration: file URI offset")
> Suggested-by: Daniel P. Berrangé 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

Right below the change, did we forget to free the ioc if
qio_channel_io_seek() failed?

> ---
>  migration/file.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index ab18ba505a..ba5b5c44ff 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -84,12 +84,19 @@ void file_start_outgoing_migration(MigrationState *s,
>  
>  trace_migration_file_outgoing(filename);
>  
> -fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
> - 0600, errp);
> +fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY, 0600, 
> errp);
>  if (!fioc) {
>  return;
>  }
>  
> +if (ftruncate(fioc->fd, offset)) {
> +error_setg_errno(errp, errno,
> + "failed to truncate migration file to offset %" 
> PRIx64,
> + offset);
> +object_unref(OBJECT(fioc));
> +return;
> +}
> +
>  outgoing_args.fname = g_strdup(filename);
>  
>  ioc = QIO_CHANNEL(fioc);
> -- 
> 2.35.3
> 

-- 
Peter Xu

Re: hw/usb/hcd-ohci: Fix #1510, #303: pid not IN or OUT

2024-05-30 Thread Cord Amfmgm

On Thu, May 30, 2024 at 3:33 AM Alex Bennée  wrote:

> Cord Amfmgm  writes:
>
> > On Tue, May 28, 2024 at 11:32 AM Peter Maydell 
> wrote:
> >
> >  On Tue, 28 May 2024 at 16:37, Cord Amfmgm  wrote:
> >  >
> >  > On Tue, May 28, 2024 at 9:03 AM Peter Maydell <
> peter.mayd...@linaro.org> wrote:
> >  >>
> >  >> On Mon, 20 May 2024 at 23:24, Cord Amfmgm 
> wrote:
> >  >> > On Mon, May 20, 2024 at 12:05 PM Peter Maydell <
> peter.mayd...@linaro.org> wrote:
> 
> >  >> > And here's an example buffer of length 0 -- you probably already
> know what I'm going to do here:
> >  >> >
> >  >> > char buf[0];
> >  >> > char * CurrentBufferPointer = &buf[0];
> >  >> > char * BufferEnd = &buf[-1]; // "address of the last byte in the
> buffer"
> >  >> > // The OHCI Host Controller than advances CurrentBufferPointer
> like this: CurrentBufferPointer += 0
> >  >> > // After the transfer:
> >  >> > // CurrentBufferPointer = &buf[0];
> >  >> > // BufferEnd = &buf[-1];
> >  >>
> >  >> Right, but why do you think this is valid, rather than
> >  >> being a guest software bug? My reading of the spec is that it's
> >  >> pretty clear about how to say "zero length buffer", and this
> >  >> isn't it.
> >  >>
> >  >> Is there some real-world guest OS that programs the OHCI
> >  >> controller this way that we're trying to accommodate?
> >  >
> >  >
> >  > qemu versions 4.2 and before allowed this behavior.
> >
> >  So? That might just mean we had a bug and we fixed it.
> >  4.2 is a very old version of QEMU and nobody seems to have
> >  complained in the four years since we released 5.0 about this,
> >  which suggests that generally guest OS drivers don't try
> >  to send zero-length buffers in this way.
> >
> >  > I don't think it's valid to ask for a *popular* guest OS as a
> proof-of-concept because I'm not an expert on those.
> >
> >  I didn't ask for "popular"; I asked for "real-world".
> >  What is the actual guest code you're running that falls over
> >  because of the behaviour change?
> >
> >  More generally, why do you want this behaviour to be
> >  changed? Reasonable reasons might include:
> >   * we're out of spec based on reading the documentation
> >   * you're trying to run some old Windows VM/QNX/etc image,
> > and it doesn't work any more
> >   * all the real hardware we tested behaves this way
> >
> >  But don't necessarily include:
> >   * something somebody wrote and only tested on QEMU happens to
> > assume the old behaviour rather than following the hw spec
> >
> >  QEMU occasionally works around guest OS bugs, but only as
> >  when we really have to. It's usually better to fix the
> >  bug in the guest.
> >
> > It's not, and I've already demonstrated that real hardware is consistent
> with the fix in this patch.
> >
> > Please check your tone.
>
> I don't think that is a particularly helpful comment for someone who is
> taking the time to review your patches. Reading through the thread I
> didn't see anything that said this is how real HW behaves but I may well
> have missed it. However you have a number of review comments to address
> so I suggest you spin a v2 of the series to address them and outline the
> reason to accept an out of spec transaction.
>
>
I did a rework of the patch -- see my email from May 20, quoted below --
and I was under the impression it addressed all the review comments. Did I
miss something? I apologize if I did.

> index acd6016980..71b54914d3 100644
> --- a/hw/usb/hcd-ohci.c
> +++ b/hw/usb/hcd-ohci.c
> @@ -941,8 +941,8 @@ static int ohci_service_td(OHCIState *ohci, struct
ohci_ed *ed)
>  if ((td.cbp & 0xf000) != (td.be & 0xf000)) {
>  len = (td.be & 0xfff) + 0x1001 - (td.cbp & 0xfff);
>  } else {
> -if (td.cbp > td.be) {
> -trace_usb_ohci_iso_td_bad_cc_overrun(td.cbp, td.be);
> +if (td.cbp - 1 > td.be) {  /* rely on td.cbp != 0 */


> Reading through the thread I didn't see anything that said this is how
real HW behaves but I may well have missed it.

This is what I wrote regarding real HW:

Results are:

 qemu 4.2   | qemu HEAD  | actual HW
++
 works fine | ohci_die() | works fine

Would additional verification of the actual HW be useful?

Peter posted the following which is more specific than "qemu 4.2" -- I
agree this is most likely the qemu commit where this thread is focused:

> Almost certainly this was commit 1328fe0c32d54 ("hw: usb: hcd-ohci:
> check len and frame_number variables"), which added these bounds
> checks. Prior to that we did no bounds checking at all, which
> meant that we permitted cbp=be+1 to mean a zero length, but also
> that we permitted the guest to overrun host-side buffers by
> specifying completely bogus cbp and be values. The timeframe is
> more or less right (2020), at least.
>
> -- PMM

Where does the conversation go from here? I'm under the impression I have
provided objective answers to all the questions and resolved all review
com

Re: [PATCH] hw/riscv/virt.c: add address-cells in create_fdt_one_aplic()

2024-05-30 Thread Conor Dooley

On Thu, May 30, 2024 at 09:26:58AM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 5/30/24 08:06, Andrew Jones wrote:
> > On Thu, May 30, 2024 at 01:05:41PM GMT, Andrew Jones wrote:
> > > On Thu, May 30, 2024 at 05:49:49AM GMT, Daniel Henrique Barboza wrote:
> > > > We need #address-cells properties in all interrupt controllers that are
> > > > referred by an interrupt-map [1]. For the RISC-V machine, both PLIC and
> > > > APLIC controllers must have this property.
> > > > 
> > > > PLIC already sets it in create_fdt_socket_plic(). Set the property for
> > > > APLIC in create_fdt_one_aplic().
> > > > 
> > > > [1] 
> > > > https://lore.kernel.org/linux-arm-kernel/cal_jsqje15d-xxxmelsmud+jqhzzxgzdxvikchn6kfwqk6n...@mail.gmail.com/
> > > 
> > > There are other issues[2] with the DT nodes that we should address at the
> > > same time.
> > > 
> > > [2] https://lore.kernel.org/all/20240529-rust-tile-a05517a6260f@spud/
> > 
> > I meant to CC Conor. Doing that now.
> 
> I'll take a look at these other DT nodes issues.
> 
> Conor, mind give me pointers on how do I reproduce the validation you did
> in [2]? Using upstream 'dtc' I have stuff like:
> 
> ../qemu/qemu_dts.dts:261.4-68: Warning (interrupts_extended_property): 
> /soc/aplic@d00:interrupts-extended: cell 0 is not a phandle reference
> 
> Which seems to also be an error but it's not what you reported. Are you
> using 'dt-validate' from dt-schema?

Yeah, dt-validate. There's probably some stuff that I could add to my
machine to make it more interesting, but I ran:
$(qemu) -smp 4 -M virt,aia=aplic-imsic,dumpdtb=$(qemu_dtb) -cpu max -m 
1G -nographic
dt-validate --schema $(processed_schema) $(qemu_dtb) 2>&1 | tee 
logs/dtbdump.log

A processed schema is a pre-requisite, and I usually have one sitting
around from running dtbs_check or dt_binding_check in Linux, but I think
you can use dt-rebasing to generate one either:
https://git.kernel.org/pub/scm/linux/kernel/git/devicetree/devicetree-rebasing.git/tree/Bindings/Makefile

You'll see a bunch of noise from undocumented isa extensions, but at the
end of the output should be the aplic complaints.

I forgot I had it disabled to test something when I did that test the
other day, there's also complaints about the imsics:
qemu.dtb: imsics@2800: $nodename:0: 'imsics@2800' does not match 
'^interrupt-controller(@[0-9a-f,]+)*$'
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2800: compatible:0: 'riscv,imsics' is not one of 
['qemu,imsics']
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2800: compatible: ['riscv,imsics'] is too short
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2800: '#msi-cells' is a required property
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2800: Unevaluated properties are not allowed ('compatible' 
was unexpected)
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2400: $nodename:0: 'imsics@2400' does not match 
'^interrupt-controller(@[0-9a-f,]+)*$'
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2400: compatible:0: 'riscv,imsics' is not one of 
['qemu,imsics']
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2400: compatible: ['riscv,imsics'] is too short
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2400: '#msi-cells' is a required property
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
qemu.dtb: imsics@2400: Unevaluated properties are not allowed ('compatible' 
was unexpected)
from schema $id: 
http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#

Cheers,
Conor.



signature.asc
Description: PGP signature

Re: [PATCH 1/2] accel/tcg: Make TCGCPUOps::cpu_exec_halt return bool for whether to halt

2024-05-30 Thread Peter Maydell

On Tue, 30 Apr 2024 at 18:15, Alex Bennée  wrote:
> The x86 version is essentially being called for side effects. Do we want
> to document this usage in the method?

I plan to take these two patches into target-arm.next, with
a slightly beefed up doc comment:

/**
 * @cpu_exec_halt: Callback for handling halt in cpu_exec.
 *
 * The target CPU should do any special processing here that it needs
 * to do when the CPU is in the halted state.
 *
 * Return true to indicate that the CPU should now leave halt, false
 * if it should remain in the halted state.
 *
 * If this method is not provided, the default is to do nothing, and
 * to leave halt if cpu_has_work() returns true.
 */

which hopefully makes it a little clearer to the reader
that "do stuff you need to do when in halt" is the
primary purpose of the method.

I'll do the follow-on tidyup rth suggests once these
have made it into git.

thanks
-- PMM

[RFC PATCH 0/1] vhost-user: Add SHMEM_MAP/UNMAP requests

2024-05-30 Thread Albert Esteve

Hi all,

This is an early attempt to have backends
support dynamic fd mapping into shared
memory regions. As such, there are a few
things that need settling, so I wanted to
post this first to have some early feedback.

The usecase for this is, e.g., to support
vhost-user-gpu RESOURCE_BLOB operations,
or DAX Window request for virtio-fs. In
general, any operation where a backend
would need to mmap an fd to a shared
memory so that the guest can access it.
The request will be processed by the VMM,
that will, in turn, trigger a mmap with
the instructed parameters (i.e., shmid,
shm_offset, fd_offset, fd, lenght).

As there are already a couple devices
that could benefit of such a feature,
and more could require it in the future,
my intention was to make it generic.

To that end, I declared the shared
memory region list in `VirtIODevice`.
I could add a couple commodity
functions to add new regions to the list,
so that the devices can use them. But
I wanted to gather some feedback before
refining it further, as I am probably
missing some required steps/or security
concerns that I am not taking into account.

Albert Esteve (1):
  vhost-user: add shmem mmap request

 docs/interop/vhost-user.rst |  23 
 hw/virtio/vhost-user.c  | 106 
 hw/virtio/virtio.c  |   2 +
 include/hw/virtio/virtio.h  |   3 +
 4 files changed, 134 insertions(+)

-- 
2.44.0

[RFC PATCH 1/1] vhost-user: add shmem mmap request

2024-05-30 Thread Albert Esteve

Add SHMEM_MAP/UNMAP requests to vhost-user.

This request allows backends to dynamically map
fds into a shared memory region indentified by
its `shmid`. Then, the fd memory is advertised
to the frontend through a BAR+offset, so it can
be read by the driver while its valid.

Then, the backend can munmap the memory range
in a given shared memory region (again, identified
by its `shmid`), to free it. After this, the
region becomes private and shall not be accessed
by the frontend anymore.

Initializing the memory region is reponsiblity
of the PCI device that will using it.

Signed-off-by: Albert Esteve 
---
 docs/interop/vhost-user.rst |  23 
 hw/virtio/vhost-user.c  | 106 
 hw/virtio/virtio.c  |   2 +
 include/hw/virtio/virtio.h  |   3 +
 4 files changed, 134 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index d8419fd2f1..3caf2a290c 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1859,6 +1859,29 @@ is sent by the front-end.
   when the operation is successful, or non-zero otherwise. Note that if the
   operation fails, no fd is sent to the backend.
 
+``VHOST_USER_BACKEND_SHMEM_MAP``
+  :id: 9
+  :equivalent ioctl: N/A
+  :request payload: fd and ``struct VhostUserMMap``
+  :reply payload: N/A
+
+  This message can be submitted by the backends to advertise a new mapping
+  to be made in a given shared memory region. Upon receiving the message,
+  QEMU will mmap the given fd into the shared memory region with the
+  requested ``shmid``. A reply is generated indicating whether mapping
+  succeeded.
+
+``VHOST_USER_BACKEND_SHMEM_UNMAP``
+  :id: 10
+  :equivalent ioctl: N/A
+  :request payload: ``struct VhostUserMMap``
+  :reply payload: N/A
+
+  This message can be submitted by the backends so that QEMU un-mmap
+  a given range (``offset``, ``len``) in the shared memory region with the
+  requested ``shmid``.
+  A reply is generated indicating whether unmapping succeeded.
+
 .. _reply_ack:
 
 VHOST_USER_PROTOCOL_F_REPLY_ACK
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index cdf9af4a4b..9526b9d07f 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest {
 VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6,
 VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7,
 VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
+VHOST_USER_BACKEND_SHMEM_MAP = 9,
+VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
 VHOST_USER_BACKEND_MAX
 }  VhostUserBackendRequest;
 
@@ -192,6 +194,23 @@ typedef struct VhostUserShared {
 unsigned char uuid[16];
 } VhostUserShared;
 
+/* For the flags field of VhostUserMMap */
+#define VHOST_USER_FLAG_MAP_R (1u << 0)
+#define VHOST_USER_FLAG_MAP_W (1u << 1)
+
+typedef struct {
+/* Shared memory region ID */
+uint8_t shmid;
+/* File offset */
+uint64_t fd_offset;
+/* Offset within the shared memory region */
+uint64_t shm_offset;
+/* Size of region to map */
+uint64_t len;
+/* Flags for the mmap operation, from VHOST_USER_FLAG_* */
+uint64_t flags;
+} VhostUserMMap;
+
 typedef struct {
 VhostUserRequest request;
 
@@ -224,6 +243,7 @@ typedef union {
 VhostUserInflight inflight;
 VhostUserShared object;
 VhostUserTransferDeviceState transfer_state;
+VhostUserMMap mmap;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -1748,6 +1768,85 @@ vhost_user_backend_handle_shared_object_lookup(struct 
vhost_user *u,
 return 0;
 }
 
+static int
+vhost_user_backend_handle_shmem_map(struct vhost_dev *dev,
+  VhostUserMMap *vu_mmap,
+  int fd)
+{
+void *addr = 0;
+MemoryRegion *mr = NULL;
+
+if (fd < 0) {
+error_report("Bad fd for map");
+return -EBADF;
+}
+
+if (!dev->vdev->shmem_list ||
+dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
+error_report("Shared memory region at "
+ "ID %d unitialized", vu_mmap->shmid);
+return -EFAULT;
+}
+
+mr = &dev->vdev->shmem_list[vu_mmap->shmid];
+
+if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
+(vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
+error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
+ vu_mmap->shm_offset, vu_mmap->len);
+return -EFAULT;
+}
+
+void *shmem_ptr = memory_region_get_ram_ptr(mr);
+
+addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
+((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) |
+((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0),
+MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset);
+if (addr == MAP_FAILED) {
+error_report("Failed to mmap mem fd");
+return -EFAULT;
+}
+
+return 0;
+}
+
+static int
+vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev,
+

Re: [Semihosting Tests PATCH v2 1/3] .editorconfig: add code conventions for tooling

2024-05-30 Thread Brian Cain




On 5/30/2024 6:23 AM, Alex Bennée wrote:

It's a pain when you come back to a code base you haven't touched in a
while and realise whatever indent settings you were using having
carried over. Add an editorconfig and be done with it.

Signed-off-by: Alex Bennée 



Adding an editorconfig seems like a great idea IMO.  But I wonder - will 
it result in unintentional additional changes when saving a file that 
contains baseline non-conformance?


Related: would a .clang-format file also be useful? git-clang-format can 
be used to apply formatting changes only on the code that's been changed.


Also: should we consider excluding any exceptional files that we don't 
expect to conform?




---
v2
   - drop mention of custom major modes (not needed here)
   - include section for assembly
---
  .editorconfig | 29 +
  1 file changed, 29 insertions(+)
  create mode 100644 .editorconfig

diff --git a/.editorconfig b/.editorconfig
new file mode 100644
index 000..c72a55c
--- /dev/null
+++ b/.editorconfig
@@ -0,0 +1,29 @@
+# EditorConfig is a file format and collection of text editor plugins
+# for maintaining consistent coding styles between different editors
+# and IDEs. Most popular editors support this either natively or via
+# plugin.
+#
+# Check https://editorconfig.org for details.
+#
+
+root = true
+
+[*]
+end_of_line = lf
+insert_final_newline = true
+charset = utf-8
+
+[Makefile*]
+indent_style = tab
+indent_size = 8
+emacs_mode = makefile
+
+[*.{c,h}]
+indent_style = space
+indent_size = 4
+emacs_mode = c
+
+[*.{s,S}]
+indent_style = tab
+indent_size = 8
+emacs_mode = asm

Re: [PATCH] hw/usb/hcd-ohci: Fix ohci_service_td: accept valid TDs

2024-05-30 Thread Peter Maydell

On Tue, 21 May 2024 at 00:26, David Hubbard  wrote:
>
> From: Cord Amfmgm 
>
> This changes the way the ohci emulation handles a Transfer Descriptor with
> "Current Buffer Pointer" set to "Buffer End" + 1.
>
> The OHCI spec 4.3.1.2 Table 4-2 allows td.cbp to be one byte more than td.be
> to signal the buffer has zero length. Currently qemu only accepts zero-length
> Transfer Descriptors if the td.cbp is equal to 0, while actual OHCI hardware
> accepts both cases.
>
> The qemu ohci emulation has a regression in ohci_service_td. Version 4.2
> and earlier matched the spec. (I haven't taken the time to bisect exactly
> where the logic was changed.)

Almost certainly this was commit 1328fe0c32d54 ("hw: usb: hcd-ohci:
check len and frame_number variables"), which added these bounds
checks. Prior to that we did no bounds checking at all, which
meant that we permitted cbp=be+1 to mean a zero length, but also
that we permitted the guest to overrun host-side buffers by
specifying completely bogus cbp and be values. The timeframe is
more or less right (2020), at least.

-- PMM

[PATCH v2 0/3] semihosting: Restrict to TCG

2024-05-30 Thread Philippe Mathieu-Daudé

v2: Address Paolo's comment

Semihosting currently uses the TCG probe_access API,
so it is pointless to have it in the binary when TCG
isn't.

It could be implemented for other accelerators, but
work need to be done. Meanwhile, do not enable it
unless TCG is available.

Philippe Mathieu-Daudé (3):
  target/mips: Restrict semihosting to TCG
  target/riscv: Restrict semihosting to TCG
  semihosting: Restrict to TCG

 semihosting/Kconfig  | 1 +
 target/mips/Kconfig  | 2 +-
 target/riscv/Kconfig | 4 ++--
 3 files changed, 4 insertions(+), 3 deletions(-)

-- 
2.41.0

[PATCH v2 2/3] target/riscv: Restrict semihosting to TCG

2024-05-30 Thread Philippe Mathieu-Daudé

Semihosting currently uses the TCG probe_access API. To prepare for
encoding the TCG dependency in Kconfig, do not enable it unless TCG
is available.

Suggested-by: Paolo Bonzini 
Signed-off-by: Philippe Mathieu-Daudé 
---
 target/riscv/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/riscv/Kconfig b/target/riscv/Kconfig
index 5f30df22f2..cbafbedaeb 100644
--- a/target/riscv/Kconfig
+++ b/target/riscv/Kconfig
@@ -1,9 +1,9 @@
 config RISCV32
 bool
-select ARM_COMPATIBLE_SEMIHOSTING # for do_common_semihosting()
+imply ARM_COMPATIBLE_SEMIHOSTING if TCG # for do_common_semihosting()
 select DEVICE_TREE # needed by boot.c
 
 config RISCV64
 bool
-select ARM_COMPATIBLE_SEMIHOSTING # for do_common_semihosting()
+imply ARM_COMPATIBLE_SEMIHOSTING if TCG # for do_common_semihosting()
 select DEVICE_TREE # needed by boot.c
-- 
2.41.0

[PATCH v2 3/3] semihosting: Restrict to TCG

2024-05-30 Thread Philippe Mathieu-Daudé

Semihosting currently uses the TCG probe_access API.
It is pointless to have it in the binary when TCG isn't.

Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Richard Henderson 
---
 semihosting/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/semihosting/Kconfig b/semihosting/Kconfig
index eaf3a20ef5..fbe6ac87f9 100644
--- a/semihosting/Kconfig
+++ b/semihosting/Kconfig
@@ -1,6 +1,7 @@
 
 config SEMIHOSTING
bool
+   depends on TCG
 
 config ARM_COMPATIBLE_SEMIHOSTING
bool
-- 
2.41.0

Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off

2024-05-30 Thread Sean Christopherson

On Thu, May 30, 2024, Igor Mammedov wrote:
> On Thu, 30 May 2024 21:54:47 +0800 Zhao Liu  wrote:

...

> > > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> > > >> that it doesn't have a chance to check MWAIT against host features and
> > > >> will be advertised to the guest regardless of whether it's supported by
> > > >> the host or not.
> > > >>
> > > >> x86_cpu_realizefn()
> > > >>   x86_cpu_filter_features()
> > > >>   cpu_exec_realizefn()
> > > >> kvm_cpu_realizefn
> > > >>   host_cpu_realizefn
> > > >> host_cpu_enable_cpu_pm
> > > >>   env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> > > >>
> > > >>
> > > >> If it's not supported by the host, executing MONITOR or MWAIT
> > > >> instructions from the guest triggers #UD, no matter MWAIT_EXITING
> > > >> control is set or not.  
> > > > 
> > > > If I recall right, kvm was able to emulate mwait/monitor.
> > > > So question is why it leads to exception instead?

Because KVM doesn't emulated MONITOR/MWAIT on #UD.

> > > KVM can come to play only iff it can trigger MWAIT/MONITOR VM exits. I
> > > didn't find explicit proof from Intel SDM that #UD exceptions take
> > > precedence over MWAIT/MONITOR VM exits, but this is my speculation.

Yeah, typically #UD takes priority over VM-Exit interception checks.  AMD's APM
is much more explicit and states that all exceptions are checked on 
MONITOR/MWAIT
before the interception check.

> > > For example, in ancient machines which don't support MWAIT yet, the only
> > > way it can do is #UD, not MWAIT VM exit?  

Not really relevant, because such CPUs wouldn't have MWAIT-exiting.

> > For the Host which doesn't support MWAIT, it shouldn't have the VMX
> > control bit for mwait exit either, right?
> > 
> > Could you pls check this on your machine? If VMX doesn't support this
> > exit event, then triggering an exception will make sense.
> 
> My assumption (probably wrong) was that KVM would emulate mwait if it's 
> unavailable,

Nope.  In order to limit the attack surface of the emulator on modern CPUs, KVM
only emulates select instructions in response to a #UD.

But even if KVM did emulate MONITOR/MWAIT on #UD, this is inarguably a QEMU bug,
e.g. QEMU will effectively coerce the guest into using a idle-polling mechanism.

[PATCH v2 1/3] target/mips: Restrict semihosting to TCG

2024-05-30 Thread Philippe Mathieu-Daudé

Semihosting currently uses the TCG probe_access API. To prepare for
encoding the TCG dependency in Kconfig, do not enable it unless TCG
is available.

Suggested-by: Paolo Bonzini 
Signed-off-by: Philippe Mathieu-Daudé 
---
 target/mips/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/mips/Kconfig b/target/mips/Kconfig
index eb19c94c7d..876048b150 100644
--- a/target/mips/Kconfig
+++ b/target/mips/Kconfig
@@ -1,6 +1,6 @@
 config MIPS
 bool
-select SEMIHOSTING
+imply SEMIHOSTING if TCG
 
 config MIPS64
 bool
-- 
2.41.0

Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off

2024-05-30 Thread Igor Mammedov

On Thu, 30 May 2024 21:54:47 +0800
Zhao Liu  wrote:

> Hi Zide,
> 
> On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote:
> > Date: Wed, 29 May 2024 10:31:21 -0700
> > From: "Chen, Zide" 
> > Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
> > 
> > 
> > 
> > On 5/29/2024 5:46 AM, Igor Mammedov wrote:  
> > > On Tue, 28 May 2024 11:16:59 -0700
> > > "Chen, Zide"  wrote:
> > >   
> > >> On 5/28/2024 2:23 AM, Igor Mammedov wrote:  
> > >>> On Fri, 24 May 2024 13:00:14 -0700
> > >>> Zide Chen  wrote:
> > >>> 
> >  Currently, if running "-overcommit cpu-pm=on" on hosts that don't
> >  have MWAIT support, the MWAIT/MONITOR feature is advertised to the
> >  guest and executing MWAIT/MONITOR on the guest triggers #UD.
> > >>>
> > >>> this is missing proper description how do you trigger issue
> > >>> with reproducer and detailed description why guest sees MWAIT
> > >>> when it's not supported by host.
> > >>
> > >> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the  
> > > it's bette to provide full QEMU CLI and host/guest kernels used and what
> > > hardware was used if it's relevant so others can reproduce problem.  
> > 
> > I ever reproduced this on an older Intel Icelake machine, a
> > Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
> > issue, not specific to particular models.
> > 
> > For the CLI, I think the only command line options that matter are
> >  -overcommit cpu-pm=on: to set enable_cpu_pm
> >  -cpu host: so that cpu->max_features is set
> > 
> > For QEMU version, as long as it's after this commit: 662175b91ff2
> > ("i386: reorder call to cpu_exec_realizefn")
> > 
> > The guest fails to boot:
> > 
> > [ 24.825568] smpboot: x86: Booting SMP configuration:
> > [ 24.826377]  node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
> > #13 #14 #15 #17
> > [ 24.985799]  node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
> > #136 #137 #138 #139 #140 #141 #142 #143 #145
> > [ 25.136955] invalid opcode:  1 PREEMPT SMP NOPTI
> > [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
> > [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
> > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> > [ 25.137790] RSP: :91403e70 EFLAGS: 00010046
> > [ 25.137790] RAX: 9140a980 RBX: 9140a980 RCX:
> > 
> > [ 25.137790] RDX:  RSI: 97f1ade21b20 RDI:
> > 0004
> > [ 25.137790] RBP:  R08: 0005da4709cb R09:
> > 0001
> > [ 25.137790] R10: 5da4 R11: 0009 R12:
> > 
> > [ 25.137790] R13: 98573ff90fc0 R14: 9140a038 R15:
> > 00093ff0
> > [ 25.137790] FS: () GS:97f1ade0()
> > knlGS:
> > [ 25.137790] CS: 0010 DS:  ES:  CR0: 80050033
> > [ 25.137790] CR2: 97d8aa801000 CR3: 0049e9430001 CR4:
> > 00770ef0
> > [ 25.137790] DR0:  DR1:  DR2:
> > 
> > [ 25.137790] DR3:  DR6: 07f0 DR7:
> > 0400
> > [ 25.137790] PKRU: 5554
> > [ 25.137790] Call Trace:
> > [ 25.137790] 
> > [ 25.137790] ? die+0x37/0x90
> > [ 25.137790] ? do_trap+0xe3/0x110
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? do_error_trap+0x6a/0x90
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? exc_invalid_op+0x52/0x70
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] default_idle_call+0x30/0x100
> > [ 25.137790] cpuidle_idle_call+0x12c/0x170
> > [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
> > [ 25.137790] do_idle+0x7f/0xd0
> > [ 25.137790] cpu_startup_entry+0x29/0x30
> > [ 25.137790] rest_init+0xcc/0xd0
> > [ 25.137790] start_kernel+0x396/0x5d0
> > [ 25.137790] x86_64_start_reservations+0x18/0x30
> > [ 25.137790] x86_64_start_kernel+0xe7/0xf0
> > [ 25.137790] common_startup_64+0x13e/0x148
> > [ 25.137790] 
> > [ 25.137790] Modules linked in:
> > [ 25.137790] --[ end trace  ]--
> > [ 25.137790] invalid opcode:  2 PREEMPT SMP NOPTI
> > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> >   
> > >   
> > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> > >> that it doesn't have a chance to check MWAIT against host features and
> > >> will be advertised to the guest regardless of whether it's supported by
> > >> the host or not.
> > >>
> > >> x86_cpu_realizefn()
> > >>   x86_cpu_filter_features()
> > >>   cpu_exec_realizefn()
> > >> kvm_cpu_realizefn
> >

Re: [PATCH] target/arm: Disable SVE extensions when SVE is disabled

2024-05-30 Thread Peter Maydell

On Sun, 26 May 2024 at 21:46, Richard Henderson
 wrote:
>
> From: Marcin Juszkiewicz 
>
> Cc: qemu-sta...@nongnu.org
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2304
> Reported-by: Marcin Juszkiewicz 
> Signed-off-by: Richard Henderson 
> ---

Applied to target-arm.next, thanks.

(Looks like we already got this right for SME:
arm_cpu_sme_finalize() clears ID_AA64SMFR0_EL1 if
SME is disabled.)

-- PMM

Re: [PATCH] hw/net: prevent potential NULL dereference

2024-05-30 Thread Олег Свиридов


Thanks for review.

Would it be correct to use hcall_dprintf() as in other functions of the 
module? For example, in h_add_logical_lan_buffer().


Best regards,
Oleg.

29.05.2024 16:52, Philippe Mathieu-Daudé пишет:

On 29/5/24 13:07, Oleg Sviridov wrote:
Pointer, returned from function 'spapr_vio_find_by_reg', may be NULL 
and is dereferenced immediately after.


Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Oleg Sviridov 
---
  hw/net/spapr_llan.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/hw/net/spapr_llan.c b/hw/net/spapr_llan.c
index ecb30b7c76..f40b733229 100644
--- a/hw/net/spapr_llan.c
+++ b/hw/net/spapr_llan.c
@@ -770,6 +770,10 @@ static target_ulong 
h_change_logical_lan_mac(PowerPCCPU *cpu,

  SpaprVioVlan *dev = VIO_SPAPR_VLAN_DEVICE(sdev);
  int i;


Trying to change a MAC when no NIC is present is dubious, we could
at least report this using qemu_log_mask(LOG_GUEST_ERROR).


+    if (!dev) {
+    return H_PARAMETER;
+    }
+
  for (i = 0; i < ETH_ALEN; i++) {
  dev->nicconf.macaddr.a[ETH_ALEN - i - 1] = macaddr & 0xff;
  macaddr >>= 8;

Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off

2024-05-30 Thread Igor Mammedov

On Thu, 30 May 2024 21:54:47 +0800
Zhao Liu  wrote:

> Hi Zide,
> 
> On Wed, May 29, 2024 at 10:31:21AM -0700, Chen, Zide wrote:
> > Date: Wed, 29 May 2024 10:31:21 -0700
> > From: "Chen, Zide" 
> > Subject: Re: [PATCH V2 0/3] improve -overcommit cpu-pm=on|off
> > 
> > 
> > 
> > On 5/29/2024 5:46 AM, Igor Mammedov wrote:  
> > > On Tue, 28 May 2024 11:16:59 -0700
> > > "Chen, Zide"  wrote:
> > >   
> > >> On 5/28/2024 2:23 AM, Igor Mammedov wrote:  
> > >>> On Fri, 24 May 2024 13:00:14 -0700
> > >>> Zide Chen  wrote:
> > >>> 
> >  Currently, if running "-overcommit cpu-pm=on" on hosts that don't
> >  have MWAIT support, the MWAIT/MONITOR feature is advertised to the
> >  guest and executing MWAIT/MONITOR on the guest triggers #UD.
> > >>>
> > >>> this is missing proper description how do you trigger issue
> > >>> with reproducer and detailed description why guest sees MWAIT
> > >>> when it's not supported by host.
> > >>
> > >> If "overcommit cpu-pm=on" and "-cpu host" are present, as shown in the  
> > > it's bette to provide full QEMU CLI and host/guest kernels used and what
> > > hardware was used if it's relevant so others can reproduce problem.  
> > 
> > I ever reproduced this on an older Intel Icelake machine, a
> > Sapphire Rapids and a Sierra Forest, but I believe this is a x86 generic
> > issue, not specific to particular models.
> > 
> > For the CLI, I think the only command line options that matter are
> >  -overcommit cpu-pm=on: to set enable_cpu_pm
> >  -cpu host: so that cpu->max_features is set
> > 
> > For QEMU version, as long as it's after this commit: 662175b91ff2
> > ("i386: reorder call to cpu_exec_realizefn")
> > 
> > The guest fails to boot:
> > 
> > [ 24.825568] smpboot: x86: Booting SMP configuration:
> > [ 24.826377]  node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12
> > #13 #14 #15 #17
> > [ 24.985799]  node #1, CPUs: #128 #129 #130 #131 #132 #133 #134 #135
> > #136 #137 #138 #139 #140 #141 #142 #143 #145
> > [ 25.136955] invalid opcode:  1 PREEMPT SMP NOPTI
> > [ 25.137790] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0 #2
> > [ 25.137790] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/04
> > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> > [ 25.137790] RSP: :91403e70 EFLAGS: 00010046
> > [ 25.137790] RAX: 9140a980 RBX: 9140a980 RCX:
> > 
> > [ 25.137790] RDX:  RSI: 97f1ade21b20 RDI:
> > 0004
> > [ 25.137790] RBP:  R08: 0005da4709cb R09:
> > 0001
> > [ 25.137790] R10: 5da4 R11: 0009 R12:
> > 
> > [ 25.137790] R13: 98573ff90fc0 R14: 9140a038 R15:
> > 00093ff0
> > [ 25.137790] FS: () GS:97f1ade0()
> > knlGS:
> > [ 25.137790] CS: 0010 DS:  ES:  CR0: 80050033
> > [ 25.137790] CR2: 97d8aa801000 CR3: 0049e9430001 CR4:
> > 00770ef0
> > [ 25.137790] DR0:  DR1:  DR2:
> > 
> > [ 25.137790] DR3:  DR6: 07f0 DR7:
> > 0400
> > [ 25.137790] PKRU: 5554
> > [ 25.137790] Call Trace:
> > [ 25.137790] 
> > [ 25.137790] ? die+0x37/0x90
> > [ 25.137790] ? do_trap+0xe3/0x110
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? do_error_trap+0x6a/0x90
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? exc_invalid_op+0x52/0x70
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] ? asm_exc_invalid_op+0x1a/0x20
> > [ 25.137790] ? mwait_idle+0x35/0x80
> > [ 25.137790] default_idle_call+0x30/0x100
> > [ 25.137790] cpuidle_idle_call+0x12c/0x170
> > [ 25.137790] ? tsc_verify_tsc_adjust+0x73/0xd0
> > [ 25.137790] do_idle+0x7f/0xd0
> > [ 25.137790] cpu_startup_entry+0x29/0x30
> > [ 25.137790] rest_init+0xcc/0xd0
> > [ 25.137790] start_kernel+0x396/0x5d0
> > [ 25.137790] x86_64_start_reservations+0x18/0x30
> > [ 25.137790] x86_64_start_kernel+0xe7/0xf0
> > [ 25.137790] common_startup_64+0x13e/0x148
> > [ 25.137790] 
> > [ 25.137790] Modules linked in:
> > [ 25.137790] --[ end trace  ]--
> > [ 25.137790] invalid opcode:  2 PREEMPT SMP NOPTI
> > [ 25.137790] RIP: 0010:mwait_idle+0x35/0x80
> > [ 25.137790] Code: 6f f0 80 48 02 20 48 8b 10 83 e2 08 75 3e 65 48 8b 15
> > 47 d6 56 6f 48 0f ba e2 27 72 41 31 d2 48 89 d8
> >   
> > >   
> > >> following, CPUID_EXT_MONITOR is set after x86_cpu_filter_features(), so
> > >> that it doesn't have a chance to check MWAIT against host features and
> > >> will be advertised to the guest regardless of whether it's supported by
> > >> the host or not.
> > >>
> > >> x86_cpu_realizefn()
> > >>   x86_cpu_filter_features()
> > >>   cpu_exec_realizefn()
> > >> kvm_cpu_realizefn
> >

Re: [QEMU][PATCH 1/1] hw/net/can: Fix sorting of the tx queue

2024-05-30 Thread Francisco Iglesias

Hi Shiva,

On Wed, May 29, 2024 at 03:54:58PM +0530, Shiva sagar Myana wrote:
> Returning an uint32_t casted to a gint from g_cmp_ids causes the tx queue to
> become wrongly sorted when executing g_slist_sort. Fix this by always
> returning -1 or 1 from g_cmp_ids based on the ID comparison instead.
> Also, if two message IDs are the same, sort them by using their index and
> transmit the message at the lowest index first.
> 
> Signed-off-by: Shiva sagar Myana 

With the subject line modified to below (for clarity):

"hw/net/can/xlnx-versal-canfd: Fix sorting of the tx queue"

Reviewed-by: Francisco Iglesias 

BR,
F

> ---
>  hw/net/can/xlnx-versal-canfd.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/net/can/xlnx-versal-canfd.c b/hw/net/can/xlnx-versal-canfd.c
> index 47a14cfe63..5f083c21e9 100644
> --- a/hw/net/can/xlnx-versal-canfd.c
> +++ b/hw/net/can/xlnx-versal-canfd.c
> @@ -1312,7 +1312,10 @@ static gint g_cmp_ids(gconstpointer data1, 
> gconstpointer data2)
>  tx_ready_reg_info *tx_reg_1 = (tx_ready_reg_info *) data1;
>  tx_ready_reg_info *tx_reg_2 = (tx_ready_reg_info *) data2;
>  
> -return tx_reg_1->can_id - tx_reg_2->can_id;
> +if (tx_reg_1->can_id == tx_reg_2->can_id) {
> +return (tx_reg_1->reg_num < tx_reg_2->reg_num) ? -1 : 1;
> +}
> +return (tx_reg_1->can_id < tx_reg_2->can_id) ? -1 : 1;
>  }
>  
>  static void free_list(GSList *list)
> -- 
> 2.34.1
>

Re: [PATCH v3 00/33] target/arm: Convert a64 advsimd to decodetree (part 1b)

2024-05-30 Thread Peter Maydell

On Tue, 28 May 2024 at 21:31, Richard Henderson
 wrote:
>
> Changes for v3:
>   * Reword prefetch unpredictable patch.
>   * Validate vector length when qc is an implied operand.
>   * Adjust some legacy decode based on review.
>   * Apply r-b.
>
> Patches needing review:
>   01-target-arm-Diagnose-UNPREDICTABLE-operands-to-PLD.patch
>   03-target-arm-Assert-oprsz-in-range-when-using-vfp.q.patch
>   04-target-arm-Convert-SUQADD-and-USQADD-to-gvec.patch
>   10-target-arm-Convert-SRSHL-and-URSHL-register-to-gv.patch
>   12-target-arm-Convert-SQSHL-and-UQSHL-register-to-gv.patch
>   31-target-arm-Convert-SQDMULH-SQRDMULH-to-decodetree.patch
>   32-target-arm-Convert-FMADD-FMSUB-FNMADD-FNMSUB-to-d.patch


Applied 2-33 to target-arm.next, thanks. (Dropped the PLD
patch for the reasons we discussed in the other thread.)

-- PMM

Re: [Semihosting Tests PATCH v2 0/3] add SYS_GET_CMDLINE test

2024-05-30 Thread Peter Maydell

On Thu, 30 May 2024 at 12:23, Alex Bennée  wrote:
>
> Hi Peter,
>
> Looking at bug #2322 I wanted to make sure SYS_GET_CMDLINE works as I
> expected. While at it I needed to fix a compile error with headers
> which I guess we got away with on earlier compilers.
>
> I've added an editorconfig for good measure.
>
> v2
>   - addressed review comments
>
> Alex Bennée (3):
>   .editorconfig: add code conventions for tooling
>   update includes for bare metal compiling
>   add SYS_GET_CMDLINE test

Applied to the semihosting-tests git repo, thanks.

-- PMM

1 2 3 >

1 - 100 of 266 matches

Mail list logo