date:20240606

Re: [PATCH 0/3] virtio-gpu: Enable virglrenderer backend for rutabaga

2024-06-06 Thread Weifeng Liu

Hi Alex,

On Thu, 2024-06-06 at 11:43 +0100, Alex Bennée wrote:
> Weifeng Liu  writes:
> 
> > Greetings,
> > 
> > I'd like to introduce you my attempt to enable virglrenderer backend for
> > rutabaga empowered virtio-gpu device.  I am aware that there have been
> > effort in supporting venus in virtio-gpu-virgl.c [1], but there is no
> > reason to prevent us from leveraging the virglrenderer component in
> > rutabaga_gfx, especially it being not very hard to add this
> > functionality.
> > 
> > Generally, the gap is the polling capability, i.e., virglrenderer
> > requires the main thread (namely the GPU command handling thread) to
> > poll virglrenderer at proper moments, which is not yet supported in
> > virtio-gpu-rutabaga device. This patch set try to add this so that
> > virglrenderer backend (including virgl and venus) can work as expected.
> > 
> > Slight change to rutabaga_gfx_ffi is also a requirement, which is
> > included in [2].
> > 
> > Further effort is required to tune the performance, since copying is
> > present before the rendered images get displayed. But I still think this
> > patch set could be a good starting point for the pending work.
> > 
> > For those interested in setting up environment and playing around with
> > this patch set, here is guideline in brief:
> > 
> > 1. Clone the master/main branch of virglrenderer, compile and install it.
> > 
> > git clone https://gitlab.freedesktop.org/virgl/virglrenderer
> > cd virglrenderer
> > meson setup builddir \
> >   --prefix=$INSTALL_DIR/virglrenderer \
> >   -Dvenus=true
> > ninja -C builddir install
> > 
> > 2. Clone the patched CrosVM, build and install rutabaga_gfx_ffi.
> > 
> > git clone -b rutabaga_ffi_virgl https://github.com/phreer/crosvm.git
> > cd crosvm/rutabaga_gfx/ffi
> > export PKG_CONFIG_PATH=$INSTALL_DIR/virglrenderer/lib64/pkgconfig/
> > meson setup builddir/ \
> >   --prefix $HOME/install/rutabaga_gfx/rutabaga_gfx_ffi/ \
> >   -Dvirglrenderer=true
> > ninja -C builddir install
> 
> Is there a PR going in for this? The moving parts for rutabaga are
> complex enough I think we need support upstream before merging this.
> 

It's true that this patch set depends on the change of
rutabaga_gfx_ffi. I am trying get the modifications of
crosvm/rubataga_gfx_ffi merged in upstream, please refer to this link:
https://chromium-review.googlesource.com/c/crosvm/crosvm/+/5599645

> Is this branch where I should be getting the poll helpers from?
> 
>   cc -m64 @qemu-system-arm.rsp
>   /usr/bin/ld: libcommon.fa.p/hw_display_virtio-gpu-rutabaga.c.o: in function 
> `virtio_gpu_fence_poll':
>   
> /home/alex/lsrc/qemu.git/builds/vulkan/../../hw/display/virtio-gpu-rutabaga.c:909:
>  undefined reference to `rutabaga_poll'
>   /usr/bin/ld: libcommon.fa.p/hw_display_virtio-gpu-rutabaga.c.o: in function 
> `virtio_gpu_rutabaga_init':
>   
> /home/alex/lsrc/qemu.git/builds/vulkan/../../hw/display/virtio-gpu-rutabaga.c:1122:
>  undefined reference to `rutabaga_poll_descriptor'
>   collect2: error: ld returned 1 exit status
>   ninja: build stopped: subcommand failed.
> 

The required patches are applied to the rutabaga_ffi_virgl branch of my
clone of crosvm already, so please check out to that branch.

> 
> > 3. Applied this patch set to QEMU, build and install it:
> > 
> > cd qemu 
> > # Apply this patch set atop main branch ...
> > mkdir builddir; cd builddir
> > ../configure --prefix=$INSTALL_DIR/qemu \
> >   --target-list=x86_64-softmmu \
> >   --disable-virglrenderer \
> >   --enable-rutabaga_gfx
> > ninja -C builddir install
> > 
> > 4. If you are lucky and everything goes fine, you are prepared to launch
> >VM with virglrenderer backed virtio-gpu-rutabaga device:
> > 
> > export 
> > LD_LIBRARY_PATH=$INSTALL_DIR/virglrenderer/lib64/:$LD_LIBRARY_PATH
> > export 
> > LD_LIBRARY_PATH=$INSTALL_DIR/rutabaga_gfx_ffi/lib64/:$LD_LIBRARY_PATH
> > $INSTALL_DIR/qemu/bin/qemu-system-x86_64
> > $QEMU -d guest_errors -enable-kvm -M q35 -smp 4 -m $MEM \
> >   -object memory-backend-memfd,id=mem1,size=$MEM \
> >   -machine memory-backend=mem1 \
> >   -device 
> > virtio-vga-rutabaga,venus=on,virgl2=on,wsi=surfaceless,hostmem=$MEM \
> > 
> 
> This should go into docs/system/devices/virtio-gpu.rst with some
> explanation. Is there anything we need on the guest side or does this
> skip the encapsulating requirements of wayland?
> 

Yeah, it's a good idea to add doc to explain the usage, thanks!

Best regards,
Weifeng

> > Note:
> > 
> > - You might need this patch set [3] to avoid KVM bad address error when
> >   you are running on a GPU using TTM for memory management.
> > 
> > [1] 
> > https://lore.kernel.org/all/dba6eb97-e1d1-4694-bfb6-e72db9571...@daynix.com/T/
> > [2] https://chromium-review.googlesource.com/c/crosvm/crosvm/+/5599645/1
> > [3] 
> > https://lore.kernel.org/kvm/20240229025759.1187910-1-steve...@google.com/
> > 
> > Weifeng

Re: [PATCH v3 1/4] qom: allow to mark objects as deprecated or not secure.

2024-06-06 Thread Philippe Mathieu-Daudé


On 6/6/24 16:30, Gerd Hoffmann wrote:

Add flags to ObjectClass for objects which are deprecated or not secure.
Add 'deprecated' and 'not-secure' bools to ObjectTypeInfo, report in
'qom-list-types'.  Print the flags when listing devices via '-device
help'.

Signed-off-by: Gerd Hoffmann 
---
  include/qom/object.h  | 3 +++
  qom/qom-qmp-cmds.c| 8 
  system/qdev-monitor.c | 8 
  qapi/qom.json | 8 +++-
  4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/qom/object.h b/include/qom/object.h
index 13d3a655ddf9..419bd9a4b219 100644
--- a/include/qom/object.h
+++ b/include/qom/object.h
@@ -136,6 +136,9 @@ struct ObjectClass
  ObjectUnparent *unparent;
  
  GHashTable *properties;

+
+bool deprecated;
+bool not_secure;


LGTM but I'd rather use a reason string instead of a boolean,
so we are forced to justify.

That would be in line with MachineClass::deprecation_reason:

 * MachineClass:
 * @deprecation_reason: If set, the machine is marked as deprecated.
 *The string should provide some clear information about what to
 *use instead.


  };
  
  /**

diff --git a/qom/qom-qmp-cmds.c b/qom/qom-qmp-cmds.c
index e91a2353472a..325ff0ba2a25 100644
--- a/qom/qom-qmp-cmds.c
+++ b/qom/qom-qmp-cmds.c
@@ -101,6 +101,14 @@ static void qom_list_types_tramp(ObjectClass *klass, void 
*data)
  if (parent) {
  info->parent = g_strdup(object_class_get_name(parent));
  }
+if (klass->deprecated) {
+info->has_deprecated = true;
+info->deprecated = true;
+}
+if (klass->not_secure) {
+info->has_not_secure = true;
+info->not_secure = true;
+}
  
  QAPI_LIST_PREPEND(*pret, info);

  }
diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index 6af6ef7d667f..effdc95d21d3 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -144,6 +144,8 @@ static bool qdev_class_has_alias(DeviceClass *dc)
  
  static void qdev_print_devinfo(DeviceClass *dc)

  {
+ObjectClass *klass = OBJECT_CLASS(dc);
+
  qemu_printf("name \"%s\"", object_class_get_name(OBJECT_CLASS(dc)));
  if (dc->bus_type) {
  qemu_printf(", bus %s", dc->bus_type);
@@ -157,6 +159,12 @@ static void qdev_print_devinfo(DeviceClass *dc)
  if (!dc->user_creatable) {
  qemu_printf(", no-user");
  }
+if (klass->deprecated) {
+qemu_printf(", deprecated");
+}
+if (klass->not_secure) {
+qemu_printf(", not-secure");
+}
  qemu_printf("\n");
  }
  
diff --git a/qapi/qom.json b/qapi/qom.json

index 8bd299265e39..3f20d4c6413b 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -163,10 +163,16 @@
  #
  # @parent: Name of parent type, if any (since 2.10)
  #
+# @deprecated: the type is deprecated (since 9.1)
+#
+# @not-secure: the type (typically a device) is not considered
+# a security boundary (since 9.1)
+#
  # Since: 1.1
  ##
  { 'struct': 'ObjectTypeInfo',
-  'data': { 'name': 'str', '*abstract': 'bool', '*parent': 'str' } }
+  'data': { 'name': 'str', '*abstract': 'bool', '*parent': 'str',
+'*deprecated': 'bool', '*not-secure': 'bool' } }
  
  ##

  # @qom-list-types:

Re: [PATCH 0/5] s390x: Add Full Boot Order Support

2024-06-06 Thread Thomas Huth


On 06/06/2024 21.22, Jared Rossi wrote:



On 6/5/24 4:02 AM, Thomas Huth wrote:

On 29/05/2024 17.43, jro...@linux.ibm.com wrote:

From: Jared Rossi 

This patch set primarily adds support for the specification of multiple boot
devices, allowing for the guest to automatically use an alternative 
device on
a failed boot without needing to be reconfigured. It additionally 
provides the

ability to define the loadparm attribute on a per-device bases, which allows
boot devices to use different loadparm values if needed.

In brief, an IPLB is generated for each designated boot device (up to a 
maximum
of 8) and stored in guest memory immediately before BIOS. If a device 
fails to

boot, the next IPLB is retrieved and we jump back to the start of BIOS.

Devices can be specified using the standard qemu device tag "bootindex" 
as with
other architectures. Lower number indices are tried first, with 
"bootindex=0"

indicating the first device to try.


Is this supposed with multiple scsi-hd devices, too? I tried to boot a 
guest with two scsi disks (attached to a single virtio-scsi-ccw adapter) 
where only the second disk had a bootable installation, but that failed...?


 Thomas




Hi Thomas,

Yes, I would expect that to work. I tried to reproduce this using a 
non-bootable scsi disk as the first boot device and then a known-good 
bootable scsi disk as the second boot device, with one controller.  In my 
instance the BIOS was not able to identify the first disk as bootable and so 
that device failed to IPL, but it did move on to the next disk after that, 
and the guest successfully IPL'd from the second device.


When you say it failed, do you mean the first disk failed to boot (as 
expected), but then the guest died without attempting to boot from the 
second disk?  Or did something else happen? I am either not understanding 
your configuration or I am not understanding your error.


I did this:

 $ ./qemu-system-s390x -bios pc-bios/s390-ccw/s390-ccw.img -accel kvm \
   -device virtio-scsi-ccw  -drive if=none,id=d2,file=/tmp/bad.qcow2 \
   -device scsi-hd,drive=d2,bootindex=2 \
   -drive if=none,id=d8,file=/tmp/good.qcow2 \
   -device scsi-hd,drive=d8,bootindex=3 -m 4G -nographic
 LOADPARM=[]
 Using virtio-scsi.
 Using guessed DASD geometry.
 Using ECKD scheme (block size   512), CDL
 No zIPL section in IPL2 record.
 zIPL load failed.

 Trying next boot device...
 LOADPARM=[]
 Using virtio-scsi.
 Using guessed DASD geometry.
 Using ECKD scheme (block size   512), CDL
 No zIPL section in IPL2 record.
 zIPL load failed.

So it claims to try to load from the second disk, but it fails.
If I change the "bootindex=3" of the second disk to "bootindex=1", it boots 
perfectly fine, so I'm sure that the installation on good.qcow2 is working fine.


 Thomas

Re: [PATCH] i386/apic: Add hint on boot failure because of disabling x2APIC

2024-06-06 Thread Philippe Mathieu-Daudé


On 6/6/24 16:08, Zhao Liu wrote:

Currently, the Q35 supports up to 4096 vCPUs (since v9.0), but for TCG
cases, if x2APIC is not actively enabled to boot more than 255 vCPUs (
e.g., qemu-system-i386 -M pc-q35-9.0 -smp 666), the following error is
reported:

Unexpected error in apic_common_set_id() at ../hw/intc/apic_common.c:449:
qemu-system-i386: APIC ID 255 requires x2APIC feature in CPU
Aborted (core dumped)

This error can be resolved by setting x2apic=on in -cpu. In order to
better help users deal with this scenario, add the error hint to
instruct users on how to enable the x2apic feature.


Why not automatically set x2apic=on in this case instead?


Then, the error
report becomes the following:

Unexpected error in apic_common_set_id() at ../hw/intc/apic_common.c:448:
qemu-system-i386: APIC ID 255 requires x2APIC feature in CPU
Try x2apic=on in -cpu.
Aborted (core dumped)

Note since @errp is &error_abort, error_append_hint() can't be applied
on @errp. And in order to separate the exact error message from the
(perhaps effectively) hint, adding a hint via error_append_hint() is
also necessary. Therefore, introduce @local_error in
apic_common_set_id() to handle both the error message and the error
hint.

Suggested-by: Philippe Mathieu-Daudé 
Signed-off-by: Zhao Liu 
---
  hw/intc/apic_common.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)


Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code.

2024-06-06 Thread Philippe Mathieu-Daudé


On 6/6/24 16:02, Don Porter wrote:

Signed-off-by: Don Porter 
---
  target/i386/cpu.h|  42 ++
  target/i386/helper.c | 515 +
  target/i386/tcg/sysemu/excp_helper.c | 555 +--
  3 files changed, 562 insertions(+), 550 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH 3/5] s390x: Build IPLB chain for multiple boot devices

2024-06-06 Thread Thomas Huth


On 05/06/2024 22.01, Jared Rossi wrote:


On 6/4/24 2:26 PM, Thomas Huth wrote:

On 29/05/2024 17.43, jro...@linux.ibm.com wrote:

From: Jared Rossi 

Write a chain of IPLBs into memory for future use.

The IPLB chain is placed immediately before the BIOS in memory at the 
highest
unused page boundary providing sufficient space to fit the chain. Because 
this
is not a fixed address, the location of the next IPLB and number of 
remaining

boot devices is stored in the QIPL global variable for later access.

At this stage the IPLB chain is not accessed by the guest during IPL.

Signed-off-by: Jared Rossi 
---
  hw/s390x/ipl.h  |   1 +
  include/hw/s390x/ipl/qipl.h |   4 +-
  hw/s390x/ipl.c  | 129 +++-
  3 files changed, 103 insertions(+), 31 deletions(-)

diff --git a/hw/s390x/ipl.h b/hw/s390x/ipl.h
index 1dcb8984bb..4f098d3a81 100644
--- a/hw/s390x/ipl.h
+++ b/hw/s390x/ipl.h
@@ -20,6 +20,7 @@
  #include "qom/object.h"
    #define DIAG308_FLAGS_LP_VALID 0x80
+#define MAX_IPLB_CHAIN 7
    void s390_ipl_set_loadparm(char *ascii_lp, uint8_t *ebcdic_lp);
  void s390_ipl_fmt_loadparm(uint8_t *loadparm, char *str, Error **errp);
diff --git a/include/hw/s390x/ipl/qipl.h b/include/hw/s390x/ipl/qipl.h
index a6ce6ddfe3..481c459a53 100644
--- a/include/hw/s390x/ipl/qipl.h
+++ b/include/hw/s390x/ipl/qipl.h
@@ -34,7 +34,9 @@ struct QemuIplParameters {
  uint8_t  reserved1[3];
  uint64_t netboot_start_addr;
  uint32_t boot_menu_timeout;
-    uint8_t  reserved2[12];
+    uint8_t  reserved2[2];
+    uint16_t num_iplbs;
+    uint64_t next_iplb;
  }  QEMU_PACKED;
  typedef struct QemuIplParameters QemuIplParameters;
  diff --git a/hw/s390x/ipl.c b/hw/s390x/ipl.c
index 2d4f5152b3..79429acabd 100644
--- a/hw/s390x/ipl.c
+++ b/hw/s390x/ipl.c
@@ -55,6 +55,13 @@ static bool iplb_extended_needed(void *opaque)
  return ipl->iplbext_migration;
  }
  +/* Start IPLB chain from the boundary of the first unused page before 
BIOS */


I'd maybe say "upper boundary" to make it clear that this is at the end of 
the page, not at the beginning?


The chain does start at the beginning of a page.  That being said, the 
comment still needs to be reworded, I'm just not sure exactly how. "Start 
the IPLB chain from the nearest page boundary providing sufficient space 
before BIOS?"  Basically because each IPLB is 4K, the chain will occupy the 
N unused pages before the start of BIOS, where N is the number of chained 
IPLBS (assuming 4K pages).


Ah, right, I missed that sizeof(IplParameterBlock) == 4096 (I guess I was 
looking at the old version in pc-bios/s390-ccw/iplb.h that does not seem to 
have the padding), sorry for the confusion! It's really good that you now 
unify the headers in your first patch!


 Thomas

Re: [PATCH v3 1/6] Add an "info pg" command that prints the current page tables

2024-06-06 Thread Philippe Mathieu-Daudé


On 6/6/24 16:02, Don Porter wrote:

The new "info pg" monitor command prints the current page table,
including virtual address ranges, flag bits, and snippets of physical
page numbers.  Completely filled regions of the page table with
compatible flags are "folded", with the result that the complete
output for a freshly booted x86-64 Linux VM can fit in a single
terminal window.  The output looks like this:

VPN range Entry FlagsPhysical page
[7f000-7f000] PML4[0fe] ---DA--UWP
   [7f28c-7f28f]  PDP[0a3] ---DA--UWP
 [7f28c4600-7f28c47ff]  PDE[023] ---DA--UWP
   [7f28c4655-7f28c4656]  PTE[055-056] X--D---U-P 007f14-007f15
   [7f28c465b-7f28c465b]  PTE[05b] A--U-P 001cfc
...
[ff800-ff800] PML4[1ff] ---DA--UWP
   [8-b]  PDP[1fe] ---DA---WP
 [81000-81dff]  PDE[008-00e] -GSDA---WP 001000-001dff
   [c-f]  PDP[1ff] ---DA--UWP
 [ff400-ff5ff]  PDE[1fa] ---DA--UWP
   [ff5fb-ff5fc]  PTE[1fb-1fc] XG-DACT-WP 0fec00 0fee00
 [ff600-ff7ff]  PDE[1fb] ---DA--UWP
   [ff600-ff600]  PTE[000] -G-DA--U-P 001467

This draws heavy inspiration from Austin Clements' original patch.

This also adds a generic page table walker, which other monitor
and execution commands will be migrated to in subsequent patches.

Signed-off-by: Don Porter 
---
  hmp-commands-info.hx  |  13 ++
  hw/core/cpu-sysemu.c  | 140 
  include/hw/core/cpu.h |  34 ++-
  include/hw/core/sysemu-cpu-ops.h  | 156 +
  include/monitor/hmp-target.h  |   1 +
  monitor/hmp-cmds-target.c | 198 +
  target/i386/arch_memory_mapping.c | 351 +-
  target/i386/cpu.c |  11 +
  target/i386/cpu.h |  15 ++
  target/i386/monitor.c | 165 ++
  10 files changed, 1082 insertions(+), 2 deletions(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 20a9835ea8..a873841920 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -242,6 +242,19 @@ SRST
  Show memory tree.
  ERST
  
+{

+.name   = "pg",
+.args_type  = "",
+.params = "",
+.help   = "show the page table",
+.cmd= hmp_info_pg,
+},
+
+SRST
+  ``info pg``
+Show the active page table.
+ERST
+
  #if defined(CONFIG_TCG)
  {
  .name   = "jit",
diff --git a/hw/core/cpu-sysemu.c b/hw/core/cpu-sysemu.c
index 2a9a2a4eb5..fd936fa90c 100644
--- a/hw/core/cpu-sysemu.c
+++ b/hw/core/cpu-sysemu.c
@@ -142,3 +142,143 @@ GuestPanicInformation *cpu_get_crash_info(CPUState *cpu)
  }
  return res;
  }
+
+/**
+ * _for_each_pte - recursive helper function
+ *
+ * @cs - CPU state
+ * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
+ * pte.
+ *   * @cs - pass through cs
+ *   * @data - user-provided, opaque pointer
+ *   * @pte - current pte
+ *   * @vaddr_in - virtual address translated by pte
+ *   * @height - height in the tree of pte
+ * @data - user-provided, opaque pointer, passed to fn()
+ * @visit_interior_nodes - if true, call fn() on page table entries in
+ * interior nodes.  If false, only call fn() on page
+ * table entries in leaves.
+ * @visit_not_present - if true, call fn() on entries that are not present.
+ * if false, visit only present entries.
+ * @node - The physical address of the current page table radix tree node
+ * @vaddr_in - The virtual address bits translated in walking the page
+ *  table to node
+ * @height - The height of node in the radix tree
+ *
+ * height starts at the max and counts down.
+ * In a 4 level x86 page table, pml4e is level 4, pdpe is level 3,
+ *  pde is level 2, and pte is level 1
+ *
+ * Returns true on success, false on error.
+ */
+static bool
+_for_each_pte(CPUState *cs,


Please avoid '_' prefix.


+  int (*fn)(CPUState *cs, void *data, PTE_t *pte,
+vaddr vaddr_in, int height, int offset),
+  void *data, bool visit_interior_nodes,
+  bool visit_not_present, hwaddr node,
+  vaddr vaddr_in, int height)
+{
+int ptes_per_node;
+int i;
+
+assert(height > 0);
+
+CPUClass *cc = CPU_GET_CLASS(cs);
+
+if ((!cc->sysemu_ops->page_table_entries_per_node)
+|| (!cc->sysemu_ops->get_pte)
+|| (!cc->sysemu_ops->pte_present)
+|| (!cc->sysemu_ops->pte_leaf)
+|| (!cc->sysemu_ops->pte_child)) {
+return false;


Since this function has local scope, it shouldn't be called with
any of these unset. If you are unsure, we can assert() on them.


+}
+
+ptes_per_node = cc->sysemu_ops->page_table_entries_per_node(cs, height);
+
+fo

Re: [PULL 07/20] virtio-net: Do not propagate ebpf-rss-fds errors

2024-06-06 Thread Akihiko Odaki


On 2024/06/06 16:59, Daniel P. Berrangé wrote:

On Thu, Jun 06, 2024 at 04:19:11PM +0900, Akihiko Odaki wrote:

On 2024/06/06 16:14, Daniel P. Berrangé wrote:

On Thu, Jun 06, 2024 at 05:14:20AM +0900, Akihiko Odaki wrote:

On 2024/06/05 19:23, Daniel P. Berrangé wrote:

On Tue, Jun 04, 2024 at 03:37:42PM +0800, Jason Wang wrote:

From: Akihiko Odaki 

Propagating ebpf-rss-fds errors has several problems.

First, it makes device realization fail and disables the fallback to the
conventional eBPF loading.


AFAICT, this is not a bug - this is desired behaviour.

If the user/mgmt app has told QEMU to use FDs it has passed
in, then any failure to do this *MUST* be treated as a fatal
error. Falling back to other codepaths is ignoring a direct
user request.


The FD options are more like an assistance rather than a request. When QEMU
does not have a permission to load eBPF programs, a user can get the eBPF
programs with the request-ebpf command of QMP, load it, and pass the FDs to
QEMU.


That still doesn't alter the fact that if the user has chosen to pass FDs
and QEMU fails to use them, it *MUST* report that error back to the user.


The user should be more interested in whether the eBPF functionality is
successfully enabled or not, and that is irrelevant from whether the eBPF
program is loaded by QEMU or someone else.


No, this is wrong. A mgmt application or user will have made a decision
about *how* it wants QEMU to configure a particular feature. QEMU must
always honour the mgmt application's request, and not try to do something
different.

If the mgmt app did not want the FDs to be used, it would not have
passed them to QEMU in the first place. Ignoring the FDs is not likely
to work, because QEMU is unlikely to have permission to open the FDs
itself.

Ignoring the errors when creating the FDs, makes it much much harder
to detect and diagnose deployment problems, because the root cause
error is being discarded, and replaced by a later error which misleads
the app managing QEMU.

Always honouring the user requested config, or giving an error back
when it fails, is standard QEMU practice.


I see.

I'll append a follow-up patch to the series "[PATCH 0/3] virtio-net: 
Convert feature properties to OnOffAuto" to remove the fallback path. We 
can keep this for now to remove the flawed error handling code.


Regards,
Akihiko Odaki

Re: [PATCH v3 2/6] Convert 'info tlb' to use generic iterator

2024-06-06 Thread Philippe Mathieu-Daudé


Hi Don,

(Cc'ing Daniel for HumanReadableText)

On 6/6/24 16:02, Don Porter wrote:

Signed-off-by: Don Porter 
---
  include/hw/core/sysemu-cpu-ops.h |   7 +
  monitor/hmp-cmds-target.c|   1 +
  target/i386/cpu.h|   2 +
  target/i386/monitor.c| 217 ++-
  4 files changed, 53 insertions(+), 174 deletions(-)

diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
index eb16a1c3e2..bf3de3e004 100644
--- a/include/hw/core/sysemu-cpu-ops.h
+++ b/include/hw/core/sysemu-cpu-ops.h
@@ -243,6 +243,13 @@ typedef struct SysemuCPUOps {
  bool (*mon_flush_page_print_state)(CPUState *cs,
 struct mem_print_state *state);
  
+/**

+ * @mon_print_pte: Hook called by the monitor to print a page
+ * table entry at address addr, with contents pte.
+ */
+void (*mon_print_pte) (Monitor *mon, CPUArchState *env, hwaddr addr,
+   hwaddr pte);


IMO the SysemuCPUOps prototype should not use the monitor and return
a HumanReadableText:

  HumanReadableText *(*mon_print_pte)(CPUArchState *env,
  hwaddr addr, hwaddr pte);

Then define a QMP handler, itself registered to the monitor using
monitor_register_hmp_info_hrt().

Otherwise the cleanup is nice!

Regards,

Phil.


+
  } SysemuCPUOps;
  
  #endif /* SYSEMU_CPU_OPS_H */

diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
index 60a8bd0c37..3393e5ad0b 100644
--- a/monitor/hmp-cmds-target.c
+++ b/monitor/hmp-cmds-target.c
@@ -318,6 +318,7 @@ void hmp_info_pg(Monitor *mon, const QDict *qdict)
  /* Print last entry, if one present */
  cc->sysemu_ops->mon_flush_page_print_state(cs, &state);
  }
+
  static void memory_dump(Monitor *mon, int count, int format, int wsize,
  hwaddr addr, int is_physical)
  {
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index cbb6f6fc4d..1346ec0033 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2167,6 +2167,8 @@ bool x86_mon_init_page_table_iterator(Monitor *mon,
struct mem_print_state *state);
  void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state 
*state);
  bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state 
*state);
+void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
+   hwaddr pte);
  
  void x86_cpu_dump_state(CPUState *cs, FILE *f, int flags);
  
diff --git a/target/i386/monitor.c b/target/i386/monitor.c

index 65e82e73e8..ecde164857 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -214,202 +214,71 @@ static hwaddr addr_canonical(CPUArchState *env, hwaddr 
addr)
  return addr;
  }
  
-static void print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,

-  hwaddr pte, hwaddr mask)
+void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
+   hwaddr pte)
  {
+char buf[128];
+char *pos = buf, *end = buf + sizeof(buf);
+
  addr = addr_canonical(env, addr);
  
-monitor_printf(mon, HWADDR_FMT_plx ": " HWADDR_FMT_plx

-   " %c%c%c%c%c%c%c%c%c\n",
-   addr,
-   pte & mask,
-   pte & PG_NX_MASK ? 'X' : '-',
-   pte & PG_GLOBAL_MASK ? 'G' : '-',
-   pte & PG_PSE_MASK ? 'P' : '-',
-   pte & PG_DIRTY_MASK ? 'D' : '-',
-   pte & PG_ACCESSED_MASK ? 'A' : '-',
-   pte & PG_PCD_MASK ? 'C' : '-',
-   pte & PG_PWT_MASK ? 'T' : '-',
-   pte & PG_USER_MASK ? 'U' : '-',
-   pte & PG_RW_MASK ? 'W' : '-');
-}
+pos += snprintf(pos, end - pos, HWADDR_FMT_plx ": " HWADDR_FMT_plx " ",
+addr, (hwaddr) (pte & PG_ADDRESS_MASK));
  
-static void tlb_info_32(Monitor *mon, CPUArchState *env)

-{
-unsigned int l1, l2;
-uint32_t pgd, pde, pte;
+pos += snprintf(pos, end - pos, " %s", pg_bits(pte));
  
-pgd = env->cr[3] & ~0xfff;

-for(l1 = 0; l1 < 1024; l1++) {
-cpu_physical_memory_read(pgd + l1 * 4, &pde, 4);
-pde = le32_to_cpu(pde);
-if (pde & PG_PRESENT_MASK) {
-if ((pde & PG_PSE_MASK) && (env->cr[4] & CR4_PSE_MASK)) {
-/* 4M pages */
-print_pte(mon, env, (l1 << 22), pde, ~((1 << 21) - 1));
-} else {
-for(l2 = 0; l2 < 1024; l2++) {
-cpu_physical_memory_read((pde & ~0xfff) + l2 * 4, &pte, 4);
-pte = le32_to_cpu(pte);
-if (pte & PG_PRESENT_MASK) {
-print_pte(mon, env, (l1 << 22) + (l2 << 12),
-  pte & ~PG_PSE_MASK,
-  ~0xfff);
-}
-}
-}
-}
+/* Trim line to fit screen */
+

Re: [PATCH 5/5] s390x: Enable and document boot device fallback on panic

2024-06-06 Thread Thomas Huth


On 05/06/2024 16.48, Jared Rossi wrote:



diff --git a/pc-bios/s390-ccw/s390-ccw.h b/pc-bios/s390-ccw/s390-ccw.h
index c977a52b50..de3d1f0d5a 100644
--- a/pc-bios/s390-ccw/s390-ccw.h
+++ b/pc-bios/s390-ccw/s390-ccw.h
@@ -43,6 +43,7 @@ typedef unsigned long long u64;
  #include "iplb.h"
    /* start.s */
+extern char _start[];
  void disabled_wait(void) __attribute__ ((__noreturn__));
  void consume_sclp_int(void);
  void consume_io_int(void);
@@ -88,6 +89,11 @@ __attribute__ ((__noreturn__))
  static inline void panic(const char *string)
  {
  sclp_print(string);
+    if (load_next_iplb()) {
+    sclp_print("\nTrying next boot device...");
+    jump_to_IPL_code((long)_start);
+    }
+
  disabled_wait();
  }


Honestly, I am unsure whether this is a really cool idea or a very ugly 
hack ... but I think I tend towards the latter, sorry. Jumping back to the 
startup code might cause various problem, e.g. pre-initialized variables 
don't get their values reset, causing different behavior when the s390-ccw 
bios runs a function a second time this way. Thus this sounds very 
fragile. Could we please try to get things cleaned up correctly, so that 
functions return with error codes instead of panicking when we can 
continue with another boot device? Even if its more work right now, I 
think this will be much more maintainable in the future.


 Thomas



Thanks Thomas, I appreciate your insight.  Your hesitation is perfectly 
understandable as well.  My initial design was like you suggest, where the 
functions return instead of panic, but the issue I ran into is that netboot 
uses a separate image, which we jump in to at the start of IPL from a 
network device (see zipl_load() in pc-bios/s390-ccw/bootmap.c). I wasn't 
able to come up with a simple way to return to the main BIOS code if a 
netboot fails other than by jumping back.  So, it seems to me that netboot 
kind of throws a monkeywrench into the basic idea of reworking the panics 
into returns.


I'm open to suggestions on a better way to recover from a failed netboot, 
and it's certainly possible I've overlooked something, but as far as I can 
tell a jump is necessary in that particular case at least. Netboot could 
perhaps be handled as a special case where the jump back is permitted 
whereas other device types return, but I don't think that actually solves 
the main issue.


What are your thoughts on this?


Yes, I agree that jumping is currently required to get back from the netboot 
code. So if you could rework your patches in a way that limits the jumping 
to a failed netboot, that would be acceptable, I think.


Apart from that: We originally decided to put the netboot code into a 
separate binary since the required roms/SLOF module might not always have 
been checked out (it needed to be done manually), so we were not able to 
compile it in all cases. But nowadays, this is handled in a much nicer way, 
the submodule is automatically checked out once you compile the 
s390x-softmmu target and have a s390x compiler available, so I wonder 
whether we should maybe do the next step and integrate the netboot code into 
the main s390-ccw.img now? Anybody got an opinion on this?


 Thomas

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-06 Thread Jinpu Wang

Hi Gonglei, hi folks on the list,

On Tue, Jun 4, 2024 at 2:14 PM Gonglei  wrote:
>
> From: Jialin Wang 
>
> Hi,
>
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
>
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
>
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
>
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-phi...@linaro.org/
>
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
>
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!
First thx for the effort, we are running migration tests on our IB
fabric, different generation of HCA from mellanox, the migration works
ok,
there are a few failures,  Yu will share the result later separately.

The one blocker for the change is the old implementation and the new
rsocket implementation;
they don't talk to each other due to the effect of different wire
protocol during connection establishment.
eg the old RDMA migration has special control message during the
migration flow, which rsocket use a different control message, so
there lead to no way
to migrate VM using rdma transport pre to the rsocket patchset to a
new version with rsocket implementation.

Probably we should keep both implementation for a while, mark the old
implementation as deprecated, and promote the new implementation, and
high light in doc,
they are not compatible.

Regards!
Jinpu



>
> Jialin Wang (6):
>   migration: remove RDMA live migration temporarily
>   io: add QIOChannelRDMA class
>   io/channel-rdma: support working in coroutine
>   tests/unit: add test-io-channel-rdma.c
>   migration: introduce new RDMA live migration
>   migration/rdma: support multifd for RDMA migration
>
>  docs/rdma.txt |  420 ---
>  include/io/channel-rdma.h |  165 ++
>  io/channel-rdma.c |  798 ++
>  io/meson.build|1 +
>  io/trace-events   |   14 +
>  meson.build   |6 -
>  migration/meson.build |3 +-
>  migration/migration-stats.c   |5 +-
>  migration/migration-stats.h   |4 -
>  migration/migration.c |   13 +-
>  migration/migration.h |9 -
>  migration/multifd.c   |   10 +
>  migration/options.c   |   16 -
>  migration/options.h   |2 -
>  migration/qemu-file.c |1 -
>  migration/ram.c   |   90 +-
>  migration/rdma.c  | 4205 +
>  migration/rdma.h  |   67 +-
>  migration/savevm.c|2 +-
>  migration/trace-events|   68 +-
>  qapi/migration.json   |   13 +-
>  scripts/analyze-migration.py  |3 -
>  tests/unit/meson.build|1 +
>  tests/unit/test-io-channel-rdma.c |  276 ++
>  24 files changed, 1360 insertions(+), 4832 deletions(-)
>  delete mode 100644 docs/rdma.txt
>  create mode 100644 include/io/channel-rdma.h
>  create mode 100644 io/channel-rdma.c
>  create mode 100644 tests/unit/test-io-channel-rdma.c
>
> --
> 2.43.0
>

Re: [PATCH v4 0/6] target/riscv: Support RISC-V privilege 1.13 spec

2024-06-06 Thread Alistair Francis

On Thu, Jun 6, 2024 at 11:50 PM Fea.Wang  wrote:
>
> Based on the change log for the RISC-V privilege 1.13 spec, add the
> support for ss1p13.
>
> base-commit: 7a2356147f3a5faebf95dba4140247ec6e5607b1
>
> * Reorder commits
>
> [v3]
> * Correct the mstateen0 for P1P13 in commit message
> * Refactor commit by splitting to two commits
>
> [v2]
> * Check HEDELEGH by hmode32 instead of any32
> * Remove unnecessary code
> * Refine calling functions
>
> [v1]
>
> Ref:https://github.com/riscv/riscv-isa-manual/blob/a7d93c9/src/priv-preface.adoc?plain=1#L40-L72
>
> Lists what to do without clarification or document format.
> * Redefined misa.MXL to be read-only, making MXLEN a constant.(Skip, 
> implementation ignored)
> * Added the constraint that SXLEN≥UXLEN.(Skip, implementation ignored)
> * Defined the misa.V field to reflect that the V extension has been 
> implemented.(Skip, existed)
> * Defined the RV32-only medelegh and hedelegh CSRs.(Done in these patches)
> * Defined the misaligned atomicity granule PMA, superseding the proposed Zam 
> extension..(Skip, implementation ignored)
> * Allocated interrupt 13 for Sscofpmf LCOFI interrupt.(Skip, existed)
> * Defined hardware error and software check exception codes.(Done in these 
> patches)
> * Specified synchronization requirements when changing the PBMTE fields in 
> menvcfg and henvcfg.(Skip, implementation ignored)
> * Incorporated Svade and Svadu extension specifications.(Skip, existed)
>
> Fea.Wang (5):
>   target/riscv: Define macros and variables for ss1p13
>   target/riscv: Add 'P1P13' bit in SMSTATEEN0
>   target/riscv: Add MEDELEGH, HEDELEGH csrs for RV32
>   target/riscv: Reserve exception codes for sw-check and hw-err
>   target/riscv: Support the version for ss1p13
>
> Jim Shu (1):
>   target/riscv: Reuse the conversion function of priv_spec

Thanks!

Applied to riscv-to-apply.next

Alistair

>
>  target/riscv/cpu.c |  8 ++--
>  target/riscv/cpu.h |  5 -
>  target/riscv/cpu_bits.h|  5 +
>  target/riscv/cpu_cfg.h |  1 +
>  target/riscv/csr.c | 39 ++
>  target/riscv/tcg/tcg-cpu.c | 17 -
>  6 files changed, 63 insertions(+), 12 deletions(-)
>
> --
> 2.34.1
>
>

Re: [PATCH 2/3] plugins: Free CPUPluginState before destroying vCPU state

2024-06-06 Thread Philippe Mathieu-Daudé


On 6/6/24 23:14, Pierrick Bouvier wrote:

On 6/6/24 05:40, Philippe Mathieu-Daudé wrote:

cpu::plugin_state is allocated in cpu_common_initfn() when
the vCPU state is created. Release it in cpu_common_finalize()
when we are done.

Signed-off-by: Philippe Mathieu-Daudé 
---
  include/qemu/plugin.h | 3 +++
  hw/core/cpu-common.c  | 5 +
  2 files changed, 8 insertions(+)

diff --git a/include/qemu/plugin.h b/include/qemu/plugin.h
index bc5aef979e..af5f9db469 100644
--- a/include/qemu/plugin.h
+++ b/include/qemu/plugin.h
@@ -149,6 +149,9 @@ struct CPUPluginState {
  /**
   * qemu_plugin_create_vcpu_state: allocate plugin state
+ *
+ * The returned data must be released with g_free()
+ * when no longer required.
   */
  CPUPluginState *qemu_plugin_create_vcpu_state(void);
diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index bf1a7b8892..cd15402552 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -283,6 +283,11 @@ static void cpu_common_finalize(Object *obj)
  {
  CPUState *cpu = CPU(obj);
+#ifdef CONFIG_PLUGIN
+    if (tcg_enabled()) {
+    g_free(cpu->plugin_state);
+    }
+#endif
  g_array_free(cpu->gdb_regs, TRUE);
  qemu_lockcnt_destroy(&cpu->in_ioctl_lock);
  qemu_mutex_destroy(&cpu->work_mutex);


To ensure I get it right, order of cpu init/deinit is:
- init
- realize
- unrealize
- finalize
Is that correct?


Yes, this is valid for all QDev (CPU is based on it).

+ init: allocate state, expose configurable properties
. user configure properties
+ realize: consume properties to tune the object
+ reset: set default values
. object is used
+ unrealize: undo stuff from realize because the object
  might be realized again (unplug - plug)
+ finalize: release resources

See 
https://lore.kernel.org/qemu-devel/20240209123226.32576-1-phi...@linaro.org/



Reviewed-by: Pierrick Bouvier 


Thanks!

[PATCH 1/3] linux-user: Adjust comment to reflect the code.

2024-06-06 Thread Warner Losh

If the user didn't specify a reserved_va, there's an else for 64-bit
host 32-bit (or fewer) target to reserve 32-bits of address
space. Update the comments to reflect this, and rejustify comment
to 80 columns.

Signed-off-by: Warner Losh 
---
 linux-user/main.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/linux-user/main.c b/linux-user/main.c
index 94e4c47f052..94c99a1366f 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -814,10 +814,10 @@ int main(int argc, char **argv, char **envp)
 thread_cpu = cpu;
 
 /*
- * Reserving too much vm space via mmap can run into problems
- * with rlimits, oom due to page table creation, etc.  We will
- * still try it, if directed by the command-line option, but
- * not by default.
+ * Reserving too much vm space via mmap can run into problems with rlimits,
+ * oom due to page table creation, etc.  We will still try it, if directed
+ * by the command-line option, but not by default. Unless we're running a
+ * target address space of 32 or fewer bits on a host with 64 bits.
  */
 max_reserved_va = MAX_RESERVED_VA(cpu);
 if (reserved_va != 0) {
-- 
2.43.0

[PATCH 2/3] bsd-user: port linux-user:ff8a8bbc2ad1 for variable page sizes

2024-06-06 Thread Warner Losh

Bring in Richard Henderson's ff8a8bbc2ad1 to finalize the page size to
allow TARGET_PAGE_BITS_VARY. bsd-user's "blitz" fork has aarch64
support, which is now variable page size. Add support for it here, even
though it's effectively a nop in upstream qemu.

Signed-off-by: Warner Losh 
---
 bsd-user/main.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/bsd-user/main.c b/bsd-user/main.c
index 29a629d8779..d685734d087 100644
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -46,6 +46,7 @@
 #include "crypto/init.h"
 #include "qemu/guest-random.h"
 #include "gdbstub/user.h"
+#include "exec/page-vary.h"
 
 #include "host-os.h"
 #include "target_arch_cpu.h"
@@ -291,6 +292,7 @@ int main(int argc, char **argv)
 char **target_environ, **wrk;
 envlist_t *envlist = NULL;
 char *argv0 = NULL;
+int host_page_size;
 
 adjust_ssize();
 
@@ -476,6 +478,16 @@ int main(int argc, char **argv)
  opt_one_insn_per_tb, &error_abort);
 ac->init_machine(NULL);
 }
+
+/*
+ * Finalize page size before creating CPUs.
+ * This will do nothing if !TARGET_PAGE_BITS_VARY.
+ * The most efficient setting is to match the host.
+ */
+host_page_size = qemu_real_host_page_size();
+set_preferred_target_page_bits(ctz32(host_page_size));
+finalize_target_page_bits();
+
 cpu = cpu_create(cpu_type);
 env = cpu_env(cpu);
 cpu_reset(cpu);
-- 
2.43.0

[PATCH 3/3] bsd-user: Catch up to run-time reserved_va math

2024-06-06 Thread Warner Losh

Catch up to linux-user's 8f67b9c694d0, 13c13397556a, 2f7828b57293, and
95059f9c313a by Richard Henderson which made reserved_va a run-time
calculation, defaulting to nothing except in the case of 64-bit host
32-bit target. Also include the adjustment of the comment heading that
work submitted in the same patch stream. Since this is a direct copy,
squash it into one patch rather than follow the Linux evolution since
breaking this down further at this point doesn't make sense for this
"new code".

Signed-off-by: Warner Losh 
---
 bsd-user/main.c | 39 +++
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/bsd-user/main.c b/bsd-user/main.c
index d685734d087..dcad266c2c9 100644
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -77,25 +77,16 @@ bool have_guest_base;
 # if HOST_LONG_BITS > TARGET_VIRT_ADDR_SPACE_BITS
 #  if TARGET_VIRT_ADDR_SPACE_BITS == 32 && \
   (TARGET_LONG_BITS == 32 || defined(TARGET_ABI32))
-#   define MAX_RESERVED_VA  0xul
+#   define MAX_RESERVED_VA(CPU)  0xul
 #  else
-#   define MAX_RESERVED_VA  ((1ul << TARGET_VIRT_ADDR_SPACE_BITS) - 1)
+#   define MAX_RESERVED_VA(CPU)  ((1ul << TARGET_VIRT_ADDR_SPACE_BITS) - 1)
 #  endif
 # else
-#  define MAX_RESERVED_VA  0
+#  define MAX_RESERVED_VA(CPU)  0
 # endif
 #endif
 
-/*
- * That said, reserving *too* much vm space via mmap can run into problems
- * with rlimits, oom due to page table creation, etc.  We will still try it,
- * if directed by the command-line option, but not by default.
- */
-#if HOST_LONG_BITS == 64 && TARGET_VIRT_ADDR_SPACE_BITS <= 32
-unsigned long reserved_va = MAX_RESERVED_VA;
-#else
 unsigned long reserved_va;
-#endif
 
 const char *interp_prefix = CONFIG_QEMU_INTERP_PREFIX;
 const char *qemu_uname_release;
@@ -293,6 +284,7 @@ int main(int argc, char **argv)
 envlist_t *envlist = NULL;
 char *argv0 = NULL;
 int host_page_size;
+unsigned long max_reserved_va;
 
 adjust_ssize();
 
@@ -493,6 +485,29 @@ int main(int argc, char **argv)
 cpu_reset(cpu);
 thread_cpu = cpu;
 
+/*
+ * Reserving too much vm space via mmap can run into problems with rlimits,
+ * oom due to page table creation, etc.  We will still try it, if directed
+ * by the command-line option, but not by default. Unless we're running a
+ * target address space of 32 or fewer bits on a host with 64 bits.
+ */
+max_reserved_va = MAX_RESERVED_VA(cpu);
+if (reserved_va != 0) {
+if ((reserved_va + 1) % host_page_size) {
+char *s = size_to_str(host_page_size);
+fprintf(stderr, "Reserved virtual address not aligned mod %s\n", 
s);
+g_free(s);
+exit(EXIT_FAILURE);
+}
+if (max_reserved_va && reserved_va > max_reserved_va) {
+fprintf(stderr, "Reserved virtual address too big\n");
+exit(EXIT_FAILURE);
+}
+} else if (HOST_LONG_BITS == 64 && TARGET_VIRT_ADDR_SPACE_BITS <= 32) {
+/* MAX_RESERVED_VA + 1 is a large power of 2, so is aligned. */
+reserved_va = max_reserved_va;
+}
+
 if (getenv("QEMU_STRACE")) {
 do_strace = 1;
 }
-- 
2.43.0

[PATCH 0/3] bsd-user: Baby Steps towards eliminating qemu_host_page_size, et al

2024-06-06 Thread Warner Losh

First baby-steps towards eliminating qemu_host_page_size: tackle the reserve_va
calculation (which is easier to copy from linux-user than to fix).

Warner Losh (3):
  linux-user: Adjust comment to reflect the code.
  bsd-user: port linux-user:ff8a8bbc2ad1 for variable page sizes
  bsd-user: Catch up to run-time reserved_va math

 bsd-user/main.c   | 51 ---
 linux-user/main.c |  8 
 2 files changed, 43 insertions(+), 16 deletions(-)

-- 
2.43.0

[PATCH v3] target/loongarch/kvm: Add software breakpoint support

2024-06-06 Thread Bibo Mao

With KVM virtualization, debug exception is injected to guest kernel
rather than host for normal break intruction. Here hypercall
instruction with special code is used for sw breakpoint usage,
and detailed instruction comes from kvm kernel with user API
KVM_REG_LOONGARCH_DEBUG_INST.

Now only software breakpoint is supported, and it is allowed to
insert/remove software breakpoint. We can debug guest kernel with gdb
method after kernel is loaded, hardware breakpoint will be added in later.

Signed-off-by: Bibo Mao 
---
v2 ... v3:
  1. Refresh patch based on the latest version, succeed in compile and
run since kvm uapi header files is updated.
v1 ... v2:
  1. Enable TARGET_KVM_HAVE_GUEST_DEBUG on loongarch64 platform
---
 configs/targets/loongarch64-softmmu.mak |  1 +
 target/loongarch/kvm/kvm.c  | 76 +
 2 files changed, 77 insertions(+)

diff --git a/configs/targets/loongarch64-softmmu.mak 
b/configs/targets/loongarch64-softmmu.mak
index 84beb19b90..65b65e0c34 100644
--- a/configs/targets/loongarch64-softmmu.mak
+++ b/configs/targets/loongarch64-softmmu.mak
@@ -1,5 +1,6 @@
 TARGET_ARCH=loongarch64
 TARGET_BASE_ARCH=loongarch
+TARGET_KVM_HAVE_GUEST_DEBUG=y
 TARGET_SUPPORTS_MTTCG=y
 TARGET_XML_FILES= gdb-xml/loongarch-base32.xml gdb-xml/loongarch-base64.xml 
gdb-xml/loongarch-fpu.xml
 # all boards require libfdt
diff --git a/target/loongarch/kvm/kvm.c b/target/loongarch/kvm/kvm.c
index 8e6e27c8bf..e1be6a6959 100644
--- a/target/loongarch/kvm/kvm.c
+++ b/target/loongarch/kvm/kvm.c
@@ -28,6 +28,7 @@
 #include "trace.h"
 
 static bool cap_has_mp_state;
+static unsigned int brk_insn;
 const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
 KVM_CAP_LAST_INFO
 };
@@ -664,7 +665,14 @@ static void kvm_loongarch_vm_stage_change(void *opaque, 
bool running,
 
 int kvm_arch_init_vcpu(CPUState *cs)
 {
+uint64_t val;
+
 qemu_add_vm_change_state_handler(kvm_loongarch_vm_stage_change, cs);
+
+if (!kvm_get_one_reg(cs, KVM_REG_LOONGARCH_DEBUG_INST, &val)) {
+brk_insn = val;
+}
+
 return 0;
 }
 
@@ -739,6 +747,67 @@ bool kvm_arch_stop_on_emulation_error(CPUState *cs)
 return true;
 }
 
+void kvm_arch_update_guest_debug(CPUState *cpu, struct kvm_guest_debug *dbg)
+{
+if (kvm_sw_breakpoints_active(cpu)) {
+dbg->control |= KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP;
+}
+}
+
+int kvm_arch_insert_sw_breakpoint(CPUState *cs, struct kvm_sw_breakpoint *bp)
+{
+if (cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, 4, 0) ||
+cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&brk_insn, 4, 1)) {
+error_report("%s failed", __func__);
+return -EINVAL;
+}
+return 0;
+}
+
+int kvm_arch_remove_sw_breakpoint(CPUState *cs, struct kvm_sw_breakpoint *bp)
+{
+static uint32_t brk;
+
+if (cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&brk, 4, 0) ||
+brk != brk_insn ||
+cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, 4, 1)) {
+error_report("%s failed", __func__);
+return -EINVAL;
+}
+return 0;
+}
+
+int kvm_arch_insert_hw_breakpoint(vaddr addr, vaddr len, int type)
+{
+return -ENOSYS;
+}
+
+int kvm_arch_remove_hw_breakpoint(vaddr addr, vaddr len, int type)
+{
+return -ENOSYS;
+}
+
+void kvm_arch_remove_all_hw_breakpoints(void)
+{
+}
+
+static bool kvm_loongarch_handle_debug(CPUState *cs, struct kvm_run *run)
+{
+LoongArchCPU *cpu = LOONGARCH_CPU(cs);
+CPULoongArchState *env = &cpu->env;
+
+kvm_cpu_synchronize_state(cs);
+if (cs->singlestep_enabled) {
+return true;
+}
+
+if (kvm_find_sw_breakpoint(cs, env->pc)) {
+return true;
+}
+
+return false;
+}
+
 int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
 {
 int ret = 0;
@@ -757,6 +826,13 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
  run->iocsr_io.len,
  run->iocsr_io.is_write);
 break;
+
+case KVM_EXIT_DEBUG:
+if (kvm_loongarch_handle_debug(cs, run)) {
+ret = EXCP_DEBUG;
+}
+break;
+
 default:
 ret = -1;
 warn_report("KVM: unknown exit reason %d", run->exit_reason);

base-commit: dec9742cbc59415a8b83e382e7ae36395394e4bd
-- 
2.39.3

Re: [PATCH v2 2/2] util/bufferiszero: Add loongarch64 vector acceleration

2024-06-06 Thread maobibo





On 2024/6/7 上午8:24, Richard Henderson wrote:

Use inline assembly because no release compiler allows
per-function selection of the ISA.

Signed-off-by: Richard Henderson 
---
  .../loongarch64/host/bufferiszero.c.inc   | 143 ++
  1 file changed, 143 insertions(+)
  create mode 100644 host/include/loongarch64/host/bufferiszero.c.inc

diff --git a/host/include/loongarch64/host/bufferiszero.c.inc 
b/host/include/loongarch64/host/bufferiszero.c.inc
new file mode 100644
index 00..69891eac80
--- /dev/null
+++ b/host/include/loongarch64/host/bufferiszero.c.inc
@@ -0,0 +1,143 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * buffer_is_zero acceleration, loongarch64 version.
+ */
+
+/*
+ * Builtins for LSX and LASX are introduced by gcc 14 and llvm 18,
+ * but as yet neither has support for attribute target, so neither
+ * is able to enable the optimization without globally enabling
+ * vector support.  Since we want runtime detection, use assembly.
+ */
+
+static bool buffer_is_zero_lsx(const void *buf, size_t len)
+{
+const void *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+const void *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16) - (7 * 16);
+const void *l = buf + len;
+bool ret;
+
+asm("vld $vr0,%2,0\n\t" /* first: buf + 0 */
+"vld $vr1,%4,-16\n\t"   /* last: buf + len - 16 */
+"vld $vr2,%3,0\n\t" /* e[0] */
+"vld $vr3,%3,16\n\t"/* e[1] */
+"vld $vr4,%3,32\n\t"/* e[2] */
+"vld $vr5,%3,48\n\t"/* e[3] */
+"vld $vr6,%3,64\n\t"/* e[4] */
+"vld $vr7,%3,80\n\t"/* e[5] */
+"vld $vr8,%3,96\n\t"/* e[6] */
+"vor.v $vr0,$vr0,$vr1\n\t"
+"vor.v $vr2,$vr2,$vr3\n\t"
+"vor.v $vr4,$vr4,$vr5\n\t"
+"vor.v $vr6,$vr6,$vr7\n\t"
+"vor.v $vr0,$vr0,$vr2\n\t"
+"vor.v $vr4,$vr4,$vr6\n\t"
+"vor.v $vr0,$vr0,$vr4\n\t"
+"vor.v $vr0,$vr0,$vr8\n\t"
+"or %0,$r0,$r0\n"   /* prepare return false */
+"1:\n\t"
+"vsetnez.v $fcc0,$vr0\n\t"
+"bcnez $fcc0,2f\n\t"
+"vld $vr0,%1,0\n\t" /* p[0] */
+"vld $vr1,%1,16\n\t"/* p[1] */
+"vld $vr2,%1,32\n\t"/* p[2] */
+"vld $vr3,%1,48\n\t"/* p[3] */
+"vld $vr4,%1,64\n\t"/* p[4] */
+"vld $vr5,%1,80\n\t"/* p[5] */
+"vld $vr6,%1,96\n\t"/* p[6] */
+"vld $vr7,%1,112\n\t"   /* p[7] */
+"addi.d %1,%1,128\n\t"
+"vor.v $vr0,$vr0,$vr1\n\t"
+"vor.v $vr2,$vr2,$vr3\n\t"
+"vor.v $vr4,$vr4,$vr5\n\t"
+"vor.v $vr6,$vr6,$vr7\n\t"
+"vor.v $vr0,$vr0,$vr2\n\t"
+"vor.v $vr4,$vr4,$vr6\n\t"
+"vor.v $vr0,$vr0,$vr4\n\t"
+"bltu %1,%3,1b\n\t"
+"vsetnez.v $fcc0,$vr0\n\t"
+"bcnez $fcc0,2f\n\t"
+"ori %0,$r0,1\n"
+"2:"
+: "=&r"(ret), "+r"(p)
+: "r"(buf), "r"(e), "r"(l)
+: "f0", "f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "fcc0");
+
+return ret;
+}
+
+static bool buffer_is_zero_lasx(const void *buf, size_t len)
+{
+const void *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32);
+const void *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32) - (7 * 32);
+const void *l = buf + len;
+bool ret;
+
+asm("xvld $xr0,%2,0\n\t" /* first: buf + 0 */
+"xvld $xr1,%4,-32\n\t"   /* last: buf + len - 32 */
+"xvld $xr2,%3,0\n\t" /* e[0] */
+"xvld $xr3,%3,32\n\t"/* e[1] */
+"xvld $xr4,%3,64\n\t"/* e[2] */
+"xvld $xr5,%3,96\n\t"/* e[3] */
+"xvld $xr6,%3,128\n\t"   /* e[4] */
+"xvld $xr7,%3,160\n\t"   /* e[5] */
+"xvld $xr8,%3,192\n\t"   /* e[6] */
+"xvor.v $xr0,$xr0,$xr1\n\t"
+"xvor.v $xr2,$xr2,$xr3\n\t"
+"xvor.v $xr4,$xr4,$xr5\n\t"
+"xvor.v $xr6,$xr6,$xr7\n\t"
+"xvor.v $xr0,$xr0,$xr2\n\t"
+"xvor.v $xr4,$xr4,$xr6\n\t"
+"xvor.v $xr0,$xr0,$xr4\n\t"
+"xvor.v $xr0,$xr0,$xr8\n\t"
+"or %0,$r0,$r0\n\t"  /* prepare return false */
+"bgeu %1,%3,2f\n"
+"1:\n\t"
+"xvsetnez.v $fcc0,$xr0\n\t"
+"bcnez $fcc0,3f\n\t"
+"xvld $xr0,%1,0\n\t" /* p[0] */
+"xvld $xr1,%1,32\n\t"/* p[1] */
+"xvld $xr2,%1,64\n\t"/* p[2] */
+"xvld $xr3,%1,96\n\t"/* p[3] */
+"xvld $xr4,%1,128\n\t"   /* p[4] */
+"xvld $xr5,%1,160\n\t"   /* p[5] */
+"xvld $xr6,%1,192\n\t"   /* p[6] */
+"xvld $xr7,%1,224\n\t"   /* p[7] */
+"addi.d %1,%1,256\n\t"
+"xvor.v $xr0,$xr0,$xr1\n\t"
+"xvor.v $xr2,$xr2,$xr3\n\t"
+"xvor.v $xr4,$xr4,$xr5\n\t"
+"xvor.v $xr6,$xr6,$xr7\n\t"
+

Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context

2024-06-06 Thread Stefan Hajnoczi

On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
> The fact that the snapshot_save_job_bh() is scheduled in the main
> loop's qemu_aio_context AioContext means that it might get executed
> during a vCPU thread's aio_poll(). But saving of the VM state cannot
> happen while the guest or devices are active and can lead to assertion
> failures. See issue #2111 for two examples. Avoid the problem by
> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
> which is not polled by vCPU threads.
> 
> Solves Issue #2111.
> 
> This change also solves the following issue:
> 
> Since commit effd60c878 ("monitor: only run coroutine commands in
> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
> right after starting the job anymore, but only after the job finished,
> which can take a long time. The reason is, because after commit
> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
> coroutine cannot be entered immediately anymore, but needs to be
> scheduled to the main loop's qemu_aio_context AioContext. But
> snapshot_save_job_bh() was scheduled first to the same AioContext and
> thus gets executed first.
> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
> Signed-off-by: Fiona Ebner 
> ---
> 
> While initial smoke testing seems fine, I'm not familiar enough with
> this to rule out any pitfalls with the approach. Any reason why
> scheduling to the iohandler AioContext could be wrong here?

If something waits for a BlockJob to finish using aio_poll() from
qemu_aio_context then a deadlock is possible since the iohandler_ctx
won't get a chance to execute. The only suspicious code path I found was
job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
sure whether it triggers this scenario. Please check that code path.

> Should the same be done for the snapshot_load_job_bh and
> snapshot_delete_job_bh to keep it consistent?

In the long term it would be cleaner to move away from synchronous APIs
that rely on nested event loops. They have been a source of bugs for
years.

If vm_stop() and perhaps other operations in save_snapshot() were
asynchronous, then it would be safe to run the operation in
qemu_aio_context without using iohandler_ctx. vm_stop() wouldn't invoke
its callback until vCPUs were quiesced and outside device emulation
code.

I think this patch is fine as a one-line bug fix, but we should be
careful about falling back on this trick because it makes the codebase
harder to understand and more fragile.

> 
>  migration/savevm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index c621f2359b..0086b76ab0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3459,7 +3459,7 @@ static int coroutine_fn snapshot_save_job_run(Job *job, 
> Error **errp)
>  SnapshotJob *s = container_of(job, SnapshotJob, common);
>  s->errp = errp;
>  s->co = qemu_coroutine_self();
> -aio_bh_schedule_oneshot(qemu_get_aio_context(),
> +aio_bh_schedule_oneshot(iohandler_get_aio_context(),
>  snapshot_save_job_bh, job);
>  qemu_coroutine_yield();
>  return s->ret ? 0 : -1;
> -- 
> 2.39.2


signature.asc
Description: PGP signature

Re: [PATCH v4 6/6] target/riscv: Support the version for ss1p13

2024-06-06 Thread Alistair Francis

On Thu, Jun 6, 2024 at 11:51 PM Fea.Wang  wrote:
>
> Add RISC-V privilege 1.13 support.
>
> Signed-off-by: Fea.Wang 
> Signed-off-by: Fea.Wang 
> Reviewed-by: Frank Chang 
> Reviewed-by: Weiwei Li 
> Reviewed-by: LIU Zhiwei 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu.c | 6 +-
>  target/riscv/tcg/tcg-cpu.c | 4 
>  2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index fd0f09c468..4760cb2cc1 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -1779,7 +1779,9 @@ static int priv_spec_from_str(const char *priv_spec_str)
>  {
>  int priv_version = -1;
>
> -if (!g_strcmp0(priv_spec_str, PRIV_VER_1_12_0_STR)) {
> +if (!g_strcmp0(priv_spec_str, PRIV_VER_1_13_0_STR)) {
> +priv_version = PRIV_VERSION_1_13_0;
> +} else if (!g_strcmp0(priv_spec_str, PRIV_VER_1_12_0_STR)) {
>  priv_version = PRIV_VERSION_1_12_0;
>  } else if (!g_strcmp0(priv_spec_str, PRIV_VER_1_11_0_STR)) {
>  priv_version = PRIV_VERSION_1_11_0;
> @@ -1799,6 +1801,8 @@ const char *priv_spec_to_str(int priv_version)
>  return PRIV_VER_1_11_0_STR;
>  case PRIV_VERSION_1_12_0:
>  return PRIV_VER_1_12_0_STR;
> +case PRIV_VERSION_1_13_0:
> +return PRIV_VER_1_13_0_STR;
>  default:
>  return NULL;
>  }
> diff --git a/target/riscv/tcg/tcg-cpu.c b/target/riscv/tcg/tcg-cpu.c
> index 4c6141f947..eb6f7b9d12 100644
> --- a/target/riscv/tcg/tcg-cpu.c
> +++ b/target/riscv/tcg/tcg-cpu.c
> @@ -318,6 +318,10 @@ static void riscv_cpu_update_named_features(RISCVCPU 
> *cpu)
>  cpu->cfg.has_priv_1_12 = true;
>  }
>
> +if (cpu->env.priv_ver >= PRIV_VERSION_1_13_0) {
> +cpu->cfg.has_priv_1_13 = true;
> +}
> +
>  /* zic64b is 1.12 or later */
>  cpu->cfg.ext_zic64b = cpu->cfg.cbom_blocksize == 64 &&
>cpu->cfg.cbop_blocksize == 64 &&
> --
> 2.34.1
>
>

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-06 Thread Zhijian Li (Fujitsu)



On 06/06/2024 19:31, Leon Romanovsky wrote:
> On Wed, Jun 05, 2024 at 10:00:24AM +, Gonglei (Arei) wrote:
>>
>>
>>> -Original Message-
>>> From: Michael S. Tsirkin [mailto:m...@redhat.com]
>>> Sent: Wednesday, June 5, 2024 3:57 PM
>>> To: Gonglei (Arei) 
>>> Cc: qemu-devel@nongnu.org; pet...@redhat.com; yu.zh...@ionos.com;
>>> mgal...@akamai.com; elmar.ger...@ionos.com; zhengchuan
>>> ; berra...@redhat.com; arm...@redhat.com;
>>> lizhij...@fujitsu.com; pbonz...@redhat.com; Xiexiangyou
>>> ; linux-r...@vger.kernel.org; lixiao (H)
>>> ; jinpu.w...@ionos.com; Wangjialin
>>> 
>>> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
>>>
>>> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
 From: Jialin Wang 

 Hi,

 This patch series attempts to refactor RDMA live migration by
 introducing a new QIOChannelRDMA class based on the rsocket API.

 The /usr/include/rdma/rsocket.h provides a higher level rsocket API
 that is a 1-1 match of the normal kernel 'sockets' API, which hides
 the detail of rdma protocol into rsocket and allows us to add support
 for some modern features like multifd more easily.

 Here is the previous discussion on refactoring RDMA live migration
 using the rsocket API:

 https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
 o.org/

 We have encountered some bugs when using rsocket and plan to submit
 them to the rdma-core community.

 In addition, the use of rsocket makes our programming more convenient,
 but it must be noted that this method introduces multiple memory
 copies, which can be imagined that there will be a certain performance
 degradation, hoping that friends with RDMA network cards can help verify,
>>> thank you!
>>>
>>> So you didn't test it with an RDMA card?
>>
>> Yep, we tested it by Soft-ROCE.
> 
> Does Soft-RoCE (RXE) support live migration?


Yes, it does


Thanks
Zhijian

> 
> Thanks
>

Re: [PATCH 2/2] util/bufferiszero: Add simd acceleration for loongarch64

2024-06-06 Thread Richard Henderson


On 6/5/24 21:00, maobibo wrote:
No, because the ifdef checks that the *compiler* is prepared to use LASX/LSX 
instructions itself without further checks.  There's no point in qemu checking further.
By my understanding, currently compiler option is the same with all files, there is no 
separate compiler option with single file or file function.


So if compiler is prepared to use LASX/LSX instructions itself, host hardware must support 
LASX/LSX instructions, else there will be problem.


Correct.


My main concern is that there is one hw machine which supports LSX, but no LASX, no KVM 
neither.  QEMU binary maybe fails to run on such hw machine if it is compiled with LASX 
option.


Yes, that would be a problem for packaging qemu for distribution.

An alternative is to write these functions in assembly.  While it's worth prioritizing 
implementation of __attribute__((target())) in the compilers, the very earliest that could 
happen is gcc 15.  Which is far away from being reliable for qemu.  It would also allow 
this optimization to happen with gcc 13, which doesn't support the builtins either.


I just sent a patch set along these lines.


r~

[PATCH v2 0/2] util/bufferiszero: Split out hosts, add loongarch64

2024-06-06 Thread Richard Henderson

Based-on: 20240527211912.14060-1-richard.hender...@linaro.org
("[PATCH 00/18] tcg/loongarch64: Support v64 and v256")
For "util/loongarch64: Detect LASX vector support"

For v2:
  * Rename to bufferiszero.c.inc (philmd).
  * Add inline assembly for loongarch64.

On cfarm400.cfarm.net (Loongson-3C5000L-LL @ 2.0GHz):

# Start of bufferiszero tests
# buffer_is_zero #0:  1KB11021 MB/sec
# buffer_is_zero #0:  4KB32107 MB/sec
# buffer_is_zero #0: 16KB59118 MB/sec
# buffer_is_zero #0: 64KB67940 MB/sec
# 
# buffer_is_zero #1:  1KB 9540 MB/sec
# buffer_is_zero #1:  4KB24050 MB/sec
# buffer_is_zero #1: 16KB38082 MB/sec
# buffer_is_zero #1: 64KB36399 MB/sec
# 
# buffer_is_zero #2:  1KB 8026 MB/sec
# buffer_is_zero #2:  4KB15493 MB/sec
# buffer_is_zero #2: 16KB20865 MB/sec
# buffer_is_zero #2: 64KB19694 MB/sec


r~


Richard Henderson (2):
  util/bufferiszero: Split out host include files
  util/bufferiszero: Add loongarch64 vector acceleration

 util/bufferiszero.c   | 191 +-
 host/include/aarch64/host/bufferiszero.c.inc  |  76 +++
 host/include/generic/host/bufferiszero.c.inc  |  10 +
 host/include/i386/host/bufferiszero.c.inc | 124 
 .../loongarch64/host/bufferiszero.c.inc   | 143 +
 host/include/x86_64/host/bufferiszero.c.inc   |   1 +
 6 files changed, 355 insertions(+), 190 deletions(-)
 create mode 100644 host/include/aarch64/host/bufferiszero.c.inc
 create mode 100644 host/include/generic/host/bufferiszero.c.inc
 create mode 100644 host/include/i386/host/bufferiszero.c.inc
 create mode 100644 host/include/loongarch64/host/bufferiszero.c.inc
 create mode 100644 host/include/x86_64/host/bufferiszero.c.inc

-- 
2.34.1

[PATCH v2 1/2] util/bufferiszero: Split out host include files

2024-06-06 Thread Richard Henderson

Split out host/bufferiszero.h.inc for x86, aarch64 and generic
in order to avoid an overlong ifdef ladder.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 util/bufferiszero.c  | 191 +--
 host/include/aarch64/host/bufferiszero.c.inc |  76 
 host/include/generic/host/bufferiszero.c.inc |  10 +
 host/include/i386/host/bufferiszero.c.inc| 124 
 host/include/x86_64/host/bufferiszero.c.inc  |   1 +
 5 files changed, 212 insertions(+), 190 deletions(-)
 create mode 100644 host/include/aarch64/host/bufferiszero.c.inc
 create mode 100644 host/include/generic/host/bufferiszero.c.inc
 create mode 100644 host/include/i386/host/bufferiszero.c.inc
 create mode 100644 host/include/x86_64/host/bufferiszero.c.inc

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 11c080e02c..522146dab9 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -81,196 +81,7 @@ static bool buffer_is_zero_int_ge256(const void *buf, 
size_t len)
 return t == 0;
 }
 
-#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
-#include 
-
-/* Helper for preventing the compiler from reassociating
-   chains of binary vector operations.  */
-#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1))
-
-/* Note that these vectorized functions may assume len >= 256.  */
-
-static bool __attribute__((target("sse2")))
-buffer_zero_sse2(const void *buf, size_t len)
-{
-/* Unaligned loads at head/tail.  */
-__m128i v = *(__m128i_u *)(buf);
-__m128i w = *(__m128i_u *)(buf + len - 16);
-/* Align head/tail to 16-byte boundaries.  */
-const __m128i *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
-const __m128i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16);
-__m128i zero = { 0 };
-
-/* Collect a partial block at tail end.  */
-v |= e[-1]; w |= e[-2];
-SSE_REASSOC_BARRIER(v, w);
-v |= e[-3]; w |= e[-4];
-SSE_REASSOC_BARRIER(v, w);
-v |= e[-5]; w |= e[-6];
-SSE_REASSOC_BARRIER(v, w);
-v |= e[-7]; v |= w;
-
-/*
- * Loop over complete 128-byte blocks.
- * With the head and tail removed, e - p >= 14, so the loop
- * must iterate at least once.
- */
-do {
-v = _mm_cmpeq_epi8(v, zero);
-if (unlikely(_mm_movemask_epi8(v) != 0x)) {
-return false;
-}
-v = p[0]; w = p[1];
-SSE_REASSOC_BARRIER(v, w);
-v |= p[2]; w |= p[3];
-SSE_REASSOC_BARRIER(v, w);
-v |= p[4]; w |= p[5];
-SSE_REASSOC_BARRIER(v, w);
-v |= p[6]; w |= p[7];
-SSE_REASSOC_BARRIER(v, w);
-v |= w;
-p += 8;
-} while (p < e - 7);
-
-return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0x;
-}
-
-#ifdef CONFIG_AVX2_OPT
-static bool __attribute__((target("avx2")))
-buffer_zero_avx2(const void *buf, size_t len)
-{
-/* Unaligned loads at head/tail.  */
-__m256i v = *(__m256i_u *)(buf);
-__m256i w = *(__m256i_u *)(buf + len - 32);
-/* Align head/tail to 32-byte boundaries.  */
-const __m256i *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32);
-const __m256i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32);
-__m256i zero = { 0 };
-
-/* Collect a partial block at tail end.  */
-v |= e[-1]; w |= e[-2];
-SSE_REASSOC_BARRIER(v, w);
-v |= e[-3]; w |= e[-4];
-SSE_REASSOC_BARRIER(v, w);
-v |= e[-5]; w |= e[-6];
-SSE_REASSOC_BARRIER(v, w);
-v |= e[-7]; v |= w;
-
-/* Loop over complete 256-byte blocks.  */
-for (; p < e - 7; p += 8) {
-/* PTEST is not profitable here.  */
-v = _mm256_cmpeq_epi8(v, zero);
-if (unlikely(_mm256_movemask_epi8(v) != 0x)) {
-return false;
-}
-v = p[0]; w = p[1];
-SSE_REASSOC_BARRIER(v, w);
-v |= p[2]; w |= p[3];
-SSE_REASSOC_BARRIER(v, w);
-v |= p[4]; w |= p[5];
-SSE_REASSOC_BARRIER(v, w);
-v |= p[6]; w |= p[7];
-SSE_REASSOC_BARRIER(v, w);
-v |= w;
-}
-
-return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0x;
-}
-#endif /* CONFIG_AVX2_OPT */
-
-static biz_accel_fn const accel_table[] = {
-buffer_is_zero_int_ge256,
-buffer_zero_sse2,
-#ifdef CONFIG_AVX2_OPT
-buffer_zero_avx2,
-#endif
-};
-
-static unsigned best_accel(void)
-{
-#ifdef CONFIG_AVX2_OPT
-unsigned info = cpuinfo_init();
-
-if (info & CPUINFO_AVX2) {
-return 2;
-}
-#endif
-return 1;
-}
-
-#elif defined(__aarch64__) && defined(__ARM_NEON)
-#include 
-
-/*
- * Helper for preventing the compiler from reassociating
- * chains of binary vector operations.
- */
-#define REASSOC_BARRIER(vec0, vec1) asm("" : "+w"(vec0), "+w"(vec1))
-
-static bool buffer_is_zero_simd(const void *buf, size_t len)
-{
-uint32x4_t t0, t1, t2, t3;
-
-/* Align head/tail to 16-byte boundaries.  */
-const uint32x4_t *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
-const uint32x4_t *e = QEMU_ALIGN_PTR_DOWN(buf + len -

[PATCH v2 2/2] util/bufferiszero: Add loongarch64 vector acceleration

2024-06-06 Thread Richard Henderson

Use inline assembly because no release compiler allows
per-function selection of the ISA.

Signed-off-by: Richard Henderson 
---
 .../loongarch64/host/bufferiszero.c.inc   | 143 ++
 1 file changed, 143 insertions(+)
 create mode 100644 host/include/loongarch64/host/bufferiszero.c.inc

diff --git a/host/include/loongarch64/host/bufferiszero.c.inc 
b/host/include/loongarch64/host/bufferiszero.c.inc
new file mode 100644
index 00..69891eac80
--- /dev/null
+++ b/host/include/loongarch64/host/bufferiszero.c.inc
@@ -0,0 +1,143 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ * buffer_is_zero acceleration, loongarch64 version.
+ */
+
+/*
+ * Builtins for LSX and LASX are introduced by gcc 14 and llvm 18,
+ * but as yet neither has support for attribute target, so neither
+ * is able to enable the optimization without globally enabling
+ * vector support.  Since we want runtime detection, use assembly.
+ */
+
+static bool buffer_is_zero_lsx(const void *buf, size_t len)
+{
+const void *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16);
+const void *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16) - (7 * 16);
+const void *l = buf + len;
+bool ret;
+
+asm("vld $vr0,%2,0\n\t" /* first: buf + 0 */
+"vld $vr1,%4,-16\n\t"   /* last: buf + len - 16 */
+"vld $vr2,%3,0\n\t" /* e[0] */
+"vld $vr3,%3,16\n\t"/* e[1] */
+"vld $vr4,%3,32\n\t"/* e[2] */
+"vld $vr5,%3,48\n\t"/* e[3] */
+"vld $vr6,%3,64\n\t"/* e[4] */
+"vld $vr7,%3,80\n\t"/* e[5] */
+"vld $vr8,%3,96\n\t"/* e[6] */
+"vor.v $vr0,$vr0,$vr1\n\t"
+"vor.v $vr2,$vr2,$vr3\n\t"
+"vor.v $vr4,$vr4,$vr5\n\t"
+"vor.v $vr6,$vr6,$vr7\n\t"
+"vor.v $vr0,$vr0,$vr2\n\t"
+"vor.v $vr4,$vr4,$vr6\n\t"
+"vor.v $vr0,$vr0,$vr4\n\t"
+"vor.v $vr0,$vr0,$vr8\n\t"
+"or %0,$r0,$r0\n"   /* prepare return false */
+"1:\n\t"
+"vsetnez.v $fcc0,$vr0\n\t"
+"bcnez $fcc0,2f\n\t"
+"vld $vr0,%1,0\n\t" /* p[0] */
+"vld $vr1,%1,16\n\t"/* p[1] */
+"vld $vr2,%1,32\n\t"/* p[2] */
+"vld $vr3,%1,48\n\t"/* p[3] */
+"vld $vr4,%1,64\n\t"/* p[4] */
+"vld $vr5,%1,80\n\t"/* p[5] */
+"vld $vr6,%1,96\n\t"/* p[6] */
+"vld $vr7,%1,112\n\t"   /* p[7] */
+"addi.d %1,%1,128\n\t"
+"vor.v $vr0,$vr0,$vr1\n\t"
+"vor.v $vr2,$vr2,$vr3\n\t"
+"vor.v $vr4,$vr4,$vr5\n\t"
+"vor.v $vr6,$vr6,$vr7\n\t"
+"vor.v $vr0,$vr0,$vr2\n\t"
+"vor.v $vr4,$vr4,$vr6\n\t"
+"vor.v $vr0,$vr0,$vr4\n\t"
+"bltu %1,%3,1b\n\t"
+"vsetnez.v $fcc0,$vr0\n\t"
+"bcnez $fcc0,2f\n\t"
+"ori %0,$r0,1\n"
+"2:"
+: "=&r"(ret), "+r"(p)
+: "r"(buf), "r"(e), "r"(l)
+: "f0", "f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "fcc0");
+
+return ret;
+}
+
+static bool buffer_is_zero_lasx(const void *buf, size_t len)
+{
+const void *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32);
+const void *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32) - (7 * 32);
+const void *l = buf + len;
+bool ret;
+
+asm("xvld $xr0,%2,0\n\t" /* first: buf + 0 */
+"xvld $xr1,%4,-32\n\t"   /* last: buf + len - 32 */
+"xvld $xr2,%3,0\n\t" /* e[0] */
+"xvld $xr3,%3,32\n\t"/* e[1] */
+"xvld $xr4,%3,64\n\t"/* e[2] */
+"xvld $xr5,%3,96\n\t"/* e[3] */
+"xvld $xr6,%3,128\n\t"   /* e[4] */
+"xvld $xr7,%3,160\n\t"   /* e[5] */
+"xvld $xr8,%3,192\n\t"   /* e[6] */
+"xvor.v $xr0,$xr0,$xr1\n\t"
+"xvor.v $xr2,$xr2,$xr3\n\t"
+"xvor.v $xr4,$xr4,$xr5\n\t"
+"xvor.v $xr6,$xr6,$xr7\n\t"
+"xvor.v $xr0,$xr0,$xr2\n\t"
+"xvor.v $xr4,$xr4,$xr6\n\t"
+"xvor.v $xr0,$xr0,$xr4\n\t"
+"xvor.v $xr0,$xr0,$xr8\n\t"
+"or %0,$r0,$r0\n\t"  /* prepare return false */
+"bgeu %1,%3,2f\n"
+"1:\n\t"
+"xvsetnez.v $fcc0,$xr0\n\t"
+"bcnez $fcc0,3f\n\t"
+"xvld $xr0,%1,0\n\t" /* p[0] */
+"xvld $xr1,%1,32\n\t"/* p[1] */
+"xvld $xr2,%1,64\n\t"/* p[2] */
+"xvld $xr3,%1,96\n\t"/* p[3] */
+"xvld $xr4,%1,128\n\t"   /* p[4] */
+"xvld $xr5,%1,160\n\t"   /* p[5] */
+"xvld $xr6,%1,192\n\t"   /* p[6] */
+"xvld $xr7,%1,224\n\t"   /* p[7] */
+"addi.d %1,%1,256\n\t"
+"xvor.v $xr0,$xr0,$xr1\n\t"
+"xvor.v $xr2,$xr2,$xr3\n\t"
+"xvor.v $xr4,$xr4,$xr5\n\t"
+"xvor.v $xr6,$xr6,$xr7\n\t"
+"xvor.v $xr0,$xr0,$xr2\n\t"
+"xvor.v $xr4,$xr

Re: [PULL 0/6] loongarch-to-apply queue

2024-06-06 Thread Richard Henderson


On 6/5/24 21:01, Song Gao wrote:

The following changes since commit db2feb2df8d19592c9859efb3f682404e0052957:

   Merge tag 'pull-misc-20240605' ofhttps://gitlab.com/rth7680/qemu  into 
staging (2024-06-05 14:17:01 -0700)

are available in the Git repository at:

   https://gitlab.com/gaosong/qemu.git  tags/pull-loongarch-20240606

for you to fetch changes up to 78f932ea1f7b3b9b0ac628dc2a91281318fe51fa:

   target/loongarch: fix a wrong print in cpu dump (2024-06-06 11:58:06 +0800)


pull-loongarch-20240606


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/9.1 as 
appropriate.


r~

Re: [PATCH RFC] hw/arm/virt: Avoid unexpected warning from Linux guest on host with Fujitsu CPUs

2024-06-06 Thread Robin Murphy


On 2024-06-06 6:13 pm, Jonathan Cameron wrote:

On Thu, 6 Jun 2024 12:56:59 +0100
Peter Maydell  wrote:


On Thu, 6 Jun 2024 at 11:48, Zhenyu Zhang  wrote:


Multiple warning messages and corresponding backtraces are observed when Linux
guest is booted on the host with Fujitsu CPUs. One of them is shown as below.

[0.032443] [ cut here ]
[0.032446] uart-pl011 900.pl011: ARCH_DMA_MINALIGN smaller than 
CTR_EL0.CWG (128 < 256)
[0.032454] WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:54 
arch_setup_dma_ops+0xbc/0xcc
[0.032470] Modules linked in:
[0.032475] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-452.el9.aarch64 
#1
[0.032481] Hardware name: linux,dummy-virt (DT)
[0.032484] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[0.032490] pc : arch_setup_dma_ops+0xbc/0xcc
[0.032496] lr : arch_setup_dma_ops+0xbc/0xcc
[0.032501] sp : 80008003b860
[0.032503] x29: 80008003b860 x28:  x27: aae4b949049c
[0.032510] x26:  x25:  x24: 
[0.032517] x23: 0100 x22:  x21: 
[0.032523] x20: 0001 x19: 2f06c02ea400 x18: 
[0.032529] x17: 208a5f76 x16: 6589dbcb x15: aae4ba071c89
[0.032535] x14:  x13: aae4ba071c84 x12: 455f525443206e61
[0.032541] x11: 68742072656c6c61 x10: 0029 x9 : aae4b7d21da4
[0.032547] x8 : 0029 x7 : 4c414e494d5f414d x6 : 0029
[0.032553] x5 : 000f x4 : aae4b9617a00 x3 : 0001
[0.032558] x2 :  x1 :  x0 : 2f06c029be40
[0.032564] Call trace:
[0.032566]  arch_setup_dma_ops+0xbc/0xcc
[0.032572]  of_dma_configure_id+0x138/0x300
[0.032591]  amba_dma_configure+0x34/0xc0
[0.032600]  really_probe+0x78/0x3dc
[0.032614]  __driver_probe_device+0x108/0x160
[0.032619]  driver_probe_device+0x44/0x114
[0.032624]  __device_attach_driver+0xb8/0x14c
[0.032629]  bus_for_each_drv+0x88/0xe4
[0.032634]  __device_attach+0xb0/0x1e0
[0.032638]  device_initial_probe+0x18/0x20
[0.032643]  bus_probe_device+0xa8/0xb0
[0.032648]  device_add+0x4b4/0x6c0
[0.032652]  amba_device_try_add.part.0+0x48/0x360
[0.032657]  amba_device_add+0x104/0x144
[0.032662]  of_amba_device_create.isra.0+0x100/0x1c4
[0.032666]  of_platform_bus_create+0x294/0x35c
[0.032669]  of_platform_populate+0x5c/0x150
[0.032672]  of_platform_default_populate_init+0xd0/0xec
[0.032697]  do_one_initcall+0x4c/0x2e0
[0.032701]  do_initcalls+0x100/0x13c
[0.032707]  kernel_init_freeable+0x1c8/0x21c
[0.032712]  kernel_init+0x28/0x140
[0.032731]  ret_from_fork+0x10/0x20
[0.032735] ---[ end trace  ]---

In Linux, a check is applied to every device which is exposed through 
device-tree
node. The warning message is raised when the device isn't DMA coherent and the
cache line size is larger than ARCH_DMA_MINALIGN (128 bytes). The cache line is
sorted from CTR_EL0[CWG], which corresponds to 256 bytes on the guest CPUs.
The DMA coherent capability is claimed through 'dma-coherent' in their
device-tree nodes.


For QEMU emulated all our DMA is always coherent, so where we
have DMA-capable devices we should definitely tell the kernel
that that DMA is coherent.


The trick for that is to put the "dma-coherent" property right in the 
root of the DT so it plausibly communicates "the whole platform is 
coherent", and is then inherited by all devices, even those which 
shouldn't technically need it.



Our pl011 does not do DMA, though (we do not set the dmas property), so
it's kind of bogus for the kernel to complain about that.


The issue there is, per the history Jonathan dug up, DT on Arm got the 
assumption baked into it from day one that "dma-ranges" was implied for 
simple-bus and similar, and thus there is no easy generic way to 
indicate that any MMIO device *can't* do DMA. For Linux this means we 
end up having to assume that everything *might* be DMA-capable, since 
the only thing which knows for sure is a driver, but we have further 
legacy in the driver model which means we have to do perform the basic 
DMA setup for any device *before* its driver probes. Yes, it's a bit 
rubbish; feel free to shake your fist at the past.


(At least we learned and got it right in ACPI for arm64 by making the 
_CCA method mandatory for DMA-capable devices...)



So I think we should take these changes where they refer to DMA
capable devices and ask the kernel folks to fix the warnings
where they refer to devices that aren't doing DMA. Looking through
the patch, though, my initial impression is that all these are
in the latter category...


I was curious and have a very slow test running, so took a look.
of_dma_configure() is being passed force_dma = true.
https://elixi

Re: [PATCH 2/3] plugins: Free CPUPluginState before destroying vCPU state

2024-06-06 Thread Pierrick Bouvier


On 6/6/24 05:40, Philippe Mathieu-Daudé wrote:

cpu::plugin_state is allocated in cpu_common_initfn() when
the vCPU state is created. Release it in cpu_common_finalize()
when we are done.

Signed-off-by: Philippe Mathieu-Daudé 
---
  include/qemu/plugin.h | 3 +++
  hw/core/cpu-common.c  | 5 +
  2 files changed, 8 insertions(+)

diff --git a/include/qemu/plugin.h b/include/qemu/plugin.h
index bc5aef979e..af5f9db469 100644
--- a/include/qemu/plugin.h
+++ b/include/qemu/plugin.h
@@ -149,6 +149,9 @@ struct CPUPluginState {
  
  /**

   * qemu_plugin_create_vcpu_state: allocate plugin state
+ *
+ * The returned data must be released with g_free()
+ * when no longer required.
   */
  CPUPluginState *qemu_plugin_create_vcpu_state(void);
  
diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c

index bf1a7b8892..cd15402552 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -283,6 +283,11 @@ static void cpu_common_finalize(Object *obj)
  {
  CPUState *cpu = CPU(obj);
  
+#ifdef CONFIG_PLUGIN

+if (tcg_enabled()) {
+g_free(cpu->plugin_state);
+}
+#endif
  g_array_free(cpu->gdb_regs, TRUE);
  qemu_lockcnt_destroy(&cpu->in_ioctl_lock);
  qemu_mutex_destroy(&cpu->work_mutex);


To ensure I get it right, order of cpu init/deinit is:
- init
- realize
- unrealize
- finalize
Is that correct?

Reviewed-by: Pierrick Bouvier

Re: [PATCH 1/3] plugins: Ensure vCPU index is assigned in init/exit hooks

2024-06-06 Thread Pierrick Bouvier


On 6/6/24 05:40, Philippe Mathieu-Daudé wrote:

Since vCPUs are hashed by their index, this index can't
be uninitialized (UNASSIGNED_CPU_INDEX).

Signed-off-by: Philippe Mathieu-Daudé 
---
  plugins/core.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/plugins/core.c b/plugins/core.c
index badede28cf..d339b3db4d 100644
--- a/plugins/core.c
+++ b/plugins/core.c
@@ -245,6 +245,7 @@ void qemu_plugin_vcpu_init_hook(CPUState *cpu)
  {
  bool success;
  
+assert(cpu->cpu_index != UNASSIGNED_CPU_INDEX);

  qemu_rec_mutex_lock(&plugin.lock);
  plugin.num_vcpus = MAX(plugin.num_vcpus, cpu->cpu_index + 1);
  plugin_cpu_update__locked(&cpu->cpu_index, NULL, NULL);
@@ -263,6 +264,7 @@ void qemu_plugin_vcpu_exit_hook(CPUState *cpu)
  
  plugin_vcpu_cb__simple(cpu, QEMU_PLUGIN_EV_VCPU_EXIT);
  
+assert(cpu->cpu_index != UNASSIGNED_CPU_INDEX);

  qemu_rec_mutex_lock(&plugin.lock);
  success = g_hash_table_remove(plugin.cpu_ht, &cpu->cpu_index);
  g_assert(success);


Reviewed-by: Pierrick Bouvier

Re: [PATCH v3 03/13] hw/riscv: add RISC-V IOMMU base emulation

2024-06-06 Thread Daniel Henrique Barboza





On 5/29/24 10:39 PM, Eric Cheng wrote:

On 5/24/2024 1:39 AM, Daniel Henrique Barboza wrote:

From: Tomasz Jeznach 

The RISC-V IOMMU specification is now ratified as-per the RISC-V
international process. The latest frozen specifcation can be found
at:

https://github.com/riscv-non-isa/riscv-iommu/releases/download/v1.0/riscv-iommu.pdf

Add the foundation of the device emulation for RISC-V IOMMU, which
includes an IOMMU that has no capabilities but MSI interrupt support and
fault queue interfaces. We'll add add more features incrementally in the

   ^^^  ^^^
repeated 'add'


Fixed.




next patches.

Co-developed-by: Sebastien Boeuf 
Signed-off-by: Sebastien Boeuf 
Signed-off-by: Tomasz Jeznach 
Signed-off-by: Daniel Henrique Barboza 
---
  hw/riscv/Kconfig |    4 +
  hw/riscv/meson.build |    1 +
  hw/riscv/riscv-iommu.c   | 1602 ++
  hw/riscv/riscv-iommu.h   |  141 
  hw/riscv/trace-events    |   11 +
  hw/riscv/trace.h |    1 +
  include/hw/riscv/iommu.h |   36 +
  meson.build  |    1 +
  8 files changed, 1797 insertions(+)
  create mode 100644 hw/riscv/riscv-iommu.c
  create mode 100644 hw/riscv/riscv-iommu.h
  create mode 100644 hw/riscv/trace-events
  create mode 100644 hw/riscv/trace.h
  create mode 100644 include/hw/riscv/iommu.h

diff --git a/hw/riscv/Kconfig b/hw/riscv/Kconfig
index a2030e3a6f..f69d6e3c8e 100644
--- a/hw/riscv/Kconfig
+++ b/hw/riscv/Kconfig
@@ -1,3 +1,6 @@
+config RISCV_IOMMU
+    bool
+
  config RISCV_NUMA
  bool
@@ -47,6 +50,7 @@ config RISCV_VIRT
  select SERIAL
  select RISCV_ACLINT
  select RISCV_APLIC
+    select RISCV_IOMMU
  select RISCV_IMSIC
  select SIFIVE_PLIC
  select SIFIVE_TEST
diff --git a/hw/riscv/meson.build b/hw/riscv/meson.build
index f872674093..cbc99c6e8e 100644
--- a/hw/riscv/meson.build
+++ b/hw/riscv/meson.build
@@ -10,5 +10,6 @@ riscv_ss.add(when: 'CONFIG_SIFIVE_U', if_true: 
files('sifive_u.c'))
  riscv_ss.add(when: 'CONFIG_SPIKE', if_true: files('spike.c'))
  riscv_ss.add(when: 'CONFIG_MICROCHIP_PFSOC', if_true: 
files('microchip_pfsoc.c'))
  riscv_ss.add(when: 'CONFIG_ACPI', if_true: files('virt-acpi-build.c'))
+riscv_ss.add(when: 'CONFIG_RISCV_IOMMU', if_true: files('riscv-iommu.c'))
  hw_arch += {'riscv': riscv_ss}
diff --git a/hw/riscv/riscv-iommu.c b/hw/riscv/riscv-iommu.c
new file mode 100644
index 00..39b4ff1405
--- /dev/null
+++ b/hw/riscv/riscv-iommu.c
@@ -0,0 +1,1602 @@
+/*
+ * QEMU emulation of an RISC-V IOMMU
+ *
+ * Copyright (C) 2021-2023, Rivos Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "qom/object.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pci_device.h"
+#include "hw/qdev-properties.h"
+#include "hw/riscv/riscv_hart.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+#include "qemu/timer.h"
+
+#include "cpu_bits.h"
+#include "riscv-iommu.h"
+#include "riscv-iommu-bits.h"
+#include "trace.h"
+
+#define LIMIT_CACHE_CTX   (1U << 7)
+#define LIMIT_CACHE_IOT   (1U << 20)
+
+/* Physical page number coversions */
+#define PPN_PHYS(ppn) ((ppn) << TARGET_PAGE_BITS)
+#define PPN_DOWN(phy) ((phy) >> TARGET_PAGE_BITS)
+
+typedef struct RISCVIOMMUContext RISCVIOMMUContext;
+typedef struct RISCVIOMMUEntry RISCVIOMMUEntry;
+
+/* Device assigned I/O address space */
+struct RISCVIOMMUSpace {
+    IOMMUMemoryRegion iova_mr;  /* IOVA memory region for attached device */
+    AddressSpace iova_as;   /* IOVA address space for attached device */
+    RISCVIOMMUState *iommu; /* Managing IOMMU device state */
+    uint32_t devid; /* Requester identifier, AKA device_id */
+    bool notifier;  /* IOMMU unmap notifier enabled */
+    QLIST_ENTRY(RISCVIOMMUSpace) list;
+};
+
+/* Device translation context state. */
+struct RISCVIOMMUContext {
+    uint64_t devid:24;  /* Requester Id, AKA device_id */
+    uint64_t pasid:20;  /* Process Address Space ID */
+    uint64_t __rfu:20;  /* reserved */
+    uint64_t tc;    /* Translation Control */
+    uint64_t ta;    /* Translation Attributes */
+    uint64_t msi_addr_mask; /* MSI filtering - address mask */
+    uint64_t msi_addr_pattern;  /* MSI filtering - address pattern */
+    uint64_t msiptp;    /* MSI r

Re: [PATCH 0/5] s390x: Add Full Boot Order Support

2024-06-06 Thread Jared Rossi





On 6/5/24 4:02 AM, Thomas Huth wrote:

On 29/05/2024 17.43, jro...@linux.ibm.com wrote:

From: Jared Rossi 

This patch set primarily adds support for the specification of 
multiple boot
devices, allowing for the guest to automatically use an alternative 
device on
a failed boot without needing to be reconfigured. It additionally 
provides the
ability to define the loadparm attribute on a per-device bases, which 
allows

boot devices to use different loadparm values if needed.

In brief, an IPLB is generated for each designated boot device (up to 
a maximum
of 8) and stored in guest memory immediately before BIOS. If a device 
fails to

boot, the next IPLB is retrieved and we jump back to the start of BIOS.

Devices can be specified using the standard qemu device tag 
"bootindex" as with
other architectures. Lower number indices are tried first, with 
"bootindex=0"

indicating the first device to try.


Is this supposed with multiple scsi-hd devices, too? I tried to boot a 
guest with two scsi disks (attached to a single virtio-scsi-ccw 
adapter) where only the second disk had a bootable installation, but 
that failed...?


 Thomas




Hi Thomas,

Yes, I would expect that to work. I tried to reproduce this using a 
non-bootable scsi disk as the first boot device and then a known-good 
bootable scsi disk as the second boot device, with one controller.  In 
my instance the BIOS was not able to identify the first disk as bootable 
and so that device failed to IPL, but it did move on to the next disk 
after that, and the guest successfully IPL'd from the second device.


When you say it failed, do you mean the first disk failed to boot (as 
expected), but then the guest died without attempting to boot from the 
second disk?  Or did something else happen? I am either not 
understanding your configuration or I am not understanding your error.


Regards,

Jared Rossi

Re: [PULL 00/12] testing cleanups (ci, vm, lcitool, ansible)

2024-06-06 Thread Richard Henderson


On 6/6/24 04:50, Alex Bennée wrote:

The following changes since commit db2feb2df8d19592c9859efb3f682404e0052957:

   Merge tag 'pull-misc-20240605' ofhttps://gitlab.com/rth7680/qemu  into 
staging (2024-06-05 14:17:01 -0700)

are available in the Git repository at:

   https://gitlab.com/stsquad/qemu.git  tags/pull-maintainer-june24-060624-1

for you to fetch changes up to c99064d03fc574254ab098562798c937a4761161:

   scripts/ci: drive ubuntu/build-environment.yml from lcitool (2024-06-06 
10:26:22 +0100)


testing cleanups (ci, vm, lcitool, ansible):

   - clean up left over Centos 8 references
   - use -fno-sanitize=function to avoid non-useful errors
   - bump lcitool and update images (alpine, fedora)
   - make sure we have mingw-w64-tools for windows builds
   - drive ansible scripts with lcitool package lists


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/9.1 as 
appropriate.


r~

Re: [PATCH 0/6] target/riscv: Add support for Control Transfer Records Ext.

2024-06-06 Thread Beeman Strong

PR for the spec fix, in case anyone is interested.  Found a couple of other
references to Smcsrind that I also removed.

https://github.com/riscv/riscv-control-transfer-records/pull/29


On Tue, Jun 4, 2024 at 8:41 PM Beeman Strong  wrote:

> Ah, good catch.  We removed that dependency late.  I'll fix it.
>
> On Tue, Jun 4, 2024 at 8:34 PM Jason Chien  wrote:
>
>> Thank you for pointing that out. CTR does not use miselect and mireg*.
>> There is no dependency on Smcsrind. I believe the spec needs to be
>> corrected.
>>
>> Chapter 1 states that:
>> Smctr depends on the Smcsrind extension, while Ssctr depends on the
>> Sscsrind extension. Further, both Smctr and Ssctr depend upon
>> implementation of S-mode.
>> Beeman Strong 於 2024/6/5 上午 06:32 寫道:
>>
>> There is no dependency on Smcsrind, only Sscsrind.
>>
>> On Tue, Jun 4, 2024 at 12:29 AM Jason Chien 
>> wrote:
>>
>>> Smctr depends on the Smcsrind extension, Ssctr depends on the Sscsrind
>>> extension, and both Smctr and Ssctr depend upon implementation of S-mode.
>>> There should be a dependency check in
>>> riscv_cpu_validate_set_extensions().
>>>
>>> Rajnesh Kanwal 於 2024/5/30 上午 12:09 寫道:
>>> > This series enables Control Transfer Records extension support on riscv
>>> > platform. This extension is similar to Arch LBR in x86 and BRBE in ARM.
>>> > The Extension has been stable and the latest release can be found here
>>> [0]
>>> >
>>> > CTR extension depends on couple of other extensions:
>>> >
>>> > 1. S[m|s]csrind : The indirect CSR extension [1] which defines
>>> additional
>>> > ([M|S|VS]IREG2-[M|S|VS]IREG6) register to address size limitation
>>> of
>>> > RISC-V CSR address space. CTR access ctrsource, ctrtartget and
>>> ctrdata
>>> > CSRs using sscsrind extension.
>>> >
>>> > 2. Smstateen: The mstateen bit[54] controls the access to the CTR ext
>>> to
>>> > S-mode.
>>> >
>>> > 3. Sscofpmf: Counter overflow and privilege mode filtering. [2]
>>> >
>>> > The series is based on Smcdeleg/Ssccfg counter delegation extension [3]
>>> > patches. CTR itself doesn't depend on counter delegation support. This
>>> > rebase is basically to include the Smcsrind patches.
>>> >
>>> > Due to the dependency of these extensions, the following extensions
>>> must be
>>> > enabled to use the control transfer records feature.
>>> >
>>> >
>>> "smstateen=true,sscofpmf=true,smcsrind=true,sscsrind=true,smctr=true,ssctr=true"
>>> >
>>> > Here is the link to a quick guide [5] to setup and run a basic perf
>>> demo on
>>> > Linux to use CTR Ext.
>>> >
>>> > The Qemu patches can be found here:
>>> > https://github.com/rajnesh-kanwal/qemu/tree/ctr_upstream
>>> >
>>> > The opensbi patch can be found here:
>>> > https://github.com/rajnesh-kanwal/opensbi/tree/ctr_upstream
>>> >
>>> > The Linux kernel patches can be found here:
>>> > https://github.com/rajnesh-kanwal/linux/tree/ctr_upstream
>>> >
>>> > [0]:https://github.com/riscv/riscv-control-transfer-records/release
>>> > [1]:https://github.com/riscv/riscv-indirect-csr-access
>>> > [2]:https://github.com/riscvarchive/riscv-count-overflow/tree/main
>>> > [3]:https://github.com/riscv/riscv-smcdeleg-ssccfg
>>> > [4]:
>>> https://lore.kernel.org/all/20240217000134.3634191-1-ati...@rivosinc.com/
>>> > [5]:
>>> https://github.com/rajnesh-kanwal/linux/wiki/Running-CTR-basic-demo-on-QEMU-RISC%E2%80%90V-Virt-machine
>>> >
>>> > Rajnesh Kanwal (6):
>>> >target/riscv: Remove obsolete sfence.vm instruction
>>> >target/riscv: Add Control Transfer Records CSR definitions.
>>> >target/riscv: Add support for Control Transfer Records extension
>>> CSRs.
>>> >target/riscv: Add support to record CTR entries.
>>> >target/riscv: Add CTR sctrclr instruction.
>>> >target/riscv: Add support to access ctrsource, ctrtarget, ctrdata
>>> >  regs.
>>> >
>>> >   target/riscv/cpu.c|   4 +
>>> >   target/riscv/cpu.h|  14 +
>>> >   target/riscv/cpu_bits.h   | 154 +
>>> >   target/riscv/cpu_cfg.h|   2 +
>>> >   target/riscv/cpu_helper.c | 213 
>>> >   target/riscv/csr.c| 312
>>> +-
>>> >   target/riscv/helper.h |   8 +-
>>> >   target/riscv/insn32.decode|   2 +-
>>> >   .../riscv/insn_trans/trans_privileged.c.inc   |  21 +-
>>> >   target/riscv/insn_trans/trans_rvi.c.inc   |  27 ++
>>> >   target/riscv/op_helper.c  | 117 ++-
>>> >   target/riscv/translate.c  |   9 +
>>> >   12 files changed, 870 insertions(+), 13 deletions(-)
>>> >
>>>
>>

Re: [PATCH qemu ] hw/acpi: Fix big endian host creation of Generic Port Affinity Structures

2024-06-06 Thread Jonathan Cameron via

On Thu, 6 Jun 2024 16:06:53 +0200
Igor Mammedov  wrote:

> On Wed, 5 Jun 2024 19:04:55 +0100
> Jonathan Cameron  wrote:
> 
> > Treating the HID as an integer caused it to get bit reversed
> > on big endian hosts running little endian guests.  Treat it
> > as a character array instead.
> > 
> > Fixes hw/acpi: Generic Port Affinity Structure Support
> > Tested-by: Richard Henderson 
> > Signed-off-by: Jonathan Cameron 
> > 
> > ---
> > Richard ran the version posted in the thread on an s390 instance.
> > Thanks for the help!
> > 
> > Difference from version in thread:
> > - Instantiate i in the for loop.
> > 
> > Sending out now so Michael can decide whether to fold this in, or
> > drop the GP series for now from his pull request (in which case
> > I'll do an updated version with this and Markus' docs feedback
> > folded in.)
> > 
> > ---
> >  include/hw/acpi/acpi_generic_initiator.h | 2 +-
> >  hw/acpi/acpi_generic_initiator.c | 4 +++-
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/hw/acpi/acpi_generic_initiator.h 
> > b/include/hw/acpi/acpi_generic_initiator.h
> > index 1a899af30f..5baefda33a 100644
> > --- a/include/hw/acpi/acpi_generic_initiator.h
> > +++ b/include/hw/acpi/acpi_generic_initiator.h
> > @@ -61,7 +61,7 @@ typedef struct PCIDeviceHandle {
> >  uint16_t bdf;
> >  };
> >  struct {
> > -uint64_t hid;
> > +char hid[8];
> >  uint32_t uid;
> >  };
> >  };  
> 
> not sure on top of what this patch applies but I have some generic comments 
> wrt it

https://lore.kernel.org/qemu-devel/20240524100507.32106-1-jonathan.came...@huawei.com/

Comments are all on elements of the existing upstream code, but I'm touching it
anyway so will look at making the improvements you suggest as new precursors
to v3 given we are going around again anyway.

> 
> why PCIDeviceHandle is in header file? is there plan for it
> being used outside of acpi_generic_initiator.c?

I'll add a precursor patch to my series that moves
it and anything else that should be more local.  May well move
to being local in aml_build.c given your later comments with the
various fields passed in as parameters.

> 
> 
> > diff --git a/hw/acpi/acpi_generic_initiator.c 
> > b/hw/acpi/acpi_generic_initiator.c
> > index 78b80dcf08..f064753b67 100644
> > --- a/hw/acpi/acpi_generic_initiator.c
> > +++ b/hw/acpi/acpi_generic_initiator.c
> > @@ -151,7 +151,9 @@ build_srat_generic_node_affinity(GArray *table_data, 
> > int node,
> >  build_append_int_noprefix(table_data, 0, 12);
> >  } else {
> >  /* Device Handle - ACPI */
> > -build_append_int_noprefix(table_data, handle->hid, 8);
> > +for (int i = 0; i < sizeof(handle->hid); i++) {
> > +build_append_int_noprefix(table_data, handle->hid[i], 1);
> > +}
> >  build_append_int_noprefix(table_data, handle->uid, 4);
> >  build_append_int_noprefix(table_data, 0, 4);  
> 
> instead of open codding structure
> 
> it might be better to introduce helper in aml_build.c
> something like 
>   /* proper reference to spec as we do for other ACPI primitives */
>   build_append_srat_acpi_device_handle(GArray *table_data, char* hid, 
> unit32_t uid)
>   assert(strlen(hid) ...
>   for() {
> build_append_byte()
>   }  
>   ...
> 
> the same applies to "Device Handle - PCI" structure

I'll look at moving that stuff and the affinity structure creation
code themselves in there. I think they ended up in this file because
of the other infrastructure needed to create these nodes and it
will have felt natural to keep this together.

Putting it in aml_build.c will put it with similar code though
which makes sense to me.

> 
> Also get rid of PCI deps in acpi_generic_initiator.c 
> move build_all_acpi_generic_initiators/build_srat_generic_pci_initiator into
> hw/acpi/pci.c

Today it's used only for PCI devices, but that's partly an artifact
of how we get to the root complex via the bus below it.

Spec wise, it's just as applicable to platform devices etc, but maybe
we can move it to pci.c for now and move it out again if it gains other
users. Or leave it in acpi_generic_initiator.c but have all the aml
stuff in aml_build.c as you suggest. 

> file if it has to access PCI code/structures directly
> (which I'm not convinced it should, can we get/expose what it needs as QOM 
> properties?)

Maybe. I'll see what I can come up with.  This feels involved
however so I'm more doubtful about this as a precursor.

> 
> btw:
> build_all_acpi_generic_initiators() name doesn't match what it's doing.
> it composes only one initiator entry.

I'll look at tidying up all the relevant naming.

Jonathan

> 
> >  }  
> 
>

Re: [PATCH] target/sparc: use signed denominator in sdiv helper

2024-06-06 Thread Mark Cave-Ayland


On 06/06/2024 15:43, Clément Chigot wrote:


The result has to be done with the signed denominator (b32) instead of
the unsigned value passed in argument (b).

Fixes: 1326010322d6 ("target/sparc: Remove CC_OP_DIV")
Signed-off-by: Clément Chigot 
---
  target/sparc/helper.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/sparc/helper.c b/target/sparc/helper.c
index 2247e243b5..7846ddd6f6 100644
--- a/target/sparc/helper.c
+++ b/target/sparc/helper.c
@@ -121,7 +121,7 @@ uint64_t helper_sdiv(CPUSPARCState *env, target_ulong a, 
target_ulong b)
  return (uint32_t)(b32 < 0 ? INT32_MAX : INT32_MIN) | (-1ull << 32);
  }
  
-a64 /= b;

+a64 /= b32;
  r = a64;
  if (unlikely(r != a64)) {
  return (uint32_t)(a64 < 0 ? INT32_MIN : INT32_MAX) | (-1ull << 32);


Thanks for the patch! I think this might also be:

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2319


ATB,

Mark.

Re: Historical QMP schema

2024-06-06 Thread John Snow

Adding upstream because I think there's little reason to keep this
discussion off-list.

The context of this mailing thread is that we are discussing how we might
want to store and record historical QMP API interface information such that
we can programmatically answer questions like "when was X command
introduced?", but possibly also to generate changelog QMP reports for
auditing QMP changes for each version during the release candidate window.

On Thu, Jun 6, 2024 at 6:25 AM Victor Toso  wrote:

> Hi John!
>
> On Wed, Jun 05, 2024 at 11:47:53AM GMT, John Snow wrote:
> > Hi,
> >
> > as part of my project to modernize QMP documentation I recently
> > forked the QAPI generator and modded it to be able to read
> > historical QAPI schema going all the way back to v1.0.
> >
> > We want to be able to use this ability to generate accurate
> > "since:" data for the docs.
>
> :)
>
> > My question for you is: do you have any input or preferences
> > for the format of "output" or "compiled" schemata?
>
> Not sure if I understand 'output' & 'compiled' reference here.
>

The idea is that even though I can modify the schema parser to cope with
historical versions of the schema, we should re-save the schema into a
unified format so that the gross hacks and kludges made to support old
versions in my forked QAPI parser don't need to be kept in-tree. In theory,
we only need to parse each historical schema precisely once.

(And then we'll only need to parse each new schema going forward with the
version of the QAPI parser that's already in-tree.)

Importantly, old versions of the schema aren't contained *entirely* within
the schema. Here's a timeline:

v0.12.0: QMP first introduced. Events are hardcoded, commands are defined
in qemu-monitor.hx. query commands are hard-coded in monitor.c.
v0.14.0: qemu-monitor.hx is forked into qmp-commands.hx and hmp-commands.hx
v1.0: First version which features qapi-schema.json; all query commands are
qapified but most other commands are not.
v1.1.0: A very large chunk of commands are QAPIfied.
v1.3.0: Most commands are now QAPIfied, but there are 2-3 remaining.
v2.1.0: events are now fully qapified; most are now defined in
qapi/events.json
v2.8.0: The remaining commands are fully qapified; qmp-commands.hx is
removed.

So what I mean by "compiled" schema is parsing historical revisions of the
schema, including descriptive schema definitions for items not-yet-qapified
(but nonetheless remain available in that version), and writing that
curated information back out to disk (and checking into qemu.git) for later
reference.

There are multiple choices for this output format:

(A) Just output in native qapi-schema format, just choose the "latest
version".
(B) Choose some other arbitrary format.
(C) Some secret, third thing.

I don't like the native qapi schema idea, for a few reasons:
- It changes over time
- It does not support nested structures except by reference, but
- Type names are not meant to API visible
- Detecting data changes in nested, shared types is difficult

In my prototype thus far, I have used a JSON-Schema based format with a
type definition library that catalogues all of the command arg / command
return / event data types with *all fields* fully dereferenced and inlined,
except where impossible due to cyclical/recursive references (PCI and
BlockStats come to mind.)

A benefit of this "compilation" is that all commands have their arguments
and return values described solely by type (and for enums, values) and not
by type name - fully removing non-API information from the "compiled"
version.

A downside is that this is yet-another-format that differs from the
existing format that requires someone knowledgeable to manipulate in case
there are errors found in it, or worse - we decide to change or upgrade the
format in some way to support a new feature in the future and we must
yet-again revise our catalogue of historical schemata.

>
> My preference is for this metadata to be there when the generator
> parses the schema. It could another external metadata file, but
> available to the generator.
>

Yes, ideally...

I'm thinking that I'd like to include a qapi/history/ directory which
contains a record of all compiled historical schemata, and the generator
can parse the directory and load up them all up - and then compile a quick
lookup table that is able to answer basic questions about the data.

i.e. "when was X [command/event] first introduced?"

"when was Command.arguments.key first introduced / incompatibly modified?"

The syntax/API for how to ask the QAPISchema object these questions remains
unpondered, as does the question "how do we report 'since' information in
the rendered QMP documentation HTML for nested objects that we begrudgingly
refer to only by type?"

Even if QAPI type names are not API, and even if my new QMP documentation
project eliminates as many type names as it can by inlining inherited
structures, boxed arguments, and branches - there are still referenc

Re: [PATCH RFC] hw/arm/virt: Avoid unexpected warning from Linux guest on host with Fujitsu CPUs

2024-06-06 Thread Jonathan Cameron via

On Thu, 6 Jun 2024 12:56:59 +0100
Peter Maydell  wrote:

> On Thu, 6 Jun 2024 at 11:48, Zhenyu Zhang  wrote:
> >
> > Multiple warning messages and corresponding backtraces are observed when 
> > Linux
> > guest is booted on the host with Fujitsu CPUs. One of them is shown as 
> > below.
> >
> > [0.032443] [ cut here ]
> > [0.032446] uart-pl011 900.pl011: ARCH_DMA_MINALIGN smaller than 
> > CTR_EL0.CWG (128 < 256)
> > [0.032454] WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:54 
> > arch_setup_dma_ops+0xbc/0xcc
> > [0.032470] Modules linked in:
> > [0.032475] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > 5.14.0-452.el9.aarch64 #1
> > [0.032481] Hardware name: linux,dummy-virt (DT)
> > [0.032484] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
> > BTYPE=--)
> > [0.032490] pc : arch_setup_dma_ops+0xbc/0xcc
> > [0.032496] lr : arch_setup_dma_ops+0xbc/0xcc
> > [0.032501] sp : 80008003b860
> > [0.032503] x29: 80008003b860 x28:  x27: 
> > aae4b949049c
> > [0.032510] x26:  x25:  x24: 
> > 
> > [0.032517] x23: 0100 x22:  x21: 
> > 
> > [0.032523] x20: 0001 x19: 2f06c02ea400 x18: 
> > 
> > [0.032529] x17: 208a5f76 x16: 6589dbcb x15: 
> > aae4ba071c89
> > [0.032535] x14:  x13: aae4ba071c84 x12: 
> > 455f525443206e61
> > [0.032541] x11: 68742072656c6c61 x10: 0029 x9 : 
> > aae4b7d21da4
> > [0.032547] x8 : 0029 x7 : 4c414e494d5f414d x6 : 
> > 0029
> > [0.032553] x5 : 000f x4 : aae4b9617a00 x3 : 
> > 0001
> > [0.032558] x2 :  x1 :  x0 : 
> > 2f06c029be40
> > [0.032564] Call trace:
> > [0.032566]  arch_setup_dma_ops+0xbc/0xcc
> > [0.032572]  of_dma_configure_id+0x138/0x300
> > [0.032591]  amba_dma_configure+0x34/0xc0
> > [0.032600]  really_probe+0x78/0x3dc
> > [0.032614]  __driver_probe_device+0x108/0x160
> > [0.032619]  driver_probe_device+0x44/0x114
> > [0.032624]  __device_attach_driver+0xb8/0x14c
> > [0.032629]  bus_for_each_drv+0x88/0xe4
> > [0.032634]  __device_attach+0xb0/0x1e0
> > [0.032638]  device_initial_probe+0x18/0x20
> > [0.032643]  bus_probe_device+0xa8/0xb0
> > [0.032648]  device_add+0x4b4/0x6c0
> > [0.032652]  amba_device_try_add.part.0+0x48/0x360
> > [0.032657]  amba_device_add+0x104/0x144
> > [0.032662]  of_amba_device_create.isra.0+0x100/0x1c4
> > [0.032666]  of_platform_bus_create+0x294/0x35c
> > [0.032669]  of_platform_populate+0x5c/0x150
> > [0.032672]  of_platform_default_populate_init+0xd0/0xec
> > [0.032697]  do_one_initcall+0x4c/0x2e0
> > [0.032701]  do_initcalls+0x100/0x13c
> > [0.032707]  kernel_init_freeable+0x1c8/0x21c
> > [0.032712]  kernel_init+0x28/0x140
> > [0.032731]  ret_from_fork+0x10/0x20
> > [0.032735] ---[ end trace  ]---
> >
> > In Linux, a check is applied to every device which is exposed through 
> > device-tree
> > node. The warning message is raised when the device isn't DMA coherent and 
> > the
> > cache line size is larger than ARCH_DMA_MINALIGN (128 bytes). The cache 
> > line is
> > sorted from CTR_EL0[CWG], which corresponds to 256 bytes on the guest CPUs.
> > The DMA coherent capability is claimed through 'dma-coherent' in their
> > device-tree nodes.  
> 
> For QEMU emulated all our DMA is always coherent, so where we
> have DMA-capable devices we should definitely tell the kernel
> that that DMA is coherent.
> 
> Our pl011 does not do DMA, though (we do not set the dmas property), so
> it's kind of bogus for the kernel to complain about that.
> 
> So I think we should take these changes where they refer to DMA
> capable devices and ask the kernel folks to fix the warnings
> where they refer to devices that aren't doing DMA. Looking through
> the patch, though, my initial impression is that all these are
> in the latter category...

I was curious and have a very slow test running, so took a look.
of_dma_configure() is being passed force_dma = true.
https://elixir.bootlin.com/linux/v6.10-rc2/source/drivers/amba/bus.c#L361

The is a comment in of_dma_configure()
/*
 * For legacy reasons, we have to assume some devices need
 * DMA configuration regardless of whether "dma-ranges" is
 * correctly specified or not.
 */
So this I think this is being triggered by a workaround for broken DT.

This was introduced by Robin Murphy +CC though you may need to ask on
kernel list because ARM / QEMU fun.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=723288836628b

Relevant comment from that patch description:

"Certain bus types have a general expec

Re: [PATCH] target/sparc: use signed denominator in sdiv helper

2024-06-06 Thread Richard Henderson


On 6/6/24 07:43, Clément Chigot wrote:

The result has to be done with the signed denominator (b32) instead of
the unsigned value passed in argument (b).

Fixes: 1326010322d6 ("target/sparc: Remove CC_OP_DIV")
Signed-off-by: Clément Chigot 
---
  target/sparc/helper.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/sparc/helper.c b/target/sparc/helper.c
index 2247e243b5..7846ddd6f6 100644
--- a/target/sparc/helper.c
+++ b/target/sparc/helper.c
@@ -121,7 +121,7 @@ uint64_t helper_sdiv(CPUSPARCState *env, target_ulong a, 
target_ulong b)
  return (uint32_t)(b32 < 0 ? INT32_MAX : INT32_MIN) | (-1ull << 32);
  }
  
-a64 /= b;

+a64 /= b32;
  r = a64;
  if (unlikely(r != a64)) {
  return (uint32_t)(a64 < 0 ? INT32_MIN : INT32_MAX) | (-1ull << 32);


Oops.

Reviewed-by: Richard Henderson 

r~

Re: [PATCH 0/6] host/i386: require x86-64-v2 ISA

2024-06-06 Thread Alexander Monakov

Hi,

On Fri, 31 May 2024, Paolo Bonzini wrote:

> x86-64-v2 processors were released in 2008, assume that we have one.
> This provides CMOV on 32-bit processors, and also POPCNT and various
> vector ISA extensions.

If my contributions to recent cleanups and speedups for buffer_is_zero
count for something, I'd like to ask you to reconsider. I do not see
what distribution maintainers (where there's no distro-wide switch to
x86_64-v2 baseline happening yet) are supposed to do with SIGILL reports
coming from affected users after this change.

I'm sure it's not "here's a nickel, kid...", but I'm honestly at a loss
what you'd suggest.

Looking at the patches, the gains appear to be so remarkably tiny, with
the exception of adding CMOV to baseline, that I question if it's worth
the friction. Is there something I'm not seeing?

I think basing the decision on when the earliest x86_64-v2 processors appeared
is not right.

Would you consider a reversal of the three patches that bump the baseline
beyond SSE2?

>   meson: assume x86-64-v2 baseline ISA
>   host/i386: assume presence of SSSE3
>   host/i386: assume presence of POPCNT

Thank you.
Alexander

Re: [PATCH v1 0/8] PRI support for VT-d

2024-06-06 Thread CLEMENT MATHIEU--DRIF

Hi,

Just adding Michael in Cc:

Thanks
 >cmd


On 30/05/2024 14:24, CLEMENT MATHIEU--DRIF wrote:
> This series belongs to a list of series that add SVM support for VT-d.
>
> Here we focus on the implementation of PRI support in the IOMMU and on a 
> PCI-level
> API for PRI to be used by virtual devices.
>
> This work is based on the VT-d specification version 4.1 (March 2023).
> Here is a link to a GitHub repository where you can find the following 
> elements :
>  - Qemu with all the patches for SVM
>  - ATS
>  - PRI
>  - Device IOTLB invalidations
>  - Requests with already translated addresses
>  - A demo device
>  - A simple driver for the demo device
>  - A userspace program (for testing and demonstration purposes)
>
> https://github.com/BullSequana/Qemu-in-guest-SVM-demo
>
> Clément Mathieu--Drif (8):
>pcie: add a helper to declare the PRI capability for a pcie device
>pcie: helper functions to check to check if PRI is enabled
>pcie: add a way to get the outstanding page request allocation (pri)
>  from the config space.
>pci: declare structures and IOMMU operation for PRI
>pci: add a PCI-level API for PRI
>intel_iommu: declare PRI constants and structures
>intel_iommu: declare registers for PRI
>intel_iommu: add PRI operations support
>
>   hw/i386/intel_iommu.c  | 302 +
>   hw/i386/intel_iommu_internal.h |  54 +-
>   hw/pci/pci.c   |  37 
>   hw/pci/pcie.c  |  42 +
>   include/exec/memory.h  |  65 +++
>   include/hw/pci/pci.h   |  45 +
>   include/hw/pci/pci_bus.h   |   1 +
>   include/hw/pci/pcie.h  |   7 +-
>   include/hw/pci/pcie_regs.h |   4 +
>   system/memory.c|  49 ++
>   10 files changed, 604 insertions(+), 2 deletions(-)
>

Re: [PATCH] target/i386: SEV: do not assume machine->cgs is SEV

2024-06-06 Thread Paolo Bonzini

On Thu, Jun 6, 2024 at 6:07 PM Xiaoyao Li  wrote:
>
> On 6/6/2024 6:44 AM, Paolo Bonzini wrote:
> > There can be other confidential computing classes that are not derived
> > from sev-common.  Avoid aborting when encountering them.
>
> I hit it today when rebasing TDX patches to latest QEMU master, which
> has the SEV-SNP series merged. (I didn't get time to review it between
> it gets merged.)
>
> my approach is to guard with sev_enabled() when calling
> sev_es_set_reset_vector() in kvm_arch_reset_vcpu(), because calling sev*
> specific function in generic kvm code doesn't look reasonable to me.

On the other hand I would like to avoid too many sev/tdx conditionals
in common code.  Neither choice is great.

Another possibility is to make this a X86ConfidentialGuest method, if
the TDX code has anything similar.

Feel free to keep this patch, or anything that replaces it, in your TDX series.

Apart from this issue, I could rebase the previous TDX patches on top
of SEV-SNP without any problems.

Paolo

Re: [PATCH] target/i386: SEV: do not assume machine->cgs is SEV

2024-06-06 Thread Xiaoyao Li


On 6/6/2024 6:44 AM, Paolo Bonzini wrote:

There can be other confidential computing classes that are not derived
from sev-common.  Avoid aborting when encountering them.


I hit it today when rebasing TDX patches to latest QEMU master, which 
has the SEV-SNP series merged. (I didn't get time to review it between 
it gets merged.)



Signed-off-by: Paolo Bonzini 
---
  target/i386/sev.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 004c667ac14..97e15f8b7a9 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -1710,7 +1710,9 @@ void sev_es_set_reset_vector(CPUState *cpu)


my approach is to guard with sev_enabled() when calling 
sev_es_set_reset_vector() in kvm_arch_reset_vcpu(), because calling sev* 
specific function in generic kvm code doesn't look reasonable to me.



  {
  X86CPU *x86;
  CPUX86State *env;
-SevCommonState *sev_common = SEV_COMMON(MACHINE(qdev_get_machine())->cgs);
+ConfidentialGuestSupport *cgs = MACHINE(qdev_get_machine())->cgs;
+SevCommonState *sev_common = SEV_COMMON(
+object_dynamic_cast(OBJECT(cgs), TYPE_SEV_COMMON));
  
  /* Only update if we have valid reset information */

  if (!sev_common || !sev_common->reset_data_valid) {

Re: [PATCH v5 0/3] Fix MCE handling on AMD hosts

2024-06-06 Thread John Allen

On Thu, Jun 06, 2024 at 11:09:05AM +0200, Paolo Bonzini wrote:
> Queued, thanks.  I added a note to the commit message in the third patch:

Thanks, Paolo!

> 
> By the time the MCE reaches the guest, the overflow has been handled
> by the host and has not caused a shutdown, so include the bit 
> unconditionally.

I'm not sure I understand this additional comment. Is this talking about
the case where the host gets an overflow? If so, yes, if the host has
overflow recovery supported, it should handle the overflow and won't
require any overflow recovery on the part of the guest. For clarity, it
may be nice to prefix the above statement with something like:
"In the case of a host overflow, ..."

If we're going to bring up the host overflow case, it may be worth
clarifying further that host overflows should not propagate to the guest
and this patch is specifically intended to allow the guest to handle
overflows in the MCEs that are injected from qemu.

> 
> Advertising of SUCCOR and OVERFLOW_RECOV in KVM would still be nice. :)

Sure, I will send a series for this.

Thanks,
John

Re: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU

2024-06-06 Thread Cédric Le Goater


Hello Joao,

On 6/22/23 23:48, Joao Martins wrote:

Hey,

This series introduces support for vIOMMU with VFIO device migration,
particurlarly related to how we do the dirty page tracking.

Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2)
provide dma translation services for guests to provide some form of
guest kernel managed DMA e.g. for nested virt based usage; (1) is specially
required for big VMs with VFs with more than 255 vcpus. We tackle both
and remove the migration blocker when vIOMMU is present provided the
conditions are met. I have both use-cases here in one series, but I am happy
to tackle them in separate series.

As I found out we don't necessarily need to expose the whole vIOMMU
functionality in order to just support interrupt remapping. x86 IOMMUs
on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really
Linux guests with commit c40c10 and since qemu commit 8646d9c773d8)
can instantiate a IOMMU just for interrupt remapping without needing to
be advertised/support DMA translation. AMD IOMMU in theory can provide
the same, but Linux doesn't quite support the IR-only part there yet,
only intel-iommu.

The series is organized as following:

Patches 1-5: Today we can't gather vIOMMU details before the guest
establishes their first DMA mapping via the vIOMMU. So these first four
patches add a way for vIOMMUs to be asked of their properties at start
of day. I choose the least churn possible way for now (as opposed to a
treewide conversion) and allow easy conversion a posteriori. As
suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which
allows us to fetch PCI backing vIOMMU attributes, without necessarily
tieing the caller (VFIO or anyone else) to an IOMMU MR like I
was doing in v3.

Patches 6-8: Handle configs with vIOMMU interrupt remapping but without
DMA translation allowed. Today the 'dma-translation' attribute is
x86-iommu only, but the way this series is structured nothing stops from
other vIOMMUs supporting it too as long as they use
pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes
are handled. The blocker is thus relaxed when vIOMMUs are able to toggle
the toggle/report DMA_TRANSLATION attribute. With the patches up to this set,
we've then tackled item (1) of the second paragraph.

Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete
IOVA address space, leveraging the logic we use to compose the dirty ranges.
The blocker is once again relaxed for vIOMMUs that advertise their IOVA
addressing limits. This tackles item (2). So far I mainly use it with
intel-iommu, although I have a small set of patches for virtio-iommu per
Alex's suggestion in v2.

Comments, suggestions welcome. Thanks for the review!



I spent sometime refreshing your series on upstream QEMU (See [1]) and
gave migration a try with CX-7 VF. LGTM. It doesn't seem we are far
from acceptance in QEMU 9.1. Are we ?

First, I will resend these with the changes I made :

  vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap()
  vfio/common: Move dirty tracking ranges update to helper()

I guess the PCIIOMMUOps::get_iommu_attr needs a close review. Is
IOMMU_ATTR_DMA_TRANSLATION a must have ?

The rest is mostly VFIO internals for dirty tracking.

Thanks,

C.

[1] https://github.com/legoater/qemu/commits/vfio-9.1




Regards,
Joao

Changes since v3[8]:
* Pick up Yi's patches[5][6], and rework the first four patches.
   These are a bit better splitted, and make the new iommu_ops *optional*
   as opposed to a treewide conversion. Rather than returning an IOMMU MR
   and let VFIO operate on it to fetch attributes, we instead let the
   underlying IOMMU driver fetch the desired IOMMU MR and ask for the
   desired IOMMU attribute. Callers only care about PCI Device backing
   vIOMMU attributes regardless of its topology/association. (Peter Xu)
   These patches are a bit better splitted compared to original ones,
   and I've kept all the same authorship and note the changes from
   original where applicable.
* Because of the rework of the first four patches, switch to
   individual attributes in the VFIOSpace that track dma_translation
   and the max_iova. All are expected to be unused when zero to retain
   the defaults of today in common code.
* Improve the migration blocker message of the last patch to be
   more obvious that vIOMMU migration blocker is added when no vIOMMU
   address space limits are advertised. (Patch 15)
* Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe).
* Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe).
* Change diffstat of patches with scripts/git.orderfile (Philippe).

Changes since v2[3]:
* New patches 1-9 to be able to handle vIOMMUs without DMA translation, and
introduce ways to know various IOMMU model attributes via the IOMMU MR. This
is partly meant to address a comment in previous versions where we can't
access the IOMMU MR prior to the DMA mapping happening. Before t

[PATCH] hw/net: cadence_gem: fix: type2_compare_x_word_0 error

2024-06-06 Thread Andrew.Yuan

In the Cadence IP for Gigabit Ethernet MAC Part Number: IP7014 IP Rev: 
R1p12 - Doc Rev: 1.3 User Guide, the specification for the 
type2_compare_x_word_0 register is as follows:
The byte stored in bits [23:16] is compared against the byte in the 
received frame from the selected offset+0, and the byte stored in bits [31:24] 
is compared against the byte in
the received frame from the selected offset+1.

However, there is an implementation error in the cadence_gem model in 
qemu：
the byte stored in bits [31:24] is compared against the byte in the 
received frame from the selected offset+0

Now, the error code is as follows:
rx_cmp = rxbuf_ptr[offset] << 8 | rxbuf_ptr[offset];

and needs to be corrected to：
rx_cmp = rxbuf_ptr[offset + 1] << 8 | rxbuf_ptr[offset];

Signed-off-by: Andrew.Yuan 
---
 hw/net/cadence_gem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/net/cadence_gem.c b/hw/net/cadence_gem.c
index ec7bf562e5..9c73ded0d3 100644
--- a/hw/net/cadence_gem.c
+++ b/hw/net/cadence_gem.c
@@ -946,7 +946,7 @@ static int get_queue_from_screen(CadenceGEMState *s, 
uint8_t *rxbuf_ptr,
 break;
 }
 
-rx_cmp = rxbuf_ptr[offset] << 8 | rxbuf_ptr[offset];
+rx_cmp = rxbuf_ptr[offset + 1] << 8 | rxbuf_ptr[offset];
 mask = FIELD_EX32(cr0, TYPE2_COMPARE_0_WORD_0, MASK_VALUE);
 compare = FIELD_EX32(cr0, TYPE2_COMPARE_0_WORD_0, COMPARE_VALUE);
 
-- 
2.37.0.windows.1

Re: [PATCH v2 5/9] target/i386: Split out gdb-internal.h

2024-06-06 Thread Richard Henderson


On 6/5/24 23:51, Philippe Mathieu-Daudé wrote:

Shouldn't we remove the definitions from the source to
complete the "split"?


Gah, I thought I had done that.


r~

Re: [PATCH] target/s390x: Fix tracing header path in TCG mem_helper.c

2024-06-06 Thread Stefan Hajnoczi

On Thu, Jun 06, 2024 at 12:30:26PM +0200, Philippe Mathieu-Daudé wrote:
> Commit c9274b6bf0 ("target/s390x: start moving TCG-only code
> to tcg/") moved mem_helper.c, but the trace-events file is
> still in the parent directory, so is the generated trace.h.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Ideally we should only use trace events from current directory.

Yes, that would be cleaner. Is it possible to move the relevant trace
events to the trace-events file in target/s390x/tcg/?

> ---
>  target/s390x/tcg/mem_helper.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/target/s390x/tcg/mem_helper.c b/target/s390x/tcg/mem_helper.c
> index 6a308c5553..1fb6cbb6cf 100644
> --- a/target/s390x/tcg/mem_helper.c
> +++ b/target/s390x/tcg/mem_helper.c
> @@ -30,7 +30,7 @@
>  #include "hw/core/tcg-cpu-ops.h"
>  #include "qemu/int128.h"
>  #include "qemu/atomic128.h"
> -#include "trace.h"
> +#include "../trace.h"
>  
>  #if !defined(CONFIG_USER_ONLY)
>  #include "hw/s390x/storage-keys.h"
> -- 
> 2.41.0
> 


signature.asc
Description: PGP signature

Re: kvm crash with virtiofs

2024-06-06 Thread German Maglione

On Thu, Jun 6, 2024 at 10:40 AM Miklos Szeredi  wrote:
>
> Hi,
>
> I get the below crash when running virtio-fs on fedora 39.
>
> Note: weirdly this makes chrome running on the host also crash.
>
> Eric Sandeen also reported some bad behavior of virtio-fs on fc39,
> which might be related.
>
> Versions:
> kernel-6.8.4-200.fc39.x86_64
> qemu-kvm-8.1.3-5.fc39.x86_64
> virtiofsd-1.10.1-1.fc39.x86_64
>
> Thanks,
> Miklos
>
> /usr/libexec/virtiofsd --socket-path=/tmp/vhostqemu --shared-dir /home &
>
> qemu-system-x86_64 -enable-kvm -s -serial none -parallel none -kernel
> /home/mszeredi/git/linux/arch/x86_64/boot/bzImage -drive
> format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive
> format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev
> stdio,id=virtiocon0,signal=off -device virtio-serial -device
> virtconsole,chardev=virtiocon0 -cpu host -m 16G -smp 8 -object
> memory-backend-file,id=mem,size=16G,mem-path=/dev/shm,share=on -numa
> node,memdev=mem -net user -net nic,model=virtio-net-pci -fsdev
> local,security_model=none,id=fsdev0,path=/home -device virtio-rng-pci
> -chardev socket,id=char0,path=/tmp/vhostqemu -device
> vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs -device
> virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -append "root=/dev/vda
> console=hvc0 "

are you running virtiofsd inside a container?, could you try

-object memory-backend-memfd,id=mem,size=16G,share=on

instead of "-object memory-backend-file..."

> [...]
> root@kvm:~# time md5sum /host/mszeredi/images/ubd1
> error: kvm run failed Bad address
> RAX= RBX=888100044240 RCX=
> RDX=888420c59ff0
> RSI=0020 RDI=888420c59ff8 RBP=
> RSP=c900016d3898
> R8 =888420c59da8 R9 =0040 R10=00036140
> R11=0005
> R12=888420c59ff0 R13=000d R14=ea0010831600
> R15=888420c59da8
> RIP=82168d80 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =   00c0
> CS =0010   00a09b00 DPL=0 CS64 [-RA]
> SS =0018   00c09300 DPL=0 DS   [-WA]
> DS =   00c0
> FS = 7fb83cea8740  00c0
> GS = 88842fd4  00c0
> LDT=   00c0
> TR =0040 fe12a000 4087 8b00 DPL=0 TSS64-busy
> GDT= fe128000 007f
> IDT= fe00 0fff
> CR0=80050033 CR2=7f2d3bd9b0f0 CR3=0001036ee005 CR4=00770ef0
> DR0= DR1= DR2=
> DR3=
> DR6=0ff0 DR7=0400
> EFER=0d01
> Code=90 90 90 90 48 c7 07 00 00 00 00 48 89 fa 48 8d 7f 08 31 c0 <48>
> c7 87 30 02 00 00 00 00 00 00 48 89 d1 48 83 e7 f8 48 29 f9 81 c1 40
> 02 00 00 c1 e9 03
>
>


-- 
German

Re: [PATCH v3 4/4] qdev: add device policy [RfC]

2024-06-06 Thread Peter Maydell

On Thu, 6 Jun 2024 at 15:31, Gerd Hoffmann  wrote:
>
> Add policies for devices which are deprecated or not secure.
> There are three options: allow, warn and deny.
>
> It's implemented for devices only.  Devices will probably be the main
> user of this.  Also object_new() can't fail as of today so it's a bit
> hard to implement policy checking at object level, especially the 'deny'
> part of it.
>
> TODO: add a command line option to actually set these policies.
>
> Comments are welcome.
>
> Signed-off-by: Gerd Hoffmann 

> @@ -162,14 +208,26 @@ DeviceState *qdev_new(const char *name)
>  error_report("unknown type '%s'", name);
>  abort();
>  }
> +
> +if (!qdev_class_check(name, oc)) {
> +exit(1);
> +}
> +
>  return DEVICE(object_new(name));
>  }
>
>  DeviceState *qdev_try_new(const char *name)
>  {
> -if (!module_object_class_by_name(name)) {
> +ObjectClass *oc = module_object_class_by_name(name);
> +
> +if (!oc) {
>  return NULL;
>  }
> +
> +if (!qdev_class_check(name, oc)) {
> +return NULL;
> +}
> +
>  return DEVICE(object_new(name));
>  }

It's valid to create a qdev device via object_new(), so
this doesn't work as a place to put the check. My suggestion
would be to restrict the deprecation handling to qdev only,
not to objects in general. Then you can do it in the
qdev device base class realize method, and fail realize
if it's not supported.

(qdev_try_new() is one of those "we use this in just 4
places" APIs that always tempts me to wonder if we should
really have it...)

thanks
-- PMM

[PATCH] target/sparc: use signed denominator in sdiv helper

2024-06-06 Thread Clément Chigot

The result has to be done with the signed denominator (b32) instead of
the unsigned value passed in argument (b).

Fixes: 1326010322d6 ("target/sparc: Remove CC_OP_DIV")
Signed-off-by: Clément Chigot 
---
 target/sparc/helper.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/sparc/helper.c b/target/sparc/helper.c
index 2247e243b5..7846ddd6f6 100644
--- a/target/sparc/helper.c
+++ b/target/sparc/helper.c
@@ -121,7 +121,7 @@ uint64_t helper_sdiv(CPUSPARCState *env, target_ulong a, 
target_ulong b)
 return (uint32_t)(b32 < 0 ? INT32_MAX : INT32_MIN) | (-1ull << 32);
 }
 
-a64 /= b;
+a64 /= b32;
 r = a64;
 if (unlikely(r != a64)) {
 return (uint32_t)(a64 < 0 ? INT32_MIN : INT32_MAX) | (-1ull << 32);
-- 
2.25.1

Re: [PATCH v3 2/4] usb/hub: mark as deprecated

2024-06-06 Thread Daniel P . Berrangé

On Thu, Jun 06, 2024 at 04:30:08PM +0200, Gerd Hoffmann wrote:
> The hub supports only USB 1.1.  When running out of usb ports it is in
> almost all cases the much better choice to add another usb host adapter
> (or increase the number of root ports when using xhci) instead of using
> the usb hub.

Is that actually a strong enough reason to delete this device though ?
This reads like its merely something we don't expect to be commonly
used, rather than something we would actively want to delete.

> 
> Signed-off-by: Gerd Hoffmann 
> ---
>  hw/usb/dev-hub.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hw/usb/dev-hub.c b/hw/usb/dev-hub.c
> index 06e9537d0356..bc8d0ba4cfcf 100644
> --- a/hw/usb/dev-hub.c
> +++ b/hw/usb/dev-hub.c
> @@ -686,6 +686,7 @@ static void usb_hub_class_initfn(ObjectClass *klass, void 
> *data)
>  set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
>  dc->fw_name = "hub";
>  dc->vmsd = &vmstate_usb_hub;
> +klass->deprecated = true;
>  device_class_set_props(dc, usb_hub_properties);
>  }

Deprecations should also have an entry in docs/about/deprecated.rst to
warn users about the intent to delete the code in future.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v3 1/4] qom: allow to mark objects as deprecated or not secure.

2024-06-06 Thread Daniel P . Berrangé

On Thu, Jun 06, 2024 at 04:30:07PM +0200, Gerd Hoffmann wrote:
> Add flags to ObjectClass for objects which are deprecated or not secure.
> Add 'deprecated' and 'not-secure' bools to ObjectTypeInfo, report in
> 'qom-list-types'.  Print the flags when listing devices via '-device
> help'.
> 
> Signed-off-by: Gerd Hoffmann 
> ---
>  include/qom/object.h  | 3 +++
>  qom/qom-qmp-cmds.c| 8 
>  system/qdev-monitor.c | 8 
>  qapi/qom.json | 8 +++-
>  4 files changed, 26 insertions(+), 1 deletion(-)

Reviewed-by: Daniel P. Berrangé 


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v3 3/4] vga/cirrus: mark as not secure

2024-06-06 Thread Daniel P . Berrangé

On Thu, Jun 06, 2024 at 04:30:09PM +0200, Gerd Hoffmann wrote:

What's the justification for declaring cirrus to be insecure ?

It is shipped as a driver in RHEL for years, was the default graphics
adapter for most of this time, and bugs in it are considered CVEs
still.

> Signed-off-by: Gerd Hoffmann 
> ---
>  hw/display/cirrus_vga.c | 1 +
>  hw/display/cirrus_vga_isa.c | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/hw/display/cirrus_vga.c b/hw/display/cirrus_vga.c
> index 150883a97166..1f4c55b21415 100644
> --- a/hw/display/cirrus_vga.c
> +++ b/hw/display/cirrus_vga.c
> @@ -3007,6 +3007,7 @@ static void cirrus_vga_class_init(ObjectClass *klass, 
> void *data)
>  dc->vmsd = &vmstate_pci_cirrus_vga;
>  device_class_set_props(dc, pci_vga_cirrus_properties);
>  dc->hotpluggable = false;
> +klass->not_secure = true;
>  }
>  
>  static const TypeInfo cirrus_vga_info = {
> diff --git a/hw/display/cirrus_vga_isa.c b/hw/display/cirrus_vga_isa.c
> index 84be51670ed8..535a631b4b09 100644
> --- a/hw/display/cirrus_vga_isa.c
> +++ b/hw/display/cirrus_vga_isa.c
> @@ -85,6 +85,7 @@ static void isa_cirrus_vga_class_init(ObjectClass *klass, 
> void *data)
>  dc->realize = isa_cirrus_vga_realizefn;
>  device_class_set_props(dc, isa_cirrus_vga_properties);
>  set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
> +klass->not_secure = true;
>  }
>  
>  static const TypeInfo isa_cirrus_vga_info = {
> -- 
> 2.45.2
> 

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

[PATCH v3 1/4] qom: allow to mark objects as deprecated or not secure.

2024-06-06 Thread Gerd Hoffmann

Add flags to ObjectClass for objects which are deprecated or not secure.
Add 'deprecated' and 'not-secure' bools to ObjectTypeInfo, report in
'qom-list-types'.  Print the flags when listing devices via '-device
help'.

Signed-off-by: Gerd Hoffmann 
---
 include/qom/object.h  | 3 +++
 qom/qom-qmp-cmds.c| 8 
 system/qdev-monitor.c | 8 
 qapi/qom.json | 8 +++-
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/qom/object.h b/include/qom/object.h
index 13d3a655ddf9..419bd9a4b219 100644
--- a/include/qom/object.h
+++ b/include/qom/object.h
@@ -136,6 +136,9 @@ struct ObjectClass
 ObjectUnparent *unparent;
 
 GHashTable *properties;
+
+bool deprecated;
+bool not_secure;
 };
 
 /**
diff --git a/qom/qom-qmp-cmds.c b/qom/qom-qmp-cmds.c
index e91a2353472a..325ff0ba2a25 100644
--- a/qom/qom-qmp-cmds.c
+++ b/qom/qom-qmp-cmds.c
@@ -101,6 +101,14 @@ static void qom_list_types_tramp(ObjectClass *klass, void 
*data)
 if (parent) {
 info->parent = g_strdup(object_class_get_name(parent));
 }
+if (klass->deprecated) {
+info->has_deprecated = true;
+info->deprecated = true;
+}
+if (klass->not_secure) {
+info->has_not_secure = true;
+info->not_secure = true;
+}
 
 QAPI_LIST_PREPEND(*pret, info);
 }
diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index 6af6ef7d667f..effdc95d21d3 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -144,6 +144,8 @@ static bool qdev_class_has_alias(DeviceClass *dc)
 
 static void qdev_print_devinfo(DeviceClass *dc)
 {
+ObjectClass *klass = OBJECT_CLASS(dc);
+
 qemu_printf("name \"%s\"", object_class_get_name(OBJECT_CLASS(dc)));
 if (dc->bus_type) {
 qemu_printf(", bus %s", dc->bus_type);
@@ -157,6 +159,12 @@ static void qdev_print_devinfo(DeviceClass *dc)
 if (!dc->user_creatable) {
 qemu_printf(", no-user");
 }
+if (klass->deprecated) {
+qemu_printf(", deprecated");
+}
+if (klass->not_secure) {
+qemu_printf(", not-secure");
+}
 qemu_printf("\n");
 }
 
diff --git a/qapi/qom.json b/qapi/qom.json
index 8bd299265e39..3f20d4c6413b 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -163,10 +163,16 @@
 #
 # @parent: Name of parent type, if any (since 2.10)
 #
+# @deprecated: the type is deprecated (since 9.1)
+#
+# @not-secure: the type (typically a device) is not considered
+# a security boundary (since 9.1)
+#
 # Since: 1.1
 ##
 { 'struct': 'ObjectTypeInfo',
-  'data': { 'name': 'str', '*abstract': 'bool', '*parent': 'str' } }
+  'data': { 'name': 'str', '*abstract': 'bool', '*parent': 'str',
+'*deprecated': 'bool', '*not-secure': 'bool' } }
 
 ##
 # @qom-list-types:
-- 
2.45.2

[PATCH v3 4/4] qdev: add device policy [RfC]

2024-06-06 Thread Gerd Hoffmann

Add policies for devices which are deprecated or not secure.
There are three options: allow, warn and deny.

It's implemented for devices only.  Devices will probably be the main
user of this.  Also object_new() can't fail as of today so it's a bit
hard to implement policy checking at object level, especially the 'deny'
part of it.

TODO: add a command line option to actually set these policies.

Comments are welcome.

Signed-off-by: Gerd Hoffmann 
---
 hw/core/qdev.c | 60 +-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index f3a996f57dee..0c4e5cec743c 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -43,6 +43,15 @@
 static bool qdev_hot_added = false;
 bool qdev_hot_removed = false;
 
+enum qdev_policy {
+QDEV_ALLOW = 0,
+QDEV_WARN  = 1,
+QDEV_DENY  = 2,
+};
+
+static enum qdev_policy qdev_deprecated_policy;
+static enum qdev_policy qdev_not_secure_policy;
+
 const VMStateDescription *qdev_get_vmsd(DeviceState *dev)
 {
 DeviceClass *dc = DEVICE_GET_CLASS(dev);
@@ -144,6 +153,43 @@ bool qdev_set_parent_bus(DeviceState *dev, BusState *bus, 
Error **errp)
 return true;
 }
 
+static bool qdev_class_check(const char *name, ObjectClass *oc)
+{
+bool allow = true;
+
+if (oc->deprecated) {
+switch (qdev_deprecated_policy) {
+case QDEV_WARN:
+warn_report("device \"%s\" is deprecated", name);
+break;
+case QDEV_DENY:
+error_report("device \"%s\" is deprecated", name);
+allow = false;
+break;
+default:
+/* nothing */
+break;
+}
+}
+
+if (oc->not_secure) {
+switch (qdev_not_secure_policy) {
+case QDEV_WARN:
+warn_report("device \"%s\" is not secure", name);
+break;
+case QDEV_DENY:
+error_report("device \"%s\" is not secure", name);
+allow = false;
+break;
+default:
+/* nothing */
+break;
+}
+}
+
+return allow;
+}
+
 DeviceState *qdev_new(const char *name)
 {
 ObjectClass *oc = object_class_by_name(name);
@@ -162,14 +208,26 @@ DeviceState *qdev_new(const char *name)
 error_report("unknown type '%s'", name);
 abort();
 }
+
+if (!qdev_class_check(name, oc)) {
+exit(1);
+}
+
 return DEVICE(object_new(name));
 }
 
 DeviceState *qdev_try_new(const char *name)
 {
-if (!module_object_class_by_name(name)) {
+ObjectClass *oc = module_object_class_by_name(name);
+
+if (!oc) {
 return NULL;
 }
+
+if (!qdev_class_check(name, oc)) {
+return NULL;
+}
+
 return DEVICE(object_new(name));
 }
 
-- 
2.45.2

[PATCH v3 0/4] allow to deprecate objects and devices

2024-06-06 Thread Gerd Hoffmann

Put some infrastructure in place to allow tagging objects (including
devices) as deprected.  Use it to mark the ohci pci host adapter and
the usb hub as deprecated.

v3:
 - switch to two properties: 'deprecated' and 'not secure' flags.
 - add rfc patch implementing policies for devices with flags.

v2:
 - pick up reviews.
 - drop ohci patch.
 - add cirrus vga patch.

Gerd Hoffmann (4):
  qom: allow to mark objects as deprecated or not secure.
  usb/hub: mark as deprecated
  vga/cirrus: mark as not secure
  qdev: add device policy [RfC]

 include/qom/object.h|  3 ++
 hw/core/qdev.c  | 60 -
 hw/display/cirrus_vga.c |  1 +
 hw/display/cirrus_vga_isa.c |  1 +
 hw/usb/dev-hub.c|  1 +
 qom/qom-qmp-cmds.c  |  8 +
 system/qdev-monitor.c   |  8 +
 qapi/qom.json   |  8 -
 8 files changed, 88 insertions(+), 2 deletions(-)

-- 
2.45.2

[PATCH v3 3/4] vga/cirrus: mark as not secure

2024-06-06 Thread Gerd Hoffmann

Signed-off-by: Gerd Hoffmann 
---
 hw/display/cirrus_vga.c | 1 +
 hw/display/cirrus_vga_isa.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/display/cirrus_vga.c b/hw/display/cirrus_vga.c
index 150883a97166..1f4c55b21415 100644
--- a/hw/display/cirrus_vga.c
+++ b/hw/display/cirrus_vga.c
@@ -3007,6 +3007,7 @@ static void cirrus_vga_class_init(ObjectClass *klass, 
void *data)
 dc->vmsd = &vmstate_pci_cirrus_vga;
 device_class_set_props(dc, pci_vga_cirrus_properties);
 dc->hotpluggable = false;
+klass->not_secure = true;
 }
 
 static const TypeInfo cirrus_vga_info = {
diff --git a/hw/display/cirrus_vga_isa.c b/hw/display/cirrus_vga_isa.c
index 84be51670ed8..535a631b4b09 100644
--- a/hw/display/cirrus_vga_isa.c
+++ b/hw/display/cirrus_vga_isa.c
@@ -85,6 +85,7 @@ static void isa_cirrus_vga_class_init(ObjectClass *klass, 
void *data)
 dc->realize = isa_cirrus_vga_realizefn;
 device_class_set_props(dc, isa_cirrus_vga_properties);
 set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+klass->not_secure = true;
 }
 
 static const TypeInfo isa_cirrus_vga_info = {
-- 
2.45.2

[PATCH v3 2/4] usb/hub: mark as deprecated

2024-06-06 Thread Gerd Hoffmann

The hub supports only USB 1.1.  When running out of usb ports it is in
almost all cases the much better choice to add another usb host adapter
(or increase the number of root ports when using xhci) instead of using
the usb hub.

Signed-off-by: Gerd Hoffmann 
---
 hw/usb/dev-hub.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/usb/dev-hub.c b/hw/usb/dev-hub.c
index 06e9537d0356..bc8d0ba4cfcf 100644
--- a/hw/usb/dev-hub.c
+++ b/hw/usb/dev-hub.c
@@ -686,6 +686,7 @@ static void usb_hub_class_initfn(ObjectClass *klass, void 
*data)
 set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
 dc->fw_name = "hub";
 dc->vmsd = &vmstate_usb_hub;
+klass->deprecated = true;
 device_class_set_props(dc, usb_hub_properties);
 }
 
-- 
2.45.2

RE: [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression

2024-06-06 Thread Liu, Yuan1



> -Original Message-
> From: Fabiano Rosas 
> Sent: Thursday, June 6, 2024 9:52 PM
> To: Liu, Yuan1 ; pet...@redhat.com;
> pbonz...@redhat.com; marcandre.lur...@redhat.com; berra...@redhat.com;
> th...@redhat.com; phi...@linaro.org
> Cc: qemu-devel@nongnu.org; Zou, Nanhai ;
> shameerali.kolothum.th...@huawei.com
> Subject: RE: [PATCH v7 6/7] migration/multifd: implement qpl compression
> and decompression
> 
> "Liu, Yuan1"  writes:
> 
> >> -Original Message-
> >> From: Fabiano Rosas 
> >> Sent: Thursday, June 6, 2024 6:26 AM
> >> To: Liu, Yuan1 ; pet...@redhat.com;
> >> pbonz...@redhat.com; marcandre.lur...@redhat.com; berra...@redhat.com;
> >> th...@redhat.com; phi...@linaro.org
> >> Cc: qemu-devel@nongnu.org; Liu, Yuan1 ; Zou,
> Nanhai
> >> ; shameerali.kolothum.th...@huawei.com
> >> Subject: Re: [PATCH v7 6/7] migration/multifd: implement qpl
> compression
> >> and decompression
> >>
> >> Yuan Liu  writes:
> >>
> >> > QPL compression and decompression will use IAA hardware first.
> >> > If IAA hardware is not available, it will automatically fall
> >> > back to QPL software path, if the software job also fails,
> >> > the uncompressed page is sent directly.
> >> >
> >> > Signed-off-by: Yuan Liu 
> >> > Reviewed-by: Nanhai Zou 
> >> > ---
> >> >  migration/multifd-qpl.c | 412
> +++-
> >> >  1 file changed, 408 insertions(+), 4 deletions(-)
> >> >
> >> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> >> > index 6791a204d5..18b3384bd5 100644
> >> > --- a/migration/multifd-qpl.c
> >> > +++ b/migration/multifd-qpl.c
> >> > @@ -13,9 +13,14 @@
> >> >  #include "qemu/osdep.h"
> >> >  #include "qemu/module.h"
> >> >  #include "qapi/error.h"
> >> > +#include "qapi/qapi-types-migration.h"
> >> > +#include "exec/ramblock.h"
> >> >  #include "multifd.h"
> >> >  #include "qpl/qpl.h"
> >> >
> >> > +/* Maximum number of retries to resubmit a job if IAA work queues
> are
> >> full */
> >> > +#define MAX_SUBMIT_RETRY_NUM (3)
> >> > +
> >> >  typedef struct {
> >> >  /* the QPL hardware path job */
> >> >  qpl_job *job;
> >> > @@ -260,6 +265,219 @@ static void
> >> multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
> >> >  p->iov = NULL;
> >> >  }
> >> >
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the job
> >> > + *
> >> > + * Set the QPL job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @is_compression: indicates compression and decompression
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_job(qpl_job *job, bool
> is_compression,
> >> > +uint8_t *input, uint32_t
> input_len,
> >> > +uint8_t *output, uint32_t
> >> output_len)
> >> > +{
> >> > +job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
> >> > +job->next_in_ptr = input;
> >> > +job->next_out_ptr = output;
> >> > +job->available_in = input_len;
> >> > +job->available_out = output_len;
> >> > +job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST |
> QPL_FLAG_OMIT_VERIFY;
> >> > +/* only supports compression level 1 */
> >> > +job->level = 1;
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the compression job
> >>
> >> function name is wrong
> >
> > Thanks, I will fix this next version.
> >
> >> > + *
> >> > + * Set the compression job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t
> *input,
> >> > + uint32_t input_len, uint8_t
> >> *output,
> >> > + uint32_t output_len)
> >> > +{
> >> > +multifd_qpl_prepare_job(job, true, input, input_len, output,
> >> output_len);
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the decompression job
> >
> > Thanks, I will fix this next version.
> >
> >> > + *
> >> > + * Set the decompression job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t
> >> *input,
> >> > +   uint32_t input_len,
> uint8_t
> >> *output,
> >> > +

Re: linux-user emulation hangs during fork

2024-06-06 Thread Richard Henderson


On 6/6/24 01:27, Andreas Schwab wrote:

Which ruby?

$ ruby --version
ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux-gnu]



ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux-gnu]

That might have been handy to have with your original report.


r~

Re: [PATCH] tracetool: Remove unused vcpu.py script

2024-06-06 Thread Zhao Liu

On Thu, Jun 06, 2024 at 12:26:31PM +0200, Philippe Mathieu-Daudé wrote:
> Date: Thu,  6 Jun 2024 12:26:31 +0200
> From: Philippe Mathieu-Daudé 
> Subject: [PATCH] tracetool: Remove unused vcpu.py script
> X-Mailer: git-send-email 2.41.0
> 
> vcpu.py is pointless since commit 89aafcf2a7 ("trace:
> remove code that depends on setting vcpu"), remote it.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  meson.build   |  1 -
>  scripts/tracetool/__init__.py |  8 +
>  scripts/tracetool/vcpu.py | 59 ---
>  3 files changed, 1 insertion(+), 67 deletions(-)
>  delete mode 100644 scripts/tracetool/vcpu.py

Reviewed-by: Zhao Liu

Re: [PATCH qemu ] hw/acpi: Fix big endian host creation of Generic Port Affinity Structures

2024-06-06 Thread Igor Mammedov

On Wed, 5 Jun 2024 19:04:55 +0100
Jonathan Cameron  wrote:

> Treating the HID as an integer caused it to get bit reversed
> on big endian hosts running little endian guests.  Treat it
> as a character array instead.
> 
> Fixes hw/acpi: Generic Port Affinity Structure Support
> Tested-by: Richard Henderson 
> Signed-off-by: Jonathan Cameron 
> 
> ---
> Richard ran the version posted in the thread on an s390 instance.
> Thanks for the help!
> 
> Difference from version in thread:
> - Instantiate i in the for loop.
> 
> Sending out now so Michael can decide whether to fold this in, or
> drop the GP series for now from his pull request (in which case
> I'll do an updated version with this and Markus' docs feedback
> folded in.)
> 
> ---
>  include/hw/acpi/acpi_generic_initiator.h | 2 +-
>  hw/acpi/acpi_generic_initiator.c | 4 +++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/acpi/acpi_generic_initiator.h 
> b/include/hw/acpi/acpi_generic_initiator.h
> index 1a899af30f..5baefda33a 100644
> --- a/include/hw/acpi/acpi_generic_initiator.h
> +++ b/include/hw/acpi/acpi_generic_initiator.h
> @@ -61,7 +61,7 @@ typedef struct PCIDeviceHandle {
>  uint16_t bdf;
>  };
>  struct {
> -uint64_t hid;
> +char hid[8];
>  uint32_t uid;
>  };
>  };

not sure on top of what this patch applies but I have some generic comments wrt 
it

why PCIDeviceHandle is in header file? is there plan for it
being used outside of acpi_generic_initiator.c?


> diff --git a/hw/acpi/acpi_generic_initiator.c 
> b/hw/acpi/acpi_generic_initiator.c
> index 78b80dcf08..f064753b67 100644
> --- a/hw/acpi/acpi_generic_initiator.c
> +++ b/hw/acpi/acpi_generic_initiator.c
> @@ -151,7 +151,9 @@ build_srat_generic_node_affinity(GArray *table_data, int 
> node,
>  build_append_int_noprefix(table_data, 0, 12);
>  } else {
>  /* Device Handle - ACPI */
> -build_append_int_noprefix(table_data, handle->hid, 8);
> +for (int i = 0; i < sizeof(handle->hid); i++) {
> +build_append_int_noprefix(table_data, handle->hid[i], 1);
> +}
>  build_append_int_noprefix(table_data, handle->uid, 4);
>  build_append_int_noprefix(table_data, 0, 4);

instead of open codding structure

it might be better to introduce helper in aml_build.c
something like 
  /* proper reference to spec as we do for other ACPI primitives */
  build_append_srat_acpi_device_handle(GArray *table_data, char* hid, unit32_t 
uid)
  assert(strlen(hid) ...
  for() {
build_append_byte()
  }  
  ...

the same applies to "Device Handle - PCI" structure

Also get rid of PCI deps in acpi_generic_initiator.c 
move build_all_acpi_generic_initiators/build_srat_generic_pci_initiator into
hw/acpi/pci.c file if it has to access PCI code/structures directly
(which I'm not convinced it should, can we get/expose what it needs as QOM 
properties?)

btw:
build_all_acpi_generic_initiators() name doesn't match what it's doing.
it composes only one initiator entry.

>  }

Re: [PATCH] qapi: clarify that the default is backend dependent

2024-06-06 Thread Stefano Garzarella


On Tue, Jun 04, 2024 at 04:58:49PM GMT, Markus Armbruster wrote:

Stefano Garzarella  writes:


On Mon, Jun 03, 2024 at 11:34:10AM GMT, Markus Armbruster wrote:

Stefano Garzarella  writes:


The default value of the @share option of the @MemoryBackendProperties
eally depends on the backend type, so let's document it explicitly and
add the default value where it was missing.

Cc: David Hildenbrand 
Suggested-by: Markus Armbruster 
Signed-off-by: Stefano Garzarella 
---
I followed how we document @share in memfd and epc, but I don't like it
very much, I just can't think of a better way, so if you have a suggestion
I can change them in all of them.

Thanks,
Stefano
---
 qapi/qom.json | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 38dde6d785..8463bd32a2 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -600,7 +600,7 @@

  ##
  # @MemoryBackendProperties:
  #
  # Properties for objects of classes derived from memory-backend.
  #

[...]


 # preallocation threads (default: none) (since 7.2)
 #
 # @share: if false, the memory is private to QEMU; if true, it is
-# shared (default: false)
+# shared (default depends on the backend type)


Note for later: the backends are the branches of ObjectOptions that use
MemoryBackendProperties as branch type or as base of their branch type.
These are

   memory-backend-epc (uses MemoryBackendEpcProperties)
   memory-backend-file (uses MemoryBackendFileProperties)
   memory-backend-memfd (uses MemoryBackendMemfdProperties)
   memory-backend-ram (uses MemoryBackendProperties)


 #
 # @reserve: if true, reserve swap space (or huge pages) if applicable
 # (default: true) (since 6.1)
@@ -639,6 +639,8 @@
 #
 # Properties for memory-backend-file objects.
 #
+# The @share boolean option is false by default with file.
+#
 # @align: the base address alignment when QEMU mmap(2)s @mem-path.
 # Some backend stores specified by @mem-path require an alignment
 # different than the default one used by QEMU, e.g. the device DAX


As stated in the commit message, this matches existing documentation in
memory-backend-epc

  # The @share boolean option is true by default with epc

and memory-backend-memfd

  # The @share boolean option is true by default with memfd.

I think "with FOO" could be clearer.  Perhaps something like "with
backend 'memory-backend-FOO'.


Ack, I'll do.



However, even with your patch, we're still missing memory-backend-ram.
I can see two solutions:

1. Create MemoryBackendRamProperties just to have a place for
documenting @share's default.

2. Document @share's default right where it's defined, roughly like
this:

  # @share: if false, the memory is private to QEMU; if true, it is
  # shared (default false for backends memory-backend-file and
  # memory-backend-ram, true for backends memory-backend-epc and
  # memory-backend-memfd)

CON: we need to remember to update this whenever we add another backend.

PRO: generated documentation is better, in my opinion.

Thoughts?



Maybe option 2 is slightly better and it's also clearer how to document the 
default for other backends.

When I added a new backend, it was not clear to me how to define the default 
for an inherited parameter.

I would go with 2 if you agree.


I actually like 2 better :)



Yeah, I'll do it ;-)

Thanks,
Stefano

[PATCH v3 5/6] Move tcg implementation of x86 get_physical_address into common helper code.

2024-06-06 Thread Don Porter

Signed-off-by: Don Porter 
---
 target/i386/cpu.h|  42 ++
 target/i386/helper.c | 515 +
 target/i386/tcg/sysemu/excp_helper.c | 555 +--
 3 files changed, 562 insertions(+), 550 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 1e463cc556..2c7cfe7901 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2136,6 +2136,43 @@ struct X86CPUClass {
 ResettablePhases parent_phases;
 };
 
+
+typedef struct X86TranslateParams {
+target_ulong addr;
+target_ulong cr3;
+int pg_mode;
+int mmu_idx;
+int ptw_idx;
+MMUAccessType access_type;
+} X86TranslateParams;
+
+typedef struct X86TranslateResult {
+hwaddr paddr;
+int prot;
+int page_size;
+} X86TranslateResult;
+
+typedef enum X86TranslateFaultStage2 {
+S2_NONE,
+S2_GPA,
+S2_GPT,
+} X86TranslateFaultStage2;
+
+typedef struct X86TranslateFault {
+int exception_index;
+int error_code;
+target_ulong cr2;
+X86TranslateFaultStage2 stage2;
+} X86TranslateFault;
+
+typedef struct X86PTETranslate {
+CPUX86State *env;
+X86TranslateFault *err;
+int ptw_idx;
+void *haddr;
+hwaddr gaddr;
+} X86PTETranslate;
+
 #ifndef CONFIG_USER_ONLY
 extern const VMStateDescription vmstate_x86_cpu;
 #endif
@@ -2180,6 +2217,11 @@ void x86_cpu_list(void);
 int cpu_x86_support_mca_broadcast(CPUX86State *env);
 
 #ifndef CONFIG_USER_ONLY
+bool x86_cpu_get_physical_address(CPUX86State *env, vaddr addr,
+  MMUAccessType access_type, int mmu_idx,
+  X86TranslateResult *out,
+  X86TranslateFault *err, uint64_t ra);
+
 hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cpu, vaddr addr,
  MemTxAttrs *attrs);
 int cpu_get_pic_interrupt(CPUX86State *s);
diff --git a/target/i386/helper.c b/target/i386/helper.c
index f9d1381f90..746570a442 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -26,6 +26,7 @@
 #include "sysemu/hw_accel.h"
 #include "monitor/monitor.h"
 #include "kvm/kvm_i386.h"
+#include "exec/cpu_ldst.h"
 #endif
 #include "qemu/log.h"
 #ifdef CONFIG_TCG
@@ -231,6 +232,520 @@ void cpu_x86_update_cr4(CPUX86State *env, uint32_t 
new_cr4)
 }
 
 #if !defined(CONFIG_USER_ONLY)
+
+static inline uint32_t ptw_ldl(const X86PTETranslate *in, uint64_t ra)
+{
+if (likely(in->haddr)) {
+return ldl_p(in->haddr);
+}
+return cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
+}
+
+static inline uint64_t ptw_ldq(const X86PTETranslate *in, uint64_t ra)
+{
+if (likely(in->haddr)) {
+return ldq_p(in->haddr);
+}
+return cpu_ldq_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, ra);
+}
+/*
+ * Note that we can use a 32-bit cmpxchg for all page table entries,
+ * even 64-bit ones, because PG_PRESENT_MASK, PG_ACCESSED_MASK and
+ * PG_DIRTY_MASK are all in the low 32 bits.
+ */
+static bool ptw_setl_slow(const X86PTETranslate *in, uint32_t old, uint32_t 
new)
+{
+uint32_t cmp;
+
+/* Does x86 really perform a rmw cycle on mmio for ptw? */
+start_exclusive();
+cmp = cpu_ldl_mmuidx_ra(in->env, in->gaddr, in->ptw_idx, 0);
+if (cmp == old) {
+cpu_stl_mmuidx_ra(in->env, in->gaddr, new, in->ptw_idx, 0);
+}
+end_exclusive();
+return cmp == old;
+}
+
+static inline bool ptw_setl(const X86PTETranslate *in, uint32_t old,
+uint32_t set)
+{
+if (set & ~old) {
+uint32_t new = old | set;
+if (likely(in->haddr)) {
+old = cpu_to_le32(old);
+new = cpu_to_le32(new);
+return qatomic_cmpxchg((uint32_t *)in->haddr, old, new) == old;
+}
+return ptw_setl_slow(in, old, new);
+}
+return true;
+}
+
+
+static bool ptw_translate(X86PTETranslate *inout, hwaddr addr, uint64_t ra)
+{
+CPUTLBEntryFull *full;
+int flags;
+
+inout->gaddr = addr;
+flags = probe_access_full(inout->env, addr, 0, MMU_DATA_STORE,
+  inout->ptw_idx, true, &inout->haddr, &full, ra);
+
+if (unlikely(flags & TLB_INVALID_MASK)) {
+X86TranslateFault *err = inout->err;
+
+assert(inout->ptw_idx == MMU_NESTED_IDX);
+*err = (X86TranslateFault){
+.error_code = inout->env->error_code,
+.cr2 = addr,
+.stage2 = S2_GPT,
+};
+return false;
+}
+return true;
+}
+
+static bool x86_mmu_translate(CPUX86State *env, const X86TranslateParams *in,
+  X86TranslateResult *out,
+  X86TranslateFault *err, uint64_t ra)
+{
+const target_ulong addr = in->addr;
+const int pg_mode = in->pg_mode;
+const bool is_user = is_mmu_index_user(in->mmu_idx);
+const MMUAccessType access_type = in->access_type;
+uint64_t ptep, pte, rsvd_mask;
+X86PTETranslate pte_trans = {
+.env = en

[PATCH v3 6/6] Convert x86_mmu_translate() to use common code.

2024-06-06 Thread Don Porter

Signed-off-by: Don Porter 
---
 target/i386/arch_memory_mapping.c|  44 +++-
 target/i386/cpu.h|   5 +-
 target/i386/helper.c | 374 +++
 target/i386/tcg/sysemu/excp_helper.c |   2 +-
 4 files changed, 129 insertions(+), 296 deletions(-)

diff --git a/target/i386/arch_memory_mapping.c 
b/target/i386/arch_memory_mapping.c
index b52e98133c..bccd290b9f 100644
--- a/target/i386/arch_memory_mapping.c
+++ b/target/i386/arch_memory_mapping.c
@@ -228,9 +228,38 @@ static void _mmu_decode_va_parameters(CPUState *cs, int 
height,
 }
 
 /**
- * get_pte - Copy the contents of the page table entry at node[i] into 
pt_entry.
- *   Optionally, add the relevant bits to the virtual address in
- *   vaddr_pte.
+ * x86_virtual_to_pte_index - Given a virtual address and height in
+ *   the page table radix tree, return the index that should be
+ *   used to look up the next page table entry (pte) in
+ *   translating an address.
+ *
+ * @cs - CPU state
+ * @vaddr - The virtual address to translate
+ * @height - height of node within the tree (leaves are 1, not 0).
+ *
+ * Example: In 32-bit x86 page tables, the virtual address is split
+ * into 10 bits at height 2, 10 bits at height 1, and 12 offset bits.
+ * So a call with VA and height 2 would return the first 10 bits of va,
+ * right shifted by 22.
+ */
+
+int x86_virtual_to_pte_index(CPUState *cs, target_ulong vaddr, int height)
+{
+int shift = 0;
+int width = 0;
+int mask = 0;
+
+_mmu_decode_va_parameters(cs, height, &shift, &width);
+
+mask = (1 << width) - 1;
+
+return (vaddr >> shift) & mask;
+}
+
+/**
+ * x86_get_pte - Copy the contents of the page table entry at node[i]
+ *   into pt_entry.  Optionally, add the relevant bits to
+ *   the virtual address in vaddr_pte.
  *
  * @cs - CPU state
  * @node - physical address of the current page table node
@@ -249,7 +278,6 @@ void
 x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
 PTE_t *pt_entry, vaddr vaddr_parent, vaddr *vaddr_pte,
 hwaddr *pte_paddr)
-
 {
 X86CPU *cpu = X86_CPU(cs);
 CPUX86State *env = &cpu->env;
@@ -282,8 +310,8 @@ x86_get_pte(CPUState *cs, hwaddr node, int i, int height,
 }
 
 
-static bool
-mmu_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask)
+bool
+x86_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask)
 {
 X86CPU *cpu = X86_CPU(cs);
 CPUX86State *env = &cpu->env;
@@ -300,7 +328,7 @@ mmu_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask)
 bool
 x86_pte_present(CPUState *cs, PTE_t *pte)
 {
-return mmu_pte_check_bits(cs, pte, PG_PRESENT_MASK);
+return x86_pte_check_bits(cs, pte, PG_PRESENT_MASK);
 }
 
 /**
@@ -312,7 +340,7 @@ x86_pte_present(CPUState *cs, PTE_t *pte)
 bool
 x86_pte_leaf(CPUState *cs, int height, PTE_t *pte)
 {
-return height == 1 || mmu_pte_check_bits(cs, pte, PG_PSE_MASK);
+return height == 1 || x86_pte_check_bits(cs, pte, PG_PSE_MASK);
 }
 
 /**
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 2c7cfe7901..978841a624 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2198,6 +2198,8 @@ bool x86_pte_present(CPUState *cs, PTE_t *pte);
 bool x86_pte_leaf(CPUState *cs, int height, PTE_t *pte);
 hwaddr x86_pte_child(CPUState *cs, PTE_t *pte, int height);
 uint64_t x86_pte_flags(uint64_t pte);
+bool x86_pte_check_bits(CPUState *cs, PTE_t *pte, int64_t mask);
+int x86_virtual_to_pte_index(CPUState *cs, target_ulong vaddr, int height);
 bool x86_cpu_get_memory_mapping(CPUState *cpu, MemoryMappingList *list,
 Error **errp);
 bool x86_mon_init_page_table_iterator(Monitor *mon,
@@ -2220,7 +,8 @@ int cpu_x86_support_mca_broadcast(CPUX86State *env);
 bool x86_cpu_get_physical_address(CPUX86State *env, vaddr addr,
   MMUAccessType access_type, int mmu_idx,
   X86TranslateResult *out,
-  X86TranslateFault *err, uint64_t ra);
+  X86TranslateFault *err, uint64_t ra,
+  bool read_only);
 
 hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cpu, vaddr addr,
  MemTxAttrs *attrs);
diff --git a/target/i386/helper.c b/target/i386/helper.c
index 746570a442..4e5467ee57 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -308,7 +308,8 @@ static bool ptw_translate(X86PTETranslate *inout, hwaddr 
addr, uint64_t ra)
 
 static bool x86_mmu_translate(CPUX86State *env, const X86TranslateParams *in,
   X86TranslateResult *out,
-  X86TranslateFault *err, uint64_t ra)
+  X86TranslateFault *err, uint64_t ra,
+  bool read_only)
 {
 const target_ulong addr = in->addr;
 const int pg_mode = in->pg_mode;
@@ -324,6 +325,10 @@ static bool

[PATCH v3 3/6] Convert 'info mem' to use generic iterator

2024-06-06 Thread Don Porter

Signed-off-by: Don Porter 
---
 include/hw/core/sysemu-cpu-ops.h |   6 +
 include/monitor/monitor.h|   4 +
 monitor/hmp-cmds-target.c|   5 +-
 target/i386/cpu.c|   1 +
 target/i386/cpu.h|   1 +
 target/i386/monitor.c| 354 ---
 6 files changed, 60 insertions(+), 311 deletions(-)

diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
index bf3de3e004..3bef129460 100644
--- a/include/hw/core/sysemu-cpu-ops.h
+++ b/include/hw/core/sysemu-cpu-ops.h
@@ -250,6 +250,12 @@ typedef struct SysemuCPUOps {
 void (*mon_print_pte) (Monitor *mon, CPUArchState *env, hwaddr addr,
hwaddr pte);
 
+/**
+ * @mon_print_mem: Hook called by the monitor to print a range
+ * of memory mappings in 'info mem'
+ */
+bool (*mon_print_mem)(CPUState *cs, struct mem_print_state *state);
+
 } SysemuCPUOps;
 
 #endif /* SYSEMU_CPU_OPS_H */
diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
index 965f5d5450..e954946ba0 100644
--- a/include/monitor/monitor.h
+++ b/include/monitor/monitor.h
@@ -5,6 +5,7 @@
 #include "qapi/qapi-types-misc.h"
 #include "qemu/readline.h"
 #include "exec/hwaddr.h"
+#include "hw/core/cpu.h"
 
 typedef struct MonitorHMP MonitorHMP;
 typedef struct MonitorOptions MonitorOptions;
@@ -63,4 +64,7 @@ void monitor_register_hmp_info_hrt(const char *name,
 int error_vprintf_unless_qmp(const char *fmt, va_list ap) G_GNUC_PRINTF(1, 0);
 int error_printf_unless_qmp(const char *fmt, ...) G_GNUC_PRINTF(1, 2);
 
+int compressing_iterator(CPUState *cs, void *data, PTE_t *pte, vaddr vaddr_in,
+ int height, int offset);
+
 #endif /* MONITOR_H */
diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
index 3393e5ad0b..8ce37d3187 100644
--- a/monitor/hmp-cmds-target.c
+++ b/monitor/hmp-cmds-target.c
@@ -122,9 +122,8 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
 }
 
 /* Assume only called on present entries */
-static
-int compressing_iterator(CPUState *cs, void *data, PTE_t *pte,
- vaddr vaddr_in, int height, int offset)
+int compressing_iterator(CPUState *cs, void *data, PTE_t *pte, vaddr vaddr_in,
+ int height, int offset)
 {
 CPUClass *cc = CPU_GET_CLASS(cs);
 struct mem_print_state *state = (struct mem_print_state *) data;
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 8bd6164b68..046d75f6bb 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -8316,6 +8316,7 @@ static const struct SysemuCPUOps i386_sysemu_ops = {
 .mon_init_page_table_iterator = &x86_mon_init_page_table_iterator,
 .mon_info_pg_print_header = &x86_mon_info_pg_print_header,
 .mon_flush_page_print_state = &x86_mon_flush_print_pg_state,
+.mon_print_mem = &x86_mon_print_mem,
 };
 #endif
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 1346ec0033..1e463cc556 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2169,6 +2169,7 @@ void x86_mon_info_pg_print_header(Monitor *mon, struct 
mem_print_state *state);
 bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state);
 void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
hwaddr pte);
+bool x86_mon_print_mem(CPUState *cs, struct mem_print_state *state);
 
 void x86_cpu_dump_state(CPUState *cs, FILE *f, int flags);
 
diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index ecde164857..215c018d1f 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -281,332 +281,70 @@ void hmp_info_tlb(Monitor *mon, const QDict *qdict)
 for_each_pte(cs, &mem_print_tlb, &state, false, false);
 }
 
-static void mem_print(Monitor *mon, CPUArchState *env,
-  hwaddr *pstart, int *plast_prot,
-  hwaddr end, int prot)
+bool x86_mon_print_mem(CPUState *cs, struct mem_print_state *state)
 {
-int prot1;
-prot1 = *plast_prot;
-if (prot != prot1) {
-if (*pstart != -1) {
-monitor_printf(mon, HWADDR_FMT_plx "-" HWADDR_FMT_plx " "
-   HWADDR_FMT_plx " %c%c%c\n",
-   addr_canonical(env, *pstart),
-   addr_canonical(env, end),
-   addr_canonical(env, end - *pstart),
-   prot1 & PG_USER_MASK ? 'u' : '-',
-   'r',
-   prot1 & PG_RW_MASK ? 'w' : '-');
-}
-if (prot != 0)
-*pstart = end;
-else
-*pstart = -1;
-*plast_prot = prot;
-}
-}
+CPUArchState *env = state->env;
+int i = 0;
 
-static void mem_info_32(Monitor *mon, CPUArchState *env)
-{
-unsigned int l1, l2;
-int prot, last_prot;
-uint32_t pgd, pde, pte;
-hwaddr start, end;
-
-pgd = env->cr[3] & ~0xfff;
-last_prot = 0;
-start = -1;
-for(l1 = 0; l1 < 1024; l1++

[PATCH v3 2/6] Convert 'info tlb' to use generic iterator

2024-06-06 Thread Don Porter

Signed-off-by: Don Porter 
---
 include/hw/core/sysemu-cpu-ops.h |   7 +
 monitor/hmp-cmds-target.c|   1 +
 target/i386/cpu.h|   2 +
 target/i386/monitor.c| 217 ++-
 4 files changed, 53 insertions(+), 174 deletions(-)

diff --git a/include/hw/core/sysemu-cpu-ops.h b/include/hw/core/sysemu-cpu-ops.h
index eb16a1c3e2..bf3de3e004 100644
--- a/include/hw/core/sysemu-cpu-ops.h
+++ b/include/hw/core/sysemu-cpu-ops.h
@@ -243,6 +243,13 @@ typedef struct SysemuCPUOps {
 bool (*mon_flush_page_print_state)(CPUState *cs,
struct mem_print_state *state);
 
+/**
+ * @mon_print_pte: Hook called by the monitor to print a page
+ * table entry at address addr, with contents pte.
+ */
+void (*mon_print_pte) (Monitor *mon, CPUArchState *env, hwaddr addr,
+   hwaddr pte);
+
 } SysemuCPUOps;
 
 #endif /* SYSEMU_CPU_OPS_H */
diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
index 60a8bd0c37..3393e5ad0b 100644
--- a/monitor/hmp-cmds-target.c
+++ b/monitor/hmp-cmds-target.c
@@ -318,6 +318,7 @@ void hmp_info_pg(Monitor *mon, const QDict *qdict)
 /* Print last entry, if one present */
 cc->sysemu_ops->mon_flush_page_print_state(cs, &state);
 }
+
 static void memory_dump(Monitor *mon, int count, int format, int wsize,
 hwaddr addr, int is_physical)
 {
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index cbb6f6fc4d..1346ec0033 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2167,6 +2167,8 @@ bool x86_mon_init_page_table_iterator(Monitor *mon,
   struct mem_print_state *state);
 void x86_mon_info_pg_print_header(Monitor *mon, struct mem_print_state *state);
 bool x86_mon_flush_print_pg_state(CPUState *cs, struct mem_print_state *state);
+void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
+   hwaddr pte);
 
 void x86_cpu_dump_state(CPUState *cs, FILE *f, int flags);
 
diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index 65e82e73e8..ecde164857 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -214,202 +214,71 @@ static hwaddr addr_canonical(CPUArchState *env, hwaddr 
addr)
 return addr;
 }
 
-static void print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
-  hwaddr pte, hwaddr mask)
+void x86_mon_print_pte(Monitor *mon, CPUArchState *env, hwaddr addr,
+   hwaddr pte)
 {
+char buf[128];
+char *pos = buf, *end = buf + sizeof(buf);
+
 addr = addr_canonical(env, addr);
 
-monitor_printf(mon, HWADDR_FMT_plx ": " HWADDR_FMT_plx
-   " %c%c%c%c%c%c%c%c%c\n",
-   addr,
-   pte & mask,
-   pte & PG_NX_MASK ? 'X' : '-',
-   pte & PG_GLOBAL_MASK ? 'G' : '-',
-   pte & PG_PSE_MASK ? 'P' : '-',
-   pte & PG_DIRTY_MASK ? 'D' : '-',
-   pte & PG_ACCESSED_MASK ? 'A' : '-',
-   pte & PG_PCD_MASK ? 'C' : '-',
-   pte & PG_PWT_MASK ? 'T' : '-',
-   pte & PG_USER_MASK ? 'U' : '-',
-   pte & PG_RW_MASK ? 'W' : '-');
-}
+pos += snprintf(pos, end - pos, HWADDR_FMT_plx ": " HWADDR_FMT_plx " ",
+addr, (hwaddr) (pte & PG_ADDRESS_MASK));
 
-static void tlb_info_32(Monitor *mon, CPUArchState *env)
-{
-unsigned int l1, l2;
-uint32_t pgd, pde, pte;
+pos += snprintf(pos, end - pos, " %s", pg_bits(pte));
 
-pgd = env->cr[3] & ~0xfff;
-for(l1 = 0; l1 < 1024; l1++) {
-cpu_physical_memory_read(pgd + l1 * 4, &pde, 4);
-pde = le32_to_cpu(pde);
-if (pde & PG_PRESENT_MASK) {
-if ((pde & PG_PSE_MASK) && (env->cr[4] & CR4_PSE_MASK)) {
-/* 4M pages */
-print_pte(mon, env, (l1 << 22), pde, ~((1 << 21) - 1));
-} else {
-for(l2 = 0; l2 < 1024; l2++) {
-cpu_physical_memory_read((pde & ~0xfff) + l2 * 4, &pte, 4);
-pte = le32_to_cpu(pte);
-if (pte & PG_PRESENT_MASK) {
-print_pte(mon, env, (l1 << 22) + (l2 << 12),
-  pte & ~PG_PSE_MASK,
-  ~0xfff);
-}
-}
-}
-}
+/* Trim line to fit screen */
+if (pos - buf > 79) {
+strcpy(buf + 77, "..");
 }
-}
 
-static void tlb_info_pae32(Monitor *mon, CPUArchState *env)
-{
-unsigned int l1, l2, l3;
-uint64_t pdpe, pde, pte;
-uint64_t pdp_addr, pd_addr, pt_addr;
-
-pdp_addr = env->cr[3] & ~0x1f;
-for (l1 = 0; l1 < 4; l1++) {
-cpu_physical_memory_read(pdp_addr + l1 * 8, &pdpe, 8);
-pdpe = le64_to_cpu(pdpe);
-if (pdpe & PG_PRESENT_MASK) {
-pd_addr = pdpe & 0x3f000ULL;
-

[PATCH v3 4/6] Convert x86_cpu_get_memory_mapping() to use generic iterators

2024-06-06 Thread Don Porter

Signed-off-by: Don Porter 
---
 target/i386/arch_memory_mapping.c | 320 --
 1 file changed, 43 insertions(+), 277 deletions(-)

diff --git a/target/i386/arch_memory_mapping.c 
b/target/i386/arch_memory_mapping.c
index 562a00b5a7..b52e98133c 100644
--- a/target/i386/arch_memory_mapping.c
+++ b/target/i386/arch_memory_mapping.c
@@ -19,6 +19,7 @@
  ** code hook implementations for x86 ***
  */
 
+/* PAE Paging or IA-32e Paging */
 #define PML4_ADDR_MASK 0xff000ULL /* selects bits 51:12 */
 
 /**
@@ -365,301 +366,66 @@ x86_pte_child(CPUState *cs, PTE_t *pte, int height)
 return -1;
 }
 
-/* PAE Paging or IA-32e Paging */
-static void walk_pte(MemoryMappingList *list, AddressSpace *as,
- hwaddr pte_start_addr,
- int32_t a20_mask, target_ulong start_line_addr)
-{
-hwaddr pte_addr, start_paddr;
-uint64_t pte;
-target_ulong start_vaddr;
-int i;
-
-for (i = 0; i < 512; i++) {
-pte_addr = (pte_start_addr + i * 8) & a20_mask;
-pte = address_space_ldq(as, pte_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-if (!(pte & PG_PRESENT_MASK)) {
-/* not present */
-continue;
-}
-
-start_paddr = (pte & ~0xfff) & ~(0x1ULL << 63);
-if (cpu_physical_memory_is_io(start_paddr)) {
-/* I/O region */
-continue;
-}
-
-start_vaddr = start_line_addr | ((i & 0x1ff) << 12);
-memory_mapping_list_add_merge_sorted(list, start_paddr,
- start_vaddr, 1 << 12);
-}
-}
-
-/* 32-bit Paging */
-static void walk_pte2(MemoryMappingList *list, AddressSpace *as,
-  hwaddr pte_start_addr, int32_t a20_mask,
-  target_ulong start_line_addr)
-{
-hwaddr pte_addr, start_paddr;
-uint32_t pte;
-target_ulong start_vaddr;
-int i;
-
-for (i = 0; i < 1024; i++) {
-pte_addr = (pte_start_addr + i * 4) & a20_mask;
-pte = address_space_ldl(as, pte_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-if (!(pte & PG_PRESENT_MASK)) {
-/* not present */
-continue;
-}
-
-start_paddr = pte & ~0xfff;
-if (cpu_physical_memory_is_io(start_paddr)) {
-/* I/O region */
-continue;
-}
-
-start_vaddr = start_line_addr | ((i & 0x3ff) << 12);
-memory_mapping_list_add_merge_sorted(list, start_paddr,
- start_vaddr, 1 << 12);
-}
-}
-
-/* PAE Paging or IA-32e Paging */
-#define PLM4_ADDR_MASK 0xff000ULL /* selects bits 51:12 */
+/**
+ * Back to x86 hooks
+ */
+struct memory_mapping_data {
+MemoryMappingList *list;
+};
 
-static void walk_pde(MemoryMappingList *list, AddressSpace *as,
- hwaddr pde_start_addr,
- int32_t a20_mask, target_ulong start_line_addr)
+static int add_memory_mapping_to_list(CPUState *cs, void *data, PTE_t *pte,
+  vaddr vaddr_in, int height,
+  int offset)
 {
-hwaddr pde_addr, pte_start_addr, start_paddr;
-uint64_t pde;
-target_ulong line_addr, start_vaddr;
-int i;
-
-for (i = 0; i < 512; i++) {
-pde_addr = (pde_start_addr + i * 8) & a20_mask;
-pde = address_space_ldq(as, pde_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-if (!(pde & PG_PRESENT_MASK)) {
-/* not present */
-continue;
-}
-
-line_addr = start_line_addr | ((i & 0x1ff) << 21);
-if (pde & PG_PSE_MASK) {
-/* 2 MB page */
-start_paddr = (pde & ~0x1f) & ~(0x1ULL << 63);
-if (cpu_physical_memory_is_io(start_paddr)) {
-/* I/O region */
-continue;
-}
-start_vaddr = line_addr;
-memory_mapping_list_add_merge_sorted(list, start_paddr,
- start_vaddr, 1 << 21);
-continue;
-}
+X86CPU *cpu = X86_CPU(cs);
+CPUX86State *env = &cpu->env;
 
-pte_start_addr = (pde & PLM4_ADDR_MASK) & a20_mask;
-walk_pte(list, as, pte_start_addr, a20_mask, line_addr);
-}
-}
+struct memory_mapping_data *mm_data = (struct memory_mapping_data *) data;
 
-/* 32-bit Paging */
-static void walk_pde2(MemoryMappingList *list, AddressSpace *as,
-  hwaddr pde_start_addr, int32_t a20_mask,
-  bool pse)
-{
-hwaddr pde_addr, pte_start_addr, start_paddr, high_paddr;
-uint32_t pde;
-target_ulong line_addr, start_vaddr;
-int i;
-
-for (i = 0; i < 1024; i++) {
-pde_addr = (pde_start_addr + i * 4) & a20_mask;
-pde = address_space_ldl(as, pde_addr, MEMTXATTRS_UNSPECIFIED, NULL);
-if (!(pde & PG_PRESENT_MASK)) {
-/* not present */
-continue;
+hwaddr start_paddr = 0

[PATCH v3 0/6] Rework x86 page table walks

2024-06-06 Thread Don Porter

This version of the 'info pg' command adopts Peter Maydell's request
to write guest-agnostic page table iterator and accessor code, along
with architecture-specific hooks.  The first patch in this series
contributes a generic page table iterator and an x86 instantiation.
As a client, we first introduce an 'info pg' monitor command, as well
as a compressing callback hook for creating succinct page table
representations.

After this, each successive patch replaces an exisitng x86 page table
walker with a use of common iterator code.

I could use advice on how to ensure this is sufficiently well tested.
I used 'make check' and 'make check-avocado', which both pass; what is
the typical standard for testing something like a page table related
change?

As far as generality, I have only attempted this on x86, but I expect
the design would work for any similar radix-tree style page table.

I am still new enough to the code base that I wasn't certain about
where to put the generic code, as well as naming conventions.

Per David Gilbert's suggestion, I was careful to ensure that monitor
calls do not perturb TLB state (see the read-only flag in some
functions).

Version 3 of this patch series moves 'info pg' into common monitor
code and implements the architecture-specific code hooks.  I did not
do this with the 'info mem' and 'info tlb' commands, since they have
implementations on other ISAs.

Don Porter (6):
  Add an "info pg" command that prints the current page tables
  Convert 'info tlb' to use generic iterator
  Convert 'info mem' to use generic iterator
  Convert x86_cpu_get_memory_mapping() to use generic iterators
  Move tcg implementation of x86 get_physical_address into common helper
code.
  Convert x86_mmu_translate() to use common code.

 hmp-commands-info.hx |  13 +
 hw/core/cpu-sysemu.c | 140 ++
 include/hw/core/cpu.h|  34 +-
 include/hw/core/sysemu-cpu-ops.h | 169 +++
 include/monitor/hmp-target.h |   1 +
 include/monitor/monitor.h|   4 +
 monitor/hmp-cmds-target.c| 198 
 target/i386/arch_memory_mapping.c| 621 ++-
 target/i386/cpu.c|  12 +
 target/i386/cpu.h|  63 +++
 target/i386/helper.c | 523 +++
 target/i386/monitor.c| 724 +--
 target/i386/tcg/sysemu/excp_helper.c | 555 +---
 13 files changed, 1688 insertions(+), 1369 deletions(-)

--
2.34.1

[PATCH v3 1/6] Add an "info pg" command that prints the current page tables

2024-06-06 Thread Don Porter

The new "info pg" monitor command prints the current page table,
including virtual address ranges, flag bits, and snippets of physical
page numbers.  Completely filled regions of the page table with
compatible flags are "folded", with the result that the complete
output for a freshly booted x86-64 Linux VM can fit in a single
terminal window.  The output looks like this:

VPN range Entry FlagsPhysical page
[7f000-7f000] PML4[0fe] ---DA--UWP
  [7f28c-7f28f]  PDP[0a3] ---DA--UWP
[7f28c4600-7f28c47ff]  PDE[023] ---DA--UWP
  [7f28c4655-7f28c4656]  PTE[055-056] X--D---U-P 007f14-007f15
  [7f28c465b-7f28c465b]  PTE[05b] A--U-P 001cfc
...
[ff800-ff800] PML4[1ff] ---DA--UWP
  [8-b]  PDP[1fe] ---DA---WP
[81000-81dff]  PDE[008-00e] -GSDA---WP 001000-001dff
  [c-f]  PDP[1ff] ---DA--UWP
[ff400-ff5ff]  PDE[1fa] ---DA--UWP
  [ff5fb-ff5fc]  PTE[1fb-1fc] XG-DACT-WP 0fec00 0fee00
[ff600-ff7ff]  PDE[1fb] ---DA--UWP
  [ff600-ff600]  PTE[000] -G-DA--U-P 001467

This draws heavy inspiration from Austin Clements' original patch.

This also adds a generic page table walker, which other monitor
and execution commands will be migrated to in subsequent patches.

Signed-off-by: Don Porter 
---
 hmp-commands-info.hx  |  13 ++
 hw/core/cpu-sysemu.c  | 140 
 include/hw/core/cpu.h |  34 ++-
 include/hw/core/sysemu-cpu-ops.h  | 156 +
 include/monitor/hmp-target.h  |   1 +
 monitor/hmp-cmds-target.c | 198 +
 target/i386/arch_memory_mapping.c | 351 +-
 target/i386/cpu.c |  11 +
 target/i386/cpu.h |  15 ++
 target/i386/monitor.c | 165 ++
 10 files changed, 1082 insertions(+), 2 deletions(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 20a9835ea8..a873841920 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -242,6 +242,19 @@ SRST
 Show memory tree.
 ERST
 
+{
+.name   = "pg",
+.args_type  = "",
+.params = "",
+.help   = "show the page table",
+.cmd= hmp_info_pg,
+},
+
+SRST
+  ``info pg``
+Show the active page table.
+ERST
+
 #if defined(CONFIG_TCG)
 {
 .name   = "jit",
diff --git a/hw/core/cpu-sysemu.c b/hw/core/cpu-sysemu.c
index 2a9a2a4eb5..fd936fa90c 100644
--- a/hw/core/cpu-sysemu.c
+++ b/hw/core/cpu-sysemu.c
@@ -142,3 +142,143 @@ GuestPanicInformation *cpu_get_crash_info(CPUState *cpu)
 }
 return res;
 }
+
+/**
+ * _for_each_pte - recursive helper function
+ *
+ * @cs - CPU state
+ * @fn(cs, data, pte, vaddr, height) - User-provided function to call on each
+ * pte.
+ *   * @cs - pass through cs
+ *   * @data - user-provided, opaque pointer
+ *   * @pte - current pte
+ *   * @vaddr_in - virtual address translated by pte
+ *   * @height - height in the tree of pte
+ * @data - user-provided, opaque pointer, passed to fn()
+ * @visit_interior_nodes - if true, call fn() on page table entries in
+ * interior nodes.  If false, only call fn() on page
+ * table entries in leaves.
+ * @visit_not_present - if true, call fn() on entries that are not present.
+ * if false, visit only present entries.
+ * @node - The physical address of the current page table radix tree node
+ * @vaddr_in - The virtual address bits translated in walking the page
+ *  table to node
+ * @height - The height of node in the radix tree
+ *
+ * height starts at the max and counts down.
+ * In a 4 level x86 page table, pml4e is level 4, pdpe is level 3,
+ *  pde is level 2, and pte is level 1
+ *
+ * Returns true on success, false on error.
+ */
+static bool
+_for_each_pte(CPUState *cs,
+  int (*fn)(CPUState *cs, void *data, PTE_t *pte,
+vaddr vaddr_in, int height, int offset),
+  void *data, bool visit_interior_nodes,
+  bool visit_not_present, hwaddr node,
+  vaddr vaddr_in, int height)
+{
+int ptes_per_node;
+int i;
+
+assert(height > 0);
+
+CPUClass *cc = CPU_GET_CLASS(cs);
+
+if ((!cc->sysemu_ops->page_table_entries_per_node)
+|| (!cc->sysemu_ops->get_pte)
+|| (!cc->sysemu_ops->pte_present)
+|| (!cc->sysemu_ops->pte_leaf)
+|| (!cc->sysemu_ops->pte_child)) {
+return false;
+}
+
+ptes_per_node = cc->sysemu_ops->page_table_entries_per_node(cs, height);
+
+for (i = 0; i < ptes_per_node; i++) {
+PTE_t pt_entry;
+vaddr vaddr_i;
+bool pte_present;
+
+cc->sysemu_ops->get_pte(cs, node, i, height, &pt_entry, vaddr_in,
+&vaddr_i,

Re: [PATCH] hw/net: cadence_gem: fix: type2_compare_x_word_0 error

2024-06-06 Thread Edgar E. Iglesias

On Thu, Jun 6, 2024 at 2:06 PM Peter Maydell 
wrote:

> On Thu, 6 Jun 2024 at 12:04, Edgar E. Iglesias 
> wrote:
> >
> > On Thu, Jun 6, 2024 at 12:00 PM Andrew.Yuan 
> wrote:
> >>
> >> In the Cadence IP for Gigabit Ethernet MAC Part Number: IP7014
> IP Rev: R1p12 - Doc Rev: 1.3 User Guide, the specification for the
> type2_compare_x_word_0 register is as follows:
> >> The byte stored in bits [23:16] is compared against the byte in
> the received frame from the selected offset+0, and the byte stored in bits
> [31:24] is compared against the byte in
> >> the received frame from the selected offset+1.
> >>
> >> However, there is an implementation error in the cadence_gem
> model in qemu：
> >> the byte stored in bits [31:24] is compared against the byte in
> the received frame from the selected offset+0
> >>
> >> Now, the error code is as follows:
> >> rx_cmp = rxbuf_ptr[offset] << 8 | rxbuf_ptr[offset];
> >>
> >> and needs to be corrected to：
> >> rx_cmp = rxbuf_ptr[offset + 1] << 8 | rxbuf_ptr[offset];
> >>
> >> Signed-off-by: Andrew.Yuan 
> >
> >
> >
> > LGTM:
> > Reviewed-by: Edgar E. Iglesias 
> >
> > At some point it would be nice to add the missing logic for the
> DISABLE_MASK bit that
> > extends the compare range from 16 to 32-bits.
>
> I had a look at this device's code, and I'm trying to
> figure out how we know at this point that there really are
> two bytes pointed to by rxbuf_ptr.
>  * The get_queue_from_screen() function takes a rxbufsize
>argument, but it never uses it...
>  * the callsite in gem_receive() will (in the "strip FCS" case)
>pass its buf argument as rxbuf_ptr, but it will use a
>rxbufsize argument which has been raised to at least
>GEM_DMACFG_RBUFSZ_MUL, even if the input size argument
>is smaller, so even if get_queue_from_screen() honoured
>its rxbufsize argument it wouldn't help
>
> Would somebody who understands the device like to have a look ?
>
>
Yes, I agree that it looks strange. The padding to minimum 60B seems wrong
since we're blindly extending buf from something less than 60B to 60B
and then potentially copying from it...

Cheers,
Edgar


> This is a separate issue from the incorrect array offset
> argument this patch fixes, though.
>
> thanks
> -- PMM
>

[PATCH] i386/apic: Add hint on boot failure because of disabling x2APIC

2024-06-06 Thread Zhao Liu

Currently, the Q35 supports up to 4096 vCPUs (since v9.0), but for TCG
cases, if x2APIC is not actively enabled to boot more than 255 vCPUs (
e.g., qemu-system-i386 -M pc-q35-9.0 -smp 666), the following error is
reported:

Unexpected error in apic_common_set_id() at ../hw/intc/apic_common.c:449:
qemu-system-i386: APIC ID 255 requires x2APIC feature in CPU
Aborted (core dumped)

This error can be resolved by setting x2apic=on in -cpu. In order to
better help users deal with this scenario, add the error hint to
instruct users on how to enable the x2apic feature. Then, the error
report becomes the following:

Unexpected error in apic_common_set_id() at ../hw/intc/apic_common.c:448:
qemu-system-i386: APIC ID 255 requires x2APIC feature in CPU
Try x2apic=on in -cpu.
Aborted (core dumped)

Note since @errp is &error_abort, error_append_hint() can't be applied
on @errp. And in order to separate the exact error message from the
(perhaps effectively) hint, adding a hint via error_append_hint() is
also necessary. Therefore, introduce @local_error in
apic_common_set_id() to handle both the error message and the error
hint.

Suggested-by: Philippe Mathieu-Daudé 
Signed-off-by: Zhao Liu 
---
 hw/intc/apic_common.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/intc/apic_common.c b/hw/intc/apic_common.c
index d8fc1e2815fe..c13cdd79943d 100644
--- a/hw/intc/apic_common.c
+++ b/hw/intc/apic_common.c
@@ -433,6 +433,7 @@ static void apic_common_set_id(Object *obj, Visitor *v, 
const char *name,
 APICCommonState *s = APIC_COMMON(obj);
 DeviceState *dev = DEVICE(obj);
 uint32_t value;
+Error *local_err = NULL;
 
 if (dev->realized) {
 qdev_prop_set_after_realize(dev, name, errp);
@@ -444,7 +445,11 @@ static void apic_common_set_id(Object *obj, Visitor *v, 
const char *name,
 }
 
 if (value >= 255 && !cpu_has_x2apic_feature(&s->cpu->env)) {
-error_setg(errp, "APIC ID %d requires x2APIC feature in CPU", value);
+error_setg(&local_err,
+   "APIC ID %d requires x2APIC feature in CPU",
+   value);
+error_append_hint(&local_err, "Try x2apic=on in -cpu.\n");
+error_propagate(errp, local_err);
 return;
 }
 
-- 
2.34.1

RE: [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression

2024-06-06 Thread Fabiano Rosas

"Liu, Yuan1"  writes:

>> -Original Message-
>> From: Fabiano Rosas 
>> Sent: Thursday, June 6, 2024 6:26 AM
>> To: Liu, Yuan1 ; pet...@redhat.com;
>> pbonz...@redhat.com; marcandre.lur...@redhat.com; berra...@redhat.com;
>> th...@redhat.com; phi...@linaro.org
>> Cc: qemu-devel@nongnu.org; Liu, Yuan1 ; Zou, Nanhai
>> ; shameerali.kolothum.th...@huawei.com
>> Subject: Re: [PATCH v7 6/7] migration/multifd: implement qpl compression
>> and decompression
>> 
>> Yuan Liu  writes:
>> 
>> > QPL compression and decompression will use IAA hardware first.
>> > If IAA hardware is not available, it will automatically fall
>> > back to QPL software path, if the software job also fails,
>> > the uncompressed page is sent directly.
>> >
>> > Signed-off-by: Yuan Liu 
>> > Reviewed-by: Nanhai Zou 
>> > ---
>> >  migration/multifd-qpl.c | 412 +++-
>> >  1 file changed, 408 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
>> > index 6791a204d5..18b3384bd5 100644
>> > --- a/migration/multifd-qpl.c
>> > +++ b/migration/multifd-qpl.c
>> > @@ -13,9 +13,14 @@
>> >  #include "qemu/osdep.h"
>> >  #include "qemu/module.h"
>> >  #include "qapi/error.h"
>> > +#include "qapi/qapi-types-migration.h"
>> > +#include "exec/ramblock.h"
>> >  #include "multifd.h"
>> >  #include "qpl/qpl.h"
>> >
>> > +/* Maximum number of retries to resubmit a job if IAA work queues are
>> full */
>> > +#define MAX_SUBMIT_RETRY_NUM (3)
>> > +
>> >  typedef struct {
>> >  /* the QPL hardware path job */
>> >  qpl_job *job;
>> > @@ -260,6 +265,219 @@ static void
>> multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
>> >  p->iov = NULL;
>> >  }
>> >
>> > +/**
>> > + * multifd_qpl_prepare_job: prepare the job
>> > + *
>> > + * Set the QPL job parameters and properties.
>> > + *
>> > + * @job: pointer to the qpl_job structure
>> > + * @is_compression: indicates compression and decompression
>> > + * @input: pointer to the input data buffer
>> > + * @input_len: the length of the input data
>> > + * @output: pointer to the output data buffer
>> > + * @output_len: the length of the output data
>> > + */
>> > +static void multifd_qpl_prepare_job(qpl_job *job, bool is_compression,
>> > +uint8_t *input, uint32_t input_len,
>> > +uint8_t *output, uint32_t
>> output_len)
>> > +{
>> > +job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
>> > +job->next_in_ptr = input;
>> > +job->next_out_ptr = output;
>> > +job->available_in = input_len;
>> > +job->available_out = output_len;
>> > +job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST | QPL_FLAG_OMIT_VERIFY;
>> > +/* only supports compression level 1 */
>> > +job->level = 1;
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_prepare_job: prepare the compression job
>> 
>> function name is wrong
>
> Thanks, I will fix this next version.
>  
>> > + *
>> > + * Set the compression job parameters and properties.
>> > + *
>> > + * @job: pointer to the qpl_job structure
>> > + * @input: pointer to the input data buffer
>> > + * @input_len: the length of the input data
>> > + * @output: pointer to the output data buffer
>> > + * @output_len: the length of the output data
>> > + */
>> > +static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t *input,
>> > + uint32_t input_len, uint8_t
>> *output,
>> > + uint32_t output_len)
>> > +{
>> > +multifd_qpl_prepare_job(job, true, input, input_len, output,
>> output_len);
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_prepare_job: prepare the decompression job
>
> Thanks, I will fix this next version.
>  
>> > + *
>> > + * Set the decompression job parameters and properties.
>> > + *
>> > + * @job: pointer to the qpl_job structure
>> > + * @input: pointer to the input data buffer
>> > + * @input_len: the length of the input data
>> > + * @output: pointer to the output data buffer
>> > + * @output_len: the length of the output data
>> > + */
>> > +static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t
>> *input,
>> > +   uint32_t input_len, uint8_t
>> *output,
>> > +   uint32_t output_len)
>> > +{
>> > +multifd_qpl_prepare_job(job, false, input, input_len, output,
>> output_len);
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_fill_iov: fill in the IOV
>> > + *
>> > + * Fill in the QPL packet IOV
>> > + *
>> > + * @p: Params for the channel being used
>> > + * @data: pointer to the IOV data
>> > + * @len: The length of the IOV data
>> > + */
>> > +static void multifd_qpl_fill_iov(MultiFDSendParams *p, uint8_t *data,
>> > + uint32_t len)
>> > +{
>> > +p->iov[p->iovs_num].iov_base = data;
>> > +p->iov[p->iovs_num].iov_len = len;
>> > +p->iovs_num++;
>> >

[PATCH v4 5/6] target/riscv: Reserve exception codes for sw-check and hw-err

2024-06-06 Thread Fea.Wang

Based on the priv-1.13.0, add the exception codes for Software-check and
Hardware-error.

Signed-off-by: Fea.Wang 
Reviewed-by: Frank Chang 
Reviewed-by: LIU Zhiwei 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu_bits.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
index 096a51b331..c257c5ed7d 100644
--- a/target/riscv/cpu_bits.h
+++ b/target/riscv/cpu_bits.h
@@ -673,6 +673,8 @@ typedef enum RISCVException {
 RISCV_EXCP_INST_PAGE_FAULT = 0xc, /* since: priv-1.10.0 */
 RISCV_EXCP_LOAD_PAGE_FAULT = 0xd, /* since: priv-1.10.0 */
 RISCV_EXCP_STORE_PAGE_FAULT = 0xf, /* since: priv-1.10.0 */
+RISCV_EXCP_SW_CHECK = 0x12, /* since: priv-1.13.0 */
+RISCV_EXCP_HW_ERR = 0x13, /* since: priv-1.13.0 */
 RISCV_EXCP_INST_GUEST_PAGE_FAULT = 0x14,
 RISCV_EXCP_LOAD_GUEST_ACCESS_FAULT = 0x15,
 RISCV_EXCP_VIRT_INSTRUCTION_FAULT = 0x16,
-- 
2.34.1

[PATCH v4 1/6] target/riscv: Reuse the conversion function of priv_spec

2024-06-06 Thread Fea.Wang

From: Jim Shu 

Public the conversion function of priv_spec in cpu.h, so that tcg-cpu.c
could also use it.

Signed-off-by: Jim Shu 
Signed-off-by: Fea.Wang 
Reviewed-by: Frank Chang 
Reviewed-by: LIU Zhiwei 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu.c |  2 +-
 target/riscv/cpu.h |  1 +
 target/riscv/tcg/tcg-cpu.c | 13 -
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 69a08e8c2c..fd0f09c468 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -1790,7 +1790,7 @@ static int priv_spec_from_str(const char *priv_spec_str)
 return priv_version;
 }
 
-static const char *priv_spec_to_str(int priv_version)
+const char *priv_spec_to_str(int priv_version)
 {
 switch (priv_version) {
 case PRIV_VERSION_1_10_0:
diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 6fe0d712b4..b4c9e13774 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -830,4 +830,5 @@ const char *satp_mode_str(uint8_t satp_mode, bool 
is_32_bit);
 /* Implemented in th_csr.c */
 void th_register_custom_csrs(RISCVCPU *cpu);
 
+const char *priv_spec_to_str(int priv_version);
 #endif /* RISCV_CPU_H */
diff --git a/target/riscv/tcg/tcg-cpu.c b/target/riscv/tcg/tcg-cpu.c
index fa8a17cc60..4c6141f947 100644
--- a/target/riscv/tcg/tcg-cpu.c
+++ b/target/riscv/tcg/tcg-cpu.c
@@ -76,16 +76,11 @@ static void riscv_cpu_write_misa_bit(RISCVCPU *cpu, 
uint32_t bit,
 
 static const char *cpu_priv_ver_to_str(int priv_ver)
 {
-switch (priv_ver) {
-case PRIV_VERSION_1_10_0:
-return "v1.10.0";
-case PRIV_VERSION_1_11_0:
-return "v1.11.0";
-case PRIV_VERSION_1_12_0:
-return "v1.12.0";
-}
+const char *priv_spec_str = priv_spec_to_str(priv_ver);
 
-g_assert_not_reached();
+g_assert(priv_spec_str);
+
+return priv_spec_str;
 }
 
 static void riscv_cpu_synchronize_from_tb(CPUState *cs,
-- 
2.34.1

[PATCH v4 2/6] target/riscv: Define macros and variables for ss1p13

2024-06-06 Thread Fea.Wang

Add macros and variables for RISC-V privilege 1.13 support.

Signed-off-by: Fea.Wang 
Reviewed-by: Frank Chang 
Reviewed-by: Weiwei Li 
Reviewed-by: LIU Zhiwei 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu.h | 4 +++-
 target/riscv/cpu_cfg.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index b4c9e13774..90b8f1b08f 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -96,12 +96,14 @@ extern RISCVCPUProfile *riscv_profiles[];
 #define PRIV_VER_1_10_0_STR "v1.10.0"
 #define PRIV_VER_1_11_0_STR "v1.11.0"
 #define PRIV_VER_1_12_0_STR "v1.12.0"
+#define PRIV_VER_1_13_0_STR "v1.13.0"
 enum {
 PRIV_VERSION_1_10_0 = 0,
 PRIV_VERSION_1_11_0,
 PRIV_VERSION_1_12_0,
+PRIV_VERSION_1_13_0,
 
-PRIV_VERSION_LATEST = PRIV_VERSION_1_12_0,
+PRIV_VERSION_LATEST = PRIV_VERSION_1_13_0,
 };
 
 #define VEXT_VERSION_1_00_0 0x0001
diff --git a/target/riscv/cpu_cfg.h b/target/riscv/cpu_cfg.h
index e1e4f32698..fb7eebde52 100644
--- a/target/riscv/cpu_cfg.h
+++ b/target/riscv/cpu_cfg.h
@@ -136,6 +136,7 @@ struct RISCVCPUConfig {
  * TCG always implement/can't be user disabled,
  * based on spec version.
  */
+bool has_priv_1_13;
 bool has_priv_1_12;
 bool has_priv_1_11;
 
-- 
2.34.1

[PATCH v4 4/6] target/riscv: Add MEDELEGH, HEDELEGH csrs for RV32

2024-06-06 Thread Fea.Wang

Based on privileged spec 1.13, the RV32 needs to implement MEDELEGH
and HEDELEGH for exception codes 32-47 for reserving and exception codes
48-63 for custom use. Add the CSR number though the implementation is
just reading zero and writing ignore. Besides, for accessing HEDELEGH, it
should be controlled by mstateen0 'P1P13' bit.

Signed-off-by: Fea.Wang 
Reviewed-by: Frank Chang 
Reviewed-by: LIU Zhiwei 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu_bits.h |  2 ++
 target/riscv/csr.c  | 31 +++
 2 files changed, 33 insertions(+)

diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
index c895aa0334..096a51b331 100644
--- a/target/riscv/cpu_bits.h
+++ b/target/riscv/cpu_bits.h
@@ -156,6 +156,8 @@
 
 /* 32-bit only */
 #define CSR_MSTATUSH0x310
+#define CSR_MEDELEGH0x312
+#define CSR_HEDELEGH0x612
 
 /* Machine Trap Handling */
 #define CSR_MSCRATCH0x340
diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index a19e1afa1f..6f15612e76 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -3229,6 +3229,33 @@ static RISCVException write_hedeleg(CPURISCVState *env, 
int csrno,
 return RISCV_EXCP_NONE;
 }
 
+static RISCVException read_hedelegh(CPURISCVState *env, int csrno,
+   target_ulong *val)
+{
+RISCVException ret;
+ret = smstateen_acc_ok(env, 0, SMSTATEEN0_P1P13);
+if (ret != RISCV_EXCP_NONE) {
+return ret;
+}
+
+/* Reserved, now read zero */
+*val = 0;
+return RISCV_EXCP_NONE;
+}
+
+static RISCVException write_hedelegh(CPURISCVState *env, int csrno,
+target_ulong val)
+{
+RISCVException ret;
+ret = smstateen_acc_ok(env, 0, SMSTATEEN0_P1P13);
+if (ret != RISCV_EXCP_NONE) {
+return ret;
+}
+
+/* Reserved, now write ignore */
+return RISCV_EXCP_NONE;
+}
+
 static RISCVException rmw_hvien64(CPURISCVState *env, int csrno,
 uint64_t *ret_val,
 uint64_t new_val, uint64_t wr_mask)
@@ -4633,6 +4660,10 @@ riscv_csr_operations csr_ops[CSR_TABLE_SIZE] = {
 
 [CSR_MSTATUSH]= { "mstatush",   any32, read_mstatush,
   write_mstatush   },
+[CSR_MEDELEGH]= { "medelegh",   any32, read_zero, write_ignore,
+  .min_priv_ver = PRIV_VERSION_1_13_0  },
+[CSR_HEDELEGH]= { "hedelegh",   hmode32, read_hedelegh, write_hedelegh,
+  .min_priv_ver = PRIV_VERSION_1_13_0  },
 
 /* Machine Trap Handling */
 [CSR_MSCRATCH] = { "mscratch", any,  read_mscratch, write_mscratch,
-- 
2.34.1

[PATCH v4 3/6] target/riscv: Add 'P1P13' bit in SMSTATEEN0

2024-06-06 Thread Fea.Wang

Based on privilege 1.13 spec, there should be a bit56 for 'P1P13' in
mstateen0 that controls access to the hedeleg.

Signed-off-by: Fea.Wang 
Reviewed-by: Frank Chang 
Reviewed-by: Weiwei Li 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu_bits.h | 1 +
 target/riscv/csr.c  | 8 
 2 files changed, 9 insertions(+)

diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
index a470fda9be..c895aa0334 100644
--- a/target/riscv/cpu_bits.h
+++ b/target/riscv/cpu_bits.h
@@ -315,6 +315,7 @@
 #define SMSTATEEN0_CS   (1ULL << 0)
 #define SMSTATEEN0_FCSR (1ULL << 1)
 #define SMSTATEEN0_JVT  (1ULL << 2)
+#define SMSTATEEN0_P1P13(1ULL << 56)
 #define SMSTATEEN0_HSCONTXT (1ULL << 57)
 #define SMSTATEEN0_IMSIC(1ULL << 58)
 #define SMSTATEEN0_AIA  (1ULL << 59)
diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index ee33019b03..a19e1afa1f 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -2252,6 +2252,10 @@ static RISCVException write_mstateen0(CPURISCVState 
*env, int csrno,
 wr_mask |= SMSTATEEN0_FCSR;
 }
 
+if (env->priv_ver >= PRIV_VERSION_1_13_0) {
+wr_mask |= SMSTATEEN0_P1P13;
+}
+
 return write_mstateen(env, csrno, wr_mask, new_val);
 }
 
@@ -2287,6 +2291,10 @@ static RISCVException write_mstateen0h(CPURISCVState 
*env, int csrno,
 {
 uint64_t wr_mask = SMSTATEEN_STATEEN | SMSTATEEN0_HSENVCFG;
 
+if (env->priv_ver >= PRIV_VERSION_1_13_0) {
+wr_mask |= SMSTATEEN0_P1P13;
+}
+
 return write_mstateenh(env, csrno, wr_mask, new_val);
 }
 
-- 
2.34.1

[PATCH v4 0/6] target/riscv: Support RISC-V privilege 1.13 spec

2024-06-06 Thread Fea.Wang

Based on the change log for the RISC-V privilege 1.13 spec, add the
support for ss1p13.

base-commit: 7a2356147f3a5faebf95dba4140247ec6e5607b1

* Reorder commits

[v3]
* Correct the mstateen0 for P1P13 in commit message
* Refactor commit by splitting to two commits

[v2]
* Check HEDELEGH by hmode32 instead of any32
* Remove unnecessary code
* Refine calling functions

[v1]

Ref:https://github.com/riscv/riscv-isa-manual/blob/a7d93c9/src/priv-preface.adoc?plain=1#L40-L72

Lists what to do without clarification or document format.
* Redefined misa.MXL to be read-only, making MXLEN a constant.(Skip, 
implementation ignored)
* Added the constraint that SXLEN≥UXLEN.(Skip, implementation ignored)
* Defined the misa.V field to reflect that the V extension has been 
implemented.(Skip, existed) 
* Defined the RV32-only medelegh and hedelegh CSRs.(Done in these patches)
* Defined the misaligned atomicity granule PMA, superseding the proposed Zam 
extension..(Skip, implementation ignored)
* Allocated interrupt 13 for Sscofpmf LCOFI interrupt.(Skip, existed) 
* Defined hardware error and software check exception codes.(Done in these 
patches)
* Specified synchronization requirements when changing the PBMTE fields in 
menvcfg and henvcfg.(Skip, implementation ignored)
* Incorporated Svade and Svadu extension specifications.(Skip, existed) 

Fea.Wang (5):
  target/riscv: Define macros and variables for ss1p13
  target/riscv: Add 'P1P13' bit in SMSTATEEN0
  target/riscv: Add MEDELEGH, HEDELEGH csrs for RV32
  target/riscv: Reserve exception codes for sw-check and hw-err
  target/riscv: Support the version for ss1p13

Jim Shu (1):
  target/riscv: Reuse the conversion function of priv_spec

 target/riscv/cpu.c |  8 ++--
 target/riscv/cpu.h |  5 -
 target/riscv/cpu_bits.h|  5 +
 target/riscv/cpu_cfg.h |  1 +
 target/riscv/csr.c | 39 ++
 target/riscv/tcg/tcg-cpu.c | 17 -
 6 files changed, 63 insertions(+), 12 deletions(-)

-- 
2.34.1

[PATCH v4 6/6] target/riscv: Support the version for ss1p13

2024-06-06 Thread Fea.Wang

Add RISC-V privilege 1.13 support.

Signed-off-by: Fea.Wang 
Signed-off-by: Fea.Wang 
Reviewed-by: Frank Chang 
Reviewed-by: Weiwei Li 
Reviewed-by: LIU Zhiwei 
---
 target/riscv/cpu.c | 6 +-
 target/riscv/tcg/tcg-cpu.c | 4 
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index fd0f09c468..4760cb2cc1 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -1779,7 +1779,9 @@ static int priv_spec_from_str(const char *priv_spec_str)
 {
 int priv_version = -1;
 
-if (!g_strcmp0(priv_spec_str, PRIV_VER_1_12_0_STR)) {
+if (!g_strcmp0(priv_spec_str, PRIV_VER_1_13_0_STR)) {
+priv_version = PRIV_VERSION_1_13_0;
+} else if (!g_strcmp0(priv_spec_str, PRIV_VER_1_12_0_STR)) {
 priv_version = PRIV_VERSION_1_12_0;
 } else if (!g_strcmp0(priv_spec_str, PRIV_VER_1_11_0_STR)) {
 priv_version = PRIV_VERSION_1_11_0;
@@ -1799,6 +1801,8 @@ const char *priv_spec_to_str(int priv_version)
 return PRIV_VER_1_11_0_STR;
 case PRIV_VERSION_1_12_0:
 return PRIV_VER_1_12_0_STR;
+case PRIV_VERSION_1_13_0:
+return PRIV_VER_1_13_0_STR;
 default:
 return NULL;
 }
diff --git a/target/riscv/tcg/tcg-cpu.c b/target/riscv/tcg/tcg-cpu.c
index 4c6141f947..eb6f7b9d12 100644
--- a/target/riscv/tcg/tcg-cpu.c
+++ b/target/riscv/tcg/tcg-cpu.c
@@ -318,6 +318,10 @@ static void riscv_cpu_update_named_features(RISCVCPU *cpu)
 cpu->cfg.has_priv_1_12 = true;
 }
 
+if (cpu->env.priv_ver >= PRIV_VERSION_1_13_0) {
+cpu->cfg.has_priv_1_13 = true;
+}
+
 /* zic64b is 1.12 or later */
 cpu->cfg.ext_zic64b = cpu->cfg.cbom_blocksize == 64 &&
   cpu->cfg.cbop_blocksize == 64 &&
-- 
2.34.1

Re: [PATCH v3 3/6] target/riscv: Support the version for ss1p13

2024-06-06 Thread Fea Wang

Sure, I will reorder the commits in the next patch series.
Thank you

Sincerely,
Fea

On Thu, Jun 6, 2024 at 7:58 AM Alistair Francis 
wrote:

> On Tue, Jun 4, 2024 at 4:23 PM Fea.Wang  wrote:
> >
> > Add RISC-V privilege 1.13 support.
> >
> > Signed-off-by: Fea.Wang 
> > Signed-off-by: Fea.Wang 
> > Reviewed-by: Frank Chang 
> > Reviewed-by: Weiwei Li 
> > Reviewed-by: LIU Zhiwei 
>
> This should be the last patch in the series. The idea is that we add
> support and then let users enable it.
>
> Alistair
>
> > ---
> >  target/riscv/cpu.c | 6 +-
> >  target/riscv/tcg/tcg-cpu.c | 4 
> >  2 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> > index e9e69b9863..02c1e12a03 100644
> > --- a/target/riscv/cpu.c
> > +++ b/target/riscv/cpu.c
> > @@ -1775,7 +1775,9 @@ static int priv_spec_from_str(const char
> *priv_spec_str)
> >  {
> >  int priv_version = -1;
> >
> > -if (!g_strcmp0(priv_spec_str, PRIV_VER_1_12_0_STR)) {
> > +if (!g_strcmp0(priv_spec_str, PRIV_VER_1_13_0_STR)) {
> > +priv_version = PRIV_VERSION_1_13_0;
> > +} else if (!g_strcmp0(priv_spec_str, PRIV_VER_1_12_0_STR)) {
> >  priv_version = PRIV_VERSION_1_12_0;
> >  } else if (!g_strcmp0(priv_spec_str, PRIV_VER_1_11_0_STR)) {
> >  priv_version = PRIV_VERSION_1_11_0;
> > @@ -1795,6 +1797,8 @@ const char *priv_spec_to_str(int priv_version)
> >  return PRIV_VER_1_11_0_STR;
> >  case PRIV_VERSION_1_12_0:
> >  return PRIV_VER_1_12_0_STR;
> > +case PRIV_VERSION_1_13_0:
> > +return PRIV_VER_1_13_0_STR;
> >  default:
> >  return NULL;
> >  }
> > diff --git a/target/riscv/tcg/tcg-cpu.c b/target/riscv/tcg/tcg-cpu.c
> > index 60fe0fd060..595d3b5b8f 100644
> > --- a/target/riscv/tcg/tcg-cpu.c
> > +++ b/target/riscv/tcg/tcg-cpu.c
> > @@ -318,6 +318,10 @@ static void
> riscv_cpu_update_named_features(RISCVCPU *cpu)
> >  cpu->cfg.has_priv_1_12 = true;
> >  }
> >
> > +if (cpu->env.priv_ver >= PRIV_VERSION_1_13_0) {
> > +cpu->cfg.has_priv_1_13 = true;
> > +}
> > +
> >  /* zic64b is 1.12 or later */
> >  cpu->cfg.ext_zic64b = cpu->cfg.cbom_blocksize == 64 &&
> >cpu->cfg.cbop_blocksize == 64 &&
> > --
> > 2.34.1
> >
> >
>

Re: [PATCH v2 0/3] semihosting: Restrict to TCG

2024-06-06 Thread Anton Johansson via

On 06/06/24, Philippe Mathieu-Daudé wrote:
> Kind ping :)

I'm off today, I'll take a look tomorrow morning!:)
//Anton

Re: [PATCH 3/6] io/channel-rdma: support working in coroutine

2024-06-06 Thread Haris Iqbal

On Tue, Jun 4, 2024 at 2:14 PM Gonglei  wrote:
>
> From: Jialin Wang 
>
> It is not feasible to obtain RDMA completion queue notifications
> through poll/ppoll on the rsocket fd. Therefore, we create a thread
> named rpoller for each rsocket fd and two eventfds: pollin_eventfd
> and pollout_eventfd.
>
> When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
> or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
> and pollout_eventfd instead of the rsocket fd.
>
> The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
> events.
> When a POLLIN event occurs, the rpoller write the pollin_eventfd,
> and then poll/ppoll will return the POLLIN event.
> When a POLLOUT event occurs, the rpoller read the pollout_eventfd,
> and then poll/ppoll will return the POLLOUT event.
>
> For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
> read/write the pollin/pollout_eventfd, preventing poll/ppoll from
> returning POLLIN/POLLOUT events.
>
> Known limitations:
>
>   For a blocking rsocket fd, if we use io_create_watch to wait for
>   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
>   cannot determine when it is not ready to read/write as we can with
>   non-blocking fds. Therefore, when an event occurs, it will occurs
>   always, potentially leave the qemu hanging. So we need be cautious
>   to avoid hanging when using io_create_watch .
>
> Luckily, channel-rdma works well in coroutines :)
>
> Signed-off-by: Jialin Wang 
> Signed-off-by: Gonglei 
> ---
>  include/io/channel-rdma.h |  15 +-
>  io/channel-rdma.c | 363 +-
>  2 files changed, 376 insertions(+), 2 deletions(-)
>
> diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> index 8cab2459e5..cb56127d76 100644
> --- a/include/io/channel-rdma.h
> +++ b/include/io/channel-rdma.h
> @@ -47,6 +47,18 @@ struct QIOChannelRDMA {
>  socklen_t localAddrLen;
>  struct sockaddr_storage remoteAddr;
>  socklen_t remoteAddrLen;
> +
> +/* private */
> +
> +/* qemu g_poll/ppoll() POLLIN event on it */
> +int pollin_eventfd;
> +/* qemu g_poll/ppoll() POLLOUT event on it */
> +int pollout_eventfd;
> +
> +/* the index in the rpoller's fds array */
> +int index;
> +/* rpoller will rpoll() rpoll_events on the rsocket fd */
> +short int rpoll_events;
>  };
>
>  /**
> @@ -147,6 +159,7 @@ void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, 
> InetSocketAddress *addr,
>   *
>   * Returns: the new client channel, or NULL on error
>   */
> -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc, Error **errp);
> +QIOChannelRDMA *coroutine_mixed_fn qio_channel_rdma_accept(QIOChannelRDMA 
> *ioc,
> +   Error **errp);
>
>  #endif /* QIO_CHANNEL_RDMA_H */
> diff --git a/io/channel-rdma.c b/io/channel-rdma.c
> index 92c362df52..9792add5cf 100644
> --- a/io/channel-rdma.c
> +++ b/io/channel-rdma.c
> @@ -23,10 +23,15 @@
>
>  #include "qemu/osdep.h"
>  #include "io/channel-rdma.h"
> +#include "io/channel-util.h"
> +#include "io/channel-watch.h"
>  #include "io/channel.h"
>  #include "qapi/clone-visitor.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-visit-sockets.h"
> +#include "qemu/atomic.h"
> +#include "qemu/error-report.h"
> +#include "qemu/thread.h"
>  #include "trace.h"
>  #include 
>  #include 
> @@ -39,11 +44,274 @@
>  #include 
>  #include 
>
> +typedef enum {
> +CLEAR_POLLIN,
> +CLEAR_POLLOUT,
> +SET_POLLIN,
> +SET_POLLOUT,
> +} UpdateEvent;
> +
> +typedef enum {
> +RP_CMD_ADD_IOC,
> +RP_CMD_DEL_IOC,
> +RP_CMD_UPDATE,
> +} RpollerCMD;
> +
> +typedef struct {
> +RpollerCMD cmd;
> +QIOChannelRDMA *rioc;
> +} RpollerMsg;
> +
> +/*
> + * rpoll() on the rsocket fd with rpoll_events, when POLLIN/POLLOUT event
> + * occurs, it will write/read the pollin_eventfd/pollout_eventfd to allow
> + * qemu g_poll/ppoll() get the POLLIN/POLLOUT event
> + */
> +static struct Rpoller {
> +QemuThread thread;
> +bool is_running;
> +int sock[2];
> +int count; /* the number of rsocket fds being rpoll() */
> +int size; /* the size of fds/riocs */
> +struct pollfd *fds;
> +QIOChannelRDMA **riocs;
> +} rpoller;
> +
> +static void qio_channel_rdma_notify_rpoller(QIOChannelRDMA *rioc,
> +RpollerCMD cmd)
> +{
> +RpollerMsg msg;
> +int ret;
> +
> +msg.cmd = cmd;
> +msg.rioc = rioc;
> +
> +ret = RETRY_ON_EINTR(write(rpoller.sock[0], &msg, sizeof msg));
> +if (ret != sizeof msg) {
> +error_report("%s: failed to send msg, errno: %d", __func__, errno);
> +}
> +}
> +
> +static void qio_channel_rdma_update_poll_event(QIOChannelRDMA *rioc,
> +   UpdateEvent action,
> +   bool notify_rpoller)
> +{
> +/* An eventfd with the value of ULLONG_MAX - 1 is re

Re: [RFC PATCH v2 4/5] Revert "meson: Propagate gnutls dependency"

2024-06-06 Thread Philippe Mathieu-Daudé


On 27/5/24 12:49, Paolo Bonzini wrote:

From: Akihiko Odaki 

This reverts commit 3eacf70bb5a83e4775ad8003cbca63a40f70c8c2.

It was only needed because of duplicate objects caused by
declare_dependency(link_whole: ...), and can be dropped now
that meson.build specifies objects and dependencies separately
for the internal dependencies.

Signed-off-by: Akihiko Odaki 
Message-ID: <20240524-objects-v1-2-07cbbe961...@daynix.com>
Signed-off-by: Paolo Bonzini 


Reviewed-by: Philippe Mathieu-Daudé 


---
  meson.build| 4 ++--
  block/meson.build  | 2 +-
  io/meson.build | 2 +-
  storage-daemon/meson.build | 2 +-
  ui/meson.build | 2 +-
  5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/meson.build b/meson.build
index 9772c145bdb..84dbd7fb371 100644
--- a/meson.build
+++ b/meson.build
@@ -3486,7 +3486,7 @@ if have_block
  'blockdev-nbd.c',
  'iothread.c',
  'job-qmp.c',
-  ), gnutls)
+  ))
  
# os-posix.c contains POSIX-specific functions used by qemu-storage-daemon,

# os-win32.c does not
@@ -4004,7 +4004,7 @@ if have_tools
   dependencies: [block, qemuutil], install: true)
qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
 link_args: '@block.syms', link_depends: block_syms,
-   dependencies: [blockdev, qemuutil, gnutls, selinux],
+   dependencies: [blockdev, qemuutil, selinux],
 install: true)
  
subdir('storage-daemon')

diff --git a/block/meson.build b/block/meson.build
index 158dc3b89db..f1262ec2ba8 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -39,7 +39,7 @@ block_ss.add(files(
'throttle.c',
'throttle-groups.c',
'write-threshold.c',
-), zstd, zlib, gnutls)
+), zstd, zlib)
  
  system_ss.add(when: 'CONFIG_TCG', if_true: files('blkreplay.c'))

  system_ss.add(files('block-ram-registrar.c'))
diff --git a/io/meson.build b/io/meson.build
index 283b9b2bdbd..1164812f912 100644
--- a/io/meson.build
+++ b/io/meson.build
@@ -13,4 +13,4 @@ io_ss.add(files(
'dns-resolver.c',
'net-listener.c',
'task.c',
-), gnutls)
+))
diff --git a/storage-daemon/meson.build b/storage-daemon/meson.build
index fd5e32f4b28..5e61a9d1bdf 100644
--- a/storage-daemon/meson.build
+++ b/storage-daemon/meson.build
@@ -1,6 +1,6 @@
  qsd_ss = ss.source_set()
  qsd_ss.add(files('qemu-storage-daemon.c'))
-qsd_ss.add(blockdev, chardev, qmp, qom, qemuutil, gnutls)
+qsd_ss.add(blockdev, chardev, qmp, qom, qemuutil)
  
  subdir('qapi')
  
diff --git a/ui/meson.build b/ui/meson.build

index cfbf29428df..28c7381dd10 100644
--- a/ui/meson.build
+++ b/ui/meson.build
@@ -44,7 +44,7 @@ vnc_ss.add(files(
'vnc-jobs.c',
'vnc-clipboard.c',
  ))
-vnc_ss.add(zlib, jpeg, gnutls)
+vnc_ss.add(zlib, jpeg)
  vnc_ss.add(when: sasl, if_true: files('vnc-auth-sasl.c'))
  system_ss.add_all(when: [vnc, pixman], if_true: vnc_ss)
  system_ss.add(when: vnc, if_false: files('vnc-stubs.c'))

Re: [RFC PATCH v2 3/5] meson: Pass objects and dependencies to declare_dependency()

2024-06-06 Thread Philippe Mathieu-Daudé


On 27/5/24 12:49, Paolo Bonzini wrote:

From: Akihiko Odaki 

We used to request declare_dependency() to link_whole static libraries.
If a static library is a thin archive, GNU ld keeps all object files
referenced by the archive open, and sometimes exceeds the open file limit.

Another problem with link_whole is that suboptimal handling of nested
dependencies.

link_whole by itself does not propagate dependencies. In particular,
gnutls, a dependency of crypto, is not propagated to its users, and we
currently workaround the issue by declaring gnutls as a dependency for
each crypto user.  On the other hand, if you write something like

   libfoo = static_library('foo', 'foo.c', dependencies: gnutls)
   foo = declare_dependency(link_whole: libfoo)

   libbar = static_library('bar', 'bar.c', dependencies: foo)
   bar = declare_dependency(link_whole: libbar, dependencies: foo)
   executable('prog', sources: files('prog.c'), dependencies: [foo, bar])

hoping to propagate the gnutls dependency into bar.c, you'll see a
linking failure for "prog", because the foo.c.o object file is included in
libbar.a and therefore it is linked twice into "prog": once from libfoo.a
and once from libbar.a.  Here Meson does not see the duplication, it
just asks the linker to link all of libfoo.a and libbar.a into "prog".

Instead of using link_whole, extract objects included in static libraries
and pass them to declare_dependency(); and then the dependencies can be
added as well so that they are propagated, because object files on the
linker command line are always deduplicated.

This requires Meson 1.1.0 or later.

Signed-off-by: Akihiko Odaki 
Message-ID: <20240524-objects-v1-1-07cbbe961...@daynix.com>
Signed-off-by: Paolo Bonzini 
---
  docs/devel/build-system.rst|  3 ++-
  meson.build| 44 +++---
  gdbstub/meson.build|  4 ++--
  pythondeps.toml|  2 +-
  tcg/meson.build|  6 +++--
  tests/qtest/libqos/meson.build |  2 +-
  6 files changed, 35 insertions(+), 26 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé

Re: [RFC PATCH v2 1/5] meson: move shared_module() calls where modules are already walked

2024-06-06 Thread Philippe Mathieu-Daudé


On 27/5/24 12:49, Paolo Bonzini wrote:

Signed-off-by: Paolo Bonzini 
---
  meson.build | 34 +++---
  1 file changed, 19 insertions(+), 15 deletions(-)




+  if emulator_modules.length() > 0
+alias_target('modules', emulator_modules)
+  endif
  endif
  
  nm = find_program('nm')

@@ -3745,19 +3762,6 @@ common_ss.add(hwcore)
  # Targets #
  ###
  
-emulator_modules = []

-foreach m : block_mods + system_mods
-  emulator_modules += shared_module(m.name(),
-build_by_default: true,
-name_prefix: '',
-link_whole: m,
-install: true,
-install_dir: qemu_moddir)
-endforeach
-if emulator_modules.length() > 0
-  alias_target('modules', emulator_modules)
-endif


In my experiment I moved this later after the qemu-system-FOO
meson targets, because I append libqemu-TARGET-softmmu objects;
but I guess this isn't a good start, and this patch LGTM.

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v2 0/3] semihosting: Restrict to TCG

2024-06-06 Thread Philippe Mathieu-Daudé


Kind ping :)

On 3/6/24 10:27, Philippe Mathieu-Daudé wrote:

On 30/5/24 16:53, Philippe Mathieu-Daudé wrote:

v2: Address Paolo's comment


Missing review: 1 & 2


Semihosting currently uses the TCG probe_access API,
so it is pointless to have it in the binary when TCG
isn't.

It could be implemented for other accelerators, but
work need to be done. Meanwhile, do not enable it
unless TCG is available.

Philippe Mathieu-Daudé (3):
   target/mips: Restrict semihosting to TCG
   target/riscv: Restrict semihosting to TCG
   semihosting: Restrict to TCG

  semihosting/Kconfig  | 1 +
  target/mips/Kconfig  | 2 +-
  target/riscv/Kconfig | 4 ++--
  3 files changed, 4 insertions(+), 3 deletions(-)

[PATCH 2/3] plugins: Free CPUPluginState before destroying vCPU state

2024-06-06 Thread Philippe Mathieu-Daudé

cpu::plugin_state is allocated in cpu_common_initfn() when
the vCPU state is created. Release it in cpu_common_finalize()
when we are done.

Signed-off-by: Philippe Mathieu-Daudé 
---
 include/qemu/plugin.h | 3 +++
 hw/core/cpu-common.c  | 5 +
 2 files changed, 8 insertions(+)

diff --git a/include/qemu/plugin.h b/include/qemu/plugin.h
index bc5aef979e..af5f9db469 100644
--- a/include/qemu/plugin.h
+++ b/include/qemu/plugin.h
@@ -149,6 +149,9 @@ struct CPUPluginState {
 
 /**
  * qemu_plugin_create_vcpu_state: allocate plugin state
+ *
+ * The returned data must be released with g_free()
+ * when no longer required.
  */
 CPUPluginState *qemu_plugin_create_vcpu_state(void);
 
diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index bf1a7b8892..cd15402552 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -283,6 +283,11 @@ static void cpu_common_finalize(Object *obj)
 {
 CPUState *cpu = CPU(obj);
 
+#ifdef CONFIG_PLUGIN
+if (tcg_enabled()) {
+g_free(cpu->plugin_state);
+}
+#endif
 g_array_free(cpu->gdb_regs, TRUE);
 qemu_lockcnt_destroy(&cpu->in_ioctl_lock);
 qemu_mutex_destroy(&cpu->work_mutex);
-- 
2.41.0

[PATCH 3/3] accel/tcg: Move qemu_plugin_vcpu_init__async() to plugins/

2024-06-06 Thread Philippe Mathieu-Daudé

Calling qemu_plugin_vcpu_init__async() on the vCPU thread
is a detail of plugins, not relevant to TCG vCPU management.

Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Richard Henderson 
Reviewed-by: Pierrick Bouvier 
---
 hw/core/cpu-common.c | 9 +
 plugins/core.c   | 8 +++-
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index cd15402552..79fcc0b286 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -192,13 +192,6 @@ static void cpu_common_parse_features(const char 
*typename, char *features,
 }
 }
 
-#ifdef CONFIG_PLUGIN
-static void qemu_plugin_vcpu_init__async(CPUState *cpu, run_on_cpu_data unused)
-{
-qemu_plugin_vcpu_init_hook(cpu);
-}
-#endif
-
 static void cpu_common_realizefn(DeviceState *dev, Error **errp)
 {
 CPUState *cpu = CPU(dev);
@@ -274,7 +267,7 @@ static void cpu_common_initfn(Object *obj)
 #ifdef CONFIG_PLUGIN
 if (tcg_enabled()) {
 cpu->plugin_state = qemu_plugin_create_vcpu_state();
-async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL);
+qemu_plugin_vcpu_init_hook(cpu);
 }
 #endif
 }
diff --git a/plugins/core.c b/plugins/core.c
index d339b3db4d..3dec3556c3 100644
--- a/plugins/core.c
+++ b/plugins/core.c
@@ -241,7 +241,7 @@ static void plugin_grow_scoreboards__locked(CPUState *cpu)
 end_exclusive();
 }
 
-void qemu_plugin_vcpu_init_hook(CPUState *cpu)
+static void qemu_plugin_vcpu_init__async(CPUState *cpu, run_on_cpu_data unused)
 {
 bool success;
 
@@ -258,6 +258,12 @@ void qemu_plugin_vcpu_init_hook(CPUState *cpu)
 plugin_vcpu_cb__simple(cpu, QEMU_PLUGIN_EV_VCPU_INIT);
 }
 
+void qemu_plugin_vcpu_init_hook(CPUState *cpu)
+{
+/* Plugin initialization must wait until the cpu start executing code */
+async_run_on_cpu(cpu, qemu_plugin_vcpu_init__async, RUN_ON_CPU_NULL);
+}
+
 void qemu_plugin_vcpu_exit_hook(CPUState *cpu)
 {
 bool success;
-- 
2.41.0

[PATCH 1/3] plugins: Ensure vCPU index is assigned in init/exit hooks

2024-06-06 Thread Philippe Mathieu-Daudé

Since vCPUs are hashed by their index, this index can't
be uninitialized (UNASSIGNED_CPU_INDEX).

Signed-off-by: Philippe Mathieu-Daudé 
---
 plugins/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/plugins/core.c b/plugins/core.c
index badede28cf..d339b3db4d 100644
--- a/plugins/core.c
+++ b/plugins/core.c
@@ -245,6 +245,7 @@ void qemu_plugin_vcpu_init_hook(CPUState *cpu)
 {
 bool success;
 
+assert(cpu->cpu_index != UNASSIGNED_CPU_INDEX);
 qemu_rec_mutex_lock(&plugin.lock);
 plugin.num_vcpus = MAX(plugin.num_vcpus, cpu->cpu_index + 1);
 plugin_cpu_update__locked(&cpu->cpu_index, NULL, NULL);
@@ -263,6 +264,7 @@ void qemu_plugin_vcpu_exit_hook(CPUState *cpu)
 
 plugin_vcpu_cb__simple(cpu, QEMU_PLUGIN_EV_VCPU_EXIT);
 
+assert(cpu->cpu_index != UNASSIGNED_CPU_INDEX);
 qemu_rec_mutex_lock(&plugin.lock);
 success = g_hash_table_remove(plugin.cpu_ht, &cpu->cpu_index);
 g_assert(success);
-- 
2.41.0

[PATCH 0/3] plugins: Few debugging cleanups

2024-06-06 Thread Philippe Mathieu-Daudé

- Assert cpu_index is assigned in INIT/EXIT hooks
- Free cpu->plugin_state
- Restrict qemu_plugin_vcpu_init__async() to plugins/

Philippe Mathieu-Daudé (3):
  plugins: Ensure vCPU index is assigned in init/exit hooks
  plugins: Free CPUPluginState before destroying vCPU state
  accel/tcg: Move qemu_plugin_vcpu_init__async() to plugins/

 include/qemu/plugin.h |  3 +++
 hw/core/cpu-common.c  | 14 ++
 plugins/core.c| 10 +-
 3 files changed, 18 insertions(+), 9 deletions(-)

-- 
2.41.0

[PATCH v5 09/10] hw/nvme: add reservation protocal command

2024-06-06 Thread Changqi Lu

Add reservation acquire, reservation register,
reservation release and reservation report commands
in the nvme device layer.

By introducing these commands, this enables the nvme
device to perform reservation-related tasks, including
querying keys, querying reservation status, registering
reservation keys, initiating and releasing reservations,
as well as clearing and preempting reservations held by
other keys.

These commands are crucial for management and control of
shared storage resources in a persistent manner.
Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 hw/nvme/ctrl.c   | 323 ++-
 hw/nvme/nvme.h   |   4 +
 include/block/nvme.h |  37 +
 3 files changed, 363 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 182307a48b..44e0bd5c63 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -294,6 +294,10 @@ static const uint32_t nvme_cse_iocs_nvm[256] = {
 [NVME_CMD_COMPARE]  = NVME_CMD_EFF_CSUPP,
 [NVME_CMD_IO_MGMT_RECV] = NVME_CMD_EFF_CSUPP,
 [NVME_CMD_IO_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+[NVME_CMD_RESV_REGISTER]= NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_REPORT]  = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_ACQUIRE] = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_RELEASE] = NVME_CMD_EFF_CSUPP,
 };
 
 static const uint32_t nvme_cse_iocs_zoned[256] = {
@@ -308,6 +312,10 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
 [NVME_CMD_ZONE_APPEND]  = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_CMD_ZONE_MGMT_SEND]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_CMD_ZONE_MGMT_RECV]   = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_REGISTER]= NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_REPORT]  = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_ACQUIRE] = NVME_CMD_EFF_CSUPP,
+[NVME_CMD_RESV_RELEASE] = NVME_CMD_EFF_CSUPP,
 };
 
 static void nvme_process_sq(void *opaque);
@@ -1745,6 +1753,7 @@ static void nvme_aio_err(NvmeRequest *req, int ret)
 
 switch (req->cmd.opcode) {
 case NVME_CMD_READ:
+case NVME_CMD_RESV_REPORT:
 status = NVME_UNRECOVERED_READ;
 break;
 case NVME_CMD_FLUSH:
@@ -1752,6 +1761,9 @@ static void nvme_aio_err(NvmeRequest *req, int ret)
 case NVME_CMD_WRITE_ZEROES:
 case NVME_CMD_ZONE_APPEND:
 case NVME_CMD_COPY:
+case NVME_CMD_RESV_REGISTER:
+case NVME_CMD_RESV_ACQUIRE:
+case NVME_CMD_RESV_RELEASE:
 status = NVME_WRITE_FAULT;
 break;
 default:
@@ -2127,7 +2139,10 @@ static inline bool nvme_is_write(NvmeRequest *req)
 
 return rw->opcode == NVME_CMD_WRITE ||
rw->opcode == NVME_CMD_ZONE_APPEND ||
-   rw->opcode == NVME_CMD_WRITE_ZEROES;
+   rw->opcode == NVME_CMD_WRITE_ZEROES ||
+   rw->opcode == NVME_CMD_RESV_REGISTER ||
+   rw->opcode == NVME_CMD_RESV_ACQUIRE ||
+   rw->opcode == NVME_CMD_RESV_RELEASE;
 }
 
 static void nvme_misc_cb(void *opaque, int ret)
@@ -2692,6 +2707,304 @@ static uint16_t nvme_verify(NvmeCtrl *n, NvmeRequest 
*req)
 return NVME_NO_COMPLETE;
 }
 
+typedef struct NvmeKeyInfo {
+uint64_t cr_key;
+uint64_t nr_key;
+} NvmeKeyInfo;
+
+static uint16_t nvme_resv_register(NvmeCtrl *n, NvmeRequest *req)
+{
+int ret;
+NvmeKeyInfo key_info;
+NvmeNamespace *ns = req->ns;
+uint32_t cdw10 = le32_to_cpu(req->cmd.cdw10);
+bool ignore_key = cdw10 >> 3 & 0x1;
+uint8_t action = cdw10 & 0x7;
+uint8_t ptpl = cdw10 >> 30 & 0x3;
+bool aptpl;
+
+switch (ptpl) {
+case NVME_RESV_PTPL_NO_CHANGE:
+aptpl = (ns->id_ns.rescap & NVME_PR_CAP_PTPL) ? true : false;
+break;
+case NVME_RESV_PTPL_DISABLE:
+aptpl = false;
+break;
+case NVME_RESV_PTPL_ENABLE:
+aptpl = true;
+break;
+default:
+return NVME_INVALID_FIELD;
+}
+
+ret = nvme_h2c(n, (uint8_t *)&key_info, sizeof(NvmeKeyInfo), req);
+if (ret) {
+return ret;
+}
+
+switch (action) {
+case NVME_RESV_REGISTER_ACTION_REGISTER:
+req->aiocb = blk_aio_pr_register(ns->blkconf.blk, 0,
+ key_info.nr_key, 0, aptpl,
+ ignore_key, nvme_misc_cb,
+ req);
+break;
+case NVME_RESV_REGISTER_ACTION_UNREGISTER:
+req->aiocb = blk_aio_pr_register(ns->blkconf.blk, key_info.cr_key, 0,
+ 0, aptpl, ignore_key,
+ nvme_misc_cb, req);
+break;
+case NVME_RESV_REGISTER_ACTION_REPLACE:
+req->aiocb = blk_aio_pr_register(ns->blkconf.blk, key_info.cr_key,
+ key_info.nr_key, 0, aptpl, ignore_key,
+ nvme_misc_cb, req);
+break;
+default:
+return

[PATCH v5 06/10] block/nvme: add reservation command protocol constants

2024-06-06 Thread Changqi Lu

Add constants for the NVMe persistent command protocol.
The constants include the reservation command opcode and
reservation type values defined in section 7 of the NVMe
2.0 specification.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 include/block/nvme.h | 61 
 1 file changed, 61 insertions(+)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index bb231d0b9a..da6ccb0f3b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -633,6 +633,10 @@ enum NvmeIoCommands {
 NVME_CMD_WRITE_ZEROES   = 0x08,
 NVME_CMD_DSM= 0x09,
 NVME_CMD_VERIFY = 0x0c,
+NVME_CMD_RESV_REGISTER  = 0x0d,
+NVME_CMD_RESV_REPORT= 0x0e,
+NVME_CMD_RESV_ACQUIRE   = 0x11,
+NVME_CMD_RESV_RELEASE   = 0x15,
 NVME_CMD_IO_MGMT_RECV   = 0x12,
 NVME_CMD_COPY   = 0x19,
 NVME_CMD_IO_MGMT_SEND   = 0x1d,
@@ -641,6 +645,63 @@ enum NvmeIoCommands {
 NVME_CMD_ZONE_APPEND= 0x7d,
 };
 
+typedef enum {
+NVME_RESV_REGISTER_ACTION_REGISTER  = 0x00,
+NVME_RESV_REGISTER_ACTION_UNREGISTER= 0x01,
+NVME_RESV_REGISTER_ACTION_REPLACE   = 0x02,
+} NvmeReservationRegisterAction;
+
+typedef enum {
+NVME_RESV_RELEASE_ACTION_RELEASE= 0x00,
+NVME_RESV_RELEASE_ACTION_CLEAR  = 0x01,
+} NvmeReservationReleaseAction;
+
+typedef enum {
+NVME_RESV_ACQUIRE_ACTION_ACQUIRE= 0x00,
+NVME_RESV_ACQUIRE_ACTION_PREEMPT= 0x01,
+NVME_RESV_ACQUIRE_ACTION_PREEMPT_AND_ABORT  = 0x02,
+} NvmeReservationAcquireAction;
+
+typedef enum {
+NVME_RESV_WRITE_EXCLUSIVE   = 0x01,
+NVME_RESV_EXCLUSIVE_ACCESS  = 0x02,
+NVME_RESV_WRITE_EXCLUSIVE_REGS_ONLY = 0x03,
+NVME_RESV_EXCLUSIVE_ACCESS_REGS_ONLY= 0x04,
+NVME_RESV_WRITE_EXCLUSIVE_ALL_REGS  = 0x05,
+NVME_RESV_EXCLUSIVE_ACCESS_ALL_REGS = 0x06,
+} NvmeResvType;
+
+typedef enum {
+NVME_RESV_PTPL_NO_CHANGE = 0x00,
+NVME_RESV_PTPL_DISABLE   = 0x02,
+NVME_RESV_PTPL_ENABLE= 0x03,
+} NvmeResvPTPL;
+
+typedef enum NVMEPrCap {
+/* Persist Through Power Loss */
+NVME_PR_CAP_PTPL = 1 << 0,
+/* Write Exclusive reservation type */
+NVME_PR_CAP_WR_EX = 1 << 1,
+/* Exclusive Access reservation type */
+NVME_PR_CAP_EX_AC = 1 << 2,
+/* Write Exclusive Registrants Only reservation type */
+NVME_PR_CAP_WR_EX_RO = 1 << 3,
+/* Exclusive Access Registrants Only reservation type */
+NVME_PR_CAP_EX_AC_RO = 1 << 4,
+/* Write Exclusive All Registrants reservation type */
+NVME_PR_CAP_WR_EX_AR = 1 << 5,
+/* Exclusive Access All Registrants reservation type */
+NVME_PR_CAP_EX_AC_AR = 1 << 6,
+
+NVME_PR_CAP_ALL = (NVME_PR_CAP_PTPL |
+  NVME_PR_CAP_WR_EX |
+  NVME_PR_CAP_EX_AC |
+  NVME_PR_CAP_WR_EX_RO |
+  NVME_PR_CAP_EX_AC_RO |
+  NVME_PR_CAP_WR_EX_AR |
+  NVME_PR_CAP_EX_AC_AR),
+} NvmePrCap;
+
 typedef struct QEMU_PACKED NvmeDeleteQ {
 uint8_t opcode;
 uint8_t flags;
-- 
2.20.1

[PATCH v5 02/10] block/raw: add persistent reservation in/out driver

2024-06-06 Thread Changqi Lu

Add persistent reservation in/out operations for raw driver.
The following methods are implemented: bdrv_co_pr_read_keys,
bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 block/raw-format.c | 56 ++
 1 file changed, 56 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index ac7e8495f6..3746bc1bd3 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -454,6 +454,55 @@ raw_co_ioctl(BlockDriverState *bs, unsigned long int req, 
void *buf)
 return bdrv_co_ioctl(bs->file->bs, req, buf);
 }
 
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation,
+uint32_t num_keys, uint64_t *keys)
+{
+
+return bdrv_co_pr_read_keys(bs->file->bs, generation, num_keys, keys);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_read_reservation(BlockDriverState *bs, uint32_t *generation,
+   uint64_t *key, BlockPrType *type)
+{
+return bdrv_co_pr_read_reservation(bs->file->bs, generation, key, type);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_register(BlockDriverState *bs, uint64_t old_key,
+   uint64_t new_key, BlockPrType type,
+   bool ptpl, bool ignore_key)
+{
+return bdrv_co_pr_register(bs->file->bs, old_key, new_key,
+   type, ptpl, ignore_key);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_reserve(BlockDriverState *bs, uint64_t key, BlockPrType type)
+{
+return bdrv_co_pr_reserve(bs->file->bs, key, type);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_release(BlockDriverState *bs, uint64_t key, BlockPrType type)
+{
+return bdrv_co_pr_release(bs->file->bs, key, type);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_clear(BlockDriverState *bs, uint64_t key)
+{
+return bdrv_co_pr_clear(bs->file->bs, key);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_pr_preempt(BlockDriverState *bs, uint64_t old_key,
+  uint64_t new_key, BlockPrType type, bool abort)
+{
+return bdrv_co_pr_preempt(bs->file->bs, old_key, new_key, type, abort);
+}
+
 static int GRAPH_RDLOCK raw_has_zero_init(BlockDriverState *bs)
 {
 return bdrv_has_zero_init(bs->file->bs);
@@ -672,6 +721,13 @@ BlockDriver bdrv_raw = {
 .strong_runtime_opts  = raw_strong_runtime_opts,
 .mutable_opts = mutable_opts,
 .bdrv_cancel_in_flight = raw_cancel_in_flight,
+.bdrv_co_pr_read_keys= raw_co_pr_read_keys,
+.bdrv_co_pr_read_reservation = raw_co_pr_read_reservation,
+.bdrv_co_pr_register = raw_co_pr_register,
+.bdrv_co_pr_reserve  = raw_co_pr_reserve,
+.bdrv_co_pr_release  = raw_co_pr_release,
+.bdrv_co_pr_clear= raw_co_pr_clear,
+.bdrv_co_pr_preempt  = raw_co_pr_preempt,
 };
 
 static void bdrv_raw_init(void)
-- 
2.20.1

[PATCH v5 10/10] block/iscsi: add persistent reservation in/out driver

2024-06-06 Thread Changqi Lu

Add persistent reservation in/out operations for iscsi driver.
The following methods are implemented: bdrv_co_pr_read_keys,
bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 block/iscsi.c | 443 ++
 1 file changed, 443 insertions(+)

diff --git a/block/iscsi.c b/block/iscsi.c
index 2ff14b7472..d94ebe35bd 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -96,6 +96,7 @@ typedef struct IscsiLun {
 unsigned long *allocmap_valid;
 long allocmap_size;
 int cluster_size;
+uint8_t pr_cap;
 bool use_16_for_rw;
 bool write_protected;
 bool lbpme;
@@ -280,6 +281,8 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int status,
 iTask->err_code = -error;
 iTask->err_str = g_strdup(iscsi_get_error(iscsi));
 }
+} else if (status == SCSI_STATUS_RESERVATION_CONFLICT) {
+iTask->err_code = -EBADE;
 }
 }
 }
@@ -1792,6 +1795,52 @@ static void iscsi_save_designator(IscsiLun *lun,
 }
 }
 
+static void iscsi_get_pr_cap_sync(IscsiLun *iscsilun, Error **errp)
+{
+struct scsi_task *task = NULL;
+struct scsi_persistent_reserve_in_report_capabilities *rc = NULL;
+int retries = ISCSI_CMD_RETRIES;
+int xferlen = sizeof(struct 
scsi_persistent_reserve_in_report_capabilities);
+
+do {
+if (task != NULL) {
+scsi_free_scsi_task(task);
+task = NULL;
+}
+
+task = iscsi_persistent_reserve_in_sync(iscsilun->iscsi,
+   iscsilun->lun, SCSI_PR_IN_REPORT_CAPABILITIES, xferlen);
+if (task != NULL && task->status == SCSI_STATUS_GOOD) {
+rc = scsi_datain_unmarshall(task);
+if (rc == NULL) {
+error_setg(errp,
+"iSCSI: Failed to unmarshall report capabilities data.");
+} else {
+iscsilun->pr_cap =
+scsi_pr_cap_to_block(rc->persistent_reservation_type_mask);
+iscsilun->pr_cap |= (rc->ptpl_a) ? BLK_PR_CAP_PTPL : 0;
+}
+break;
+}
+
+if (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
+&& task->sense.key == SCSI_SENSE_UNIT_ATTENTION) {
+break;
+}
+
+} while (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
+ && task->sense.key == SCSI_SENSE_UNIT_ATTENTION
+ && retries-- > 0);
+
+if (task == NULL || task->status != SCSI_STATUS_GOOD) {
+error_setg(errp, "iSCSI: failed to send report capabilities command");
+}
+
+if (task) {
+scsi_free_scsi_task(task);
+}
+}
+
 static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
   Error **errp)
 {
@@ -2024,6 +2073,11 @@ static int iscsi_open(BlockDriverState *bs, QDict 
*options, int flags,
 bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
 }
 
+iscsi_get_pr_cap_sync(iscsilun, &local_err);
+if (local_err != NULL) {
+error_propagate(errp, local_err);
+ret = -EINVAL;
+}
 out:
 qemu_opts_del(opts);
 g_free(initiator_name);
@@ -2110,6 +2164,8 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.opt_transfer = pow2floor(iscsilun->bl.opt_xfer_len *
 iscsilun->block_size);
 }
+
+bs->bl.pr_cap = iscsilun->pr_cap;
 }
 
 /* Note that this will not re-establish a connection with an iSCSI target - it
@@ -2408,6 +2464,385 @@ out_unlock:
 return r;
 }
 
+static int coroutine_fn
+iscsi_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation,
+  uint32_t num_keys, uint64_t *keys)
+{
+IscsiLun *iscsilun = bs->opaque;
+QEMUIOVector qiov;
+struct IscsiTask iTask;
+int xferlen = sizeof(struct scsi_persistent_reserve_in_read_keys) +
+  sizeof(uint64_t) * num_keys;
+uint8_t *buf = g_malloc0(xferlen);
+int32_t num_collect_keys = 0;
+int r = 0;
+
+qemu_iovec_init_buf(&qiov, buf, xferlen);
+iscsi_co_init_iscsitask(iscsilun, &iTask);
+qemu_mutex_lock(&iscsilun->mutex);
+retry:
+iTask.task = iscsi_persistent_reserve_in_task(iscsilun->iscsi,
+ iscsilun->lun, SCSI_PR_IN_READ_KEYS, xferlen,
+ iscsi_co_generic_cb, &iTask);
+
+if (iTask.task == NULL) {
+qemu_mutex_unlock(&iscsilun->mutex);
+return -ENOMEM;
+}
+
+scsi_task_set_iov_in(iTask.task, (struct scsi_iovec *)qiov.iov, qiov.niov);
+iscsi_co_wait_for_task(&iTask, iscsilun);
+
+if (iTask.task != NULL) {
+scsi_free_scsi_task(iTask.task);
+iTask.task = NULL;
+}
+
+if (iTask.do_retry) {
+iTask.complete = 0;
+goto retry;
+}
+
+

[PATCH v5 05/10] hw/scsi: add persistent reservation in/out api for scsi device

2024-06-06 Thread Changqi Lu

Add persistent reservation in/out operations in the
SCSI device layer. By introducing the persistent
reservation in/out api, this enables the SCSI device
to perform reservation-related tasks, including querying
keys, querying reservation status, registering reservation
keys, initiating and releasing reservations, as well as
clearing and preempting reservations held by other keys.

These operations are crucial for management and control of
shared storage resources in a persistent manner.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 hw/scsi/scsi-disk.c | 352 
 1 file changed, 352 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 4bd7af9d0c..0e964dbd87 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -32,6 +32,7 @@
 #include "migration/vmstate.h"
 #include "hw/scsi/emulation.h"
 #include "scsi/constants.h"
+#include "scsi/utils.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/blockdev.h"
 #include "hw/block/block.h"
@@ -42,6 +43,7 @@
 #include "qemu/cutils.h"
 #include "trace.h"
 #include "qom/object.h"
+#include "block/block_int.h"
 
 #ifdef __linux
 #include 
@@ -1474,6 +1476,346 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
*req)
 scsi_req_complete(&r->req, GOOD);
 }
 
+typedef struct SCSIPrReadKeys {
+uint32_t generation;
+uint32_t num_keys;
+uint64_t *keys;
+void *req;
+} SCSIPrReadKeys;
+
+typedef struct SCSIPrReadReservation {
+uint32_t generation;
+uint64_t key;
+BlockPrType type;
+void *req;
+} SCSIPrReadReservation;
+
+static void scsi_pr_read_keys_complete(void *opaque, int ret)
+{
+int num_keys;
+uint8_t *buf;
+SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
+SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
+SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
+
+assert(blk_get_aio_context(s->qdev.conf.blk) ==
+qemu_get_current_aio_context());
+
+assert(r->req.aiocb != NULL);
+r->req.aiocb = NULL;
+
+if (scsi_disk_req_check_error(r, ret, true)) {
+goto done;
+}
+
+buf = scsi_req_get_buf(&r->req);
+num_keys = MIN(blk_keys->num_keys, ret);
+blk_keys->generation = cpu_to_be32(blk_keys->generation);
+memcpy(&buf[0], &blk_keys->generation, 4);
+for (int i = 0; i < num_keys; i++) {
+blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
+memcpy(&buf[8 + i * 8], &blk_keys->keys[i], 8);
+}
+num_keys = cpu_to_be32(num_keys * 8);
+memcpy(&buf[4], &num_keys, 4);
+
+scsi_req_data(&r->req, r->buflen);
+done:
+scsi_req_unref(&r->req);
+g_free(blk_keys->keys);
+g_free(blk_keys);
+}
+
+static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
+{
+SCSIPrReadKeys *blk_keys;
+SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
+SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
+int buflen = MIN(r->req.cmd.xfer, r->buflen);
+int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
+
+blk_keys = g_new0(SCSIPrReadKeys, 1);
+blk_keys->generation = 0;
+/* num_keys is the maximum number of keys that can be transmitted */
+blk_keys->num_keys = num_keys;
+blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
+blk_keys->req = r;
+
+/* The request is used as the AIO opaque value, so add a ref.  */
+scsi_req_ref(&r->req);
+r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
&blk_keys->generation,
+blk_keys->num_keys, blk_keys->keys,
+scsi_pr_read_keys_complete, blk_keys);
+return 0;
+}
+
+static void scsi_pr_read_reservation_complete(void *opaque, int ret)
+{
+uint8_t *buf;
+uint32_t additional_len = 0;
+SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
+SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
+SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
+
+assert(blk_get_aio_context(s->qdev.conf.blk) ==
+qemu_get_current_aio_context());
+
+assert(r->req.aiocb != NULL);
+r->req.aiocb = NULL;
+
+if (scsi_disk_req_check_error(r, ret, true)) {
+goto done;
+}
+
+buf = scsi_req_get_buf(&r->req);
+blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
+memcpy(&buf[0], &blk_rsv->generation, 4);
+if (ret) {
+additional_len = cpu_to_be32(16);
+blk_rsv->key = cpu_to_be64(blk_rsv->key);
+memcpy(&buf[8], &blk_rsv->key, 8);
+buf[21] = block_pr_type_to_scsi(blk_rsv->type) & 0xf;
+} else {
+additional_len = cpu_to_be32(0);
+}
+
+memcpy(&buf[4], &additional_len, 4);
+scsi_req_data(&r->req, r->buflen);
+
+done:
+scsi_req_unref(&r->req);
+g_free(blk_rsv);
+}
+
+static int scsi_disk_emulate_pr_read_reservation(SCSIRequest *req)
+{
+SCSIPrReadReservation *blk_rsv;
+SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
+SCSIDiskS

[PATCH v5 08/10] hw/nvme: enable ONCS and rescap function

2024-06-06 Thread Changqi Lu

This commit enables ONCS to support the reservation
function at the controller level. Also enables rescap
function in the namespace by detecting the supported reservation
function in the backend driver.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 hw/nvme/ctrl.c | 3 ++-
 hw/nvme/ns.c   | 5 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 127c3d2383..182307a48b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8248,7 +8248,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->nn = cpu_to_le32(NVME_MAX_NAMESPACES);
 id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP |
NVME_ONCS_FEATURES | NVME_ONCS_DSM |
-   NVME_ONCS_COMPARE | NVME_ONCS_COPY);
+   NVME_ONCS_COMPARE | NVME_ONCS_COPY |
+   NVME_ONCS_RESRVATIONS);
 
 /*
  * NOTE: If this device ever supports a command set that does NOT use 0x0
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index ea8db175db..320c9bf658 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -20,6 +20,7 @@
 #include "qemu/bitops.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
+#include "block/block_int.h"
 
 #include "nvme.h"
 #include "trace.h"
@@ -33,6 +34,7 @@ void nvme_ns_init_format(NvmeNamespace *ns)
 BlockDriverInfo bdi;
 int npdg, ret;
 int64_t nlbas;
+uint8_t blk_pr_cap;
 
 ns->lbaf = id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
 ns->lbasz = 1 << ns->lbaf.ds;
@@ -55,6 +57,9 @@ void nvme_ns_init_format(NvmeNamespace *ns)
 }
 
 id_ns->npda = id_ns->npdg = npdg - 1;
+
+blk_pr_cap = blk_bs(ns->blkconf.blk)->file->bs->bl.pr_cap;
+id_ns->rescap = block_pr_cap_to_nvme(blk_pr_cap);
 }
 
 static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
-- 
2.20.1

[PATCH v5 01/10] block: add persistent reservation in/out api

2024-06-06 Thread Changqi Lu

Add persistent reservation in/out operations
at the block level. The following operations
are included:

- read_keys:retrieves the list of registered keys.
- read_reservation: retrieves the current reservation status.
- register: registers a new reservation key.
- reserve:  initiates a reservation for a specific key.
- release:  releases a reservation for a specific key.
- clear:clears all existing reservations.
- preempt:  preempts a reservation held by another key.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 block/block-backend.c | 397 ++
 block/io.c| 163 
 include/block/block-common.h  |  40 +++
 include/block/block-io.h  |  20 ++
 include/block/block_int-common.h  |  84 +++
 include/sysemu/block-backend-io.h |  24 ++
 6 files changed, 728 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index db6f9b92a3..6707d94df7 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1770,6 +1770,403 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned 
long int req, void *buf,
 return blk_aio_prwv(blk, req, 0, buf, blk_aio_ioctl_entry, 0, cb, opaque);
 }
 
+typedef struct BlkPrInCo {
+BlockBackend *blk;
+uint32_t *generation;
+uint32_t num_keys;
+BlockPrType *type;
+uint64_t *keys;
+int ret;
+} BlkPrInCo;
+
+typedef struct BlkPrInCB {
+BlockAIOCB common;
+BlkPrInCo prco;
+bool has_returned;
+} BlkPrInCB;
+
+static const AIOCBInfo blk_pr_in_aiocb_info = {
+.aiocb_size = sizeof(BlkPrInCB),
+};
+
+static void blk_pr_in_complete(BlkPrInCB *acb)
+{
+if (acb->has_returned) {
+acb->common.cb(acb->common.opaque, acb->prco.ret);
+blk_dec_in_flight(acb->prco.blk);
+qemu_aio_unref(acb);
+}
+}
+
+static void blk_pr_in_complete_bh(void *opaque)
+{
+BlkPrInCB *acb = opaque;
+assert(acb->has_returned);
+blk_pr_in_complete(acb);
+}
+
+static BlockAIOCB *blk_aio_pr_in(BlockBackend *blk, uint32_t *generation,
+ uint32_t num_keys, BlockPrType *type,
+ uint64_t *keys, CoroutineEntry co_entry,
+ BlockCompletionFunc *cb, void *opaque)
+{
+BlkPrInCB *acb;
+Coroutine *co;
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_pr_in_aiocb_info, blk, cb, opaque);
+acb->prco = (BlkPrInCo) {
+.blk= blk,
+.generation = generation,
+.num_keys   = num_keys,
+.type   = type,
+.ret= NOT_DONE,
+.keys   = keys,
+};
+acb->has_returned = false;
+
+co = qemu_coroutine_create(co_entry, acb);
+aio_co_enter(qemu_get_current_aio_context(), co);
+
+acb->has_returned = true;
+if (acb->prco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
+ blk_pr_in_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+/* To be called between exactly one pair of blk_inc/dec_in_flight() */
+static int coroutine_fn
+blk_aio_pr_do_read_keys(BlockBackend *blk, uint32_t *generation,
+uint32_t num_keys, uint64_t *keys)
+{
+IO_CODE();
+
+blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
+
+if (!blk_co_is_available(blk)) {
+return -ENOMEDIUM;
+}
+
+return bdrv_co_pr_read_keys(blk_bs(blk), generation, num_keys, keys);
+}
+
+static void coroutine_fn blk_aio_pr_read_keys_entry(void *opaque)
+{
+BlkPrInCB *acb = opaque;
+BlkPrInCo *prco = &acb->prco;
+
+prco->ret = blk_aio_pr_do_read_keys(prco->blk, prco->generation,
+prco->num_keys, prco->keys);
+blk_pr_in_complete(acb);
+}
+
+BlockAIOCB *blk_aio_pr_read_keys(BlockBackend *blk, uint32_t *generation,
+ uint32_t num_keys, uint64_t *keys,
+ BlockCompletionFunc *cb, void *opaque)
+{
+IO_CODE();
+return blk_aio_pr_in(blk, generation, num_keys, NULL, keys,
+ blk_aio_pr_read_keys_entry, cb, opaque);
+}
+
+/* To be called between exactly one pair of blk_inc/dec_in_flight() */
+static int coroutine_fn
+blk_aio_pr_do_read_reservation(BlockBackend *blk, uint32_t *generation,
+   uint64_t *key, BlockPrType *type)
+{
+IO_CODE();
+
+blk_wait_while_drained(blk);
+GRAPH_RDLOCK_GUARD();
+
+if (!blk_co_is_available(blk)) {
+return -ENOMEDIUM;
+}
+
+return bdrv_co_pr_read_reservation(blk_bs(blk), generation, key, type);
+}
+
+static void coroutine_fn blk_aio_pr_read_reservation_entry(void *opaque)
+{
+BlkPrInCB *acb = opaque;
+BlkPrInCo *prco = &acb->prco;
+
+prco->ret = blk_aio_pr_do_read_reservation(prco->blk, prco->generation,
+   prco->keys, prco->

[PATCH v5 04/10] scsi/util: add helper functions for persistent reservation types conversion

2024-06-06 Thread Changqi Lu

This commit introduces two helper functions
that facilitate the conversion between the
persistent reservation types used in the SCSI
protocol and those used in the block layer.

Signed-off-by: Changqi Lu 
Signed-off-by: zhenwei pi 
---
 include/scsi/utils.h |  8 +
 scsi/utils.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/scsi/utils.h b/include/scsi/utils.h
index d5c8efa16e..89a0b082fb 100644
--- a/include/scsi/utils.h
+++ b/include/scsi/utils.h
@@ -1,6 +1,8 @@
 #ifndef SCSI_UTILS_H
 #define SCSI_UTILS_H
 
+#include "block/block-common.h"
+#include "scsi/constants.h"
 #ifdef CONFIG_LINUX
 #include 
 #endif
@@ -135,6 +137,12 @@ uint32_t scsi_data_cdb_xfer(uint8_t *buf);
 uint32_t scsi_cdb_xfer(uint8_t *buf);
 int scsi_cdb_length(uint8_t *buf);
 
+BlockPrType scsi_pr_type_to_block(SCSIPrType type);
+SCSIPrType block_pr_type_to_scsi(BlockPrType type);
+
+uint8_t scsi_pr_cap_to_block(uint16_t scsi_pr_cap);
+uint16_t block_pr_cap_to_scsi(uint8_t block_pr_cap);
+
 /* Linux SG_IO interface.  */
 #ifdef CONFIG_LINUX
 #define SG_ERR_DRIVER_TIMEOUT  0x06
diff --git a/scsi/utils.c b/scsi/utils.c
index 357b036671..0dfdeb499d 100644
--- a/scsi/utils.c
+++ b/scsi/utils.c
@@ -658,3 +658,84 @@ int scsi_sense_from_host_status(uint8_t host_status,
 }
 return GOOD;
 }
+
+BlockPrType scsi_pr_type_to_block(SCSIPrType type)
+{
+switch (type) {
+case SCSI_PR_WRITE_EXCLUSIVE:
+return BLK_PR_WRITE_EXCLUSIVE;
+case SCSI_PR_EXCLUSIVE_ACCESS:
+return BLK_PR_EXCLUSIVE_ACCESS;
+case SCSI_PR_WRITE_EXCLUSIVE_REGS_ONLY:
+return BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY;
+case SCSI_PR_EXCLUSIVE_ACCESS_REGS_ONLY:
+return BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY;
+case SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS:
+return BLK_PR_WRITE_EXCLUSIVE_ALL_REGS;
+case SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS:
+return BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS;
+}
+
+return 0;
+}
+
+SCSIPrType block_pr_type_to_scsi(BlockPrType type)
+{
+switch (type) {
+case BLK_PR_WRITE_EXCLUSIVE:
+return SCSI_PR_WRITE_EXCLUSIVE;
+case BLK_PR_EXCLUSIVE_ACCESS:
+return SCSI_PR_EXCLUSIVE_ACCESS;
+case BLK_PR_WRITE_EXCLUSIVE_REGS_ONLY:
+return SCSI_PR_WRITE_EXCLUSIVE_REGS_ONLY;
+case BLK_PR_EXCLUSIVE_ACCESS_REGS_ONLY:
+return SCSI_PR_EXCLUSIVE_ACCESS_REGS_ONLY;
+case BLK_PR_WRITE_EXCLUSIVE_ALL_REGS:
+return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS;
+case BLK_PR_EXCLUSIVE_ACCESS_ALL_REGS:
+return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS;
+}
+
+return 0;
+}
+
+
+uint8_t scsi_pr_cap_to_block(uint16_t scsi_pr_cap)
+{
+uint8_t res = 0;
+
+res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX) ?
+   BLK_PR_CAP_WR_EX : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC) ?
+   BLK_PR_CAP_EX_AC : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX_RO) ?
+   BLK_PR_CAP_WR_EX_RO : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC_RO) ?
+   BLK_PR_CAP_EX_AC_RO : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_WR_EX_AR) ?
+   BLK_PR_CAP_WR_EX_AR : 0;
+res |= (scsi_pr_cap & SCSI_PR_CAP_EX_AC_AR) ?
+   BLK_PR_CAP_EX_AC_AR : 0;
+
+return res;
+}
+
+uint16_t block_pr_cap_to_scsi(uint8_t block_pr_cap)
+{
+uint16_t res = 0;
+
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX) ?
+  SCSI_PR_CAP_WR_EX : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC) ?
+  SCSI_PR_CAP_EX_AC : 0;
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX_RO) ?
+  SCSI_PR_CAP_WR_EX_RO : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC_RO) ?
+  SCSI_PR_CAP_EX_AC_RO : 0;
+res |= (block_pr_cap & BLK_PR_CAP_WR_EX_AR) ?
+  SCSI_PR_CAP_WR_EX_AR : 0;
+res |= (block_pr_cap & BLK_PR_CAP_EX_AC_AR) ?
+  SCSI_PR_CAP_EX_AC_AR : 0;
+
+return res;
+}
-- 
2.20.1

1 2 3 >

1 - 100 of 200 matches

Mail list logo