Re: [PATCH v1 0/2] A Fixup for "QEMU: CXL mailbox rework and features (Part 1)"

2023-11-27 Thread Michael S. Tsirkin
On Mon, Nov 27, 2023 at 07:58:28PM +0900, Hyeonggon Yoo wrote:
> Hi, this is a fixup for the recent patch series "QEMU: CXL mailbox rework and
> features (Part 1)" [1].


To clarify do you plan v2 of this?

> This fixes two problems:
> 
>1. Media Status in memory device status register not being correctly
>   read as "Disabled" while sanitation is in progress.
> 
>2. QEMU assertion failure when it issues an MSI-X interrupt
>   (indicating the completion of the sanitize command).
> 
> [1] 
> https://lore.kernel.org/linux-cxl/20231023160806.13206-1-jonathan.came...@huawei.com
> 
> Hyeonggon Yoo (2):
>   hw/cxl/device: read from register values in mdev_reg_read()
>   hw/mem/cxl_type3: allocate more vectors for MSI-X
> 
>  hw/cxl/cxl-device-utils.c   | 17 +++--
>  hw/mem/cxl_type3.c  |  2 +-
>  include/hw/cxl/cxl_device.h |  4 +++-
>  3 files changed, 15 insertions(+), 8 deletions(-)
> 
> -- 
> 2.39.1




Re: [PATCH v2] migration: free 'channel' and 'addr' after their use in migration.c

2023-11-27 Thread Markus Armbruster
Het Gala  writes:

> 'channel' and 'addr' in qmp_migrate() and qmp_migrate_incoming() are
> not auto-freed. migrate_uri_parse() allocates memory which is
> returned to 'channel', which is leaked because there is no code for
> freeing 'channel' or 'addr'.
> So, free addr and channel to avoid memory leak. 'addr' does shallow
> copying of channel->addr, hence free 'channel' itself and deep free
> contents of 'addr'
>
> Fixes: 5994024f ("migration: Implement MigrateChannelList to qmp
> migration flow")
> Signed-off-by: Het Gala 
> ---
>  migration/migration.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 28a34c9068..29efb51b62 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2004,6 +2004,8 @@ void qmp_migrate(const char *uri, bool has_channels,
>MIGRATION_STATUS_FAILED);
>  block_cleanup_parameters();
>  }
> +g_free(channel);
> +qapi_free_MigrationAddress(addr);
>  
>  if (local_err) {
>  if (!resume_requested) {

See my review of v1.




Re: [PATCH] 'channel' and 'addr' in qmp_migrate() are not auto-freed. migrate_uri_parse() allocates memory which is returned to 'channel', which is leaked because there is no code for freeing 'channel

2023-11-27 Thread Markus Armbruster
Your commit message is all in one line.  You need to format it like

 migration: Plug memory leak

'channel' and 'addr' in qmp_migrate() are not auto-freed.
migrate_uri_parse() allocates memory which is returned to 'channel',
which is leaked because there is no code for freeing 'channel' or
'addr'.  So, free addr and channel to avoid memory leak.  'addr'
does shallow copying of channel->addr, hence free 'channel' itself
and deep free contents of 'addr'.

Het Gala  writes:

> Fixes: 5994024f ("migration: Implement MigrateChannelList to qmp
> migration flow")
> Signed-off-by: Het Gala 
> ---
>  migration/migration.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 28a34c9068..29efb51b62 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2004,6 +2004,8 @@ void qmp_migrate(const char *uri, bool has_channels,
>MIGRATION_STATUS_FAILED);
>  block_cleanup_parameters();
>  }
> +g_free(channel);
> +qapi_free_MigrationAddress(addr);
>  
>  if (local_err) {
>  if (!resume_requested) {

What makes you think there's a memory leak?  I can't see it.

qmp_migrate() has two callers:

1. qmp_marshal_migrate(), in generated qapi-commands-migration.c

   qmp_marshal_migrate() deserializes @args into @arg with a visitor.
   This allocates @arg and its members, including arg.channels.  It
   passes all the members of @arg to qmp_migrate().  And then it frees
   @arg and its members including arg.channels.

2. hmp_migrate()

   hmp_migrate() allocates @channel with migrate_uri_parse(), adds it to
   list @caps, passes @caps to qmp_migrate(), then frees @caps with
   qapi_free_MigrationChannelList().




Re: [PATCH v2 for-8.2] ppc/amigaone: Allow running AmigaOS without firmware image

2023-11-27 Thread Cédric Le Goater

On 11/28/23 02:47, Nicholas Piggin wrote:

On Tue Nov 28, 2023 at 2:37 AM AEST, Cédric Le Goater wrote:



I'm not sure, I don't think it's necessary if your minimal patch works.

I'll do a PR for 8.2 for SLOF and Skiboot updates, so happy to include
this as well.


I think this is a bit late for 8.2 to change FW images, well, at least
SLOF and skiboot. Are the new versions fixing something critical ?


Ah okay. Well then I can put them in next instead.

SLOF has a fix for virtio console over reboots, pretty minimal.


I see that commit dd4d4ea0add9 has :

  Fixes: cf28264 ("virtio-serial: Rework shutdown sequence")

Looks good for 8.2


skiboot has some bug fixes but it's a bigger change and maybe not
so important for QEMU.> Could they be merged in next release 


yes. it seems skiboot should be merged with chiptod support in 9.0.


and SLOF tagged with stable?

I think this amigaone patch could still be merged since it's only
touching a new machine and it's fixing an issue of missing firmware.


ARM does something similar with roms. See hw/arm/boot.c file.

It will need a "Fixes" tag.

Thanks,

C.




Re: [PATCH-for-9.0 03/16] target/arm/kvm: Have kvm_arm_add_vcpu_properties take a ARMCPU argument

2023-11-27 Thread Philippe Mathieu-Daudé

On 27/11/23 05:05, Gavin Shan wrote:

Hi Phil,

On 11/24/23 05:35, Philippe Mathieu-Daudé wrote:

Unify the "kvm_arm.h" API: All functions related to ARM vCPUs
take a ARMCPU* argument. Use the CPU() QOM cast macro When
calling the generic vCPU API from "sysemu/kvm.h".

Signed-off-by: Philippe Mathieu-Daudé 
---
  target/arm/kvm_arm.h | 4 ++--
  target/arm/cpu.c | 2 +-
  target/arm/kvm.c | 4 ++--
  3 files changed, 5 insertions(+), 5 deletions(-)



With the following comments resolved:

Reviewed-by: Gavin Shan 


diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 50967f4ae9..6fb8a5f67e 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -153,7 +153,7 @@ void kvm_arm_set_cpu_features_from_host(ARMCPU *cpu);
   * Add all KVM specific CPU properties to the CPU object. These
   * are the CPU properties with "kvm-" prefixed names.
   */
-void kvm_arm_add_vcpu_properties(Object *obj);
+void kvm_arm_add_vcpu_properties(ARMCPU *cpu);


The function's description needs to be modified since @obj has been
renamed to @cpu?

   /**
    * kvm_arm_add_vcpu_properties:
    * @obj: The CPU object to add the properties to
    *
    */


Oops thanks!



Re: [PATCH-for-9.0 05/16] target/arm/kvm: Have kvm_arm_sve_get_vls take a ARMCPU argument

2023-11-27 Thread Philippe Mathieu-Daudé

Hi Gavin,

On 27/11/23 05:12, Gavin Shan wrote:

Hi Phil,

On 11/24/23 05:35, Philippe Mathieu-Daudé wrote:

Unify the "kvm_arm.h" API: All functions related to ARM vCPUs
take a ARMCPU* argument. Use the CPU() QOM cast macro When
calling the generic vCPU API from "sysemu/kvm.h".

Signed-off-by: Philippe Mathieu-Daudé 
---
  target/arm/kvm_arm.h | 6 +++---
  target/arm/cpu64.c   | 2 +-
  target/arm/kvm.c | 2 +-
  3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 6fb8a5f67e..84f87f5ed7 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -129,13 +129,13 @@ void kvm_arm_destroy_scratch_host_vcpu(int 
*fdarray);

  /**
   * kvm_arm_sve_get_vls:
- * @cs: CPUState
+ * @cpu: ARMCPU
   *
   * Get all the SVE vector lengths supported by the KVM host, setting
   * the bits corresponding to their length in quadwords minus one
   * (vq - 1) up to ARM_MAX_VQ.  Return the resulting map.
   */
-uint32_t kvm_arm_sve_get_vls(CPUState *cs);
+uint32_t kvm_arm_sve_get_vls(ARMCPU *cpu);


Either @cs or @cpu isn't dereferenced in kvm_arm_sve_get_vls(). So I guess
the argument can be simply droped?


If KVM eventually supports heterogeneous vCPUs such big.LITTLE, we'd
de-reference. But then we'd have a major rework of the code.

Peter, do you have a preference?

Thanks,

Phil.



[PATCH v2] migration: free 'channel' and 'addr' after their use in migration.c

2023-11-27 Thread Het Gala
'channel' and 'addr' in qmp_migrate() and qmp_migrate_incoming() are
not auto-freed. migrate_uri_parse() allocates memory which is
returned to 'channel', which is leaked because there is no code for
freeing 'channel' or 'addr'.
So, free addr and channel to avoid memory leak. 'addr' does shallow
copying of channel->addr, hence free 'channel' itself and deep free
contents of 'addr'

Fixes: 5994024f ("migration: Implement MigrateChannelList to qmp
migration flow")
Signed-off-by: Het Gala 
---
 migration/migration.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 28a34c9068..29efb51b62 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2004,6 +2004,8 @@ void qmp_migrate(const char *uri, bool has_channels,
   MIGRATION_STATUS_FAILED);
 block_cleanup_parameters();
 }
+g_free(channel);
+qapi_free_MigrationAddress(addr);
 
 if (local_err) {
 if (!resume_requested) {
-- 
2.22.3




Re: [PATCH v6 2/3] hw/ppc: Add N1 chiplet model

2023-11-27 Thread Cédric Le Goater

On 11/27/23 18:13, Chalapathi V wrote:

The N1 chiplet handle the high speed i/o traffic over PCIe and others.
The N1 chiplet consists of PowerBus Fabric controller,
nest Memory Management Unit, chiplet control unit and more.

This commit creates a N1 chiplet model and initialize and realize the
pervasive chiplet model where chiplet control registers are implemented.

This commit also implement the read/write method for the powerbus scom
registers

Signed-off-by: Chalapathi V 
---
  include/hw/ppc/pnv_n1_chiplet.h |  35 +++
  include/hw/ppc/pnv_xscom.h  |   6 ++
  hw/ppc/pnv_n1_chiplet.c | 171 
  hw/ppc/meson.build  |   1 +
  4 files changed, 213 insertions(+)
  create mode 100644 include/hw/ppc/pnv_n1_chiplet.h
  create mode 100644 hw/ppc/pnv_n1_chiplet.c

diff --git a/include/hw/ppc/pnv_n1_chiplet.h b/include/hw/ppc/pnv_n1_chiplet.h
new file mode 100644
index 00..3c42ada7f4
--- /dev/null
+++ b/include/hw/ppc/pnv_n1_chiplet.h
@@ -0,0 +1,35 @@
+/*
+ * QEMU PowerPC N1 chiplet model
+ *
+ * Copyright (c) 2023, IBM Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PPC_PNV_N1_CHIPLET_H
+#define PPC_PNV_N1_CHIPLET_H
+
+#include "hw/ppc/pnv_nest_pervasive.h"
+
+#define TYPE_PNV_N1_CHIPLET "pnv-N1-chiplet"
+#define PNV_N1_CHIPLET(obj) OBJECT_CHECK(PnvN1Chiplet, (obj), 
TYPE_PNV_N1_CHIPLET)
+
+typedef struct pb_scom {
+uint64_t mode;
+uint64_t hp_mode2_curr;
+} pb_scom;


Please use CamelCase coding style.



+
+typedef struct PnvN1Chiplet {
+DeviceState parent;
+MemoryRegion xscom_pb_eq_regs;
+MemoryRegion xscom_pb_es_regs;


the MemoryRegion are generally called _mr, _iomem.


+/* common pervasive chiplet unit */
+PnvNestChipletPervasive nest_pervasive;
+pb_scom eq[8];
+pb_scom es[4];


are these arrays the registers ?


+} PnvN1Chiplet;
+#endif /*PPC_PNV_N1_CHIPLET_H */
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index 3e15706dec..535ae1dab0 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -173,6 +173,12 @@ struct PnvXScomInterfaceClass {
  #define PNV10_XSCOM_N1_CHIPLET_CTRL_REGS_BASE  0x300
  #define PNV10_XSCOM_CHIPLET_CTRL_REGS_SIZE 0x400
  
+#define PNV10_XSCOM_N1_PB_SCOM_EQ_BASE  0x3011000

+#define PNV10_XSCOM_N1_PB_SCOM_EQ_SIZE  0x200
+
+#define PNV10_XSCOM_N1_PB_SCOM_ES_BASE  0x3011300
+#define PNV10_XSCOM_N1_PB_SCOM_ES_SIZE  0x100
+
  #define PNV10_XSCOM_PEC_NEST_BASE  0x3011800 /* index goes downwards ... */
  #define PNV10_XSCOM_PEC_NEST_SIZE  0x100
  
diff --git a/hw/ppc/pnv_n1_chiplet.c b/hw/ppc/pnv_n1_chiplet.c

new file mode 100644
index 00..8e4c21dbf6
--- /dev/null
+++ b/hw/ppc/pnv_n1_chiplet.c
@@ -0,0 +1,171 @@
+/*
+ * QEMU PowerPC N1 chiplet model
+ *
+ * Copyright (c) 2023, IBM Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "hw/qdev-properties.h"
+#include "hw/ppc/pnv.h"
+#include "hw/ppc/pnv_xscom.h"
+#include "hw/ppc/pnv_n1_chiplet.h"
+#include "hw/ppc/pnv_nest_pervasive.h"
+
+/*
+ * The n1 chiplet contains chiplet control unit,
+ * PowerBus/RaceTrack/Bridge logic, nest Memory Management Unit(nMMU)
+ * and more.
+ *
+ * In this model Nest1 chiplet control registers are modelled via common
+ * nest pervasive model and few PowerBus racetrack registers are modelled.
+ */
+
+#define PB_SCOM_EQ0_HP_MODE2_CURR  0xe
+#define PB_SCOM_ES3_MODE   0x8a
+
+static uint64_t pnv_n1_chiplet_pb_scom_eq_read(void *opaque, hwaddr addr,
+  unsigned size)
+{
+PnvN1Chiplet *n1_chiplet = PNV_N1_CHIPLET(opaque);
+int reg = addr >> 3;
+uint64_t val = ~0ull;
+
+switch (reg) {
+case PB_SCOM_EQ0_HP_MODE2_CURR:
+val = n1_chiplet->eq[0].hp_mode2_curr;
+break;
+default:
+qemu_log_mask(LOG_UNIMP, "%s: Invalid xscom read at 0x%" PRIx32 "\n",
+  __func__, reg);
+}
+return val;
+}
+
+static void pnv_n1_chiplet_pb_scom_eq_write(void *opaque, hwaddr addr,
+   uint64_t val, unsigned size)
+{
+PnvN1Chiplet *n1_chiplet = PNV_N1_CHIPLET(opaque);
+int reg = addr >> 3;
+
+switch (reg) {
+case PB_SCOM_EQ0_HP_MODE2_CURR:
+n1_chiplet->eq[0].hp_mode2_curr = val;
+break;
+default:
+qemu_log_mask(LOG_UNIMP, "%s: Invalid xscom write at 0x%" PRIx32 "\n",
+  __func__, reg);
+}
+}
+
+static const MemoryRegionOps pnv_n1_chiplet_pb_scom_eq_ops = {
+.read = pnv_n1_chiplet_pb_scom_eq_read,
+.write = pnv_n1_chiplet_pb_scom_eq_write,
+

Re: [PATCH-for-9.0 08/16] target/arm/kvm: Have kvm_arm_pmu_init take a ARMCPU argument

2023-11-27 Thread Philippe Mathieu-Daudé

On 27/11/23 05:20, Gavin Shan wrote:

Hi Phil,

On 11/24/23 05:35, Philippe Mathieu-Daudé wrote:

Unify the "kvm_arm.h" API: All functions related to ARM vCPUs
take a ARMCPU* argument. Use the CPU() QOM cast macro When
calling the generic vCPU API from "sysemu/kvm.h".

Signed-off-by: Philippe Mathieu-Daudé 
---
  target/arm/kvm_arm.h | 4 ++--
  hw/arm/virt.c    | 2 +-
  target/arm/kvm.c | 6 +++---
  3 files changed, 6 insertions(+), 6 deletions(-)



One nit below, but I guess it doesn't matter.

Reviewed-by: Gavin Shan 


diff --git a/target/arm/kvm_arm.h b/target/arm/kvm_arm.h
index 0e12a008ab..fde1c45609 100644
--- a/target/arm/kvm_arm.h
+++ b/target/arm/kvm_arm.h
@@ -200,8 +200,8 @@ int kvm_arm_get_max_vm_ipa_size(MachineState *ms, 
bool *fixed_ipa);

  int kvm_arm_vgic_probe(void);
+void kvm_arm_pmu_init(ARMCPU *cpu);
  void kvm_arm_pmu_set_irq(CPUState *cs, int irq);
-void kvm_arm_pmu_init(CPUState *cs);


Why the order of the declaration is changed? I guess the reason would be
kvm_arm_pmu_init() is called prior to kvm_arm_pmu_set_irq().


Yes, exactly. Not worth mentioning IMHO.

Thanks!

Phil.




[PATCH] 'channel' and 'addr' in qmp_migrate() are not auto-freed. migrate_uri_parse() allocates memory which is returned to 'channel', which is leaked because there is no code for freeing 'channel' or

2023-11-27 Thread Het Gala
Fixes: 5994024f ("migration: Implement MigrateChannelList to qmp
migration flow")
Signed-off-by: Het Gala 
---
 migration/migration.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 28a34c9068..29efb51b62 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2004,6 +2004,8 @@ void qmp_migrate(const char *uri, bool has_channels,
   MIGRATION_STATUS_FAILED);
 block_cleanup_parameters();
 }
+g_free(channel);
+qapi_free_MigrationAddress(addr);
 
 if (local_err) {
 if (!resume_requested) {
-- 
2.22.3




Re: [PATCH 1/2] vhost-user: fix the reconnect error

2023-11-27 Thread Raphael Norwitz


> On Nov 23, 2023, at 12:54 AM, Li Feng  wrote:
> 
> If the error occurs in vhost_dev_init, the value of s->connected is set to 
> true
> in advance, and there is no chance to enter this function execution again
> in the future.
> 
> Signed-off-by: Li Feng 

Reviewed-by: Raphael Norwitz 

> ---
> hw/block/vhost-user-blk.c   | 8 +++-
> hw/scsi/vhost-user-scsi.c   | 3 ++-
> hw/virtio/vhost-user-gpio.c | 3 ++-
> 3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 818b833108..2863d80d15 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -326,7 +326,6 @@ static int vhost_user_blk_connect(DeviceState *dev, Error 
> **errp)
> if (s->connected) {
> return 0;
> }
> -s->connected = true;
> 
> s->dev.num_queues = s->num_queues;
> s->dev.nvqs = s->num_queues;
> @@ -343,15 +342,14 @@ static int vhost_user_blk_connect(DeviceState *dev, 
> Error **errp)
> return ret;
> }
> 
> +s->connected = true;
> +
> /* restore vhost state */
> if (virtio_device_started(vdev, vdev->status)) {
> ret = vhost_user_blk_start(vdev, errp);
> -if (ret < 0) {
> -return ret;
> -}
> }
> 
> -return 0;
> +return ret;
> }
> 
> static void vhost_user_blk_disconnect(DeviceState *dev)
> diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
> index 4486500cac..2060f9f94b 100644
> --- a/hw/scsi/vhost-user-scsi.c
> +++ b/hw/scsi/vhost-user-scsi.c
> @@ -147,7 +147,6 @@ static int vhost_user_scsi_connect(DeviceState *dev, 
> Error **errp)
> if (s->connected) {
> return 0;
> }
> -s->connected = true;
> 
> vsc->dev.num_queues = vs->conf.num_queues;
> vsc->dev.nvqs = VIRTIO_SCSI_VQ_NUM_FIXED + vs->conf.num_queues;
> @@ -161,6 +160,8 @@ static int vhost_user_scsi_connect(DeviceState *dev, 
> Error **errp)
> return ret;
> }
> 
> +s->connected = true;
> +
> /* restore vhost state */
> if (virtio_device_started(vdev, vdev->status)) {
> ret = vhost_user_scsi_start(s, errp);
> diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c
> index aff2d7eff6..a83437a5da 100644
> --- a/hw/virtio/vhost-user-gpio.c
> +++ b/hw/virtio/vhost-user-gpio.c
> @@ -229,7 +229,6 @@ static int vu_gpio_connect(DeviceState *dev, Error **errp)
> if (gpio->connected) {
> return 0;
> }
> -gpio->connected = true;
> 
> vhost_dev_set_config_notifier(vhost_dev, _ops);
> gpio->vhost_user.supports_config = true;
> @@ -243,6 +242,8 @@ static int vu_gpio_connect(DeviceState *dev, Error **errp)
> return ret;
> }
> 
> +gpio->connected = true;
> +
> /* restore vhost state */
> if (virtio_device_started(vdev, vdev->status)) {
> vu_gpio_start(vdev);
> -- 
> 2.42.0
> 




Re: [PATCH 2/2] vhost-user-scsi: free the inflight area when reset

2023-11-27 Thread Raphael Norwitz



> On Nov 23, 2023, at 12:54 AM, Li Feng  wrote:
> 
> Keep it the same to vhost-user-blk.
> At the same time, fix the vhost_reset_device.
> 
> Signed-off-by: Li Feng 

Reviewed-by: Raphael Norwitz 

> ---
> hw/scsi/vhost-user-scsi.c | 16 
> hw/virtio/virtio.c|  2 +-
> 2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
> index 2060f9f94b..780f10559d 100644
> --- a/hw/scsi/vhost-user-scsi.c
> +++ b/hw/scsi/vhost-user-scsi.c
> @@ -360,6 +360,20 @@ static Property vhost_user_scsi_properties[] = {
> DEFINE_PROP_END_OF_LIST(),
> };
> 
> +static void vhost_user_scsi_reset(VirtIODevice *vdev)
> +{
> +VHostUserSCSI *s = VHOST_USER_SCSI(vdev);
> +VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s);
> +
> +vhost_dev_free_inflight(vsc->inflight);
> +}
> +
> +static struct vhost_dev *vhost_user_scsi_get_vhost(VirtIODevice *vdev)
> +{
> +VHostSCSICommon *vsc = VHOST_SCSI_COMMON(vdev);
> +return >dev;
> +}
> +
> static const VMStateDescription vmstate_vhost_scsi = {
> .name = "virtio-scsi",
> .minimum_version_id = 1,
> @@ -385,6 +399,8 @@ static void vhost_user_scsi_class_init(ObjectClass 
> *klass, void *data)
> vdc->set_config = vhost_scsi_common_set_config;
> vdc->set_status = vhost_user_scsi_set_status;
> fwc->get_dev_path = vhost_scsi_common_get_fw_dev_path;
> +vdc->reset = vhost_user_scsi_reset;
> +vdc->get_vhost = vhost_user_scsi_get_vhost;
> }
> 
> static void vhost_user_scsi_instance_init(Object *obj)
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 4259fefeb6..d0a640af63 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -2137,7 +2137,7 @@ void virtio_reset(void *opaque)
> vdev->device_endian = virtio_default_endian();
> }
> 
> -if (vdev->vhost_started) {
> +if (vdev->vhost_started && k->get_vhost) {
> vhost_reset_device(k->get_vhost(vdev));
> }
> 
> -- 
> 2.42.0
> 




Re: [PATCH v1 1/2] hw/cxl/device: read from register values in mdev_reg_read()

2023-11-27 Thread Davidlohr Bueso

On Tue, 28 Nov 2023, Hyeonggon Yoo wrote:


All of them make sense to me. I will adjust, thanks!

But I'm not confident enough to write a single description for all the
changes so will
split it into a few patches. May I add your Suggested-by (or
Signed-off-by) in v2
as it will contain some part of your idea and code?


Sure, feel free to split as you see fit.



Re: [PATCH v6 3/3] hw/ppc: N1 chiplet wiring

2023-11-27 Thread Nicholas Piggin
On Tue Nov 28, 2023 at 3:13 AM AEST, Chalapathi V wrote:
> This part of the patchset connects the nest1 chiplet model to p10 chip.
>
> Signed-off-by: Chalapathi V 
> ---
>  include/hw/ppc/pnv_chip.h |  2 ++
>  hw/ppc/pnv.c  | 15 +++
>  2 files changed, 17 insertions(+)
>
> diff --git a/include/hw/ppc/pnv_chip.h b/include/hw/ppc/pnv_chip.h
> index 0ab5c42308..9b06c8d87c 100644
> --- a/include/hw/ppc/pnv_chip.h
> +++ b/include/hw/ppc/pnv_chip.h
> @@ -4,6 +4,7 @@
>  #include "hw/pci-host/pnv_phb4.h"
>  #include "hw/ppc/pnv_core.h"
>  #include "hw/ppc/pnv_homer.h"
> +#include "hw/ppc/pnv_n1_chiplet.h"
>  #include "hw/ppc/pnv_lpc.h"
>  #include "hw/ppc/pnv_occ.h"
>  #include "hw/ppc/pnv_psi.h"
> @@ -113,6 +114,7 @@ struct Pnv10Chip {
>  PnvOCC   occ;
>  PnvSBE   sbe;
>  PnvHomer homer;
> +PnvN1Chiplet n1_chiplet;
>  
>  uint32_t nr_quads;
>  PnvQuad  *quads;
> diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
> index 0297871bdd..6cf1f3319f 100644
> --- a/hw/ppc/pnv.c
> +++ b/hw/ppc/pnv.c
> @@ -1680,6 +1680,8 @@ static void pnv_chip_power10_instance_init(Object *obj)
>  object_initialize_child(obj, "occ",  >occ, TYPE_PNV10_OCC);
>  object_initialize_child(obj, "sbe",  >sbe, TYPE_PNV10_SBE);
>  object_initialize_child(obj, "homer", >homer, TYPE_PNV10_HOMER);
> +object_initialize_child(obj, "n1_chiplet", >n1_chiplet,
> +TYPE_PNV_N1_CHIPLET);

Another very small nit, we seem to have convention of minus rather than
underscore for these names, so n1-chiplet fits better.

Reviewed-by: Nicholas Piggin 

>  
>  chip->num_pecs = pcc->num_pecs;
>  
> @@ -1849,6 +1851,19 @@ static void pnv_chip_power10_realize(DeviceState *dev, 
> Error **errp)
>  memory_region_add_subregion(get_system_memory(), PNV10_HOMER_BASE(chip),
>  >homer.regs);
>  
> +/* N1 chiplet */
> +if (!qdev_realize(DEVICE(>n1_chiplet), NULL, errp)) {
> +return;
> +}
> +pnv_xscom_add_subregion(chip, PNV10_XSCOM_N1_CHIPLET_CTRL_REGS_BASE,
> + >n1_chiplet.nest_pervasive.xscom_ctrl_regs);
> +
> +pnv_xscom_add_subregion(chip, PNV10_XSCOM_N1_PB_SCOM_EQ_BASE,
> +   >n1_chiplet.xscom_pb_eq_regs);
> +
> +pnv_xscom_add_subregion(chip, PNV10_XSCOM_N1_PB_SCOM_ES_BASE,
> +   >n1_chiplet.xscom_pb_es_regs);
> +
>  /* PHBs */
>  pnv_chip_power10_phb_realize(chip, _err);
>  if (local_err) {




Re: [PATCH v6 2/3] hw/ppc: Add N1 chiplet model

2023-11-27 Thread Nicholas Piggin
On Tue Nov 28, 2023 at 3:13 AM AEST, Chalapathi V wrote:
> The N1 chiplet handle the high speed i/o traffic over PCIe and others.
> The N1 chiplet consists of PowerBus Fabric controller,
> nest Memory Management Unit, chiplet control unit and more.
>
> This commit creates a N1 chiplet model and initialize and realize the
> pervasive chiplet model where chiplet control registers are implemented.
>
> This commit also implement the read/write method for the powerbus scom
> registers
>
> Signed-off-by: Chalapathi V 
> ---
>  include/hw/ppc/pnv_n1_chiplet.h |  35 +++
>  include/hw/ppc/pnv_xscom.h  |   6 ++
>  hw/ppc/pnv_n1_chiplet.c | 171 
>  hw/ppc/meson.build  |   1 +
>  4 files changed, 213 insertions(+)
>  create mode 100644 include/hw/ppc/pnv_n1_chiplet.h
>  create mode 100644 hw/ppc/pnv_n1_chiplet.c
>
> diff --git a/include/hw/ppc/pnv_n1_chiplet.h b/include/hw/ppc/pnv_n1_chiplet.h
> new file mode 100644
> index 00..3c42ada7f4
> --- /dev/null
> +++ b/include/hw/ppc/pnv_n1_chiplet.h
> @@ -0,0 +1,35 @@
> +/*
> + * QEMU PowerPC N1 chiplet model
> + *
> + * Copyright (c) 2023, IBM Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.

Same question about tag here in in the .c. Otherwise,

Reviewed-by: Nicholas Piggin 

Thanks,
Nick



Re: [PATCH v6 1/3] hw/ppc: Add pnv nest pervasive common chiplet model

2023-11-27 Thread Nicholas Piggin
On Tue Nov 28, 2023 at 3:13 AM AEST, Chalapathi V wrote:
> A POWER10 chip is divided into logical pieces called chiplets. Chiplets
> are broadly divided into "core chiplets" (with the processor cores) and
> "nest chiplets" (with everything else). Each chiplet has an attachment
> to the pervasive bus (PIB) and with chiplet-specific registers. All nest
> chiplets have a common basic set of registers and This model will provide
> the registers functionality for common registers of nest chiplet (Pervasive
> Chiplet, PB Chiplet, PCI Chiplets, MC Chiplet, PAU Chiplets)
>
> This commit implement the read/write functions of chiplet control registers.
>
> Signed-off-by: Chalapathi V 
> ---
>  include/hw/ppc/pnv_nest_pervasive.h |  36 +
>  include/hw/ppc/pnv_xscom.h  |   3 +
>  hw/ppc/pnv_nest_pervasive.c | 219 
>  hw/ppc/meson.build  |   1 +
>  4 files changed, 259 insertions(+)
>  create mode 100644 include/hw/ppc/pnv_nest_pervasive.h
>  create mode 100644 hw/ppc/pnv_nest_pervasive.c
>
> diff --git a/include/hw/ppc/pnv_nest_pervasive.h 
> b/include/hw/ppc/pnv_nest_pervasive.h
> new file mode 100644
> index 00..9f11531f52
> --- /dev/null
> +++ b/include/hw/ppc/pnv_nest_pervasive.h
> @@ -0,0 +1,36 @@
> +/*
> + * QEMU PowerPC nest pervasive common chiplet model
> + *
> + * Copyright (c) 2023, IBM Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.

Shouldn't need this line with the SPDX tag I think? There are a copule
of dozen files already in the tree that do have both, but hundreds that
only have the tag.

> + */
> +
> +#ifndef PPC_PNV_NEST_PERVASIVE_H
> +#define PPC_PNV_NEST_PERVASIVE_H
> +
> +#define TYPE_PNV_NEST_PERVASIVE "pnv-nest-chiplet-pervasive"
> +#define PNV_NEST_PERVASIVE(obj) OBJECT_CHECK(PnvNestChipletPervasive, (obj), 
> TYPE_PNV_NEST_PERVASIVE)

_NEXT_CHIPLET_PERVASIVE?

> +
> +typedef struct PnvPervasiveCtrlRegs {
> +#define CPLT_CTRL_SIZE 6
> +uint64_t cplt_ctrl[CPLT_CTRL_SIZE];
> +uint64_t cplt_cfg0;
> +uint64_t cplt_cfg1;
> +uint64_t cplt_stat0;
> +uint64_t cplt_mask0;
> +uint64_t ctrl_protect_mode;
> +uint64_t ctrl_atomic_lock;
> +} PnvPervasiveCtrlRegs;
> +
> +typedef struct PnvNestChipletPervasive {
> +DeviceState parent;
> +char*parent_obj_name;
> +MemoryRegionxscom_ctrl_regs;
> +PnvPervasiveCtrlRegscontrol_regs;
> +} PnvNestChipletPervasive;

The file name doesn't quite match the type name, but that's probably
okay. This could be a good place for other misc pervasive helpers,
so keeping the name more general is fine.


> +
> +#endif /*PPC_PNV_NEST_PERVASIVE_H */
> diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
> index f5becbab41..3e15706dec 100644
> --- a/include/hw/ppc/pnv_xscom.h
> +++ b/include/hw/ppc/pnv_xscom.h
> @@ -170,6 +170,9 @@ struct PnvXScomInterfaceClass {
>  #define PNV10_XSCOM_XIVE2_BASE 0x2010800
>  #define PNV10_XSCOM_XIVE2_SIZE 0x400
>  
> +#define PNV10_XSCOM_N1_CHIPLET_CTRL_REGS_BASE  0x300
> +#define PNV10_XSCOM_CHIPLET_CTRL_REGS_SIZE 0x400
> +
>  #define PNV10_XSCOM_PEC_NEST_BASE  0x3011800 /* index goes downwards ... */
>  #define PNV10_XSCOM_PEC_NEST_SIZE  0x100
>  
> diff --git a/hw/ppc/pnv_nest_pervasive.c b/hw/ppc/pnv_nest_pervasive.c
> new file mode 100644
> index 00..0575f87e8f
> --- /dev/null
> +++ b/hw/ppc/pnv_nest_pervasive.c
> @@ -0,0 +1,219 @@
> +/*
> + * QEMU PowerPC nest pervasive common chiplet model
> + *
> + * Copyright (c) 2023, IBM Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/ppc/pnv.h"
> +#include "hw/ppc/pnv_xscom.h"
> +#include "hw/ppc/pnv_nest_pervasive.h"
> +
> +/*
> + * Status, configuration, and control units in POWER chips is provided
> + * by the pervasive subsystem, which connects registers to the SCOM bus,
> + * which can be programmed by processor cores, other units on the chip,
> + * BMCs, or other POWER chips.
> + *
> + * A POWER10 chip is divided into logical pieces called chiplets. Chiplets
> + * are broadly divided into "core chiplets" (with the processor cores) and
> + * "nest chiplets" (with everything else). Each chiplet has an attachment
> + * to the nest_pervasiveasive bus (PIB) and with chiplet-specific registers.
> + * All nest chiplets have a common basic set of registers.
> + *
> + * This model will provide the registers fuctionality for common registers of
> + * nest unit (PB Chiplet, PCI Chiplets, MC Chiplet, PAU Chiplets)
> + *
> + * Currently this model provide the read/write fuctionality of chiplet 
> control
> + * 

Re: [PATCH v3 0/4] Add BHRB Facility Support

2023-11-27 Thread Nicholas Piggin
On Tue Nov 28, 2023 at 6:51 AM AEST, Miles Glenn wrote:
> On Wed, 2023-10-18 at 10:59 -0500, Miles Glenn wrote:
> > On Thu, 2023-10-19 at 01:06 +1000, Nicholas Piggin wrote:
> > > On Tue Sep 26, 2023 at 3:43 AM AEST, Glenn Miles wrote:
> > > > This is a series of patches for adding support for the Branch
> > > > History
> > > > Rolling Buffer (BHRB) facility.  This was added to the Power ISA
> > > > starting with version 2.07.  Changes were subsequently made in
> > > > version
> > > > 3.1 to limit BHRB recording to instructions run in problem state
> > > > only
> > > > and to add a control bit to disable recording (MMCRA[BHRBRD]).
> > > > 
> > > > Version 3 of this series disables branch recording on P8 and P9
> > > > due
> > > > to a drop in performance caused by recording branches outside of
> > > > problem state.
> > > 
> > > Thanks for these, they all look good to me.
> > > 
> > > With P10 CPU, Linux perf branch recording appears to work with this
> > > series, and I confirmed that Linux does disable BHRB in MMCRA at
> > > boot, so it should not take the performance hit.
> > > 
> > > It had a couple of compile bugs, no matter I fixed them, but I
> > > often
> > > trip overppc32 and user-mode when working on TCG too, so building
> > > with --target-list including ppc64-softmmu,ppc-softmmu,
> > > ppc64-linux-user,ppc64le-linux-user,ppc-linux-user is good to catch
> > > those.
> > > 
> > > Thanks,
> > > Nick
> > > 
> > 
> > Thanks, Nick.  I'll have to remember that for next time!
> > 
> > Glenn
> > 
>
> Hi Nick,
>
> Is there anything else you need me to do for this series to be merged?

Hey Glenn,

No, sorry I just missed this cycle. I have it queued up, it'll have to
go in after 8.2 release.

Thanks,
Nick



Re: [PATCH v2 for-8.2] ppc/amigaone: Allow running AmigaOS without firmware image

2023-11-27 Thread Nicholas Piggin
On Tue Nov 28, 2023 at 2:37 AM AEST, Cédric Le Goater wrote:
>
> > I'm not sure, I don't think it's necessary if your minimal patch works.
> > 
> > I'll do a PR for 8.2 for SLOF and Skiboot updates, so happy to include
> > this as well.
>
> I think this is a bit late for 8.2 to change FW images, well, at least
> SLOF and skiboot. Are the new versions fixing something critical ?

Ah okay. Well then I can put them in next instead.

SLOF has a fix for virtio console over reboots, pretty minimal.
skiboot has some bug fixes but it's a bigger change and maybe not
so important for QEMU.

Could they be merged in next release and SLOF tagged with stable?

I think this amigaone patch could still be merged since it's only
touching a new machine and it's fixing an issue of missing firmware.

Thanks,
Nick



[PATCH v3] ppc/amigaone: Allow running AmigaOS without firmware image

2023-11-27 Thread BALATON Zoltan
The machine uses a modified U-Boot under GPL license but the sources
of it are lost with only a binary available so it cannot be included
in QEMU. Allow running without the firmware image which can be used
when calling a boot loader directly and thus simplifying booting
guests. We need a small routine that AmigaOS calls from ROM which is
added in this case to allow booting AmigaOS without external firmware
image.

Signed-off-by: BALATON Zoltan 
---
v3: Instead of -bios none do this when no -bios option given, use
constants for address and rom_add_blob_fixed() to add dummy_fw.
This makes both code and usage a bit simpler.

 hw/ppc/amigaone.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/hw/ppc/amigaone.c b/hw/ppc/amigaone.c
index 992a55e632..ddfa09457a 100644
--- a/hw/ppc/amigaone.c
+++ b/hw/ppc/amigaone.c
@@ -36,10 +36,19 @@
  * -device VGA,romfile=VGABIOS-lgpl-latest.bin
  * from http://www.nongnu.org/vgabios/ instead.
  */
-#define PROM_FILENAME "u-boot-amigaone.bin"
 #define PROM_ADDR 0xfff0
 #define PROM_SIZE (512 * KiB)
 
+/* AmigaOS calls this routine from ROM, use this if no firmware loaded */
+static const char dummy_fw[] = {
+0x38, 0x00, 0x00, 0x08, /* li  r0,8 */
+0x7c, 0x09, 0x03, 0xa6, /* mtctr   r0 */
+0x54, 0x63, 0xf8, 0x7e, /* srwir3,r3,1 */
+0x42, 0x00, 0xff, 0xfc, /* bdnz0x8 */
+0x7c, 0x63, 0x18, 0xf8, /* not r3,r3 */
+0x4e, 0x80, 0x00, 0x20, /* blr */
+};
+
 static void amigaone_cpu_reset(void *opaque)
 {
 PowerPCCPU *cpu = opaque;
@@ -60,8 +69,6 @@ static void amigaone_init(MachineState *machine)
 PowerPCCPU *cpu;
 CPUPPCState *env;
 MemoryRegion *rom, *pci_mem, *mr;
-const char *fwname = machine->firmware ?: PROM_FILENAME;
-char *filename;
 ssize_t sz;
 PCIBus *pci_bus;
 Object *via;
@@ -94,20 +101,24 @@ static void amigaone_init(MachineState *machine)
 }
 
 /* allocate and load firmware */
-filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, fwname);
-if (filename) {
-rom = g_new(MemoryRegion, 1);
-memory_region_init_rom(rom, NULL, "rom", PROM_SIZE, _fatal);
-memory_region_add_subregion(get_system_memory(), PROM_ADDR, rom);
+rom = g_new(MemoryRegion, 1);
+memory_region_init_rom(rom, NULL, "rom", PROM_SIZE, _fatal);
+memory_region_add_subregion(get_system_memory(), PROM_ADDR, rom);
+if (!machine->firmware) {
+rom_add_blob_fixed("dummy-fw", dummy_fw, sizeof(dummy_fw),
+   PROM_ADDR + PROM_SIZE - 0x80);
+} else {
+g_autofree char *filename = qemu_find_file(QEMU_FILE_TYPE_BIOS,
+   machine->firmware);
+if (!filename) {
+error_report("Could not find firmware '%s'", machine->firmware);
+exit(1);
+}
 sz = load_image_targphys(filename, PROM_ADDR, PROM_SIZE);
 if (sz <= 0 || sz > PROM_SIZE) {
 error_report("Could not load firmware '%s'", filename);
 exit(1);
 }
-g_free(filename);
-} else if (!qtest_enabled()) {
-error_report("Could not find firmware '%s'", fwname);
-exit(1);
 }
 
 /* Articia S */
-- 
2.30.9




Re: [EXTERNAL] [PATCH v3 2/5] xen: backends: don't overwrite XenStore nodes created by toolstack

2023-11-27 Thread Volodymyr Babchuk
Hi David,

Thank you for the review

David Woodhouse  writes:

> [[S/MIME Signed Part:Undecided]]
> On Fri, 2023-11-24 at 23:24 +, Volodymyr Babchuk wrote:
>> Xen PV devices in QEMU can be created in two ways: either by QEMU
>> itself, if they were passed via command line, or by Xen toolstack. In
>> the latter case, QEMU scans XenStore entries and configures devices
>> accordingly.
>> 
>> In the second case we don't want QEMU to write/delete front-end
>> entries for two reasons: it might have no access to those entries if
>> it is running in un-privileged domain and it is just incorrect to
>> overwrite entries already provided by Xen toolstack, because
>> toolstack
>> manages those nodes. For example, it might read backend- or frontend-
>> state to be sure that they are both disconnected and it is safe to
>> destroy a domain.
>> 
>> This patch checks presence of xendev->backend to check if Xen PV
>> device was configured by Xen toolstack to decide if it should touch
>> frontend entries in XenStore. Also, when we need to remove XenStore
>> entries during device teardown only if they weren't created by Xen
>> toolstack. If they were created by toolstack, then it is toolstack's
>> job to do proper clean-up.
>> 
>> Suggested-by: Paul Durrant 
>> Suggested-by: David Woodhouse 
>> Co-Authored-by: Oleksandr Tyshchenko 
>> Signed-off-by: Volodymyr Babchuk 
>
> Reviewed-by: David Woodhouse 
>
> ... albeit with a couple of suggestions... 
>
>> diff --git a/hw/char/xen_console.c b/hw/char/xen_console.c
>> index bef8a3a621..b52abf 100644
>> --- a/hw/char/xen_console.c
>> +++ b/hw/char/xen_console.c
>> @@ -450,7 +450,7 @@ static void xen_console_realize(XenDevice *xendev, Error 
>> **errp)
>> 
>>  trace_xen_console_realize(con->dev, object_get_typename(OBJECT(cs)));
>> 
>> -    if (CHARDEV_IS_PTY(cs)) {
>> +    if (CHARDEV_IS_PTY(cs) && !xendev->backend) {
>>  /* Strip the leading 'pty:' */
>>  xen_device_frontend_printf(xendev, "tty", "%s", cs->filename + 4);
>>  }
>
>
> It's kind of weird that that one is a frontend node at all; surely it
> should have been a backend node?

Yeah, AFAIK, console was the first PV driver, so it is a bit strange.
As I see, this frontend entry is used by "xl console" tool to find PTY
device to attach to. So yes, it should be in backend part of the
xenstore. But I don't believe we can fix this right now.

> But it is known only to QEMU once it
> actually opens /dev/ptmx and creates a new pty. It can't be populated
> by the toolstack in advance.
>
> So shouldn't the toolstack have made it writable by the driver domain?

Maybe it can lead to a weird situation when user in Dom-0 tries to use
"xl console" command to attach to a console that is absent in Dom-0,
because "tty" entry points to PTY in the driver domain.

> I think we should attempt to write this and just gracefully handle the
> failure if we can't. (In fact, xen_device_frontend_printf() will just
> use error_report_err() which is probably OK unless you feel strongly
> about silencing it).

Nope, I am fine with this approach. I'll remove this hunk in the next
version.

>
>> diff --git a/hw/net/xen_nic.c b/hw/net/xen_nic.c
>> index afa10c96e8..27442bef38 100644
>> --- a/hw/net/xen_nic.c
>> +++ b/hw/net/xen_nic.c
>> @@ -315,14 +315,16 @@ static void xen_netdev_realize(XenDevice *xendev, 
>> Error **errp)
>> 
>>  qemu_macaddr_default_if_unset(>conf.macaddr);
>> 
>> -    xen_device_frontend_printf(xendev, "mac", 
>> "%02x:%02x:%02x:%02x:%02x:%02x",
>> -   netdev->conf.macaddr.a[0],
>> -   netdev->conf.macaddr.a[1],
>> -   netdev->conf.macaddr.a[2],
>> -   netdev->conf.macaddr.a[3],
>> -   netdev->conf.macaddr.a[4],
>> -   netdev->conf.macaddr.a[5]);
>> -
>> +    if (!xendev->backend) {
>> +    xen_device_frontend_printf(xendev, "mac",
>> +   "%02x:%02x:%02x:%02x:%02x:%02x",
>> +   netdev->conf.macaddr.a[0],
>> +   netdev->conf.macaddr.a[1],
>> +   netdev->conf.macaddr.a[2],
>> +   netdev->conf.macaddr.a[3],
>> +   netdev->conf.macaddr.a[4],
>> +   netdev->conf.macaddr.a[5]);
>> +    }
>>  netdev->nic = qemu_new_nic(_xen_info, >conf,
>>     object_get_typename(OBJECT(xendev)),
>>     DEVICE(xendev)->id, netdev);
>
>
> Perhaps here you should create the "mac" node if it doesn't exist (or
> is that "if it doesn't match netdev->conf.macaddr"?) and just
> gracefully accept failure too?
>

I am not sure that I got this right. conf.maccadr can be sent in two
ways: via xen_net_device_create(), which will fail if toolstack didn't
provided a MAC address, or via 

Re: [PATCH v1 1/2] hw/cxl/device: read from register values in mdev_reg_read()

2023-11-27 Thread Hyeonggon Yoo
On Tue, Nov 28, 2023 at 5:27 AM Davidlohr Bueso  wrote:
>
> On Mon, 27 Nov 2023, Hyeonggon Yoo wrote:
>
> >In the current mdev_reg_read() implementation, it consistently returns
> >that the Media Status is Ready (01b). This was fine until commit
> >25a52959f99d ("hw/cxl: Add support for device sanitation") because the
> >media was presumed to be ready.
> >
> >However, as per the CXL 3.0 spec "8.2.9.8.5.1 Sanitize (Opcode 4400h)",
> >during sanitation, the Media State should be set to Disabled (11b). The
> >mentioned commit correctly sets it to Disabled, but mdev_reg_read()
> >still returns Media Status as Ready.
> >
> >To address this, update mdev_reg_read() to read register values instead
> >of returning dummy values.
> >
> >Fixes: commit 25a52959f99d ("hw/cxl: Add support for device sanitation")
> >Signed-off-by: Hyeonggon Yoo <42.hye...@gmail.com>
>
> Looks good, thanks.
>
> Reviewed-by: Davidlohr Bueso 
>
> In addition how about the following to further robustify?
>- disallow certain incoming cci cmd when media is disabled
>- deal with memory reads/writes when media is disabled
>- make __toggle_media() a nop when passed value is already set
>- play nice with arm64 uses little endian reads and writes (this
>  should be extended to all of mbox/cci of course).

All of them make sense to me. I will adjust, thanks!

But I'm not confident enough to write a single description for all the
changes so will
split it into a few patches. May I add your Suggested-by (or
Signed-off-by) in v2
as it will contain some part of your idea and code?

> 8<-
> diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> index 6eff56fb1b34..9bc5121215c9 100644
> --- a/hw/cxl/cxl-mailbox-utils.c
> +++ b/hw/cxl/cxl-mailbox-utils.c
> @@ -1314,6 +1314,7 @@ int cxl_process_cci_message(CXLCCI *cci, uint8_t set, 
> uint8_t cmd,
>   int ret;
>   const struct cxl_cmd *cxl_cmd;
>   opcode_handler h;
> +CXLType3Dev *ct3d = CXL_TYPE3(cci->d);
>
>   *len_out = 0;
>   cxl_cmd = >cxl_cmd_set[set][cmd];
> @@ -1334,8 +1335,8 @@ int cxl_process_cci_message(CXLCCI *cci, uint8_t set, 
> uint8_t cmd,
>   return CXL_MBOX_BUSY;
>   }
>
> -/* forbid any selected commands while overwriting */
> -if (sanitize_running(cci)) {
> +/* forbid any selected commands when necessary */
> +if (sanitize_running(cci) || cxl_dev_media_disabled(>cxl_dstate)) {
>   if (h == cmd_events_get_records ||
>   h == cmd_ccls_get_partition_info ||
>   h == cmd_ccls_set_lsa ||
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index 72d93713473d..e0a164fde007 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -899,7 +899,8 @@ MemTxResult cxl_type3_read(PCIDevice *d, hwaddr 
> host_addr, uint64_t *data,
>   return MEMTX_ERROR;
>   }
>
> -if (sanitize_running(>cci)) {
> +if (sanitize_running(>cci) ||
> +cxl_dev_media_disabled(>cxl_dstate)) {
>   qemu_guest_getrandom_nofail(data, size);
>   return MEMTX_OK;
>   }
> @@ -925,6 +926,11 @@ MemTxResult cxl_type3_write(PCIDevice *d, hwaddr 
> host_addr, uint64_t data,
>   return MEMTX_OK;
>   }
>
> +/* memory writes to the device will have no effect */
> +if (cxl_dev_media_disabled(>cxl_dstate)) {
> +return MEMTX_OK;
> +}
> +
>   return address_space_write(as, dpa_offset, attrs, , size);
>   }
>
> diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
> index 873e6d6ab159..007d4169df7c 100644
> --- a/include/hw/cxl/cxl_device.h
> +++ b/include/hw/cxl/cxl_device.h
> @@ -349,14 +349,26 @@ REG64(CXL_MEM_DEV_STS, 0)
>   FIELD(CXL_MEM_DEV_STS, MBOX_READY, 4, 1)
>   FIELD(CXL_MEM_DEV_STS, RESET_NEEDED, 5, 3)
>
> +static inline bool cxl_dev_media_disabled(CXLDeviceState *cxl_dstate)
> +{
> +uint64_t dev_status_reg;
> +
> +dev_status_reg = ldq_le_p(cxl_dstate->mbox_reg_state64 + 
> R_CXL_MEM_DEV_STS);
> +return FIELD_EX64(dev_status_reg, CXL_MEM_DEV_STS, MEDIA_STATUS) == 0x3;
> +}
> +
>   static inline void __toggle_media(CXLDeviceState *cxl_dstate, int val)
>   {
>   uint64_t dev_status_reg;
>
> -dev_status_reg = cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS];
> +dev_status_reg = ldq_le_p(cxl_dstate->mbox_reg_state64 + 
> R_CXL_MEM_DEV_STS);
> +if (FIELD_EX64(dev_status_reg, CXL_MEM_DEV_STS, MEDIA_STATUS) == val) {
> +return;
> +}
> +
>   dev_status_reg = FIELD_DP64(dev_status_reg, CXL_MEM_DEV_STS, 
> MEDIA_STATUS,
>   val);
> -cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS] = dev_status_reg;
> +stq_le_p(cxl_dstate->mbox_reg_state64 + R_CXL_MEM_DEV_STS, 
> dev_status_reg);
>   }
>   #define cxl_dev_disable_media(cxlds)\
>   do { __toggle_media((cxlds), 0x3); } while (0)



Re: [PATCH v1 2/2] hw/mem/cxl_type3: allocate more vectors for MSI-X

2023-11-27 Thread Hyeonggon Yoo
On Tue, Nov 28, 2023 at 2:53 AM Davidlohr Bueso  wrote:
>
> On Mon, 27 Nov 2023, Hyeonggon Yoo wrote:
>
> >commit 43efb0bfad2b ("hw/cxl/mbox: Wire up interrupts for background
> >completion") enables notifying background command completion via MSI-X
> >interrupt (vector number 9).
> >
> >However, the commit uses vector number 9 but the maximum number of
> >entries is less thus resulting in error below. Fix it by passing
> >nentries = 10 when calling msix_init_exclusive_bar().
>
> Hmm yeah this was already set to 10 in Jonathan's tree, thanks for reporting.

Oh, yeah, it's based on the mainline tree. I should have checked Jonathan's.

hmm it's already 10 there but vector number 9 is already being used by PCIe DOE.
So I think it should change msix_num = 11 and use vector number 10 for
background command completion interrupt instead?

https://gitlab.com/jic23/qemu/-/commit/2823f19188664a6d48a965ea8170c9efa23cddab

Thanks!

--
Hyeonggon



Re: [PATCH v4 1/2] qom: new object to associate device to numa node

2023-11-27 Thread Alex Williamson
On Sun, 19 Nov 2023 18:31:10 +0530
 wrote:

> From: Ankit Agrawal 
> 
> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> partitioning of the GPU device resources (including device memory) into
> several (upto 8) isolated instances. Each of the partitioned memory needs
> a dedicated NUMA node to operate. The partitions are not fixed and they
> can be created/deleted at runtime.
> 
> Unfortunately Linux OS does not provide a means to dynamically create/destroy
> NUMA nodes and such feature implementation is not expected to be trivial. The
> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> we utilize the Generic Initiator Affinity structures that allows association
> between nodes and devices. Multiple GI structures per BDF is possible,
> allowing creation of multiple nodes by exposing unique PXM in each of these
> structures.
> 
> Introduce a new acpi-generic-initiator object to allow host admin provide the
> device and the corresponding NUMA nodes. Qemu maintain this association and
> use this object to build the requisite GI Affinity Structure.
> 
> An admin can provide the range of nodes through a uint16 array host-nodes
> and link it to a device by providing its id. Currently, only PCI device is
> supported and an error is returned for acpi device. The following sample
> creates 8 nodes and link them to the PCI device dev0:
> 
> -numa node,nodeid=2 \
> -numa node,nodeid=3 \
> -numa node,nodeid=4 \
> -numa node,nodeid=5 \
> -numa node,nodeid=6 \
> -numa node,nodeid=7 \
> -numa node,nodeid=8 \
> -numa node,nodeid=9 \
> -device 
> vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \
> 
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
> 
> Signed-off-by: Ankit Agrawal 
> ---
>  hw/acpi/acpi-generic-initiator.c | 84 
>  hw/acpi/meson.build  |  1 +
>  include/hw/acpi/acpi-generic-initiator.h | 30 +
>  qapi/qom.json| 18 +
>  4 files changed, 133 insertions(+)
>  create mode 100644 hw/acpi/acpi-generic-initiator.c
>  create mode 100644 include/hw/acpi/acpi-generic-initiator.h
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c 
> b/hw/acpi/acpi-generic-initiator.c
> new file mode 100644
> index 00..5ea51cb81e
> --- /dev/null
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -0,0 +1,84 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/qdev-properties.h"
> +#include "qapi/error.h"
> +#include "qapi/qapi-builtin-visit.h"
> +#include "qapi/visitor.h"
> +#include "qom/object_interfaces.h"
> +#include "qom/object.h"
> +#include "hw/qdev-core.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/vfio/pci.h"

There's nothing related to vfio here except for the example use case,
surely you don't need the above two headers.

> +#include "hw/pci/pci_device.h"
> +#include "sysemu/numa.h"
> +#include "hw/acpi/acpi-generic-initiator.h"
> +
> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, 
> acpi_generic_initiator,
> +   ACPI_GENERIC_INITIATOR, OBJECT,
> +   { TYPE_USER_CREATABLE },
> +   { NULL })
> +
> +OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
> +
> +static void acpi_generic_initiator_init(Object *obj)
> +{
> +AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +gi->device = NULL;
> +gi->nodelist = NULL;
> +}
> +
> +static void acpi_generic_initiator_finalize(Object *obj)
> +{
> +AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +g_free(gi->device);
> +qapi_free_uint16List(gi->nodelist);
> +}
> +
> +static void acpi_generic_initiator_set_pci_device(Object *obj, const char 
> *val,
> +  Error **errp)
> +{
> +AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +gi->device = g_strdup(val);
> +}
> +
> +static void acpi_generic_initiator_set_acpi_device(Object *obj, const char 
> *val,
> +   Error **errp)
> +{
> +error_setg(errp, "Generic Initiator ACPI device not supported");
> +}
> +
> +static void
> +acpi_generic_initiator_set_host_nodes(Object *obj, Visitor *v, const char 
> *name,
> +  void *opaque, Error **errp)
> +{
> +AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +uint16List *l;
> +
> +visit_type_uint16List(v, name, &(gi->nodelist), errp);
> +
> +for (l = gi->nodelist; l; l = l->next) {
> +if (l->value >= MAX_NODES) {
> +error_setg(errp, "Invalid host-nodes value: %d", l->value);
> +qapi_free_uint16List(gi->nodelist);
> +return;
> +}
> +}
> +}
> +
> +static void 

Re: [PATCH v4 2/2] hw/acpi: Implement the SRAT GI affinity structure

2023-11-27 Thread Alex Williamson
On Sun, 19 Nov 2023 18:31:11 +0530
 wrote:

> From: Ankit Agrawal 
> 
> ACPI spec provides a scheme to associate "Generic Initiators" [1]
> (e.g. heterogeneous processors and accelerators, GPUs, and I/O devices with
> integrated compute or DMA engines GPUs) with Proximity Domains. This is
> achieved using Generic Initiator Affinity Structure in SRAT. During bootup,
> Linux kernel parse the ACPI SRAT to determine the PXM ids and create a NUMA
> node for each unique PXM ID encountered. Qemu currently do not implement
> these structures while building SRAT.
> 
> Add GI structures while building VM ACPI SRAT. The association between
> devices and nodes are stored using acpi-generic-initiator object. Lookup
> presence of all such objects and use them to build these structures.
> 
> The structure needs a PCI device handle [2] that consists of the device BDF.
> The vfio-pci device corresponding to the acpi-generic-initiator object is
> located to determine the BDF.
> 
> [1] ACPI Spec 6.3, Section 5.2.16.6
> [2] ACPI Spec 6.3, Table 5.80
> 
> Signed-off-by: Ankit Agrawal 
> ---
>  hw/acpi/acpi-generic-initiator.c | 100 +++
>  hw/arm/virt-acpi-build.c |   3 +
>  include/hw/acpi/acpi-generic-initiator.h |  26 ++
>  3 files changed, 129 insertions(+)
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c 
> b/hw/acpi/acpi-generic-initiator.c
> index 5ea51cb81e..a9222438ec 100644
> --- a/hw/acpi/acpi-generic-initiator.c
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -16,6 +16,7 @@
>  #include "hw/pci/pci_device.h"
>  #include "sysemu/numa.h"
>  #include "hw/acpi/acpi-generic-initiator.h"
> +#include "qemu/error-report.h"
>  
>  OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, 
> acpi_generic_initiator,
> ACPI_GENERIC_INITIATOR, OBJECT,
> @@ -82,3 +83,102 @@ static void acpi_generic_initiator_class_init(ObjectClass 
> *oc, void *data)
>  acpi_generic_initiator_set_host_nodes,
>  NULL, NULL);
>  }
> +
> +static int acpi_generic_initiator_list(Object *obj, void *opaque)
> +{
> +GSList **list = opaque;
> +
> +if (object_dynamic_cast(obj, TYPE_ACPI_GENERIC_INITIATOR)) {
> +*list = g_slist_append(*list, ACPI_GENERIC_INITIATOR(obj));
> +}
> +
> +object_child_foreach(obj, acpi_generic_initiator_list, opaque);
> +return 0;
> +}
> +
> +/*
> + * Identify Generic Initiator objects and link them into the list which is
> + * returned to the caller.
> + *
> + * Note: it is the caller's responsibility to free the list to avoid
> + * memory leak.
> + */
> +static GSList *acpi_generic_initiator_get_list(void)
> +{
> +GSList *list = NULL;
> +
> +object_child_foreach(object_get_root(), acpi_generic_initiator_list, 
> );
> +return list;
> +}
> +
> +/*
> + * ACPI 6.3:
> + * Table 5-78 Generic Initiator Affinity Structure
> + */
> +static
> +void build_srat_generic_pci_initiator_affinity(GArray *table_data, int node,
> +   PCIDeviceHandle *handle)
> +{
> +uint8_t index;
> +
> +build_append_int_noprefix(table_data, 5, 1);  /* Type */
> +build_append_int_noprefix(table_data, 32, 1); /* Length */
> +build_append_int_noprefix(table_data, 0, 1);  /* Reserved */
> +build_append_int_noprefix(table_data, 1, 1);  /* Device Handle Type: PCI 
> */
> +build_append_int_noprefix(table_data, node, 4);  /* Proximity Domain */
> +
> +/* Device Handle - PCI */
> +build_append_int_noprefix(table_data, handle->segment, 2);
> +build_append_int_noprefix(table_data, handle->bdf, 2);
> +for (index = 0; index < 12; index++) {
> +build_append_int_noprefix(table_data, 0, 1);
> +}
> +
> +build_append_int_noprefix(table_data, GEN_AFFINITY_ENABLED, 4); /* Flags 
> */
> +build_append_int_noprefix(table_data, 0, 4); /* Reserved */
> +}
> +
> +void build_srat_generic_pci_initiator(GArray *table_data)
> +{
> +GSList *gi_list, *list = acpi_generic_initiator_get_list();
> +AcpiGenericInitiator *gi;
> +
> +for (gi_list = list; gi_list; gi_list = gi_list->next) {
> +Object *o;
> +uint16List *l;
> +PCIDevice *pci_dev;
> +bool node_specified = false;
> +
> +gi = gi_list->data;
> +
> +/* User fails to provide a device. */
> +g_assert(gi->device);
> +
> +o = object_resolve_path_type(gi->device, TYPE_PCI_DEVICE, NULL);
> +if (!o) {
> +error_printf("Specified device must be a PCI device.\n");
> +g_assert(o);
> +}
> +pci_dev = PCI_DEVICE(o);
> +
> +for (l = gi->nodelist; l; l = l->next) {
> +PCIDeviceHandle dev_handle;
> +dev_handle.segment = 0;
> +dev_handle.bdf = PCI_BUILD_BDF(pci_bus_num(pci_get_bus(pci_dev)),
> +   pci_dev->devfn);
> +build_srat_generic_pci_initiator_affinity(table_data,
> +

Re: [PATCH v1 5/7] tests/qtest/migration: Print migration incoming errors

2023-11-27 Thread Peter Xu
On Mon, Nov 27, 2023 at 05:32:45PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Mon, Nov 27, 2023 at 12:52:38PM -0300, Fabiano Rosas wrote:
> >> >> @@ -118,6 +118,12 @@ void migrate_incoming_qmp(QTestState *to, const 
> >> >> char *uri, const char *fmt, ...)
> >> >>  
> >> >>  rsp = qtest_qmp(to, "{ 'execute': 'migrate-incoming', 'arguments': 
> >> >> %p}",
> >> >>  args);
> >> >> +
> >> >> +if (!qdict_haskey(rsp, "return")) {
> >> >> +g_autoptr(GString) s = qobject_to_json_pretty(QOBJECT(rsp), 
> >> >> true);
> >> >> +g_test_message("%s", s->str);
> >> >> +}
> >> >
> >> > This traps the "migrate-incoming" command only (which, afaiu, only setup
> >> > the listening), would this capture the incoming error?
> >> 
> >> This is about the migrate-incoming only. We could replace "incoming
> >> migration" with "qmp_migrate_incoming" in the commit message to clarify.
> >
> > Ah.. Did you ever see this failure in any of your runs in these tests?  I
> > think it means you hit the assertion right below this part, but I'm just
> > curious how, as the URIs in the test cases are pretty constant.
> 
> Yes, I don't remember what exactly, but we changed the code that parses
> the URIs in this release and I'm also working on
> file_start_incoming_migration.

OK then.

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v3 0/4] Add BHRB Facility Support

2023-11-27 Thread Miles Glenn
On Wed, 2023-10-18 at 10:59 -0500, Miles Glenn wrote:
> On Thu, 2023-10-19 at 01:06 +1000, Nicholas Piggin wrote:
> > On Tue Sep 26, 2023 at 3:43 AM AEST, Glenn Miles wrote:
> > > This is a series of patches for adding support for the Branch
> > > History
> > > Rolling Buffer (BHRB) facility.  This was added to the Power ISA
> > > starting with version 2.07.  Changes were subsequently made in
> > > version
> > > 3.1 to limit BHRB recording to instructions run in problem state
> > > only
> > > and to add a control bit to disable recording (MMCRA[BHRBRD]).
> > > 
> > > Version 3 of this series disables branch recording on P8 and P9
> > > due
> > > to a drop in performance caused by recording branches outside of
> > > problem state.
> > 
> > Thanks for these, they all look good to me.
> > 
> > With P10 CPU, Linux perf branch recording appears to work with this
> > series, and I confirmed that Linux does disable BHRB in MMCRA at
> > boot, so it should not take the performance hit.
> > 
> > It had a couple of compile bugs, no matter I fixed them, but I
> > often
> > trip overppc32 and user-mode when working on TCG too, so building
> > with --target-list including ppc64-softmmu,ppc-softmmu,
> > ppc64-linux-user,ppc64le-linux-user,ppc-linux-user is good to catch
> > those.
> > 
> > Thanks,
> > Nick
> > 
> 
> Thanks, Nick.  I'll have to remember that for next time!
> 
> Glenn
> 

Hi Nick,

Is there anything else you need me to do for this series to be merged?

Thanks,

Glenn

> > > Glenn Miles (4):
> > >   target/ppc: Add new hflags to support BHRB
> > >   target/ppc: Add recording of taken branches to BHRB
> > >   target/ppc: Add clrbhrb and mfbhrbe instructions
> > >   target/ppc: Add migration support for BHRB
> > > 
> > >  target/ppc/cpu.h   |  24 ++
> > >  target/ppc/cpu_init.c  |  39 +-
> > >  target/ppc/helper.h|   5 ++
> > >  target/ppc/helper_regs.c   |  35 +
> > >  target/ppc/insn32.decode   |   8 ++
> > >  target/ppc/machine.c   |  23 +-
> > >  target/ppc/misc_helper.c   |  46 +++
> > >  target/ppc/power8-pmu-regs.c.inc   |   5 ++
> > >  target/ppc/power8-pmu.c|  48 +++-
> > >  target/ppc/power8-pmu.h|  11 ++-
> > >  target/ppc/spr_common.h|   1 +
> > >  target/ppc/translate.c | 101
> > > +++--
> > >  target/ppc/translate/bhrb-impl.c.inc   |  43 +++
> > >  target/ppc/translate/branch-impl.c.inc |   2 +-
> > >  14 files changed, 374 insertions(+), 17 deletions(-)
> > >  create mode 100644 target/ppc/translate/bhrb-impl.c.inc




Re: [PATCH v1 5/7] tests/qtest/migration: Print migration incoming errors

2023-11-27 Thread Fabiano Rosas
Peter Xu  writes:

> On Mon, Nov 27, 2023 at 12:52:38PM -0300, Fabiano Rosas wrote:
>> >> @@ -118,6 +118,12 @@ void migrate_incoming_qmp(QTestState *to, const char 
>> >> *uri, const char *fmt, ...)
>> >>  
>> >>  rsp = qtest_qmp(to, "{ 'execute': 'migrate-incoming', 'arguments': 
>> >> %p}",
>> >>  args);
>> >> +
>> >> +if (!qdict_haskey(rsp, "return")) {
>> >> +g_autoptr(GString) s = qobject_to_json_pretty(QOBJECT(rsp), 
>> >> true);
>> >> +g_test_message("%s", s->str);
>> >> +}
>> >
>> > This traps the "migrate-incoming" command only (which, afaiu, only setup
>> > the listening), would this capture the incoming error?
>> 
>> This is about the migrate-incoming only. We could replace "incoming
>> migration" with "qmp_migrate_incoming" in the commit message to clarify.
>
> Ah.. Did you ever see this failure in any of your runs in these tests?  I
> think it means you hit the assertion right below this part, but I'm just
> curious how, as the URIs in the test cases are pretty constant.

Yes, I don't remember what exactly, but we changed the code that parses
the URIs in this release and I'm also working on
file_start_incoming_migration.



[RFC PATCH v3 29/30] migration: Add support for fdset with multifd + file

2023-11-27 Thread Fabiano Rosas
Allow multifd to use an fdset when migrating to a file. This is useful
for the scenario where the management layer wants to have control over
the migration file.

By receiving the file descriptors directly, QEMU can delegate some
high level operating system operations to the management layer (such
as mandatory access control). The management layer might also want to
add its own headers before the migration stream.

Enable the "file:/dev/fdset/#" syntax for the multifd migration with
fixed-ram. The requirements for the fdset mechanism are:

On the migration source side:

- the fdset must contain two fds that are not duplicates between
  themselves;
- if direct-io is to be used, exactly one of the fds must have the
  O_DIRECT flag set;
- the file must be opened with WRONLY both times.

On the migration destination side:

- the fdset must contain one fd;
- the file must be opened with RDONLY.

Signed-off-by: Fabiano Rosas 
---
 docs/devel/migration.rst |  18 +++
 migration/file.c | 100 ---
 2 files changed, 112 insertions(+), 6 deletions(-)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index 1488e5b2f9..096ef27ed7 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -46,6 +46,24 @@ over any transport.
   application to add its own metadata to the start of the file without
   QEMU interference.
 
+  The file migration also supports using a file that has already been
+  opened. A set of file descriptors is passed to QEMU via an "fdset"
+  (see add-fd QMP command documentation). This method allows a
+  management application to have control over the migration file
+  opening operation. There are, however, strict requirements to this
+  interface:
+
+  On the migration source side:
+- the fdset must contain two file descriptors that are not
+  duplicates between themselves;
+- if the direct-io capability is to be used, exactly one of the
+  file descriptors must have the O_DIRECT flag set;
+- the file must be opened with WRONLY both times.
+
+  On the migration destination side:
+- the fdset must contain one file descriptor;
+- the file must be opened with RDONLY.
+
 In addition, support is included for migration using RDMA, which
 transports the page data using ``RDMA``, where the hardware takes care of
 transporting the pages, and the load on the CPU is much lower.  While the
diff --git a/migration/file.c b/migration/file.c
index fc5c1a45f4..4b06335a8c 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -9,11 +9,13 @@
 #include "qemu/cutils.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "qapi/qapi-commands-misc.h"
 #include "channel.h"
 #include "file.h"
 #include "migration.h"
 #include "io/channel-file.h"
 #include "io/channel-util.h"
+#include "monitor/monitor.h"
 #include "options.h"
 #include "trace.h"
 
@@ -21,6 +23,7 @@
 
 static struct FileOutgoingArgs {
 char *fname;
+int64_t fdset_id;
 } outgoing_args;
 
 /* Remove the offset option from @filespec and return it in @offsetp. */
@@ -42,6 +45,84 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, 
Error **errp)
 return 0;
 }
 
+/*
+ * If the open flags and file status flags from the file descriptors
+ * in the fdset don't match what QEMU expects, errno gets set to
+ * EACCES. Let's provide a more user-friendly message.
+ */
+static void file_fdset_error(int flags, Error **errp)
+{
+ERRP_GUARD();
+
+if (errno == EACCES) {
+/* ditch the previous error */
+error_free(*errp);
+*errp = NULL;
+
+error_setg(errp, "Fdset is missing a file descriptor with flags: 0x%x",
+   flags);
+}
+}
+
+static void file_remove_fdset(void)
+{
+if (outgoing_args.fdset_id != -1) {
+qmp_remove_fd(outgoing_args.fdset_id, false, -1, NULL);
+outgoing_args.fdset_id = -1;
+}
+}
+
+/*
+ * Due to the behavior of the dup() system call, we need the fdset to
+ * have two non-duplicate fds so we can enable direct IO in the
+ * secondary channels without affecting the main channel.
+ */
+static bool file_parse_fdset(const char *filename, int64_t *fdset_id,
+ Error **errp)
+{
+FdsetInfoList *fds_info;
+FdsetFdInfoList *fd_info;
+const char *fdset_id_str;
+int nfds = 0;
+
+*fdset_id = -1;
+
+if (!strstart(filename, "/dev/fdset/", _id_str)) {
+return true;
+}
+
+if (!migrate_multifd()) {
+error_setg(errp, "fdset is only supported with multifd");
+return false;
+}
+
+*fdset_id = qemu_parse_fd(fdset_id_str);
+
+for (fds_info = qmp_query_fdsets(NULL); fds_info;
+ fds_info = fds_info->next) {
+
+if (*fdset_id != fds_info->value->fdset_id) {
+continue;
+}
+
+for (fd_info = fds_info->value->fds; fd_info; fd_info = fd_info->next) 
{
+if (nfds++ > 2) {
+break;
+}
+}
+}
+
+ 

[RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets

2023-11-27 Thread Fabiano Rosas
Currently multifd does not need to have knowledge of pages on the
receiving side because all the information needed is within the
packets that come in the stream.

We're about to add support to fixed-ram migration, which cannot use
packets because it expects the ramblock section in the migration file
to contain only the guest pages data.

Add a data structure to transfer pages between the ram migration code
and the multifd receiving threads.

We don't want to reuse MultiFDPages_t for two reasons:

a) multifd threads don't really need to know about the data they're
   receiving.

b) the receiving side has to be stopped to load the pages, which means
   we can experiment with larger granularities than page size when
   transferring data.

Signed-off-by: Fabiano Rosas 
---
- stopped using MultiFDPages_t and added a new structure which can
  take offset + size
---
 migration/multifd.c | 122 ++--
 migration/multifd.h |  20 
 2 files changed, 138 insertions(+), 4 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index c1381bdc21..7dfab2367a 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -142,17 +142,36 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
 static int nocomp_recv_data(MultiFDRecvParams *p, Error **errp)
 {
 uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+ERRP_GUARD();
 
 if (flags != MULTIFD_FLAG_NOCOMP) {
 error_setg(errp, "multifd %u: flags received %x flags expected %x",
p->id, flags, MULTIFD_FLAG_NOCOMP);
 return -1;
 }
-for (int i = 0; i < p->normal_num; i++) {
-p->iov[i].iov_base = p->host + p->normal[i];
-p->iov[i].iov_len = p->page_size;
+
+if (!migrate_multifd_packets()) {
+MultiFDRecvData *data = p->data;
+size_t ret;
+
+ret = qio_channel_pread(p->c, (char *) data->opaque,
+data->size, data->file_offset, errp);
+if (ret != data->size) {
+error_prepend(errp,
+  "multifd recv (%u): read 0x%zx, expected 0x%zx",
+  p->id, ret, data->size);
+return -1;
+}
+
+return 0;
+} else {
+for (int i = 0; i < p->normal_num; i++) {
+p->iov[i].iov_base = p->host + p->normal[i];
+p->iov[i].iov_len = p->page_size;
+}
+
+return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
 }
-return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
 }
 
 static MultiFDMethods multifd_nocomp_ops = {
@@ -989,6 +1008,7 @@ int multifd_save_setup(Error **errp)
 
 struct {
 MultiFDRecvParams *params;
+MultiFDRecvData *data;
 /* number of created threads */
 int count;
 /* syncs main thread and channels */
@@ -999,6 +1019,49 @@ struct {
 MultiFDMethods *ops;
 } *multifd_recv_state;
 
+int multifd_recv(void)
+{
+int i;
+static int next_recv_channel;
+MultiFDRecvParams *p = NULL;
+MultiFDRecvData *data = multifd_recv_state->data;
+
+/*
+ * next_channel can remain from a previous migration that was
+ * using more channels, so ensure it doesn't overflow if the
+ * limit is lower now.
+ */
+next_recv_channel %= migrate_multifd_channels();
+for (i = next_recv_channel;; i = (i + 1) % migrate_multifd_channels()) {
+p = _recv_state->params[i];
+
+qemu_mutex_lock(>mutex);
+if (p->quit) {
+error_report("%s: channel %d has already quit!", __func__, i);
+qemu_mutex_unlock(>mutex);
+return -1;
+}
+if (!p->pending_job) {
+p->pending_job++;
+next_recv_channel = (i + 1) % migrate_multifd_channels();
+break;
+}
+qemu_mutex_unlock(>mutex);
+}
+assert(p->data->size == 0);
+multifd_recv_state->data = p->data;
+p->data = data;
+qemu_mutex_unlock(>mutex);
+qemu_sem_post(>sem);
+
+return 1;
+}
+
+MultiFDRecvData *multifd_get_recv_data(void)
+{
+return multifd_recv_state->data;
+}
+
 static void multifd_recv_terminate_threads(Error *err)
 {
 int i;
@@ -1020,6 +1083,7 @@ static void multifd_recv_terminate_threads(Error *err)
 
 qemu_mutex_lock(>mutex);
 p->quit = true;
+qemu_sem_post(>sem);
 /*
  * We could arrive here for two reasons:
  *  - normal quit, i.e. everything went fine, just finished
@@ -1069,6 +1133,7 @@ void multifd_load_cleanup(void)
 p->c = NULL;
 qemu_mutex_destroy(>mutex);
 qemu_sem_destroy(>sem_sync);
+qemu_sem_destroy(>sem);
 g_free(p->name);
 p->name = NULL;
 p->packet_len = 0;
@@ -1083,6 +1148,8 @@ void multifd_load_cleanup(void)
 qemu_sem_destroy(_recv_state->sem_sync);
 g_free(multifd_recv_state->params);
 multifd_recv_state->params = NULL;
+g_free(multifd_recv_state->data);
+

[RFC PATCH v3 21/30] migration/multifd: Support incoming fixed-ram stream format

2023-11-27 Thread Fabiano Rosas
For the incoming fixed-ram migration we need to read the ramblock
headers, get the pages bitmap and send the host address of each
non-zero page to the multifd channel thread for writing.

To read from the migration file we need a preadv function that can
read into the iovs in segments of contiguous pages because (as in the
writing case) the file offset applies to the entire iovec.

Usage on HMP is:

(qemu) migrate_set_capability multifd on
(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_parameter max-bandwidth 0
(qemu) migrate_set_parameter multifd-channels 8
(qemu) migrate_incoming file:migfile
(qemu) info status
(qemu) c

Signed-off-by: Fabiano Rosas 
---
 migration/ram.c | 34 +++---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 385fe431bf..f5173755f0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -111,6 +111,7 @@
  * pages region in the migration file at a time.
  */
 #define FIXED_RAM_LOAD_BUF_SIZE 0x10
+#define FIXED_RAM_MULTIFD_LOAD_BUF_SIZE 0x10
 
 XBZRLECacheStats xbzrle_counters;
 
@@ -3942,13 +3943,36 @@ void colo_flush_ram_cache(void)
 trace_colo_flush_ram_cache_end();
 }
 
+static size_t ram_load_multifd_pages(RAMBlock *block, ram_addr_t start_offset,
+ size_t size)
+{
+MultiFDRecvData *data = multifd_get_recv_data();
+
+/*
+ * Pointing the opaque directly to the host buffer, no
+ * preprocessing needed.
+ */
+data->opaque = block->host + start_offset;
+
+data->file_offset = block->pages_offset + start_offset;
+data->size = size;
+
+if (multifd_recv() < 0) {
+return -1;
+}
+
+return size;
+}
+
 static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
 long num_pages, unsigned long *bitmap)
 {
 unsigned long set_bit_idx, clear_bit_idx;
 ram_addr_t offset;
 void *host;
-size_t read, unread, size, buf_size = FIXED_RAM_LOAD_BUF_SIZE;
+size_t read, unread, size;
+size_t buf_size = (migrate_multifd() ? FIXED_RAM_MULTIFD_LOAD_BUF_SIZE :
+   FIXED_RAM_LOAD_BUF_SIZE);
 
 for (set_bit_idx = find_first_bit(bitmap, num_pages);
  set_bit_idx < num_pages;
@@ -3963,8 +3987,12 @@ static void read_ramblock_fixed_ram(QEMUFile *f, 
RAMBlock *block,
 host = host_from_ram_block_offset(block, offset);
 size = MIN(unread, buf_size);
 
-read = qemu_get_buffer_at(f, host, size,
-  block->pages_offset + offset);
+if (migrate_multifd()) {
+read = ram_load_multifd_pages(block, offset, size);
+} else {
+read = qemu_get_buffer_at(f, host, size,
+  block->pages_offset + offset);
+}
 offset += read;
 unread -= read;
 }
-- 
2.35.3




[RFC PATCH v3 24/30] tests/qtest: Add a test for migration with direct-io and multifd

2023-11-27 Thread Fabiano Rosas
The test is only allowed to run in systems that know and in
filesystems which support O_DIRECT.

Signed-off-by: Fabiano Rosas 
---
- added ifdefs for O_DIRECT and a probing function
---
 tests/qtest/migration-helpers.c | 39 +
 tests/qtest/migration-helpers.h |  1 +
 tests/qtest/migration-test.c| 35 +
 3 files changed, 75 insertions(+)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 24fb7b3525..02b92f0cb6 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -292,3 +292,42 @@ char *resolve_machine_version(const char *alias, const 
char *var1,
 
 return find_common_machine_version(machine_name, var1, var2);
 }
+
+#ifdef O_DIRECT
+/*
+ * Probe for O_DIRECT support on the filesystem. Since this is used
+ * for tests, be conservative, if anything fails, assume it's
+ * unsupported.
+ */
+bool probe_o_direct_support(const char *tmpfs)
+{
+g_autofree char *filename = g_strdup_printf("%s/probe-o-direct", tmpfs);
+int fd, flags = O_CREAT | O_RDWR | O_DIRECT;
+void *buf;
+ssize_t ret, len;
+uint64_t offset;
+
+fd = open(filename, flags, 0660);
+if (fd < 0) {
+unlink(filename);
+return false;
+}
+
+/*
+ * Assuming 4k should be enough to satisfy O_DIRECT alignment
+ * requirements. The migration code uses 1M to be conservative.
+ */
+len = 0x10;
+offset = 0x10;
+
+buf = g_malloc0(len);
+ret = pwrite(fd, buf, len, offset);
+unlink(filename);
+
+if (ret < 0) {
+return false;
+}
+
+return true;
+}
+#endif
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index e31dc85cc7..15df009d35 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -47,4 +47,5 @@ char *find_common_machine_version(const char *mtype, const 
char *var1,
   const char *var2);
 char *resolve_machine_version(const char *alias, const char *var1,
   const char *var2);
+bool probe_o_direct_support(const char *tmpfs);
 #endif /* MIGRATION_HELPERS_H */
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 5c5725687c..192b8ec993 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -,6 +,36 @@ static void test_multifd_file_fixed_ram(void)
 test_file_common(, true);
 }
 
+#ifdef O_DIRECT
+static void *migrate_multifd_fixed_ram_dio_start(QTestState *from,
+ QTestState *to)
+{
+migrate_multifd_fixed_ram_start(from, to);
+
+migrate_set_parameter_bool(from, "direct-io", true);
+migrate_set_parameter_bool(to, "direct-io", true);
+
+return NULL;
+}
+
+static void test_multifd_file_fixed_ram_dio(void)
+{
+g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+   FILE_TEST_FILENAME);
+MigrateCommon args = {
+.connect_uri = uri,
+.listen_uri = "defer",
+.start_hook = migrate_multifd_fixed_ram_dio_start,
+};
+
+if (!probe_o_direct_support(tmpfs)) {
+g_test_skip("Filesystem does not support O_DIRECT");
+return;
+}
+
+test_file_common(, true);
+}
+#endif
 
 static void test_precopy_tcp_plain(void)
 {
@@ -3476,6 +3506,11 @@ int main(int argc, char **argv)
 qtest_add_func("/migration/multifd/file/fixed-ram/live",
test_multifd_file_fixed_ram_live);
 
+#ifdef O_DIRECT
+qtest_add_func("/migration/multifd/file/fixed-ram/dio",
+   test_multifd_file_fixed_ram_dio);
+#endif
+
 #ifdef CONFIG_GNUTLS
 qtest_add_func("/migration/precopy/unix/tls/psk",
test_precopy_unix_tls_psk);
-- 
2.35.3




[RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec

2023-11-27 Thread Fabiano Rosas
For the upcoming support to fixed-ram migration with multifd, we need
to be able to accept an iovec array with non-contiguous data.

Add a pwritev and preadv version that splits the array into contiguous
segments before writing. With that we can have the ram code continue
to add pages in any order and the multifd code continue to send large
arrays for reading and writing.

Signed-off-by: Fabiano Rosas 
---
- split the API that was merged into a single function
- use uintptr_t for compatibility with 32-bit
---
 include/io/channel.h | 26 
 io/channel.c | 70 
 2 files changed, 96 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 7986c49c71..25383db5aa 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
 ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
 size_t niov, off_t offset, Error **errp);
 
+/**
+ * qio_channel_pwritev_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @offset: the iovec offset in the file where to write the data
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns: 0 if all bytes were written, or -1 on error
+ */
+int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
+size_t niov, off_t offset, Error **errp);
+
 /**
  * qio_channel_pwrite
  * @ioc: the channel object
@@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, 
size_t buflen,
 ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
size_t niov, off_t offset, Error **errp);
 
+/**
+ * qio_channel_preadv_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data to
+ * @niov: the length of the @iov array
+ * @offset: the iovec offset in the file from where to read the data
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns: 0 if all bytes were read, or -1 on error
+ */
+int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
+   size_t niov, off_t offset, Error **errp);
+
 /**
  * qio_channel_pread
  * @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index a1f12f8e90..2f1745d052 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct 
iovec *iov,
 return klass->io_pwritev(ioc, iov, niov, offset, errp);
 }
 
+static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov, off_t offset,
+ bool is_write, Error **errp)
+{
+ssize_t ret = -1;
+int i, slice_idx, slice_num;
+uintptr_t base, next, file_offset;
+size_t len;
+
+slice_idx = 0;
+slice_num = 1;
+
+/*
+ * If the iov array doesn't have contiguous elements, we need to
+ * split it in slices because we only have one (file) 'offset' for
+ * the whole iov. Do this here so callers don't need to break the
+ * iov array themselves.
+ */
+for (i = 0; i < niov; i++, slice_num++) {
+base = (uintptr_t) iov[i].iov_base;
+
+if (i != niov - 1) {
+len = iov[i].iov_len;
+next = (uintptr_t) iov[i + 1].iov_base;
+
+if (base + len == next) {
+continue;
+}
+}
+
+/*
+ * Use the offset of the first element of the segment that
+ * we're sending.
+ */
+file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
+
+if (is_write) {
+ret = qio_channel_pwritev(ioc, [slice_idx], slice_num,
+  file_offset, errp);
+} else {
+ret = qio_channel_preadv(ioc, [slice_idx], slice_num,
+ file_offset, errp);
+}
+
+if (ret < 0) {
+break;
+}
+
+slice_idx += slice_num;
+slice_num = 0;
+}
+
+return (ret < 0) ? -1 : 0;
+}
+
+int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
+size_t niov, off_t offset, Error **errp)
+{
+return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
+ offset, true, errp);
+}
+
 ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
off_t offset, Error **errp)
 {
@@ -501,6 +564,13 @@ ssize_t qio_channel_preadv(QIOChannel *ioc, const struct 
iovec *iov,
 return klass->io_preadv(ioc, iov, niov, offset, errp);
 }
 
+int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
+   size_t niov, off_t offset, 

[RFC PATCH v3 23/30] migration: Add direct-io parameter

2023-11-27 Thread Fabiano Rosas
Add the direct-io migration parameter that tells the migration code to
use O_DIRECT when opening the migration stream file whenever possible.

This is currently only used with fixed-ram migration for the multifd
channels that transfer the RAM pages. Those channels only transfer the
pages and are guaranteed to perform aligned writes.

However the parameter could be made to affect other types of
file-based migrations in the future.

Signed-off-by: Fabiano Rosas 
---
- json formatting
- added checks for O_DIRECT support
---
 include/qemu/osdep.h   |  2 ++
 migration/file.c   | 22 --
 migration/migration-hmp-cmds.c | 11 +++
 migration/options.c| 30 ++
 migration/options.h|  1 +
 qapi/migration.json| 18 +++---
 util/osdep.c   |  9 +
 7 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 475a1c62ff..ea5d29ab9b 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -597,6 +597,8 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, 
bool exclusive);
 bool qemu_has_ofd_lock(void);
 #endif
 
+bool qemu_has_direct_io(void);
+
 #if defined(__HAIKU__) && defined(__i386__)
 #define FMT_pid "%ld"
 #elif defined(WIN64)
diff --git a/migration/file.c b/migration/file.c
index 62ba994109..fc5c1a45f4 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -61,12 +61,30 @@ int file_send_channel_destroy(QIOChannel *ioc)
 
 void file_send_channel_create(QIOTaskFunc f, void *data)
 {
-QIOChannelFile *ioc;
+QIOChannelFile *ioc = NULL;
 QIOTask *task;
 Error *err = NULL;
 int flags = O_WRONLY;
 
-ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, );
+if (migrate_direct_io()) {
+#ifdef O_DIRECT
+/*
+ * Enable O_DIRECT for the secondary channels. These are used
+ * for sending ram pages and writes should be guaranteed to be
+ * aligned to at least page size.
+ */
+flags |= O_DIRECT;
+#else
+error_setg(, "System does not support O_DIRECT");
+error_append_hint(,
+  "Try disabling direct-io migration capability\n");
+/* errors are propagated through the qio_task below */
+#endif
+}
+
+if (!err) {
+ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, );
+}
 
 task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
 if (!ioc) {
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 86ae832176..5ad6b2788d 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -392,6 +392,13 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "%s: %s\n",
 MigrationParameter_str(MIGRATION_PARAMETER_MODE),
 qapi_enum_lookup(_lookup, params->mode));
+
+if (params->has_direct_io) {
+monitor_printf(mon, "%s: %s\n",
+   MigrationParameter_str(
+   MIGRATION_PARAMETER_DIRECT_IO),
+   params->direct_io ? "on" : "off");
+}
 }
 
 qapi_free_MigrationParameters(params);
@@ -679,6 +686,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict 
*qdict)
 p->has_mode = true;
 visit_type_MigMode(v, param, >mode, );
 break;
+case MIGRATION_PARAMETER_DIRECT_IO:
+p->has_direct_io = true;
+visit_type_bool(v, param, >direct_io, );
+break;
 default:
 assert(0);
 }
diff --git a/migration/options.c b/migration/options.c
index 7f23881f51..6c100dff7a 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -835,6 +835,22 @@ int migrate_decompress_threads(void)
 return s->parameters.decompress_threads;
 }
 
+bool migrate_direct_io(void)
+{
+MigrationState *s = migrate_get_current();
+
+/* For now O_DIRECT is only supported with fixed-ram */
+if (!s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
+return false;
+}
+
+if (s->parameters.has_direct_io) {
+return s->parameters.direct_io;
+}
+
+return false;
+}
+
 uint64_t migrate_downtime_limit(void)
 {
 MigrationState *s = migrate_get_current();
@@ -1052,6 +1068,11 @@ MigrationParameters *qmp_query_migrate_parameters(Error 
**errp)
 params->has_mode = true;
 params->mode = s->parameters.mode;
 
+if (s->parameters.has_direct_io) {
+params->has_direct_io = true;
+params->direct_io = s->parameters.direct_io;
+}
+
 return params;
 }
 
@@ -1087,6 +1108,7 @@ void migrate_params_init(MigrationParameters *params)
 params->has_x_vcpu_dirty_limit_period = true;
 params->has_vcpu_dirty_limit = true;
 params->has_mode = true;
+params->has_direct_io = qemu_has_direct_io();
 }
 
 /*
@@ -1388,6 +1410,10 @@ static void 

[RFC PATCH v3 09/30] migration/ram: Add incoming 'fixed-ram' migration

2023-11-27 Thread Fabiano Rosas
Add the necessary code to parse the format changes for the 'fixed-ram'
capability.

One of the more notable changes in behavior is that in the 'fixed-ram'
case ram pages are restored in one go rather than constantly looping
through the migration stream.

Signed-off-by: Nikolay Borisov 
Signed-off-by: Fabiano Rosas 
---
- added sanity check for pages_offset alignment
- s/parsing/reading
- used Error
- fixed buffer size computation, now allowing an arbitrary limit
- fixed dereference of pointer to packed struct member in endianness
  conversion
---
 migration/ram.c | 119 
 1 file changed, 119 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 4a0ab8105f..08604222f2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -106,6 +106,12 @@
  */
 #define FIXED_RAM_FILE_OFFSET_ALIGNMENT 0x10
 
+/*
+ * When doing fixed-ram migration, this is the amount we read from the
+ * pages region in the migration file at a time.
+ */
+#define FIXED_RAM_LOAD_BUF_SIZE 0x10
+
 XBZRLECacheStats xbzrle_counters;
 
 /* used by the search for pages to send */
@@ -2996,6 +3002,35 @@ static void fixed_ram_insert_header(QEMUFile *file, 
RAMBlock *block)
 qemu_put_buffer(file, (uint8_t *) header, header_size);
 }
 
+static bool fixed_ram_read_header(QEMUFile *file, FixedRamHeader *header,
+  Error **errp)
+{
+size_t ret, header_size = sizeof(FixedRamHeader);
+
+ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
+if (ret != header_size) {
+error_setg(errp, "Could not read whole fixed-ram migration header "
+   "(expected %zd, got %zd bytes)", header_size, ret);
+return false;
+}
+
+/* migration stream is big-endian */
+header->version = be32_to_cpu(header->version);
+
+if (header->version > FIXED_RAM_HDR_VERSION) {
+error_setg(errp, "Migration fixed-ram capability version mismatch "
+   "(expected %d, got %d)", FIXED_RAM_HDR_VERSION,
+   header->version);
+return false;
+}
+
+header->page_size = be64_to_cpu(header->page_size);
+header->bitmap_offset = be64_to_cpu(header->bitmap_offset);
+header->pages_offset = be64_to_cpu(header->pages_offset);
+
+return true;
+}
+
 /*
  * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
  * long-running RCU critical section.  When rcu-reclaims in the code
@@ -3892,6 +3927,80 @@ void colo_flush_ram_cache(void)
 trace_colo_flush_ram_cache_end();
 }
 
+static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
+long num_pages, unsigned long *bitmap)
+{
+unsigned long set_bit_idx, clear_bit_idx;
+ram_addr_t offset;
+void *host;
+size_t read, unread, size, buf_size = FIXED_RAM_LOAD_BUF_SIZE;
+
+for (set_bit_idx = find_first_bit(bitmap, num_pages);
+ set_bit_idx < num_pages;
+ set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
+
+clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
+
+unread = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
+offset = set_bit_idx << TARGET_PAGE_BITS;
+
+while (unread > 0) {
+host = host_from_ram_block_offset(block, offset);
+size = MIN(unread, buf_size);
+
+read = qemu_get_buffer_at(f, host, size,
+  block->pages_offset + offset);
+offset += read;
+unread -= read;
+}
+}
+}
+
+static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
+ram_addr_t length, Error **errp)
+{
+g_autofree unsigned long *bitmap = NULL;
+FixedRamHeader header;
+size_t bitmap_size;
+long num_pages;
+
+if (!fixed_ram_read_header(f, , errp)) {
+return -EINVAL;
+}
+
+block->pages_offset = header.pages_offset;
+
+/*
+ * Check the alignment of the file region that contains pages. We
+ * don't enforce FIXED_RAM_FILE_OFFSET_ALIGNMENT to allow that
+ * value to change in the future. Do only a sanity check with page
+ * size alignment.
+ */
+if (!QEMU_IS_ALIGNED(block->pages_offset, TARGET_PAGE_SIZE)) {
+error_setg(errp,
+   "Error reading ramblock %s pages, region has bad alignment",
+   block->idstr);
+return -EINVAL;
+}
+
+num_pages = length / header.page_size;
+bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+
+bitmap = g_malloc0(bitmap_size);
+if (qemu_get_buffer_at(f, (uint8_t *)bitmap, bitmap_size,
+   header.bitmap_offset) != bitmap_size) {
+error_setg(errp, "Error reading dirty bitmap");
+return -EINVAL;
+}
+
+read_ramblock_fixed_ram(f, block, num_pages, bitmap);
+
+/* Skip pages array */
+qemu_set_offset(f, block->pages_offset + length, 

[RFC PATCH v3 12/30] migration/multifd: Allow QIOTask error reporting without an object

2023-11-27 Thread Fabiano Rosas
The only way for the channel backend to report an error to the multifd
core during creation is by setting the QIOTask error. We must allow
the channel backend to set the error even if the QIOChannel has failed
to be created, which means the QIOTask source object would be NULL.

At multifd_new_send_channel_async() move the QOM casting of the
channel until after we have checked for the QIOTask error.

Signed-off-by: Fabiano Rosas 
---
context: When doing multifd + file, it's possible that we fail to open
the file. I'll use the empty QIOTask to report the error back to
multifd.
---
 migration/multifd.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 9625640d61..123ff0dec0 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -865,8 +865,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
 return true;
 }
 
-static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
- QIOChannel *ioc, Error *err)
+static void multifd_new_send_channel_cleanup(MultiFDSendParams *p, Error *err)
 {
  migrate_set_error(migrate_get_current(), err);
  /* Error happen, we need to tell who pay attention to me */
@@ -878,20 +877,20 @@ static void 
multifd_new_send_channel_cleanup(MultiFDSendParams *p,
   * its status.
   */
  p->quit = true;
- object_unref(OBJECT(ioc));
  error_free(err);
 }
 
 static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 {
 MultiFDSendParams *p = opaque;
-QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
+Object *obj = qio_task_get_source(task);
 Error *local_err = NULL;
 
 trace_multifd_new_send_channel_async(p->id);
 if (!qio_task_propagate_error(task, _err)) {
-p->c = ioc;
-qio_channel_set_delay(p->c, false);
+QIOChannel *ioc = QIO_CHANNEL(obj);
+
+qio_channel_set_delay(ioc, false);
 p->running = true;
 if (multifd_channel_connect(p, ioc, _err)) {
 return;
@@ -899,7 +898,8 @@ static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)
 }
 
 trace_multifd_new_send_channel_async_error(p->id, local_err);
-multifd_new_send_channel_cleanup(p, ioc, local_err);
+multifd_new_send_channel_cleanup(p, local_err);
+object_unref(obj);
 }
 
 static void multifd_new_send_channel_create(gpointer opaque)
-- 
2.35.3




[RFC PATCH v3 20/30] migration/multifd: Support outgoing fixed-ram stream format

2023-11-27 Thread Fabiano Rosas
The new fixed-ram stream format uses a file transport and puts ram
pages in the migration file at their respective offsets and can be
done in parallel by using the pwritev system call which takes iovecs
and an offset.

Add support to enabling the new format along with multifd to make use
of the threading and page handling already in place.

This requires multifd to stop sending headers and leaving the stream
format to the fixed-ram code. When it comes time to write the data, we
need to call a version of qio_channel_write that can take an offset.

Usage on HMP is:

(qemu) stop
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_parameter max-bandwidth 0
(qemu) migrate_set_parameter multifd-channels 8
(qemu) migrate file:migfile

Signed-off-by: Fabiano Rosas 
---
- altered to call a separate qio_channel function for fixed-ram
---
 include/qemu/bitops.h | 13 +++
 migration/migration.c | 19 ++
 migration/multifd.c   | 81 ---
 migration/options.c   |  6 
 migration/ram.c   | 17 +++--
 migration/ram.h   |  1 +
 6 files changed, 110 insertions(+), 27 deletions(-)

diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index cb3526d1f4..2c0a2fe751 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -67,6 +67,19 @@ static inline void clear_bit(long nr, unsigned long *addr)
 *p &= ~mask;
 }
 
+/**
+ * clear_bit_atomic - Clears a bit in memory atomically
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ */
+static inline void clear_bit_atomic(long nr, unsigned long *addr)
+{
+unsigned long mask = BIT_MASK(nr);
+unsigned long *p = addr + BIT_WORD(nr);
+
+return qatomic_and(p, ~mask);
+}
+
 /**
  * change_bit - Toggle a bit in memory
  * @nr: Bit to change
diff --git a/migration/migration.c b/migration/migration.c
index 16689171ab..cc707b0223 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -128,11 +128,19 @@ static bool migration_needs_multiple_sockets(void)
 return migrate_multifd() || migrate_postcopy_preempt();
 }
 
-static bool transport_supports_multi_channels(SocketAddress *saddr)
+static bool transport_supports_multi_channels(MigrationAddress *addr)
 {
-return saddr->type == SOCKET_ADDRESS_TYPE_INET ||
-   saddr->type == SOCKET_ADDRESS_TYPE_UNIX ||
-   saddr->type == SOCKET_ADDRESS_TYPE_VSOCK;
+if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
+SocketAddress *saddr = >u.socket;
+
+return (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
+saddr->type == SOCKET_ADDRESS_TYPE_UNIX ||
+saddr->type == SOCKET_ADDRESS_TYPE_VSOCK);
+} else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
+return migrate_fixed_ram();
+} else {
+return false;
+}
 }
 
 static bool migration_needs_seekable_channel(void)
@@ -156,8 +164,7 @@ 
migration_channels_and_transport_compatible(MigrationAddress *addr,
 }
 
 if (migration_needs_multiple_sockets() &&
-(addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) &&
-!transport_supports_multi_channels(>u.socket)) {
+!transport_supports_multi_channels(addr)) {
 error_setg(errp, "Migration requires multi-channel URIs (e.g. tcp)");
 return false;
 }
diff --git a/migration/multifd.c b/migration/multifd.c
index 7dfab2367a..8eae7de4de 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -278,6 +278,17 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
 g_free(pages);
 }
 
+static void multifd_set_file_bitmap(MultiFDSendParams *p)
+{
+MultiFDPages_t *pages = p->pages;
+
+assert(pages->block);
+
+for (int i = 0; i < p->normal_num; i++) {
+ramblock_set_shadow_bmap_atomic(pages->block, pages->offset[i]);
+}
+}
+
 static void multifd_send_fill_packet(MultiFDSendParams *p)
 {
 MultiFDPacket_t *packet = p->packet;
@@ -624,6 +635,34 @@ int multifd_send_sync_main(QEMUFile *f)
 }
 }
 
+if (!migrate_multifd_packets()) {
+/*
+ * There's no sync packet to send. Just make sure the sending
+ * above has finished.
+ */
+for (i = 0; i < migrate_multifd_channels(); i++) {
+qemu_sem_wait(_send_state->channels_ready);
+}
+
+/* sanity check and release the channels */
+for (i = 0; i < migrate_multifd_channels(); i++) {
+MultiFDSendParams *p = _send_state->params[i];
+
+qemu_mutex_lock(>mutex);
+if (p->quit) {
+error_report("%s: channel %d has already quit!", __func__, i);
+qemu_mutex_unlock(>mutex);
+return -1;
+}
+assert(!p->pending_job);
+qemu_mutex_unlock(>mutex);
+
+qemu_sem_post(>sem);
+}
+
+return 0;
+}
+
 /*
  * When using zero-copy, it's necessary to flush the pages before any of
  

[RFC PATCH v3 26/30] monitor: Extract fdset fd flags comparison into a function

2023-11-27 Thread Fabiano Rosas
We're about to add one more condition to the flags comparison that
requires an ifdef. Move the code into a separate function now to make
it cleaner after the next patch.

Signed-off-by: Fabiano Rosas 
---
 monitor/fds.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/monitor/fds.c b/monitor/fds.c
index 4ec3b7eea9..9a28e4b72b 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -406,6 +406,19 @@ AddfdInfo *monitor_fdset_add_fd(int fd, bool has_fdset_id, 
int64_t fdset_id,
 return fdinfo;
 }
 
+#ifndef _WIN32
+static bool monitor_fdset_flags_match(int flags, int fd_flags)
+{
+bool match = false;
+
+if ((flags & O_ACCMODE) == (fd_flags & O_ACCMODE)) {
+match = true;
+}
+
+return match;
+}
+#endif
+
 int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
 {
 #ifdef _WIN32
@@ -431,7 +444,7 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
 return -1;
 }
 
-if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
+if (monitor_fdset_flags_match(flags, mon_fd_flags)) {
 fd = mon_fdset_fd->fd;
 break;
 }
-- 
2.35.3




[RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets

2023-11-27 Thread Fabiano Rosas
For the upcoming support to the new 'fixed-ram' migration stream
format, we cannot use multifd packets because each write into the
ramblock section in the migration file is expected to contain only the
guest pages. They are written at their respective offsets relative to
the ramblock section header.

There is no space for the packet information and the expected gains
from the new approach come partly from being able to write the pages
sequentially without extraneous data in between.

The new format also doesn't need the packets and all necessary
information can be taken from the standard migration headers with some
(future) changes to multifd code.

Use the presence of the fixed-ram capability to decide whether to send
packets. For now this has no effect as fixed-ram cannot yet be enabled
with multifd.

Signed-off-by: Fabiano Rosas 
---
- moved more of the packet code under use_packets
---
 migration/multifd.c | 138 +++-
 migration/options.c |   5 ++
 migration/options.h |   1 +
 3 files changed, 91 insertions(+), 53 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index ec58c58082..9625640d61 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -654,18 +654,22 @@ static void *multifd_send_thread(void *opaque)
 Error *local_err = NULL;
 int ret = 0;
 bool use_zero_copy_send = migrate_zero_copy_send();
+bool use_packets = migrate_multifd_packets();
 
 thread = migration_threads_add(p->name, qemu_get_thread_id());
 
 trace_multifd_send_thread_start(p->id);
 rcu_register_thread();
 
-if (multifd_send_initial_packet(p, _err) < 0) {
-ret = -1;
-goto out;
+if (use_packets) {
+if (multifd_send_initial_packet(p, _err) < 0) {
+ret = -1;
+goto out;
+}
+
+/* initial packet */
+p->num_packets = 1;
 }
-/* initial packet */
-p->num_packets = 1;
 
 while (true) {
 qemu_sem_post(_send_state->channels_ready);
@@ -677,11 +681,10 @@ static void *multifd_send_thread(void *opaque)
 qemu_mutex_lock(>mutex);
 
 if (p->pending_job) {
-uint64_t packet_num = p->packet_num;
 uint32_t flags;
 p->normal_num = 0;
 
-if (use_zero_copy_send) {
+if (!use_packets || use_zero_copy_send) {
 p->iovs_num = 0;
 } else {
 p->iovs_num = 1;
@@ -699,16 +702,20 @@ static void *multifd_send_thread(void *opaque)
 break;
 }
 }
-multifd_send_fill_packet(p);
+
+if (use_packets) {
+multifd_send_fill_packet(p);
+p->num_packets++;
+}
+
 flags = p->flags;
 p->flags = 0;
-p->num_packets++;
 p->total_normal_pages += p->normal_num;
 p->pages->num = 0;
 p->pages->block = NULL;
 qemu_mutex_unlock(>mutex);
 
-trace_multifd_send(p->id, packet_num, p->normal_num, flags,
+trace_multifd_send(p->id, p->packet_num, p->normal_num, flags,
p->next_packet_size);
 
 if (use_zero_copy_send) {
@@ -718,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
 if (ret != 0) {
 break;
 }
-} else {
+} else if (use_packets) {
 /* Send header using the same writev call */
 p->iov[0].iov_len = p->packet_len;
 p->iov[0].iov_base = p->packet;
@@ -904,6 +911,7 @@ int multifd_save_setup(Error **errp)
 {
 int thread_count;
 uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+bool use_packets = migrate_multifd_packets();
 uint8_t i;
 
 if (!migrate_multifd()) {
@@ -928,14 +936,20 @@ int multifd_save_setup(Error **errp)
 p->pending_job = 0;
 p->id = i;
 p->pages = multifd_pages_init(page_count);
-p->packet_len = sizeof(MultiFDPacket_t)
-  + sizeof(uint64_t) * page_count;
-p->packet = g_malloc0(p->packet_len);
-p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+if (use_packets) {
+p->packet_len = sizeof(MultiFDPacket_t)
+  + sizeof(uint64_t) * page_count;
+p->packet = g_malloc0(p->packet_len);
+p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
+p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+/* We need one extra place for the packet header */
+p->iov = g_new0(struct iovec, page_count + 1);
+} else {
+p->iov = g_new0(struct iovec, page_count);
+}
 p->name = g_strdup_printf("multifdsend_%d", i);
-/* We need one extra place for the packet header */
-p->iov = g_new0(struct 

Re: [PATCH v1 1/2] hw/cxl/device: read from register values in mdev_reg_read()

2023-11-27 Thread Davidlohr Bueso

On Mon, 27 Nov 2023, Hyeonggon Yoo wrote:


In the current mdev_reg_read() implementation, it consistently returns
that the Media Status is Ready (01b). This was fine until commit
25a52959f99d ("hw/cxl: Add support for device sanitation") because the
media was presumed to be ready.

However, as per the CXL 3.0 spec "8.2.9.8.5.1 Sanitize (Opcode 4400h)",
during sanitation, the Media State should be set to Disabled (11b). The
mentioned commit correctly sets it to Disabled, but mdev_reg_read()
still returns Media Status as Ready.

To address this, update mdev_reg_read() to read register values instead
of returning dummy values.

Fixes: commit 25a52959f99d ("hw/cxl: Add support for device sanitation")
Signed-off-by: Hyeonggon Yoo <42.hye...@gmail.com>


Looks good, thanks.

Reviewed-by: Davidlohr Bueso 

In addition how about the following to further robustify?
  - disallow certain incoming cci cmd when media is disabled
  - deal with memory reads/writes when media is disabled
  - make __toggle_media() a nop when passed value is already set
  - play nice with arm64 uses little endian reads and writes (this
should be extended to all of mbox/cci of course).

8<-
diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
index 6eff56fb1b34..9bc5121215c9 100644
--- a/hw/cxl/cxl-mailbox-utils.c
+++ b/hw/cxl/cxl-mailbox-utils.c
@@ -1314,6 +1314,7 @@ int cxl_process_cci_message(CXLCCI *cci, uint8_t set, 
uint8_t cmd,
 int ret;
 const struct cxl_cmd *cxl_cmd;
 opcode_handler h;
+CXLType3Dev *ct3d = CXL_TYPE3(cci->d);
 
 *len_out = 0;

 cxl_cmd = >cxl_cmd_set[set][cmd];
@@ -1334,8 +1335,8 @@ int cxl_process_cci_message(CXLCCI *cci, uint8_t set, 
uint8_t cmd,
 return CXL_MBOX_BUSY;
 }
 
-/* forbid any selected commands while overwriting */

-if (sanitize_running(cci)) {
+/* forbid any selected commands when necessary */
+if (sanitize_running(cci) || cxl_dev_media_disabled(>cxl_dstate)) {
 if (h == cmd_events_get_records ||
 h == cmd_ccls_get_partition_info ||
 h == cmd_ccls_set_lsa ||
diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index 72d93713473d..e0a164fde007 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -899,7 +899,8 @@ MemTxResult cxl_type3_read(PCIDevice *d, hwaddr host_addr, 
uint64_t *data,
 return MEMTX_ERROR;
 }
 
-if (sanitize_running(>cci)) {

+if (sanitize_running(>cci) ||
+cxl_dev_media_disabled(>cxl_dstate)) {
 qemu_guest_getrandom_nofail(data, size);
 return MEMTX_OK;
 }
@@ -925,6 +926,11 @@ MemTxResult cxl_type3_write(PCIDevice *d, hwaddr 
host_addr, uint64_t data,
 return MEMTX_OK;
 }
 
+/* memory writes to the device will have no effect */

+if (cxl_dev_media_disabled(>cxl_dstate)) {
+return MEMTX_OK;
+}
+
 return address_space_write(as, dpa_offset, attrs, , size);
 }
 
diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h

index 873e6d6ab159..007d4169df7c 100644
--- a/include/hw/cxl/cxl_device.h
+++ b/include/hw/cxl/cxl_device.h
@@ -349,14 +349,26 @@ REG64(CXL_MEM_DEV_STS, 0)
 FIELD(CXL_MEM_DEV_STS, MBOX_READY, 4, 1)
 FIELD(CXL_MEM_DEV_STS, RESET_NEEDED, 5, 3)
 
+static inline bool cxl_dev_media_disabled(CXLDeviceState *cxl_dstate)

+{
+uint64_t dev_status_reg;
+
+dev_status_reg = ldq_le_p(cxl_dstate->mbox_reg_state64 + 
R_CXL_MEM_DEV_STS);
+return FIELD_EX64(dev_status_reg, CXL_MEM_DEV_STS, MEDIA_STATUS) == 0x3;
+}
+
 static inline void __toggle_media(CXLDeviceState *cxl_dstate, int val)
 {
 uint64_t dev_status_reg;
 
-dev_status_reg = cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS];

+dev_status_reg = ldq_le_p(cxl_dstate->mbox_reg_state64 + 
R_CXL_MEM_DEV_STS);
+if (FIELD_EX64(dev_status_reg, CXL_MEM_DEV_STS, MEDIA_STATUS) == val) {
+return;
+}
+
 dev_status_reg = FIELD_DP64(dev_status_reg, CXL_MEM_DEV_STS, MEDIA_STATUS,
 val);
-cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS] = dev_status_reg;
+stq_le_p(cxl_dstate->mbox_reg_state64 + R_CXL_MEM_DEV_STS, dev_status_reg);
 }
 #define cxl_dev_disable_media(cxlds)\
 do { __toggle_media((cxlds), 0x3); } while (0)



[RFC PATCH v3 25/30] monitor: Honor QMP request for fd removal immediately

2023-11-27 Thread Fabiano Rosas
We're currently only removing an fd from the fdset if the VM is
running. This causes a QMP call to "remove-fd" to not actually remove
the fd if the VM happens to be stopped.

While the fd would eventually be removed when monitor_fdset_cleanup()
is called again, the user request should be honored and the fd
actually removed. Calling remove-fd + query-fdset shows a recently
removed fd still present.

The runstate_is_running() check was introduced by commit ebe52b592d
("monitor: Prevent removing fd from set during init"), which by the
shortlog indicates that they were trying to avoid removing an
yet-unduplicated fd too early.

I don't see why an fd explicitly removed with qmp_remove_fd() should
be under runstate_is_running(). I'm assuming this was a mistake when
adding the parenthesis around the expression.

Move the runstate_is_running() check to apply only to the
QLIST_EMPTY(dup_fds) side of the expression and ignore it when
mon_fdset_fd->removed has been explicitly set.

Signed-off-by: Fabiano Rosas 
---
 monitor/fds.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/monitor/fds.c b/monitor/fds.c
index d86c2c674c..4ec3b7eea9 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -173,9 +173,9 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
 MonFdsetFd *mon_fdset_fd_next;
 
 QLIST_FOREACH_SAFE(mon_fdset_fd, _fdset->fds, next, mon_fdset_fd_next) 
{
-if ((mon_fdset_fd->removed ||
-(QLIST_EMPTY(_fdset->dup_fds) && mon_refcount == 0)) &&
-runstate_is_running()) {
+if (mon_fdset_fd->removed ||
+(QLIST_EMPTY(_fdset->dup_fds) && mon_refcount == 0 &&
+ runstate_is_running())) {
 close(mon_fdset_fd->fd);
 g_free(mon_fdset_fd->opaque);
 QLIST_REMOVE(mon_fdset_fd, next);
-- 
2.35.3




[RFC PATCH v3 22/30] tests/qtest: Add a multifd + fixed-ram migration test

2023-11-27 Thread Fabiano Rosas
Signed-off-by: Fabiano Rosas 
---
 tests/qtest/migration-test.c | 45 
 1 file changed, 45 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 96a6217af0..5c5725687c 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2183,6 +2183,46 @@ static void test_precopy_file_fixed_ram(void)
 test_file_common(, true);
 }
 
+static void *migrate_multifd_fixed_ram_start(QTestState *from, QTestState *to)
+{
+migrate_fixed_ram_start(from, to);
+
+migrate_set_parameter_int(from, "multifd-channels", 4);
+migrate_set_parameter_int(to, "multifd-channels", 4);
+
+migrate_set_capability(from, "multifd", true);
+migrate_set_capability(to, "multifd", true);
+
+return NULL;
+}
+
+static void test_multifd_file_fixed_ram_live(void)
+{
+g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+   FILE_TEST_FILENAME);
+MigrateCommon args = {
+.connect_uri = uri,
+.listen_uri = "defer",
+.start_hook = migrate_multifd_fixed_ram_start,
+};
+
+test_file_common(, false);
+}
+
+static void test_multifd_file_fixed_ram(void)
+{
+g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+   FILE_TEST_FILENAME);
+MigrateCommon args = {
+.connect_uri = uri,
+.listen_uri = "defer",
+.start_hook = migrate_multifd_fixed_ram_start,
+};
+
+test_file_common(, true);
+}
+
+
 static void test_precopy_tcp_plain(void)
 {
 MigrateCommon args = {
@@ -3431,6 +3471,11 @@ int main(int argc, char **argv)
 qtest_add_func("/migration/precopy/file/fixed-ram/live",
test_precopy_file_fixed_ram_live);
 
+qtest_add_func("/migration/multifd/file/fixed-ram",
+   test_multifd_file_fixed_ram);
+qtest_add_func("/migration/multifd/file/fixed-ram/live",
+   test_multifd_file_fixed_ram_live);
+
 #ifdef CONFIG_GNUTLS
 qtest_add_func("/migration/precopy/unix/tls/psk",
test_precopy_unix_tls_psk);
-- 
2.35.3




[RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram

2023-11-27 Thread Fabiano Rosas
Hi,

In this v3:

Added support for the "file:/dev/fdset/" syntax to receive multiple
file descriptors. This allows the management layer to open the
migration file beforehand and pass the file descriptors to QEMU. We
need more than one fd to be able to use O_DIRECT concurrently with
unaligned writes.

Dropped the auto-pause capability. That discussion was kind of
stuck. We can revisit optimizations for non-live scenarios once the
series is more mature/merged.

Changed the multifd incoming side to use a more generic data structure
instead of MultiFDPages_t. This allows multifd to restore the ram
using larger chunks.

The rest are minor changes, I have noted them in the patches
themselves.

Thanks

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1086786947

v2:
https://lore.kernel.org/r/20231023203608.26370-1-faro...@suse.de
v1:
https://lore.kernel.org/r/20230330180336.2791-1-faro...@suse.de

Fabiano Rosas (24):
  io: fsync before closing a file channel
  migration/ram: Introduce 'fixed-ram' migration capability
  migration: Add fixed-ram URI compatibility check
  migration/ram: Add incoming 'fixed-ram' migration
  migration/multifd: Allow multifd without packets
  migration/multifd: Allow QIOTask error reporting without an object
  migration/multifd: Add outgoing QIOChannelFile support
  migration/multifd: Add incoming QIOChannelFile support
  io: Add a pwritev/preadv version that takes a discontiguous iovec
  multifd: Rename MultiFDSendParams::data to compress_data
  migration/multifd: Decouple recv method from pages
  migration/multifd: Allow receiving pages without packets
  migration/ram: Ignore multifd flush when doing fixed-ram migration
  migration/multifd: Support outgoing fixed-ram stream format
  migration/multifd: Support incoming fixed-ram stream format
  tests/qtest: Add a multifd + fixed-ram migration test
  migration: Add direct-io parameter
  tests/qtest: Add a test for migration with direct-io and multifd
  monitor: Honor QMP request for fd removal immediately
  monitor: Extract fdset fd flags comparison into a function
  monitor: fdset: Match against O_DIRECT
  docs/devel/migration.rst: Document the file transport
  migration: Add support for fdset with multifd + file
  tests/qtest: Add a test for fixed-ram with passing of fds

Nikolay Borisov (6):
  io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
  io: Add generic pwritev/preadv interface
  io: implement io_pwritev/preadv for QIOChannelFile
  migration/qemu-file: add utility methods for working with seekable
channels
  migration/ram: Add outgoing 'fixed-ram' migration
  tests/qtest: migration-test: Add tests for fixed-ram file-based
migration

 docs/devel/migration.rst|  43 
 include/exec/ramblock.h |   8 +
 include/io/channel.h| 109 
 include/migration/qemu-file-types.h |   2 +
 include/qemu/bitops.h   |  13 +
 include/qemu/osdep.h|   2 +
 io/channel-file.c   |  69 +
 io/channel.c| 128 ++
 migration/file.c| 191 +-
 migration/file.h|   5 +
 migration/migration-hmp-cmds.c  |  11 +
 migration/migration.c   |  38 ++-
 migration/multifd-zlib.c|  22 +-
 migration/multifd-zstd.c|  22 +-
 migration/multifd.c | 376 ++--
 migration/multifd.h |  30 ++-
 migration/options.c |  70 ++
 migration/options.h |   4 +
 migration/qemu-file.c   |  82 ++
 migration/qemu-file.h   |   7 +-
 migration/ram.c | 291 -
 migration/ram.h |   1 +
 migration/savevm.c  |   1 +
 monitor/fds.c   |  27 +-
 qapi/migration.json |  24 +-
 tests/qtest/migration-helpers.c |  42 
 tests/qtest/migration-helpers.h |   1 +
 tests/qtest/migration-test.c| 206 +++
 util/osdep.c|   9 +
 29 files changed, 1686 insertions(+), 148 deletions(-)

-- 
2.35.3




[RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability

2023-11-27 Thread Fabiano Rosas
Add a new migration capability 'fixed-ram'.

The core of the feature is to ensure that each RAM page has a specific
offset in the resulting migration stream. The reasons why we'd want
such behavior are:

 - The resulting file will have a bounded size, since pages which are
   dirtied multiple times will always go to a fixed location in the
   file, rather than constantly being added to a sequential
   stream. This eliminates cases where a VM with, say, 1G of RAM can
   result in a migration file that's 10s of GBs, provided that the
   workload constantly redirties memory.

 - It paves the way to implement O_DIRECT-enabled save/restore of the
   migration stream as the pages are ensured to be written at aligned
   offsets.

 - It allows the usage of multifd so we can write RAM pages to the
   migration file in parallel.

For now, enabling the capability has no effect. The next couple of
patches implement the core functionality.

Signed-off-by: Fabiano Rosas 
---
- mentioned seeking on docs
---
 docs/devel/migration.rst | 21 +
 migration/options.c  | 34 ++
 migration/options.h  |  1 +
 migration/savevm.c   |  1 +
 qapi/migration.json  |  6 +-
 5 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index ec55089b25..eeb4fec31f 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -572,6 +572,27 @@ Others (especially either older devices or system devices 
which for
 some reason don't have a bus concept) make use of the ``instance id``
 for otherwise identically named devices.
 
+Fixed-ram format
+
+
+When the ``fixed-ram`` capability is enabled, a slightly different
+stream format is used for the RAM section. Instead of having a
+sequential stream of pages that follow the RAMBlock headers, the dirty
+pages for a RAMBlock follow its header. This ensures that each RAM
+page has a fixed offset in the resulting migration file.
+
+The ``fixed-ram`` capability must be enabled in both source and
+destination with:
+
+``migrate_set_capability fixed-ram on``
+
+Since pages are written to their relative offsets and out of order
+(due to the memory dirtying patterns), streaming channels such as
+sockets are not supported. A seekable channel such as a file is
+required. This can be verified in the QIOChannel by the presence of
+the QIO_CHANNEL_FEATURE_SEEKABLE. In more practical terms, this
+migration format requires the ``file:`` URI when migrating.
+
 Return path
 ---
 
diff --git a/migration/options.c b/migration/options.c
index 8d8ec73ad9..775428a8a5 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -204,6 +204,7 @@ Property migration_properties[] = {
 DEFINE_PROP_MIG_CAP("x-switchover-ack",
 MIGRATION_CAPABILITY_SWITCHOVER_ACK),
 DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
+DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
 DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -263,6 +264,13 @@ bool migrate_events(void)
 return s->capabilities[MIGRATION_CAPABILITY_EVENTS];
 }
 
+bool migrate_fixed_ram(void)
+{
+MigrationState *s = migrate_get_current();
+
+return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
+}
+
 bool migrate_ignore_shared(void)
 {
 MigrationState *s = migrate_get_current();
@@ -645,6 +653,32 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, 
Error **errp)
 }
 }
 
+if (new_caps[MIGRATION_CAPABILITY_FIXED_RAM]) {
+if (new_caps[MIGRATION_CAPABILITY_MULTIFD]) {
+error_setg(errp,
+   "Fixed-ram migration is incompatible with multifd");
+return false;
+}
+
+if (new_caps[MIGRATION_CAPABILITY_XBZRLE]) {
+error_setg(errp,
+   "Fixed-ram migration is incompatible with xbzrle");
+return false;
+}
+
+if (new_caps[MIGRATION_CAPABILITY_COMPRESS]) {
+error_setg(errp,
+   "Fixed-ram migration is incompatible with compression");
+return false;
+}
+
+if (new_caps[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
+error_setg(errp,
+   "Fixed-ram migration is incompatible with postcopy 
ram");
+return false;
+}
+}
+
 return true;
 }
 
diff --git a/migration/options.h b/migration/options.h
index 246c160aee..8680a10b79 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -31,6 +31,7 @@ bool migrate_compress(void);
 bool migrate_dirty_bitmaps(void);
 bool migrate_dirty_limit(void);
 bool migrate_events(void);
+bool migrate_fixed_ram(void);
 bool migrate_ignore_shared(void);
 bool migrate_late_block_activate(void);
 bool migrate_multifd(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index eec5503a42..48c37bd198 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ 

[RFC PATCH v3 04/30] io: fsync before closing a file channel

2023-11-27 Thread Fabiano Rosas
Make sure the data is flushed to disk before closing file
channels. This will ensure data is on disk at the end of a migration
to file.

Signed-off-by: Fabiano Rosas 
---
 io/channel-file.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/io/channel-file.c b/io/channel-file.c
index a6ad7770c6..d4706fa592 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -242,6 +242,11 @@ static int qio_channel_file_close(QIOChannel *ioc,
 {
 QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
 
+if (qemu_fdatasync(fioc->fd) < 0) {
+error_setg_errno(errp, errno,
+ "Unable to synchronize file data with storage 
device");
+return -1;
+}
 if (qemu_close(fioc->fd) < 0) {
 error_setg_errno(errp, errno,
  "Unable to close file");
-- 
2.35.3




[RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface

2023-11-27 Thread Fabiano Rosas
From: Nikolay Borisov 

Introduce basic pwritev/preadv support in the generic channel layer.
Specific implementation will follow for the file channel as this is
required in order to support migration streams with fixed location of
each ram page.

Signed-off-by: Nikolay Borisov 
Signed-off-by: Fabiano Rosas 
---
- fixed naming: s/pwritev_full/pwritev
---
 include/io/channel.h | 82 
 io/channel.c | 58 +++
 2 files changed, 140 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index fcb19fd672..7986c49c71 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -131,6 +131,16 @@ struct QIOChannelClass {
Error **errp);
 
 /* Optional callbacks */
+ssize_t (*io_pwritev)(QIOChannel *ioc,
+  const struct iovec *iov,
+  size_t niov,
+  off_t offset,
+  Error **errp);
+ssize_t (*io_preadv)(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov,
+ off_t offset,
+ Error **errp);
 int (*io_shutdown)(QIOChannel *ioc,
QIOChannelShutdown how,
Error **errp);
@@ -529,6 +539,78 @@ void qio_channel_set_follow_coroutine_ctx(QIOChannel *ioc, 
bool enabled);
 int qio_channel_close(QIOChannel *ioc,
   Error **errp);
 
+/**
+ * qio_channel_pwritev
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ * Behaves as qio_channel_writev_full, apart from not supporting
+ * sending of file handles as well as beginning the write at the
+ * passed @offset
+ *
+ */
+ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
+size_t niov, off_t offset, Error **errp);
+
+/**
+ * qio_channel_pwrite
+ * @ioc: the channel object
+ * @buf: the memory region to write data into
+ * @buflen: the number of bytes to @buf
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ */
+ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
+   off_t offset, Error **errp);
+
+/**
+ * qio_channel_preadv
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data into
+ * @niov: the length of the @iov array
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error.  To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ * Behaves as qio_channel_readv_full, apart from not supporting
+ * receiving of file handles as well as beginning the read at the
+ * passed @offset
+ *
+ */
+ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
+   size_t niov, off_t offset, Error **errp);
+
+/**
+ * qio_channel_pread
+ * @ioc: the channel object
+ * @buf: the memory region to write data into
+ * @buflen: the number of bytes to @buf
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error.  To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ */
+ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
+  off_t offset, Error **errp);
+
 /**
  * qio_channel_shutdown:
  * @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index 86c5834510..a1f12f8e90 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -454,6 +454,64 @@ GSource *qio_channel_add_watch_source(QIOChannel *ioc,
 }
 
 
+ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
+size_t niov, off_t offset, Error **errp)
+{
+QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+if (!klass->io_pwritev) {
+error_setg(errp, "Channel does not support pwritev");
+return -1;
+}
+
+if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_SEEKABLE)) {
+

[RFC PATCH v3 10/30] tests/qtest: migration-test: Add tests for fixed-ram file-based migration

2023-11-27 Thread Fabiano Rosas
From: Nikolay Borisov 

Add basic tests for 'fixed-ram' migration.

Signed-off-by: Nikolay Borisov 
Signed-off-by: Fabiano Rosas 
---
 tests/qtest/migration-test.c | 39 
 1 file changed, 39 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 0fbaa6a90f..96a6217af0 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2135,6 +2135,14 @@ static void *test_mode_reboot_start(QTestState *from, 
QTestState *to)
 return NULL;
 }
 
+static void *migrate_fixed_ram_start(QTestState *from, QTestState *to)
+{
+migrate_set_capability(from, "fixed-ram", true);
+migrate_set_capability(to, "fixed-ram", true);
+
+return NULL;
+}
+
 static void test_mode_reboot(void)
 {
 g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
@@ -2149,6 +2157,32 @@ static void test_mode_reboot(void)
 test_file_common(, true);
 }
 
+static void test_precopy_file_fixed_ram_live(void)
+{
+g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+   FILE_TEST_FILENAME);
+MigrateCommon args = {
+.connect_uri = uri,
+.listen_uri = "defer",
+.start_hook = migrate_fixed_ram_start,
+};
+
+test_file_common(, false);
+}
+
+static void test_precopy_file_fixed_ram(void)
+{
+g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+   FILE_TEST_FILENAME);
+MigrateCommon args = {
+.connect_uri = uri,
+.listen_uri = "defer",
+.start_hook = migrate_fixed_ram_start,
+};
+
+test_file_common(, true);
+}
+
 static void test_precopy_tcp_plain(void)
 {
 MigrateCommon args = {
@@ -3392,6 +3426,11 @@ int main(int argc, char **argv)
 qtest_add_func("/migration/mode/reboot", test_mode_reboot);
 }
 
+qtest_add_func("/migration/precopy/file/fixed-ram",
+   test_precopy_file_fixed_ram);
+qtest_add_func("/migration/precopy/file/fixed-ram/live",
+   test_precopy_file_fixed_ram_live);
+
 #ifdef CONFIG_GNUTLS
 qtest_add_func("/migration/precopy/unix/tls/psk",
test_precopy_unix_tls_psk);
-- 
2.35.3




[RFC PATCH v3 05/30] migration/qemu-file: add utility methods for working with seekable channels

2023-11-27 Thread Fabiano Rosas
From: Nikolay Borisov 

Add utility methods that will be needed when implementing 'fixed-ram'
migration capability.

qemu_file_is_seekable
qemu_put_buffer_at
qemu_get_buffer_at
qemu_set_offset
qemu_get_offset

Signed-off-by: Nikolay Borisov 
Signed-off-by: Fabiano Rosas 
Reviewed-by: Daniel P. Berrangé 
---
 include/migration/qemu-file-types.h |  2 +
 migration/qemu-file.c   | 82 +
 migration/qemu-file.h   |  6 +++
 3 files changed, 90 insertions(+)

diff --git a/include/migration/qemu-file-types.h 
b/include/migration/qemu-file-types.h
index 9ba163f333..adec5abc07 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -50,6 +50,8 @@ unsigned int qemu_get_be16(QEMUFile *f);
 unsigned int qemu_get_be32(QEMUFile *f);
 uint64_t qemu_get_be64(QEMUFile *f);
 
+bool qemu_file_is_seekable(QEMUFile *f);
+
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
 qemu_put_be64(f, *pv);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 94231ff295..faf6427b91 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -33,6 +33,7 @@
 #include "options.h"
 #include "qapi/error.h"
 #include "rdma.h"
+#include "io/channel-file.h"
 
 #define IO_BUF_SIZE 32768
 #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
@@ -255,6 +256,10 @@ static void qemu_iovec_release_ram(QEMUFile *f)
 memset(f->may_free, 0, sizeof(f->may_free));
 }
 
+bool qemu_file_is_seekable(QEMUFile *f)
+{
+return qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_SEEKABLE);
+}
 
 /**
  * Flushes QEMUFile buffer
@@ -447,6 +452,83 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, 
size_t size)
 }
 }
 
+void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
+off_t pos)
+{
+Error *err = NULL;
+
+if (f->last_error) {
+return;
+}
+
+qemu_fflush(f);
+qio_channel_pwrite(f->ioc, (char *)buf, buflen, pos, );
+
+if (err) {
+qemu_file_set_error_obj(f, -EIO, err);
+} else {
+stat64_add(_stats.qemu_file_transferred, buflen);
+}
+
+return;
+}
+
+
+size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
+  off_t pos)
+{
+Error *err = NULL;
+ssize_t ret;
+
+if (f->last_error) {
+return 0;
+}
+
+ret = qio_channel_pread(f->ioc, (char *)buf, buflen, pos, );
+if (ret == -1 || err) {
+goto error;
+}
+
+return (size_t)ret;
+
+ error:
+qemu_file_set_error_obj(f, -EIO, err);
+return 0;
+}
+
+void qemu_set_offset(QEMUFile *f, off_t off, int whence)
+{
+Error *err = NULL;
+off_t ret;
+
+qemu_fflush(f);
+
+if (!qemu_file_is_writable(f)) {
+f->buf_index = 0;
+f->buf_size = 0;
+}
+
+ret = qio_channel_io_seek(f->ioc, off, whence, );
+if (ret == (off_t)-1) {
+qemu_file_set_error_obj(f, -EIO, err);
+}
+}
+
+off_t qemu_get_offset(QEMUFile *f)
+{
+Error *err = NULL;
+off_t ret;
+
+qemu_fflush(f);
+
+ret = qio_channel_io_seek(f->ioc, 0, SEEK_CUR, );
+if (ret == (off_t)-1) {
+qemu_file_set_error_obj(f, -EIO, err);
+}
+return ret;
+}
+
+
 void qemu_put_byte(QEMUFile *f, int v)
 {
 if (f->last_error) {
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 8aec9fabf7..32fd4a34fd 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -75,6 +75,12 @@ QEMUFile *qemu_file_get_return_path(QEMUFile *f);
 int qemu_fflush(QEMUFile *f);
 void qemu_file_set_blocking(QEMUFile *f, bool block);
 int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
+void qemu_set_offset(QEMUFile *f, off_t off, int whence);
+off_t qemu_get_offset(QEMUFile *f);
+void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
+off_t pos);
+size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
+  off_t pos);
 
 QIOChannel *qemu_file_get_ioc(QEMUFile *file);
 
-- 
2.35.3




[RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration

2023-11-27 Thread Fabiano Rosas
Some functionalities of multifd are incompatible with the 'fixed-ram'
migration format.

The MULTIFD_FLUSH flag in particular is not used because in fixed-ram
there is no sinchronicity between migration source and destination so
there is not need for a sync packet. In fact, fixed-ram disables
packets in multifd as a whole.

However, we still need to sync the migration thread with the multifd
channels at key moments:

- between iterations, to avoid a slow channel being overrun by a fast
channel in the subsequent iteration;

- at ram_save_complete, to make sure all data has been transferred
  before finishing migration;

Make sure RAM_SAVE_FLAG_MULTIFD_FLUSH is only emitted for fixed-ram at
those key moments.

Signed-off-by: Fabiano Rosas 
---
 migration/ram.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 08604222f2..ad6abd1761 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1363,7 +1363,7 @@ static int find_dirty_block(RAMState *rs, 
PageSearchStatus *pss)
 pss->page = 0;
 pss->block = QLIST_NEXT_RCU(pss->block, next);
 if (!pss->block) {
-if (migrate_multifd() &&
+if (migrate_multifd() && !migrate_fixed_ram() &&
 !migrate_multifd_flush_after_each_section()) {
 QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
 int ret = multifd_send_sync_main(f);
@@ -3112,7 +3112,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 return ret;
 }
 
-if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
+if (migrate_multifd() && !migrate_multifd_flush_after_each_section()
+&& !migrate_fixed_ram()) {
 qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
 }
 
@@ -3242,8 +3243,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 out:
 if (ret >= 0
 && migration_is_setup_or_active(migrate_get_current()->state)) {
-if (migrate_multifd() && migrate_multifd_flush_after_each_section()) {
-ret = 
multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
+if (migrate_multifd() &&
+(migrate_multifd_flush_after_each_section() ||
+ migrate_fixed_ram())) {
+ret = multifd_send_sync_main(
+rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
 if (ret < 0) {
 return ret;
 }
-- 
2.35.3




[RFC PATCH v3 30/30] tests/qtest: Add a test for fixed-ram with passing of fds

2023-11-27 Thread Fabiano Rosas
Add a multifd test for fixed-ram with passing of fds into QEMU. This
is how libvirt will consume the feature.

There are a couple of details to the fdset mechanism:

- multifd needs two distinct file descriptors (not duplicated with
  dup()) on the outgoing side so it can enable O_DIRECT only on the
  channels that write with alignment. The dup() system call creates
  file descriptors that share status flags, of which O_DIRECT is one.

  the incoming side doesn't set O_DIRECT, so it can dup() fds and
  therefore can receive only one in the fdset.

- the open() access mode flags used for the fds passed into QEMU need
  to match the flags QEMU uses to open the file. Currently O_WRONLY
  for src and O_RDONLY for dst.

O_DIRECT is not supported on all systems/filesystems, so run the fdset
test without O_DIRECT if that's the case. The migration code should
still work in that scenario.

Signed-off-by: Fabiano Rosas 
---
 tests/qtest/migration-helpers.c |  7 ++-
 tests/qtest/migration-test.c| 87 +
 2 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 02b92f0cb6..3013094800 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -302,7 +302,7 @@ char *resolve_machine_version(const char *alias, const char 
*var1,
 bool probe_o_direct_support(const char *tmpfs)
 {
 g_autofree char *filename = g_strdup_printf("%s/probe-o-direct", tmpfs);
-int fd, flags = O_CREAT | O_RDWR | O_DIRECT;
+int fd, flags = O_CREAT | O_RDWR | O_TRUNC | O_DIRECT;
 void *buf;
 ssize_t ret, len;
 uint64_t offset;
@@ -320,9 +320,12 @@ bool probe_o_direct_support(const char *tmpfs)
 len = 0x10;
 offset = 0x10;
 
-buf = g_malloc0(len);
+buf = aligned_alloc(len, len);
+g_assert(buf);
+
 ret = pwrite(fd, buf, len, offset);
 unlink(filename);
+g_free(buf);
 
 if (ret < 0) {
 return false;
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 192b8ec993..bb2dd805fc 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2251,8 +2251,90 @@ static void test_multifd_file_fixed_ram_dio(void)
 
 test_file_common(, true);
 }
+
+static void migrate_multifd_fixed_ram_fdset_dio_end(QTestState *from,
+QTestState *to,
+void *opaque)
+{
+QDict *resp;
+QList *fdsets;
+
+/*
+ * Check that we removed the fdsets after migration, otherwise a
+ * second migration would fail due to too many fdsets.
+ */
+
+resp = qtest_qmp(from, "{'execute': 'query-fdsets', "
+ "'arguments': {}}");
+g_assert(qdict_haskey(resp, "return"));
+fdsets = qdict_get_qlist(resp, "return");
+g_assert(fdsets && qlist_empty(fdsets));
+}
+#endif /* O_DIRECT */
+
+#ifndef _WIN32
+static void *migrate_multifd_fixed_ram_fdset(QTestState *from, QTestState *to)
+{
+g_autofree char *file = g_strdup_printf("%s/%s", tmpfs, 
FILE_TEST_FILENAME);
+int fds[3];
+int src_flags = O_CREAT | O_WRONLY;
+int dst_flags = O_CREAT | O_RDONLY;
+
+/* main outgoing channel: no O_DIRECT */
+fds[0] = open(file, src_flags, 0660);
+assert(fds[0] != -1);
+
+#ifdef O_DIRECT
+src_flags |= O_DIRECT;
 #endif
 
+/* secondary outgoing channels */
+fds[1] = open(file, src_flags, 0660);
+assert(fds[1] != -1);
+
+qtest_qmp_fds_assert_success(from, [0], 1, "{'execute': 'add-fd', "
+ "'arguments': {'fdset-id': 1}}");
+
+qtest_qmp_fds_assert_success(from, [1], 1, "{'execute': 'add-fd', "
+ "'arguments': {'fdset-id': 1}}");
+
+/* incoming channel */
+fds[2] = open(file, dst_flags, 0660);
+assert(fds[2] != -1);
+
+qtest_qmp_fds_assert_success(to, [2], 1, "{'execute': 'add-fd', "
+ "'arguments': {'fdset-id': 1}}");
+
+#ifdef O_DIRECT
+migrate_multifd_fixed_ram_dio_start(from, to);
+#else
+migrate_multifd_fixed_ram_start(from, to);
+#endif
+
+return NULL;
+}
+
+static void test_multifd_file_fixed_ram_fdset(void)
+{
+g_autofree char *uri = g_strdup_printf("file:/dev/fdset/1,offset=0x100");
+MigrateCommon args = {
+.connect_uri = uri,
+.listen_uri = "defer",
+.start_hook = migrate_multifd_fixed_ram_fdset,
+#ifdef O_DIRECT
+.finish_hook = migrate_multifd_fixed_ram_fdset_dio_end,
+#endif
+};
+
+if (!probe_o_direct_support(tmpfs)) {
+g_test_skip("Filesystem does not support O_DIRECT");
+return;
+}
+
+test_file_common(, true);
+}
+#endif /* _WIN32 */
+
 static void test_precopy_tcp_plain(void)
 {
 MigrateCommon args = {
@@ -3511,6 +3593,11 @@ int main(int argc, char **argv)
test_multifd_file_fixed_ram_dio);
 #endif
 
+#ifndef _WIN32
+

[RFC PATCH v3 27/30] monitor: fdset: Match against O_DIRECT

2023-11-27 Thread Fabiano Rosas
We're about to enable the use of O_DIRECT in the migration code and
due to the alignment restrictions imposed by filesystems we need to
make sure the flag is only used when doing aligned IO.

The migration will do parallel IO to different regions of a file, so
we need to use more than one file descriptor. Those cannot be obtained
by duplicating (dup()) since duplicated file descriptors share the
file status flags, including O_DIRECT. If one migration channel does
unaligned IO while another sets O_DIRECT to do aligned IO, the
filesystem would fail the unaligned operation.

The add-fd QMP command along with the fdset code are specifically
designed to allow the user to pass a set of file descriptors with
different access flags into QEMU to be later fetched by code that
needs to alternate between those flags when doing IO.

Extend the fdset matching function to behave the same with the
O_DIRECT flag.

Signed-off-by: Fabiano Rosas 
---
 monitor/fds.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/monitor/fds.c b/monitor/fds.c
index 9a28e4b72b..42bf3eb982 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -413,6 +413,12 @@ static bool monitor_fdset_flags_match(int flags, int 
fd_flags)
 
 if ((flags & O_ACCMODE) == (fd_flags & O_ACCMODE)) {
 match = true;
+
+#ifdef O_DIRECT
+if ((flags & O_DIRECT) != (fd_flags & O_DIRECT)) {
+match = false;
+}
+#endif
 }
 
 return match;
-- 
2.35.3




[RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support

2023-11-27 Thread Fabiano Rosas
Allow multifd to open file-backed channels. This will be used when
enabling the fixed-ram migration stream format which expects a
seekable transport.

The QIOChannel read and write methods will use the preadv/pwritev
versions which don't update the file offset at each call so we can
reuse the fd without re-opening for every channel.

Note that this is just setup code and multifd cannot yet make use of
the file channels.

Signed-off-by: Fabiano Rosas 
---
- open multifd channels with O_WRONLY and no mode
- stop cancelling migration and propagate error via qio_task
---
 migration/file.c  | 47 +--
 migration/file.h  |  5 +
 migration/multifd.c   | 14 +++--
 migration/options.c   |  7 +++
 migration/options.h   |  1 +
 migration/qemu-file.h |  1 -
 6 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index 5d4975f43e..67d6f42da7 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -17,6 +17,10 @@
 
 #define OFFSET_OPTION ",offset="
 
+static struct FileOutgoingArgs {
+char *fname;
+} outgoing_args;
+
 /* Remove the offset option from @filespec and return it in @offsetp. */
 
 int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
@@ -36,6 +40,42 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, 
Error **errp)
 return 0;
 }
 
+static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
+{
+/* noop */
+}
+
+int file_send_channel_destroy(QIOChannel *ioc)
+{
+if (ioc) {
+qio_channel_close(ioc, NULL);
+object_unref(OBJECT(ioc));
+}
+g_free(outgoing_args.fname);
+outgoing_args.fname = NULL;
+
+return 0;
+}
+
+void file_send_channel_create(QIOTaskFunc f, void *data)
+{
+QIOChannelFile *ioc;
+QIOTask *task;
+Error *err = NULL;
+int flags = O_WRONLY;
+
+ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, );
+
+task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
+if (!ioc) {
+qio_task_set_error(task, err);
+return;
+}
+
+qio_task_run_in_thread(task, qio_channel_file_connect_worker,
+   (gpointer)data, NULL, NULL);
+}
+
 void file_start_outgoing_migration(MigrationState *s,
FileMigrationArgs *file_args, Error **errp)
 {
@@ -43,15 +83,18 @@ void file_start_outgoing_migration(MigrationState *s,
 g_autofree char *filename = g_strdup(file_args->filename);
 uint64_t offset = file_args->offset;
 QIOChannel *ioc;
+int flags = O_CREAT | O_TRUNC | O_WRONLY;
+mode_t mode = 0660;
 
 trace_migration_file_outgoing(filename);
 
-fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
- 0600, errp);
+fioc = qio_channel_file_new_path(filename, flags, mode, errp);
 if (!fioc) {
 return;
 }
 
+outgoing_args.fname = g_strdup(filename);
+
 ioc = QIO_CHANNEL(fioc);
 if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
 return;
diff --git a/migration/file.h b/migration/file.h
index 37d6a08bfc..511019b319 100644
--- a/migration/file.h
+++ b/migration/file.h
@@ -9,10 +9,15 @@
 #define QEMU_MIGRATION_FILE_H
 
 #include "qapi/qapi-types-migration.h"
+#include "io/task.h"
+#include "channel.h"
 
 void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp);
 
 void file_start_outgoing_migration(MigrationState *s,
FileMigrationArgs *file_args, Error **errp);
 int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp);
+
+void file_send_channel_create(QIOTaskFunc f, void *data);
+int file_send_channel_destroy(QIOChannel *ioc);
 #endif
diff --git a/migration/multifd.c b/migration/multifd.c
index 123ff0dec0..427740aab6 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -17,6 +17,7 @@
 #include "exec/ramblock.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "file.h"
 #include "ram.h"
 #include "migration.h"
 #include "migration-stats.h"
@@ -28,6 +29,7 @@
 #include "threadinfo.h"
 #include "options.h"
 #include "qemu/yank.h"
+#include "io/channel-file.h"
 #include "io/channel-socket.h"
 #include "yank_functions.h"
 
@@ -511,7 +513,11 @@ static void multifd_send_terminate_threads(Error *err)
 
 static int multifd_send_channel_destroy(QIOChannel *send)
 {
-return socket_send_channel_destroy(send);
+if (migrate_to_file()) {
+return file_send_channel_destroy(send);
+} else {
+return socket_send_channel_destroy(send);
+}
 }
 
 void multifd_save_cleanup(void)
@@ -904,7 +910,11 @@ static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)
 
 static void multifd_new_send_channel_create(gpointer opaque)
 {
-socket_send_channel_create(multifd_new_send_channel_async, opaque);
+if (migrate_to_file()) {
+

[RFC PATCH v3 28/30] docs/devel/migration.rst: Document the file transport

2023-11-27 Thread Fabiano Rosas
When adding the support for file migration with the file: transport,
we missed adding documentation for it.

Signed-off-by: Fabiano Rosas 
---
 docs/devel/migration.rst | 4 
 1 file changed, 4 insertions(+)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index eeb4fec31f..1488e5b2f9 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -41,6 +41,10 @@ over any transport.
 - exec migration: do the migration using the stdin/stdout through a process.
 - fd migration: do the migration using a file descriptor that is
   passed to QEMU.  QEMU doesn't care how this file descriptor is opened.
+- file migration: do the migration using a file that is passed to QEMU
+  by path. A file offset option is supported to allow a management
+  application to add its own metadata to the start of the file without
+  QEMU interference.
 
 In addition, support is included for migration using RDMA, which
 transports the page data using ``RDMA``, where the hardware takes care of
-- 
2.35.3




[RFC PATCH v3 16/30] multifd: Rename MultiFDSendParams::data to compress_data

2023-11-27 Thread Fabiano Rosas
Use a more specific name for the compression data so we can use the
generic for the multifd core code.

Signed-off-by: Fabiano Rosas 
---
 migration/multifd-zlib.c | 20 ++--
 migration/multifd-zstd.c | 20 ++--
 migration/multifd.h  |  4 ++--
 3 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 37ce48621e..fd94e79dd9 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -69,7 +69,7 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
 err_msg = "out of memory for buf";
 goto err_free_zbuff;
 }
-p->data = z;
+p->compress_data = z;
 return 0;
 
 err_free_zbuff:
@@ -92,15 +92,15 @@ err_free_z:
  */
 static void zlib_send_cleanup(MultiFDSendParams *p, Error **errp)
 {
-struct zlib_data *z = p->data;
+struct zlib_data *z = p->compress_data;
 
 deflateEnd(>zs);
 g_free(z->zbuff);
 z->zbuff = NULL;
 g_free(z->buf);
 z->buf = NULL;
-g_free(p->data);
-p->data = NULL;
+g_free(p->compress_data);
+p->compress_data = NULL;
 }
 
 /**
@@ -116,7 +116,7 @@ static void zlib_send_cleanup(MultiFDSendParams *p, Error 
**errp)
  */
 static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
 {
-struct zlib_data *z = p->data;
+struct zlib_data *z = p->compress_data;
 z_stream *zs = >zs;
 uint32_t out_size = 0;
 int ret;
@@ -189,7 +189,7 @@ static int zlib_recv_setup(MultiFDRecvParams *p, Error 
**errp)
 struct zlib_data *z = g_new0(struct zlib_data, 1);
 z_stream *zs = >zs;
 
-p->data = z;
+p->compress_data = z;
 zs->zalloc = Z_NULL;
 zs->zfree = Z_NULL;
 zs->opaque = Z_NULL;
@@ -219,13 +219,13 @@ static int zlib_recv_setup(MultiFDRecvParams *p, Error 
**errp)
  */
 static void zlib_recv_cleanup(MultiFDRecvParams *p)
 {
-struct zlib_data *z = p->data;
+struct zlib_data *z = p->compress_data;
 
 inflateEnd(>zs);
 g_free(z->zbuff);
 z->zbuff = NULL;
-g_free(p->data);
-p->data = NULL;
+g_free(p->compress_data);
+p->compress_data = NULL;
 }
 
 /**
@@ -241,7 +241,7 @@ static void zlib_recv_cleanup(MultiFDRecvParams *p)
  */
 static int zlib_recv_pages(MultiFDRecvParams *p, Error **errp)
 {
-struct zlib_data *z = p->data;
+struct zlib_data *z = p->compress_data;
 z_stream *zs = >zs;
 uint32_t in_size = p->next_packet_size;
 /* we measure the change of total_out */
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index b471daadcd..238eebbf4b 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -52,7 +52,7 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
 struct zstd_data *z = g_new0(struct zstd_data, 1);
 int res;
 
-p->data = z;
+p->compress_data = z;
 z->zcs = ZSTD_createCStream();
 if (!z->zcs) {
 g_free(z);
@@ -90,14 +90,14 @@ static int zstd_send_setup(MultiFDSendParams *p, Error 
**errp)
  */
 static void zstd_send_cleanup(MultiFDSendParams *p, Error **errp)
 {
-struct zstd_data *z = p->data;
+struct zstd_data *z = p->compress_data;
 
 ZSTD_freeCStream(z->zcs);
 z->zcs = NULL;
 g_free(z->zbuff);
 z->zbuff = NULL;
-g_free(p->data);
-p->data = NULL;
+g_free(p->compress_data);
+p->compress_data = NULL;
 }
 
 /**
@@ -113,7 +113,7 @@ static void zstd_send_cleanup(MultiFDSendParams *p, Error 
**errp)
  */
 static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
 {
-struct zstd_data *z = p->data;
+struct zstd_data *z = p->compress_data;
 int ret;
 uint32_t i;
 
@@ -178,7 +178,7 @@ static int zstd_recv_setup(MultiFDRecvParams *p, Error 
**errp)
 struct zstd_data *z = g_new0(struct zstd_data, 1);
 int ret;
 
-p->data = z;
+p->compress_data = z;
 z->zds = ZSTD_createDStream();
 if (!z->zds) {
 g_free(z);
@@ -216,14 +216,14 @@ static int zstd_recv_setup(MultiFDRecvParams *p, Error 
**errp)
  */
 static void zstd_recv_cleanup(MultiFDRecvParams *p)
 {
-struct zstd_data *z = p->data;
+struct zstd_data *z = p->compress_data;
 
 ZSTD_freeDStream(z->zds);
 z->zds = NULL;
 g_free(z->zbuff);
 z->zbuff = NULL;
-g_free(p->data);
-p->data = NULL;
+g_free(p->compress_data);
+p->compress_data = NULL;
 }
 
 /**
@@ -243,7 +243,7 @@ static int zstd_recv_pages(MultiFDRecvParams *p, Error 
**errp)
 uint32_t out_size = 0;
 uint32_t expected_size = p->normal_num * p->page_size;
 uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
-struct zstd_data *z = p->data;
+struct zstd_data *z = p->compress_data;
 int ret;
 int i;
 
diff --git a/migration/multifd.h b/migration/multifd.h
index a112ec7ac6..744b52762f 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -132,7 +132,7 @@ typedef struct {
 /* num of non zero pages */
 uint32_t normal_num;
 /* used for compression methods 

[RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile

2023-11-27 Thread Fabiano Rosas
From: Nikolay Borisov 

The upcoming 'fixed-ram' feature will require qemu to write data to
(and restore from) specific offsets of the migration file.

Add a minimal implementation of pwritev/preadv and expose them via the
io_pwritev and io_preadv interfaces.

Signed-off-by: Nikolay Borisov 
Signed-off-by: Fabiano Rosas 
---
- check CONFIG_PREADV to avoid breaking Windows
---
 io/channel-file.c | 56 +++
 1 file changed, 56 insertions(+)

diff --git a/io/channel-file.c b/io/channel-file.c
index f91bf6db1c..a6ad7770c6 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -146,6 +146,58 @@ static ssize_t qio_channel_file_writev(QIOChannel *ioc,
 return ret;
 }
 
+#ifdef CONFIG_PREADV
+static ssize_t qio_channel_file_preadv(QIOChannel *ioc,
+   const struct iovec *iov,
+   size_t niov,
+   off_t offset,
+   Error **errp)
+{
+QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+ssize_t ret;
+
+ retry:
+ret = preadv(fioc->fd, iov, niov, offset);
+if (ret < 0) {
+if (errno == EAGAIN) {
+return QIO_CHANNEL_ERR_BLOCK;
+}
+if (errno == EINTR) {
+goto retry;
+}
+
+error_setg_errno(errp, errno, "Unable to read from file");
+return -1;
+}
+
+return ret;
+}
+
+static ssize_t qio_channel_file_pwritev(QIOChannel *ioc,
+const struct iovec *iov,
+size_t niov,
+off_t offset,
+Error **errp)
+{
+QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+ssize_t ret;
+
+ retry:
+ret = pwritev(fioc->fd, iov, niov, offset);
+if (ret <= 0) {
+if (errno == EAGAIN) {
+return QIO_CHANNEL_ERR_BLOCK;
+}
+if (errno == EINTR) {
+goto retry;
+}
+error_setg_errno(errp, errno, "Unable to write to file");
+return -1;
+}
+return ret;
+}
+#endif /* CONFIG_PREADV */
+
 static int qio_channel_file_set_blocking(QIOChannel *ioc,
  bool enabled,
  Error **errp)
@@ -231,6 +283,10 @@ static void qio_channel_file_class_init(ObjectClass *klass,
 ioc_klass->io_writev = qio_channel_file_writev;
 ioc_klass->io_readv = qio_channel_file_readv;
 ioc_klass->io_set_blocking = qio_channel_file_set_blocking;
+#ifdef CONFIG_PREADV
+ioc_klass->io_pwritev = qio_channel_file_pwritev;
+ioc_klass->io_preadv = qio_channel_file_preadv;
+#endif
 ioc_klass->io_seek = qio_channel_file_seek;
 ioc_klass->io_close = qio_channel_file_close;
 ioc_klass->io_create_watch = qio_channel_file_create_watch;
-- 
2.35.3




[RFC PATCH v3 17/30] migration/multifd: Decouple recv method from pages

2023-11-27 Thread Fabiano Rosas
Next patch will abstract the type of data being received by the
channels, so do some cleanup now to remove references to pages and
dependency on 'normal_num'.

Signed-off-by: Fabiano Rosas 
---
 migration/multifd-zlib.c |  2 +-
 migration/multifd-zstd.c |  2 +-
 migration/multifd.c  | 12 +++-
 migration/multifd.h  |  5 ++---
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index fd94e79dd9..e019d2d74e 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -314,7 +314,7 @@ static MultiFDMethods multifd_zlib_ops = {
 .send_prepare = zlib_send_prepare,
 .recv_setup = zlib_recv_setup,
 .recv_cleanup = zlib_recv_cleanup,
-.recv_pages = zlib_recv_pages
+.recv_data = zlib_recv_pages
 };
 
 static void multifd_zlib_register(void)
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 238eebbf4b..0b8414df5b 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -305,7 +305,7 @@ static MultiFDMethods multifd_zstd_ops = {
 .send_prepare = zstd_send_prepare,
 .recv_setup = zstd_recv_setup,
 .recv_cleanup = zstd_recv_cleanup,
-.recv_pages = zstd_recv_pages
+.recv_data = zstd_recv_pages
 };
 
 static void multifd_zstd_register(void)
diff --git a/migration/multifd.c b/migration/multifd.c
index 3476fac49f..c1381bdc21 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -130,7 +130,7 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
 }
 
 /**
- * nocomp_recv_pages: read the data from the channel into actual pages
+ * nocomp_recv_data: read the data from the channel
  *
  * For no compression we just need to read things into the correct place.
  *
@@ -139,7 +139,7 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
  * @p: Params for the channel that we are using
  * @errp: pointer to an error
  */
-static int nocomp_recv_pages(MultiFDRecvParams *p, Error **errp)
+static int nocomp_recv_data(MultiFDRecvParams *p, Error **errp)
 {
 uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
 
@@ -161,7 +161,7 @@ static MultiFDMethods multifd_nocomp_ops = {
 .send_prepare = nocomp_send_prepare,
 .recv_setup = nocomp_recv_setup,
 .recv_cleanup = nocomp_recv_cleanup,
-.recv_pages = nocomp_recv_pages
+.recv_data = nocomp_recv_data
 };
 
 static MultiFDMethods *multifd_ops[MULTIFD_COMPRESSION__MAX] = {
@@ -1126,6 +1126,7 @@ static void *multifd_recv_thread(void *opaque)
 
 while (true) {
 uint32_t flags = 0;
+bool has_data = false;
 p->normal_num = 0;
 
 if (p->quit) {
@@ -1154,12 +1155,13 @@ static void *multifd_recv_thread(void *opaque)
p->next_packet_size);
 
 p->total_normal_pages += p->normal_num;
+has_data = !!p->normal_num;
 }
 
 qemu_mutex_unlock(>mutex);
 
-if (p->normal_num) {
-ret = multifd_recv_state->ops->recv_pages(p, _err);
+if (has_data) {
+ret = multifd_recv_state->ops->recv_data(p, _err);
 if (ret != 0) {
 break;
 }
diff --git a/migration/multifd.h b/migration/multifd.h
index 744b52762f..406d42dbae 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -203,11 +203,10 @@ typedef struct {
 int (*recv_setup)(MultiFDRecvParams *p, Error **errp);
 /* Cleanup for receiving side */
 void (*recv_cleanup)(MultiFDRecvParams *p);
-/* Read all pages */
-int (*recv_pages)(MultiFDRecvParams *p, Error **errp);
+/* Read all data */
+int (*recv_data)(MultiFDRecvParams *p, Error **errp);
 } MultiFDMethods;
 
 void multifd_register_ops(int method, MultiFDMethods *ops);
 
 #endif
-
-- 
2.35.3




[RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check

2023-11-27 Thread Fabiano Rosas
The fixed-ram migration format needs a channel that supports seeking
to be able to write each page to an arbitrary offset in the migration
stream.

Signed-off-by: Fabiano Rosas 
Reviewed-by: Daniel P. Berrangé 
---
- avoided overwriting errp in compatibility check
---
 migration/migration.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 28a34c9068..897ed1db67 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -135,10 +135,26 @@ static bool 
transport_supports_multi_channels(SocketAddress *saddr)
saddr->type == SOCKET_ADDRESS_TYPE_VSOCK;
 }
 
+static bool migration_needs_seekable_channel(void)
+{
+return migrate_fixed_ram();
+}
+
+static bool transport_supports_seeking(MigrationAddress *addr)
+{
+return addr->transport == MIGRATION_ADDRESS_TYPE_FILE;
+}
+
 static bool
 migration_channels_and_transport_compatible(MigrationAddress *addr,
 Error **errp)
 {
+if (migration_needs_seekable_channel() &&
+!transport_supports_seeking(addr)) {
+error_setg(errp, "Migration requires seekable transport (e.g. file)");
+return false;
+}
+
 if (migration_needs_multiple_sockets() &&
 (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) &&
 !transport_supports_multi_channels(>u.socket)) {
-- 
2.35.3




[RFC PATCH v3 14/30] migration/multifd: Add incoming QIOChannelFile support

2023-11-27 Thread Fabiano Rosas
On the receiving side we don't need to differentiate between main
channel and threads, so whichever channel is defined first gets to be
the main one. And since there are no packets, use the atomic channel
count to index into the params array.

Signed-off-by: Fabiano Rosas 
---
- stop setting offset in secondary channels
- check for packets when peeking
---
 migration/file.c  | 36 
 migration/migration.c |  3 ++-
 migration/multifd.c   |  3 +--
 migration/multifd.h   |  1 +
 4 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index 67d6f42da7..62ba994109 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -7,12 +7,14 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "channel.h"
 #include "file.h"
 #include "migration.h"
 #include "io/channel-file.h"
 #include "io/channel-util.h"
+#include "options.h"
 #include "trace.h"
 
 #define OFFSET_OPTION ",offset="
@@ -117,22 +119,40 @@ void file_start_incoming_migration(FileMigrationArgs 
*file_args, Error **errp)
 g_autofree char *filename = g_strdup(file_args->filename);
 QIOChannelFile *fioc = NULL;
 uint64_t offset = file_args->offset;
-QIOChannel *ioc;
+int channels = 1;
+int i = 0, fd;
 
 trace_migration_file_incoming(filename);
 
 fioc = qio_channel_file_new_path(filename, O_RDONLY, 0, errp);
 if (!fioc) {
+goto out;
+}
+
+if (offset &&
+qio_channel_io_seek(QIO_CHANNEL(fioc), offset, SEEK_SET, errp) < 0) {
 return;
 }
 
-ioc = QIO_CHANNEL(fioc);
-if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
+if (migrate_multifd()) {
+channels += migrate_multifd_channels();
+}
+
+fd = fioc->fd;
+
+do {
+QIOChannel *ioc = QIO_CHANNEL(fioc);
+
+qio_channel_set_name(ioc, "migration-file-incoming");
+qio_channel_add_watch_full(ioc, G_IO_IN,
+   file_accept_incoming_migration,
+   NULL, NULL,
+   g_main_context_get_thread_default());
+} while (++i < channels && (fioc = qio_channel_file_new_fd(fd)));
+
+out:
+if (!fioc) {
+error_setg(errp, "Error creating migration incoming channel");
 return;
 }
-qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-incoming");
-qio_channel_add_watch_full(ioc, G_IO_IN,
-   file_accept_incoming_migration,
-   NULL, NULL,
-   g_main_context_get_thread_default());
 }
diff --git a/migration/migration.c b/migration/migration.c
index 897ed1db67..16689171ab 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -838,7 +838,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error 
**errp)
 uint32_t channel_magic = 0;
 int ret = 0;
 
-if (migrate_multifd() && !migrate_postcopy_ram() &&
+if (migrate_multifd() && migrate_multifd_packets() &&
+!migrate_postcopy_ram() &&
 qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_READ_MSG_PEEK)) {
 /*
  * With multiple channels, it is possible that we receive channels
diff --git a/migration/multifd.c b/migration/multifd.c
index 427740aab6..3476fac49f 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1283,8 +1283,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error 
**errp)
 /* initial packet */
 num_packets = 1;
 } else {
-/* next patch gives this a meaningful value */
-id = 0;
+id = qatomic_read(_recv_state->count);
 }
 
 p = _recv_state->params[id];
diff --git a/migration/multifd.h b/migration/multifd.h
index a835643b48..a112ec7ac6 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -18,6 +18,7 @@ void multifd_save_cleanup(void);
 int multifd_load_setup(Error **errp);
 void multifd_load_cleanup(void);
 void multifd_load_shutdown(void);
+bool multifd_recv_first_channel(void);
 bool multifd_recv_all_channels_created(void);
 void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-- 
2.35.3




[RFC PATCH v3 08/30] migration/ram: Add outgoing 'fixed-ram' migration

2023-11-27 Thread Fabiano Rosas
From: Nikolay Borisov 

Implement the outgoing migration side for the 'fixed-ram' capability.

A bitmap is introduced to track which pages have been written in the
migration file. Pages are written at a fixed location for every
ramblock. Zero pages are ignored as they'd be zero in the destination
migration as well.

The migration stream is altered to put the dirty pages for a ramblock
after its header instead of having a sequential stream of pages that
follow the ramblock headers. Since all pages have a fixed location,
RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.

Without fixed-ram (current):With fixed-ram (new):

 -   
 | ramblock 1 header |   | ramblock 1 header|
 -   
 | ramblock 2 header |   | ramblock 1 fixed-ram header  |
 -   
 | ...   |   | padding to next 1MB boundary |
 -   | ...  |
 | ramblock n header |   
 -   | ramblock 1 pages |
 | RAM_SAVE_FLAG_EOS |   | ...  |
 -   
 | stream of pages   |   | ramblock 2 header|
 | (iter 1)  |   
 | ...   |   | ramblock 2 fixed-ram header  |
 -   
 | RAM_SAVE_FLAG_EOS |   | padding to next 1MB boundary |
 -   | ...  |
 | stream of pages   |   
 | (iter 2)  |   | ramblock 2 pages |
 | ...   |   | ...  |
 -   
 | ...   |   | ...  |
 -   
 | RAM_SAVE_FLAG_EOS|
 
 | ...  |
 

where:
 - ramblock header: the generic information for a ramblock, such as
   idstr, used_len, etc.

 - ramblock fixed-ram header: the new information added by this
   feature: bitmap of pages written, bitmap size and offset of pages
   in the migration file.

Signed-off-by: Nikolay Borisov 
Signed-off-by: Fabiano Rosas 
---
- used a macro for alignment value
- documented alignment assumptions
- moved shadow_bmap debug code to multifd patch
- did NOT use used_length for bmap, it breaks dirty page tracking somehow
- uncommented the capability enabling
- accounted for the bitmap size with ram_transferred_add()
---
 include/exec/ramblock.h |   8 +++
 migration/ram.c | 121 +---
 2 files changed, 120 insertions(+), 9 deletions(-)

diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 69c6a53902..e0e3f16852 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -44,6 +44,14 @@ struct RAMBlock {
 size_t page_size;
 /* dirty bitmap used during migration */
 unsigned long *bmap;
+/* shadow dirty bitmap used when migrating to a file */
+unsigned long *shadow_bmap;
+/*
+ * offset in the file pages belonging to this ramblock are saved,
+ * used only during migration to a file.
+ */
+off_t bitmap_offset;
+uint64_t pages_offset;
 /* bitmap of already received pages in postcopy */
 unsigned long *receivedmap;
 
diff --git a/migration/ram.c b/migration/ram.c
index 8c7886ab79..4a0ab8105f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -94,6 +94,18 @@
 #define RAM_SAVE_FLAG_MULTIFD_FLUSH0x200
 /* We can't use any flag that is bigger than 0x200 */
 
+/*
+ * fixed-ram migration supports O_DIRECT, so we need to make sure the
+ * userspace buffer, the IO operation size and the file offset are
+ * aligned according to the underlying device's block size. The first
+ * two are already aligned to page size, but we need to add padding to
+ * the file to align the offset.  We cannot read the block size
+ * dynamically because the migration file can be moved between
+ * different systems, so use 1M to cover most block sizes and to keep
+ * the file offset aligned at page size as well.
+ */
+#define FIXED_RAM_FILE_OFFSET_ALIGNMENT 0x10
+
 XBZRLECacheStats xbzrle_counters;
 
 /* used by the search for pages to send */
@@ -1127,12 +1139,18 @@ static int save_zero_page(RAMState *rs, 
PageSearchStatus 

[RFC PATCH v3 01/30] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file

2023-11-27 Thread Fabiano Rosas
From: Nikolay Borisov 

Add a generic QIOChannel feature SEEKABLE which would be used by the
qemu_file* apis. For the time being this will be only implemented for
file channels.

Signed-off-by: Nikolay Borisov 
Signed-off-by: Fabiano Rosas 
Reviewed-by: Daniel P. Berrangé 
---
 include/io/channel.h | 1 +
 io/channel-file.c| 8 
 2 files changed, 9 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 5f9dbaab65..fcb19fd672 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -44,6 +44,7 @@ enum QIOChannelFeature {
 QIO_CHANNEL_FEATURE_LISTEN,
 QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
 QIO_CHANNEL_FEATURE_READ_MSG_PEEK,
+QIO_CHANNEL_FEATURE_SEEKABLE,
 };
 
 
diff --git a/io/channel-file.c b/io/channel-file.c
index 4a12c61886..f91bf6db1c 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -36,6 +36,10 @@ qio_channel_file_new_fd(int fd)
 
 ioc->fd = fd;
 
+if (lseek(fd, 0, SEEK_CUR) != (off_t)-1) {
+qio_channel_set_feature(QIO_CHANNEL(ioc), 
QIO_CHANNEL_FEATURE_SEEKABLE);
+}
+
 trace_qio_channel_file_new_fd(ioc, fd);
 
 return ioc;
@@ -60,6 +64,10 @@ qio_channel_file_new_path(const char *path,
 return NULL;
 }
 
+if (lseek(ioc->fd, 0, SEEK_CUR) != (off_t)-1) {
+qio_channel_set_feature(QIO_CHANNEL(ioc), 
QIO_CHANNEL_FEATURE_SEEKABLE);
+}
+
 trace_qio_channel_file_new_path(ioc, path, flags, mode, ioc->fd);
 
 return ioc;
-- 
2.35.3




Re: [PATCH v1 5/7] tests/qtest/migration: Print migration incoming errors

2023-11-27 Thread Peter Xu
On Mon, Nov 27, 2023 at 12:52:38PM -0300, Fabiano Rosas wrote:
> >> @@ -118,6 +118,12 @@ void migrate_incoming_qmp(QTestState *to, const char 
> >> *uri, const char *fmt, ...)
> >>  
> >>  rsp = qtest_qmp(to, "{ 'execute': 'migrate-incoming', 'arguments': 
> >> %p}",
> >>  args);
> >> +
> >> +if (!qdict_haskey(rsp, "return")) {
> >> +g_autoptr(GString) s = qobject_to_json_pretty(QOBJECT(rsp), true);
> >> +g_test_message("%s", s->str);
> >> +}
> >
> > This traps the "migrate-incoming" command only (which, afaiu, only setup
> > the listening), would this capture the incoming error?
> 
> This is about the migrate-incoming only. We could replace "incoming
> migration" with "qmp_migrate_incoming" in the commit message to clarify.

Ah.. Did you ever see this failure in any of your runs in these tests?  I
think it means you hit the assertion right below this part, but I'm just
curious how, as the URIs in the test cases are pretty constant.

-- 
Peter Xu




Re: [RFC PATCH v2 17/19] heki: x86: Update permissions counters during text patching

2023-11-27 Thread Peter Zijlstra
On Mon, Nov 27, 2023 at 10:48:29AM -0600, Madhavan T. Venkataraman wrote:
> Apologies for the late reply. I was on vacation. Please see my response below:
> 
> On 11/13/23 02:19, Peter Zijlstra wrote:
> > On Sun, Nov 12, 2023 at 09:23:24PM -0500, Mickaël Salaün wrote:
> >> From: Madhavan T. Venkataraman 
> >>
> >> X86 uses a function called __text_poke() to modify executable code. This
> >> patching function is used by many features such as KProbes and FTrace.
> >>
> >> Update the permissions counters for the text page so that write
> >> permissions can be temporarily established in the EPT to modify the
> >> instructions in that page.
> >>
> >> Cc: Borislav Petkov 
> >> Cc: Dave Hansen 
> >> Cc: H. Peter Anvin 
> >> Cc: Ingo Molnar 
> >> Cc: Kees Cook 
> >> Cc: Madhavan T. Venkataraman 
> >> Cc: Mickaël Salaün 
> >> Cc: Paolo Bonzini 
> >> Cc: Sean Christopherson 
> >> Cc: Thomas Gleixner 
> >> Cc: Vitaly Kuznetsov 
> >> Cc: Wanpeng Li 
> >> Signed-off-by: Madhavan T. Venkataraman 
> >> ---
> >>
> >> Changes since v1:
> >> * New patch
> >> ---
> >>  arch/x86/kernel/alternative.c |  5 
> >>  arch/x86/mm/heki.c| 49 +++
> >>  include/linux/heki.h  | 14 ++
> >>  3 files changed, 68 insertions(+)
> >>
> >> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> >> index 517ee01503be..64fd8757ba5c 100644
> >> --- a/arch/x86/kernel/alternative.c
> >> +++ b/arch/x86/kernel/alternative.c
> >> @@ -18,6 +18,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >> @@ -1801,6 +1802,7 @@ static void *__text_poke(text_poke_f func, void 
> >> *addr, const void *src, size_t l
> >> */
> >>pgprot = __pgprot(pgprot_val(PAGE_KERNEL) & ~_PAGE_GLOBAL);
> >>  
> >> +  heki_text_poke_start(pages, cross_page_boundary ? 2 : 1, pgprot);
> >>/*
> >> * The lock is not really needed, but this allows to avoid open-coding.
> >> */
> >> @@ -1865,7 +1867,10 @@ static void *__text_poke(text_poke_f func, void 
> >> *addr, const void *src, size_t l
> >>}
> >>  
> >>local_irq_restore(flags);
> >> +
> >>pte_unmap_unlock(ptep, ptl);
> >> +  heki_text_poke_end(pages, cross_page_boundary ? 2 : 1, pgprot);
> >> +
> >>return addr;
> >>  }
> > 
> > This makes no sense, we already use a custom CR3 with userspace alias
> > for the actual pages to write to, why are you then frobbing permissions
> > on that *again* ?
> 
> Today, the permissions for a guest page in the extended page table
> (EPT) are RWX (unless permissions are restricted for some specific
> reason like for shadow page table pages). In this Heki feature, we
> don't allow RWX by default in the EPT. We only allow those permissions
> in the EPT that the guest page actually needs.  E.g., for a text page,
> it is R_X in both the guest page table and the EPT.

To what end? If you always mirror what the guest does, you've not
actually gained anything.

> For text patching, the above code establishes an alternate mapping in
> the guest page table that is RW_ so that the text can be patched. That
> needs to be reflected in the EPT so that the EPT permissions will
> change from R_X to RWX. In other words, RWX is allowed only as
> necessary. At the end of patching, the EPT permissions are restored to
> R_X.
> 
> Does that address your comment?

No, if you want to mirror the native PTEs why don't you hook into the
paravirt page-table muck and get all that for free?

Also, this is the user range, are you saying you're also playing these
daft games with user maps?



Re: [RFC PATCH v2 18/19] heki: x86: Protect guest kernel memory using the KVM hypervisor

2023-11-27 Thread Peter Zijlstra
On Mon, Nov 27, 2023 at 11:05:23AM -0600, Madhavan T. Venkataraman wrote:
> Apologies for the late reply. I was on vacation. Please see my response below:
> 
> On 11/13/23 02:54, Peter Zijlstra wrote:
> > On Sun, Nov 12, 2023 at 09:23:25PM -0500, Mickaël Salaün wrote:
> >> From: Madhavan T. Venkataraman 
> >>
> >> Implement a hypervisor function, kvm_protect_memory() that calls the
> >> KVM_HC_PROTECT_MEMORY hypercall to request the KVM hypervisor to
> >> set specified permissions on a list of guest pages.
> >>
> >> Using the protect_memory() function, set proper EPT permissions for all
> >> guest pages.
> >>
> >> Use the MEM_ATTR_IMMUTABLE property to protect the kernel static
> >> sections and the boot-time read-only sections. This enables to make sure
> >> a compromised guest will not be able to change its main physical memory
> >> page permissions. However, this also disable any feature that may change
> >> the kernel's text section (e.g., ftrace, Kprobes), but they can still be
> >> used on kernel modules.
> >>
> >> Module loading/unloading, and eBPF JIT is allowed without restrictions
> >> for now, but we'll need a way to authenticate these code changes to
> >> really improve the guests' security. We plan to use module signatures,
> >> but there is no solution yet to authenticate eBPF programs.
> >>
> >> Being able to use ftrace and Kprobes in a secure way is a challenge not
> >> solved yet. We're looking for ideas to make this work.
> >>
> >> Likewise, the JUMP_LABEL feature cannot work because the kernel's text
> >> section is read-only.
> > 
> > What is the actual problem? As is the kernel text map is already RO and
> > never changed.
> 
> For the JUMP_LABEL optimization, the text needs to be patched at some point.
> That patching requires a writable mapping of the text page at the time of
> patching.
> 
> In this Heki feature, we currently lock down the kernel text at the end of
> kernel boot just before kicking off the init process. The lockdown is
> implemented by setting the permissions of a text page to R_X in the extended
> page table and not allowing write permissions in the EPT after that. So, jump 
> label
> patching during kernel boot is not a problem. But doing it after kernel
> boot is a problem.

But you see, that's exactly what the kernel already does with the normal
permissions. They get set to RX after init and are never changed.

See the previous patch, we establish a read-write alias and write there.

You seem to lack basic understanding of how the kernel works in this
regard, which makes me very nervous about you touching any of this.

I must also say I really dislike your extra/random permssion calls all
over the place. They don't really get us anything afaict. Why can't you
plumb into the existing set_memory_*() family?



[PATCH] accel/kvm: Turn DPRINTF macro use into tracepoints

2023-11-27 Thread Jai Arora
To remove DPRINTF macros and use tracepoints
for logging.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1827

Signed-off-by: Jai Arora 
---
 accel/kvm/kvm-all.c| 32 ++--
 accel/kvm/trace-events |  2 +-
 2 files changed, 11 insertions(+), 23 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index e39a810a4e..d0dd7e54c3 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -69,16 +69,6 @@
 #define KVM_GUESTDBG_BLOCKIRQ 0
 #endif
 
-//#define DEBUG_KVM
-
-#ifdef DEBUG_KVM
-#define DPRINTF(fmt, ...) \
-do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
-#else
-#define DPRINTF(fmt, ...) \
-do { } while (0)
-#endif
-
 struct KVMParkedVcpu {
 unsigned long vcpu_id;
 int kvm_fd;
@@ -331,7 +321,7 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
 struct KVMParkedVcpu *vcpu = NULL;
 int ret = 0;
 
-DPRINTF("kvm_destroy_vcpu\n");
+trace_kvm_dprintf("kvm_destroy_vcpu\n");
 
 ret = kvm_arch_destroy_vcpu(cpu);
 if (ret < 0) {
@@ -341,7 +331,7 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
 mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
 if (mmap_size < 0) {
 ret = mmap_size;
-DPRINTF("KVM_GET_VCPU_MMAP_SIZE failed\n");
+trace_kvm_dprintf("KVM_GET_VCPU_MMAP_SIZE failed\n");
 goto err;
 }
 
@@ -443,7 +433,6 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
PAGE_SIZE * KVM_DIRTY_LOG_PAGE_OFFSET);
 if (cpu->kvm_dirty_gfns == MAP_FAILED) {
 ret = -errno;
-DPRINTF("mmap'ing vcpu dirty gfns failed: %d\n", ret);
 goto err;
 }
 }
@@ -2821,7 +2810,7 @@ int kvm_cpu_exec(CPUState *cpu)
 struct kvm_run *run = cpu->kvm_run;
 int ret, run_ret;
 
-DPRINTF("kvm_cpu_exec()\n");
+trace_kvm_dprintf("kvm_cpu_exec()\n");
 
 if (kvm_arch_process_async_events(cpu)) {
 qatomic_set(>exit_request, 0);
@@ -2848,7 +2837,7 @@ int kvm_cpu_exec(CPUState *cpu)
 
 kvm_arch_pre_run(cpu, run);
 if (qatomic_read(>exit_request)) {
-DPRINTF("interrupt exit requested\n");
+   trace_kvm_dprintf("interrupt exit requested\n");
 /*
  * KVM requires us to reenter the kernel after IO exits to complete
  * instruction emulation. This self-signal will ensure that we
@@ -2878,7 +2867,7 @@ int kvm_cpu_exec(CPUState *cpu)
 
 if (run_ret < 0) {
 if (run_ret == -EINTR || run_ret == -EAGAIN) {
-DPRINTF("io window exit\n");
+   trace_kvm_dprintf("io window exit\n");
 kvm_eat_signals(cpu);
 ret = EXCP_INTERRUPT;
 break;
@@ -2900,7 +2889,7 @@ int kvm_cpu_exec(CPUState *cpu)
 trace_kvm_run_exit(cpu->cpu_index, run->exit_reason);
 switch (run->exit_reason) {
 case KVM_EXIT_IO:
-DPRINTF("handle_io\n");
+trace_kvm_dprintf("handle_io\n");
 /* Called outside BQL */
 kvm_handle_io(run->io.port, attrs,
   (uint8_t *)run + run->io.data_offset,
@@ -2910,7 +2899,7 @@ int kvm_cpu_exec(CPUState *cpu)
 ret = 0;
 break;
 case KVM_EXIT_MMIO:
-DPRINTF("handle_mmio\n");
+trace_kvm_dprintf("handle_mmio\n");
 /* Called outside BQL */
 address_space_rw(_space_memory,
  run->mmio.phys_addr, attrs,
@@ -2920,11 +2909,11 @@ int kvm_cpu_exec(CPUState *cpu)
 ret = 0;
 break;
 case KVM_EXIT_IRQ_WINDOW_OPEN:
-DPRINTF("irq_window_open\n");
+trace_kvm_dprintf("irq_window_open\n");
 ret = EXCP_INTERRUPT;
 break;
 case KVM_EXIT_SHUTDOWN:
-DPRINTF("shutdown\n");
+trace_kvm_dprintf("shutdown\n");
 qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
 ret = EXCP_INTERRUPT;
 break;
@@ -2976,13 +2965,12 @@ int kvm_cpu_exec(CPUState *cpu)
 ret = 0;
 break;
 default:
-DPRINTF("kvm_arch_handle_exit\n");
+trace_kvm_dprintf("kvm_arch_handle_exit\n");
 ret = kvm_arch_handle_exit(cpu, run);
 break;
 }
 break;
 default:
-DPRINTF("kvm_arch_handle_exit\n");
 ret = kvm_arch_handle_exit(cpu, run);
 break;
 }
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 399aaeb0ec..1754909efe 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -25,4 +25,4 @@ kvm_dirty_ring_reaper(const char *s) "%s"
 kvm_dirty_ring_reap(uint64_t count, int64_t t) "reaped %"PRIu64" pages (took 
%"PRIi64" us)"
 kvm_dirty_ring_reaper_kick(const char *reason) "%s"
 kvm_dirty_ring_flush(int finished) "%d"
-
+kvm_dprintf(const char *s) "from KVM: %s"
-- 

Re: [PATCH v2] avr: Fix wrong initial value of stack pointer

2023-11-27 Thread Philippe Mathieu-Daudé

Hi Gihun,

On 27/11/23 03:54, Gihun Nam wrote:

The current implementation initializes the stack pointer of AVR devices
to 0. Although older AVR devices used to be like that, newer ones set
it to RAMEND.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1525
Signed-off-by: Gihun Nam 
---
Edit code to use QOM property and add more description to commit message
about the changes

Thanks for the detailed help, Mr. Peter!

P.S. I don't understand how replies work with git send-email, so
  if I've done something wrong, please bear with me.

  hw/avr/atmega.c  |  4 
  target/avr/cpu.c | 10 +-
  target/avr/cpu.h |  3 +++
  3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/hw/avr/atmega.c b/hw/avr/atmega.c
index a34803e642..31c8992d75 100644
--- a/hw/avr/atmega.c
+++ b/hw/avr/atmega.c
@@ -233,6 +233,10 @@ static void atmega_realize(DeviceState *dev, Error **errp)
  
  /* CPU */

  object_initialize_child(OBJECT(dev), "cpu", >cpu, mc->cpu_type);
+
+object_property_set_uint(OBJECT(>cpu), "init-sp",
+ mc->io_size + mc->sram_size - 1, _abort);


Since the CPU implements the QDev interface, you can use:

   qdev_prop_set_uint32(DEVICE(>cpu), "init-sp",
mc->io_size + mc->sram_size - 1);


  qdev_realize(DEVICE(>cpu), NULL, _abort);
  cpudev = DEVICE(>cpu);
  
diff --git a/target/avr/cpu.c b/target/avr/cpu.c

index 44de1e18d1..999c010ded 100644
--- a/target/avr/cpu.c
+++ b/target/avr/cpu.c
@@ -25,6 +25,7 @@
  #include "cpu.h"
  #include "disas/dis-asm.h"
  #include "tcg/debug-assert.h"
+#include "hw/qdev-properties.h"
  
  static void avr_cpu_set_pc(CPUState *cs, vaddr value)

  {
@@ -95,7 +96,7 @@ static void avr_cpu_reset_hold(Object *obj)
  env->rampY = 0;
  env->rampZ = 0;
  env->eind = 0;
-env->sp = 0;
+env->sp = cpu->init_sp;
  
  env->skip = 0;
  
@@ -152,6 +153,11 @@ static void avr_cpu_initfn(Object *obj)

sizeof(cpu->env.intsrc) * 8);
  }
  
+static Property avr_cpu_properties[] = {

+DEFINE_PROP_UINT32("init-sp", AVRCPU, init_sp, 0),
+DEFINE_PROP_END_OF_LIST()
+};
+
  static ObjectClass *avr_cpu_class_by_name(const char *cpu_model)
  {
  ObjectClass *oc;
@@ -228,6 +234,8 @@ static void avr_cpu_class_init(ObjectClass *oc, void *data)
  
  device_class_set_parent_realize(dc, avr_cpu_realizefn, >parent_realize);
  
+device_class_set_props(dc, avr_cpu_properties);

+
  resettable_class_set_parent_phases(rc, NULL, avr_cpu_reset_hold, NULL,
 >parent_phases);
  
diff --git a/target/avr/cpu.h b/target/avr/cpu.h

index 8a17862737..7960c5c57a 100644
--- a/target/avr/cpu.h
+++ b/target/avr/cpu.h
@@ -145,6 +145,9 @@ struct ArchCPU {
  CPUState parent_obj;
  
  CPUAVRState env;

+
+/* Initial value of stack pointer */
+uint32_t init_sp;


Hmm the stack is 16-bit wide. I suppose AVRCPU::sp is 32-bit
wide because tcg_global_mem_new_i32() forces us to (the smaller
TCG register is 16-bit).

Preferably using uint16_t/DEFINE_PROP_UINT16/qdev_prop_set_uint16:

Reviewed-by: Philippe Mathieu-Daudé 


  };
  
  /**





Re: [PATCH] tests: bios-tables-test: Rename smbios type 4 related test functions

2023-11-27 Thread Philippe Mathieu-Daudé

On 27/11/23 17:02, Zhao Liu wrote:

From: Zhao Liu 

In fact, type4-count, core-count, core-count2, thread-count and
thread-count2 are tested with KVM not TCG.

Rename these test functions to reflect KVM base instead of TCG.

Signed-off-by: Zhao Liu 
---
  tests/qtest/bios-tables-test.c | 20 ++--
  1 file changed, 10 insertions(+), 10 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH 4/4] dma-helpers: don't lock AioContext in dma_blk_cb()

2023-11-27 Thread Eric Blake
On Thu, Nov 23, 2023 at 02:49:31PM -0500, Stefan Hajnoczi wrote:
> Commit abfcd2760b3e ("dma-helpers: prevent dma_blk_cb() vs
> dma_aio_cancel() race") acquired the AioContext lock inside dma_blk_cb()
> to avoid a race with scsi_device_purge_requests() running in the main
> loop thread.
> 
> The SCSI code no longer calls dma_aio_cancel() from the main loop thread
> while I/O is running in the IOThread AioContext. Therefore it is no
> longer necessary to take this lock to protect DMAAIOCB fields. The
> ->cb() function also does not require the lock because blk_aio_*() and
> friends do not need the AioContext lock.
> 
> Both hw/ide/core.c and hw/ide/macio.c also call dma_blk_io() but don't
> rely on it taking the AioContext lock, so this change is safe.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  system/dma-helpers.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH for-8.2] ide/via: Fix BAR4 value in legacy mode

2023-11-27 Thread Kevin Wolf
Am 25.11.2023 um 15:01 hat BALATON Zoltan geschrieben:
> Return default value in legacy mode for BAR4 when unset. This can't be
> set in reset method because BARs are cleared on reset so we return it
> instead when BARs are read in legacy mode.
> 
> Signed-off-by: BALATON Zoltan 
> ---
> This fixes UDMA on amigaone with AmigaOS and I'd like to include for
> 8.2 release.

Thanks, applied to the block branch.

Kevin




Re: [PATCH] vmdk: Don't corrupt desc file in vmdk_write_cid

2023-11-27 Thread Kevin Wolf
Am 27.11.2023 um 16:18 hat Eric Blake geschrieben:
> On Fri, Nov 24, 2023 at 11:56:54AM +, Fam wrote:
> > From: Fam Zheng 
> > 
> > If the text description file is larger than DESC_SIZE, we force the last
> > byte in the buffer to be 0 and write it out.
> > 
> > This results in a corruption.
> > 
> > Try to allocate a big buffer in this case.
> > 
> > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1923
> > 
> > Signed-off-by: Fam Zheng 
> > ---
> >  block/vmdk.c   | 28 
> >  tests/qemu-iotests/059 |  2 ++
> >  tests/qemu-iotests/059.out |  4 
> >  3 files changed, 26 insertions(+), 8 deletions(-)
> 
> Reviewed-by: Eric Blake 
> 
> Are we trying to get this into 8.2, since it is a data corruption?

Yes, I've queued it for -rc2.

Kevin




[PATCH v6 9/9] ppc/pnv: Test pnv i2c master and connected devices

2023-11-27 Thread Glenn Miles
Tests the following for both P9 and P10:
  - I2C master POR status
  - I2C master status after immediate reset

Tests the following for powernv10-ranier only:
  - Config pca9552 hotplug device pins as inputs then
Read the INPUT0/1 registers to verify all pins are high
  - Connected GPIO pin tests of P10 PCA9552 device.  Tests
output of pins 0-4 affect input of pins 5-9 respectively.
  - PCA9554 GPIO pins test.  Tests input and ouput functionality.

Signed-off-by: Glenn Miles 
---

No change from previous version

 tests/qtest/meson.build |   1 +
 tests/qtest/pnv-host-i2c-test.c | 650 
 2 files changed, 651 insertions(+)
 create mode 100644 tests/qtest/pnv-host-i2c-test.c

diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index 47dabf91d0..fbb0bd204c 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -163,6 +163,7 @@ qtests_ppc64 = \
   qtests_ppc + \
   (config_all_devices.has_key('CONFIG_PSERIES') ? ['device-plug-test'] : []) + 
  \
   (config_all_devices.has_key('CONFIG_POWERNV') ? ['pnv-xscom-test'] : []) +   
  \
+  (config_all_devices.has_key('CONFIG_POWERNV') ? ['pnv-host-i2c-test'] : []) 
+  \
   (config_all_devices.has_key('CONFIG_PSERIES') ? ['rtas-test'] : []) +
  \
   (slirp.found() ? ['pxe-test'] : []) +  \
   (config_all_devices.has_key('CONFIG_USB_UHCI') ? ['usb-hcd-uhci-test'] : []) 
+ \
diff --git a/tests/qtest/pnv-host-i2c-test.c b/tests/qtest/pnv-host-i2c-test.c
new file mode 100644
index 00..377525e458
--- /dev/null
+++ b/tests/qtest/pnv-host-i2c-test.c
@@ -0,0 +1,650 @@
+/*
+ * QTest testcase for PowerNV 10 Host I2C Communications
+ *
+ * Copyright (c) 2023, IBM Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later. See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "hw/misc/pca9554_regs.h"
+#include "hw/misc/pca9552_regs.h"
+
+#define PPC_BIT(bit)(0x8000ULL >> (bit))
+#define PPC_BIT32(bit)  (0x8000 >> (bit))
+#define PPC_BIT8(bit)   (0x80 >> (bit))
+#define PPC_BITMASK(bs, be) ((PPC_BIT(bs) - PPC_BIT(be)) | PPC_BIT(bs))
+#define PPC_BITMASK32(bs, be)   ((PPC_BIT32(bs) - PPC_BIT32(be)) | \
+ PPC_BIT32(bs))
+
+#define MASK_TO_LSH(m)  (__builtin_ffsll(m) - 1)
+#define GETFIELD(m, v)  (((v) & (m)) >> MASK_TO_LSH(m))
+#define SETFIELD(m, v, val) \
+(((v) & ~(m)) | typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
+
+#define P10_XSCOM_BASE  0x000603fcull
+#define PNV10_CHIP_MAX_I2C  5
+#define PNV10_XSCOM_I2CM_BASE   0xa
+#define PNV10_XSCOM_I2CM_SIZE   0x1000
+
+/* I2C FIFO register */
+#define I2C_FIFO_REG0x4
+#define I2C_FIFOPPC_BITMASK(0, 7)
+
+/* I2C command register */
+#define I2C_CMD_REG 0x5
+#define I2C_CMD_WITH_START  PPC_BIT(0)
+#define I2C_CMD_WITH_ADDR   PPC_BIT(1)
+#define I2C_CMD_READ_CONT   PPC_BIT(2)
+#define I2C_CMD_WITH_STOP   PPC_BIT(3)
+#define I2C_CMD_INTR_STEERING   PPC_BITMASK(6, 7) /* P9 */
+#define   I2C_CMD_INTR_STEER_HOST   1
+#define   I2C_CMD_INTR_STEER_OCC2
+#define I2C_CMD_DEV_ADDRPPC_BITMASK(8, 14)
+#define I2C_CMD_READ_NOT_WRITE  PPC_BIT(15)
+#define I2C_CMD_LEN_BYTES   PPC_BITMASK(16, 31)
+#define I2C_MAX_TFR_LEN 0xfff0ull
+
+/* I2C mode register */
+#define I2C_MODE_REG0x6
+#define I2C_MODE_BIT_RATE_DIV   PPC_BITMASK(0, 15)
+#define I2C_MODE_PORT_NUM   PPC_BITMASK(16, 21)
+#define I2C_MODE_ENHANCED   PPC_BIT(28)
+#define I2C_MODE_DIAGNOSTIC PPC_BIT(29)
+#define I2C_MODE_PACING_ALLOW   PPC_BIT(30)
+#define I2C_MODE_WRAP   PPC_BIT(31)
+
+/* I2C watermark register */
+#define I2C_WATERMARK_REG   0x7
+#define I2C_WATERMARK_HIGH  PPC_BITMASK(16, 19)
+#define I2C_WATERMARK_LOW   PPC_BITMASK(24, 27)
+
+/*
+ * I2C interrupt mask and condition registers
+ *
+ * NB: The function of 0x9 and 0xa changes depending on whether you're reading
+ * or writing to them. When read they return the interrupt condition bits
+ * and on writes they update the interrupt mask register.
+ *
+ *  The bit definitions are the same for all the interrupt registers.
+ */
+#define I2C_INTR_MASK_REG   0x8
+
+#define I2C_INTR_RAW_COND_REG   0x9 /* read */
+#define I2C_INTR_MASK_OR_REG0x9 /* write*/
+
+#define I2C_INTR_COND_REG   0xa /* read */
+#define I2C_INTR_MASK_AND_REG   0xa /* write */
+
+#define I2C_INTR_ALLPPC_BITMASK(16, 31)
+#define I2C_INTR_INVALID_CMDPPC_BIT(16)
+#define I2C_INTR_LBUS_PARITY_ERR  

[PATCH v6 7/9] misc: Add a pca9554 GPIO device model

2023-11-27 Thread Glenn Miles
Specs are available here:

https://www.nxp.com/docs/en/data-sheet/PCA9554_9554A.pdf

This is a simple model supporting the basic registers for GPIO
mode.  The device also supports an interrupt output line but the
model does not yet support this.

Reviewed-by: Cédric Le Goater 
Signed-off-by: Glenn Miles 
---

No change from previous version

 MAINTAINERS|  10 +-
 hw/misc/pca9554.c  | 328 +
 include/hw/misc/pca9554.h  |  36 
 include/hw/misc/pca9554_regs.h |  19 ++
 4 files changed, 391 insertions(+), 2 deletions(-)
 create mode 100644 hw/misc/pca9554.c
 create mode 100644 include/hw/misc/pca9554.h
 create mode 100644 include/hw/misc/pca9554_regs.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 695e0bd34f..4d1c991691 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1155,9 +1155,7 @@ R: Joel Stanley 
 L: qemu-...@nongnu.org
 S: Maintained
 F: hw/*/*aspeed*
-F: hw/misc/pca9552.c
 F: include/hw/*/*aspeed*
-F: include/hw/misc/pca9552*.h
 F: hw/net/ftgmac100.c
 F: include/hw/net/ftgmac100.h
 F: docs/system/arm/aspeed.rst
@@ -1526,6 +1524,14 @@ F: include/hw/pci-host/pnv*
 F: pc-bios/skiboot.lid
 F: tests/qtest/pnv*
 
+pca955x
+M: Glenn Miles 
+L: qemu-...@nongnu.org
+L: qemu-...@nongnu.org
+S: Odd Fixes
+F: hw/misc/pca955*.c
+F: include/hw/misc/pca955*.h
+
 virtex_ml507
 M: Edgar E. Iglesias 
 L: qemu-...@nongnu.org
diff --git a/hw/misc/pca9554.c b/hw/misc/pca9554.c
new file mode 100644
index 00..778b32e443
--- /dev/null
+++ b/hw/misc/pca9554.c
@@ -0,0 +1,328 @@
+/*
+ * PCA9554 I/O port
+ *
+ * Copyright (c) 2023, IBM Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qemu/module.h"
+#include "qemu/bitops.h"
+#include "hw/qdev-properties.h"
+#include "hw/misc/pca9554.h"
+#include "hw/misc/pca9554_regs.h"
+#include "hw/irq.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+#include "qapi/visitor.h"
+#include "trace.h"
+#include "qom/object.h"
+
+struct PCA9554Class {
+/*< private >*/
+I2CSlaveClass parent_class;
+/*< public >*/
+};
+typedef struct PCA9554Class PCA9554Class;
+
+DECLARE_CLASS_CHECKERS(PCA9554Class, PCA9554,
+   TYPE_PCA9554)
+
+#define PCA9554_PIN_LOW  0x0
+#define PCA9554_PIN_HIZ  0x1
+
+static const char *pin_state[] = {"low", "high"};
+
+static void pca9554_update_pin_input(PCA9554State *s)
+{
+int i;
+uint8_t config = s->regs[PCA9554_CONFIG];
+uint8_t output = s->regs[PCA9554_OUTPUT];
+uint8_t internal_state = config | output;
+
+for (i = 0; i < PCA9554_PIN_COUNT; i++) {
+uint8_t bit_mask = 1 << i;
+uint8_t internal_pin_state = (internal_state >> i) & 0x1;
+uint8_t old_value = s->regs[PCA9554_INPUT] & bit_mask;
+uint8_t new_value;
+
+switch (internal_pin_state) {
+case PCA9554_PIN_LOW:
+s->regs[PCA9554_INPUT] &= ~bit_mask;
+break;
+case PCA9554_PIN_HIZ:
+/*
+ * pullup sets it to a logical 1 unless
+ * external device drives it low.
+ */
+if (s->ext_state[i] == PCA9554_PIN_LOW) {
+s->regs[PCA9554_INPUT] &= ~bit_mask;
+} else {
+s->regs[PCA9554_INPUT] |=  bit_mask;
+}
+break;
+default:
+break;
+}
+
+/* update irq state only if pin state changed */
+new_value = s->regs[PCA9554_INPUT] & bit_mask;
+if (new_value != old_value) {
+if (new_value) {
+/* changed from 0 to 1 */
+qemu_set_irq(s->gpio_out[i], 1);
+} else {
+/* changed from 1 to 0 */
+qemu_set_irq(s->gpio_out[i], 0);
+}
+}
+}
+}
+
+static uint8_t pca9554_read(PCA9554State *s, uint8_t reg)
+{
+switch (reg) {
+case PCA9554_INPUT:
+return s->regs[PCA9554_INPUT] ^ s->regs[PCA9554_POLARITY];
+case PCA9554_OUTPUT:
+case PCA9554_POLARITY:
+case PCA9554_CONFIG:
+return s->regs[reg];
+default:
+qemu_log_mask(LOG_GUEST_ERROR, "%s: unexpected read to register %d\n",
+  __func__, reg);
+return 0xFF;
+}
+}
+
+static void pca9554_write(PCA9554State *s, uint8_t reg, uint8_t data)
+{
+switch (reg) {
+case PCA9554_OUTPUT:
+case PCA9554_CONFIG:
+s->regs[reg] = data;
+pca9554_update_pin_input(s);
+break;
+case PCA9554_POLARITY:
+s->regs[reg] = data;
+break;
+case PCA9554_INPUT:
+default:
+qemu_log_mask(LOG_GUEST_ERROR, "%s: unexpected write to register %d\n",
+  __func__, reg);
+}
+}
+
+static uint8_t pca9554_recv(I2CSlave *i2c)
+{
+PCA9554State *s = PCA9554(i2c);
+uint8_t ret;
+
+ret = pca9554_read(s, s->pointer & 0x3);
+
+return ret;
+}
+
+static int pca9554_send(I2CSlave *i2c, uint8_t 

[PATCH v6 2/9] misc/pca9552: Let external devices set pca9552 inputs

2023-11-27 Thread Glenn Miles
Allow external devices to drive pca9552 input pins by adding
input GPIO's to the model.  This allows a device to connect
its output GPIO's to the pca9552 input GPIO's.

In order for an external device to set the state of a pca9552
pin, the pin must first be configured for high impedance (LED
is off).  If the pca9552 pin is configured to drive the pin low
(LED is on), then external input will be ignored.

Here is a table describing the logical state of a pca9552 pin
given the state being driven by the pca9552 and an external device:

   PCA9552
   Configured
   State

  | Hi-Z | Low |
--+--+-+
  External   Hi-Z |  Hi  | Low |
  Device--+--+-+
  State  Low  |  Low | Low |
--+--+-+

Reviewed-by: Andrew Jeffery 
Signed-off-by: Glenn Miles 
---

No change from previous version

 hw/misc/pca9552.c | 50 +--
 include/hw/misc/pca9552.h |  3 ++-
 2 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/hw/misc/pca9552.c b/hw/misc/pca9552.c
index 445f56a9e8..fe876471c8 100644
--- a/hw/misc/pca9552.c
+++ b/hw/misc/pca9552.c
@@ -44,6 +44,8 @@ DECLARE_CLASS_CHECKERS(PCA955xClass, PCA955X,
 #define PCA9552_LED_OFF  0x1
 #define PCA9552_LED_PWM0 0x2
 #define PCA9552_LED_PWM1 0x3
+#define PCA9552_PIN_LOW  0x0
+#define PCA9552_PIN_HIZ  0x1
 
 static const char *led_state[] = {"on", "off", "pwm0", "pwm1"};
 
@@ -110,22 +112,27 @@ static void pca955x_update_pin_input(PCA955xState *s)
 
 for (i = 0; i < k->pin_count; i++) {
 uint8_t input_reg = PCA9552_INPUT0 + (i / 8);
-uint8_t input_shift = (i % 8);
+uint8_t bit_mask = 1 << (i % 8);
 uint8_t config = pca955x_pin_get_config(s, i);
+uint8_t old_value = s->regs[input_reg] & bit_mask;
+uint8_t new_value;
 
 switch (config) {
 case PCA9552_LED_ON:
 /* Pin is set to 0V to turn on LED */
-qemu_set_irq(s->gpio[i], 0);
-s->regs[input_reg] &= ~(1 << input_shift);
+s->regs[input_reg] &= ~bit_mask;
 break;
 case PCA9552_LED_OFF:
 /*
  * Pin is set to Hi-Z to turn off LED and
- * pullup sets it to a logical 1.
+ * pullup sets it to a logical 1 unless
+ * external device drives it low.
  */
-qemu_set_irq(s->gpio[i], 1);
-s->regs[input_reg] |= 1 << input_shift;
+if (s->ext_state[i] == PCA9552_PIN_LOW) {
+s->regs[input_reg] &= ~bit_mask;
+} else {
+s->regs[input_reg] |=  bit_mask;
+}
 break;
 case PCA9552_LED_PWM0:
 case PCA9552_LED_PWM1:
@@ -133,6 +140,12 @@ static void pca955x_update_pin_input(PCA955xState *s)
 default:
 break;
 }
+
+/* update irq state only if pin state changed */
+new_value = s->regs[input_reg] & bit_mask;
+if (new_value != old_value) {
+qemu_set_irq(s->gpio_out[i], !!new_value);
+}
 }
 }
 
@@ -340,6 +353,7 @@ static const VMStateDescription pca9552_vmstate = {
 VMSTATE_UINT8(len, PCA955xState),
 VMSTATE_UINT8(pointer, PCA955xState),
 VMSTATE_UINT8_ARRAY(regs, PCA955xState, PCA955X_NR_REGS),
+VMSTATE_UINT8_ARRAY(ext_state, PCA955xState, PCA955X_PIN_COUNT_MAX),
 VMSTATE_I2C_SLAVE(i2c, PCA955xState),
 VMSTATE_END_OF_LIST()
 }
@@ -358,6 +372,7 @@ static void pca9552_reset(DeviceState *dev)
 s->regs[PCA9552_LS2] = 0x55;
 s->regs[PCA9552_LS3] = 0x55;
 
+memset(s->ext_state, PCA9552_PIN_HIZ, PCA955X_PIN_COUNT_MAX);
 pca955x_update_pin_input(s);
 
 s->pointer = 0xFF;
@@ -380,6 +395,26 @@ static void pca955x_initfn(Object *obj)
 }
 }
 
+static void pca955x_set_ext_state(PCA955xState *s, int pin, int level)
+{
+if (s->ext_state[pin] != level) {
+uint16_t pins_status = pca955x_pins_get_status(s);
+s->ext_state[pin] = level;
+pca955x_update_pin_input(s);
+pca955x_display_pins_status(s, pins_status);
+}
+}
+
+static void pca955x_gpio_in_handler(void *opaque, int pin, int level)
+{
+
+PCA955xState *s = PCA955X(opaque);
+PCA955xClass *k = PCA955X_GET_CLASS(s);
+
+assert((pin >= 0) && (pin < k->pin_count));
+pca955x_set_ext_state(s, pin, level);
+}
+
 static void pca955x_realize(DeviceState *dev, Error **errp)
 {
 PCA955xClass *k = PCA955X_GET_CLASS(dev);
@@ -389,7 +424,8 @@ static void pca955x_realize(DeviceState *dev, Error **errp)
 s->description = g_strdup("pca-unspecified");
 }
 
-qdev_init_gpio_out(dev, s->gpio, k->pin_count);
+qdev_init_gpio_out(dev, s->gpio_out, k->pin_count);
+qdev_init_gpio_in(dev, pca955x_gpio_in_handler, k->pin_count);
 }
 
 static Property pca955x_properties[] = {
diff --git a/include/hw/misc/pca9552.h 

[PATCH v6 6/9] ppc/pnv: Use resettable interface to reset child I2C buses

2023-11-27 Thread Glenn Miles
The QEMU I2C buses and devices use the resettable
interface for resetting while the PNV I2C controller
and parent buses and devices have not yet transitioned
to this new interface and use the old reset strategy.
This was preventing the I2C buses and devices wired
to the PNV I2C controller from being reset.

The short term fix for this is to have the PNV I2C
Controller's reset function explicitly call the resettable
interface function, bus_cold_reset(), on all child
I2C buses.

The long term fix should be to transition all PNV parent
devices and buses to use the resettable interface so that
all child buses and devices are automatically reset.

Reviewed-by: Cédric Le Goater 
Signed-off-by: Glenn Miles 
---

No change from previous version

 hw/ppc/pnv_i2c.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/pnv_i2c.c b/hw/ppc/pnv_i2c.c
index 656a48eebe..774946d6b2 100644
--- a/hw/ppc/pnv_i2c.c
+++ b/hw/ppc/pnv_i2c.c
@@ -629,6 +629,19 @@ static int pnv_i2c_dt_xscom(PnvXScomInterface *dev, void 
*fdt,
 return 0;
 }
 
+static void pnv_i2c_sys_reset(void *dev)
+{
+int port;
+PnvI2C *i2c = PNV_I2C(dev);
+
+pnv_i2c_reset(dev);
+
+/* reset all buses connected to this i2c controller */
+for (port = 0; port < i2c->num_busses; port++) {
+bus_cold_reset(BUS(i2c->busses[port]));
+}
+}
+
 static void pnv_i2c_realize(DeviceState *dev, Error **errp)
 {
 PnvI2C *i2c = PNV_I2C(dev);
@@ -654,7 +667,7 @@ static void pnv_i2c_realize(DeviceState *dev, Error **errp)
 
 fifo8_create(>fifo, PNV_I2C_FIFO_SIZE);
 
-qemu_register_reset(pnv_i2c_reset, dev);
+qemu_register_reset(pnv_i2c_sys_reset, dev);
 
 qdev_init_gpio_out(DEVICE(dev), >psi_irq, 1);
 }
-- 
2.31.1




[PATCH v6 3/9] ppc/pnv: New powernv10-rainier machine type

2023-11-27 Thread Glenn Miles
Create a new powernv machine type, powernv10-rainier, that
will contain rainier-specific devices.

Reviewed-by: Cédric Le Goater 
Signed-off-by: Glenn Miles 
---

No change from previous version

 hw/ppc/pnv.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 0297871bdd..08704ce695 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -2251,7 +2251,7 @@ static void pnv_machine_power9_class_init(ObjectClass 
*oc, void *data)
 machine_class_allow_dynamic_sysbus_dev(mc, TYPE_PNV_PHB);
 }
 
-static void pnv_machine_power10_class_init(ObjectClass *oc, void *data)
+static void pnv_machine_p10_common_class_init(ObjectClass *oc, void *data)
 {
 MachineClass *mc = MACHINE_CLASS(oc);
 PnvMachineClass *pmc = PNV_MACHINE_CLASS(oc);
@@ -2263,7 +2263,6 @@ static void pnv_machine_power10_class_init(ObjectClass 
*oc, void *data)
 { TYPE_PNV_PHB_ROOT_PORT, "version", "5" },
 };
 
-mc->desc = "IBM PowerNV (Non-Virtualized) POWER10";
 mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power10_v2.0");
 compat_props_add(mc->compat_props, phb_compat, G_N_ELEMENTS(phb_compat));
 
@@ -2276,6 +2275,22 @@ static void pnv_machine_power10_class_init(ObjectClass 
*oc, void *data)
 machine_class_allow_dynamic_sysbus_dev(mc, TYPE_PNV_PHB);
 }
 
+static void pnv_machine_power10_class_init(ObjectClass *oc, void *data)
+{
+MachineClass *mc = MACHINE_CLASS(oc);
+
+pnv_machine_p10_common_class_init(oc, data);
+mc->desc = "IBM PowerNV (Non-Virtualized) POWER10";
+}
+
+static void pnv_machine_p10_rainier_class_init(ObjectClass *oc, void *data)
+{
+MachineClass *mc = MACHINE_CLASS(oc);
+
+pnv_machine_p10_common_class_init(oc, data);
+mc->desc = "IBM PowerNV (Non-Virtualized) POWER10 Rainier";
+}
+
 static bool pnv_machine_get_hb(Object *obj, Error **errp)
 {
 PnvMachineState *pnv = PNV_MACHINE(obj);
@@ -2381,6 +2396,11 @@ static void pnv_machine_class_init(ObjectClass *oc, void 
*data)
 }
 
 static const TypeInfo types[] = {
+{
+.name  = MACHINE_TYPE_NAME("powernv10-rainier"),
+.parent= MACHINE_TYPE_NAME("powernv10"),
+.class_init= pnv_machine_p10_rainier_class_init,
+},
 {
 .name  = MACHINE_TYPE_NAME("powernv10"),
 .parent= TYPE_PNV_MACHINE,
-- 
2.31.1




[PATCH v6 8/9] ppc/pnv: Add a pca9554 I2C device to powernv10-rainier

2023-11-27 Thread Glenn Miles
For powernv10-rainier, the Power Hypervisor code expects to see a
pca9554 device connected to the 3rd PNV I2C engine on port 1 at I2C
address 0x25 (or left-justified address of 0x4A).  This is used by
the hypervisor code to detect if a "Cable Card" is present.

Reviewed-by: Cédric Le Goater 
Signed-off-by: Glenn Miles 
---

No change from previous version

 hw/misc/Kconfig | 4 
 hw/misc/meson.build | 1 +
 hw/ppc/Kconfig  | 1 +
 hw/ppc/pnv.c| 6 ++
 4 files changed, 12 insertions(+)

diff --git a/hw/misc/Kconfig b/hw/misc/Kconfig
index cc8a8c1418..c347a132c2 100644
--- a/hw/misc/Kconfig
+++ b/hw/misc/Kconfig
@@ -34,6 +34,10 @@ config PCA9552
 bool
 depends on I2C
 
+config PCA9554
+bool
+depends on I2C
+
 config I2C_ECHO
 bool
 default y if TEST_DEVICES
diff --git a/hw/misc/meson.build b/hw/misc/meson.build
index 36c20d5637..c39410e4a7 100644
--- a/hw/misc/meson.build
+++ b/hw/misc/meson.build
@@ -4,6 +4,7 @@ system_ss.add(when: 'CONFIG_FW_CFG_DMA', if_true: 
files('vmcoreinfo.c'))
 system_ss.add(when: 'CONFIG_ISA_DEBUG', if_true: files('debugexit.c'))
 system_ss.add(when: 'CONFIG_ISA_TESTDEV', if_true: files('pc-testdev.c'))
 system_ss.add(when: 'CONFIG_PCA9552', if_true: files('pca9552.c'))
+system_ss.add(when: 'CONFIG_PCA9554', if_true: files('pca9554.c'))
 system_ss.add(when: 'CONFIG_PCI_TESTDEV', if_true: files('pci-testdev.c'))
 system_ss.add(when: 'CONFIG_UNIMP', if_true: files('unimp.c'))
 system_ss.add(when: 'CONFIG_EMPTY_SLOT', if_true: files('empty_slot.c'))
diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
index f77ca773cf..2302778265 100644
--- a/hw/ppc/Kconfig
+++ b/hw/ppc/Kconfig
@@ -33,6 +33,7 @@ config POWERNV
 select FDT_PPC
 select PCI_POWERNV
 select PCA9552
+select PCA9554
 
 config PPC405
 bool
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 42105211f5..40f6b8fea0 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -1913,6 +1913,12 @@ static void pnv_rainier_i2c_init(PnvMachineState *pnv)
 qdev_connect_gpio_out(DEVICE(dev), 2, qdev_get_gpio_in(DEVICE(dev), 
7));
 qdev_connect_gpio_out(DEVICE(dev), 3, qdev_get_gpio_in(DEVICE(dev), 
8));
 qdev_connect_gpio_out(DEVICE(dev), 4, qdev_get_gpio_in(DEVICE(dev), 
9));
+
+/*
+ * Add a PCA9554 I2C device for cable card presence detection
+ * to engine 2, bus 1, address 0x25
+ */
+i2c_slave_create_simple(chip10->i2c[2].busses[1], "pca9554", 0x25);
 }
 }
 
-- 
2.31.1




[PATCH v6 5/9] ppc/pnv: Wire up pca9552 GPIO pins for PCIe hotplug power control

2023-11-27 Thread Glenn Miles
For power10-rainier, a pca9552 device is used for PCIe slot hotplug
power control by the Power Hypervisor code.  The code expects that
some time after it enables power to a PCIe slot by asserting one of
the pca9552 GPIO pins 0-4, it should see a "power good" signal asserted
on one of pca9552 GPIO pins 5-9.

To simulate this behavior, we simply connect the GPIO outputs for
pins 0-4 to the GPIO inputs for pins 5-9.

Each PCIe slot is assigned 3 GPIO pins on the pca9552 device, for
control of up to 5 PCIe slots.  The per-slot signal names are:

   SLOTx_EN...PHYP uses this as an output to enable
  slot power.  We connect this to the
  SLOTx_PG pin to simulate a PGOOD signal.
   SLOTx_PG...PHYP uses this as in input to detect
  PGOOD for the slot.  For our purposes
  we just connect this to the SLOTx_EN
  output.
   SLOTx_Control..PHYP uses this as an output to prevent
  a race condition in the real hotplug
  circuitry, but we can ignore this output
  for simulation.

Reviewed-by: Cédric Le Goater 
Signed-off-by: Glenn Miles 
---

Changes from previous version:
  - Changed 'hotplug' variable name to 'dev'

 hw/ppc/pnv.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index d8d19fb065..42105211f5 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -1900,7 +1900,19 @@ static void pnv_rainier_i2c_init(PnvMachineState *pnv)
  * Add a PCA9552 I2C device for PCIe hotplug control
  * to engine 2, bus 1, address 0x63
  */
-i2c_slave_create_simple(chip10->i2c[2].busses[1], "pca9552", 0x63);
+I2CSlave *dev = i2c_slave_create_simple(chip10->i2c[2].busses[1],
+"pca9552", 0x63);
+
+/*
+ * Connect PCA9552 GPIO pins 0-4 (SLOTx_EN) outputs to GPIO pins 5-9
+ * (SLOTx_PG) inputs in order to fake the pgood state of PCIe slots
+ * after hypervisor code sets a SLOTx_EN pin high.
+ */
+qdev_connect_gpio_out(DEVICE(dev), 0, qdev_get_gpio_in(DEVICE(dev), 
5));
+qdev_connect_gpio_out(DEVICE(dev), 1, qdev_get_gpio_in(DEVICE(dev), 
6));
+qdev_connect_gpio_out(DEVICE(dev), 2, qdev_get_gpio_in(DEVICE(dev), 
7));
+qdev_connect_gpio_out(DEVICE(dev), 3, qdev_get_gpio_in(DEVICE(dev), 
8));
+qdev_connect_gpio_out(DEVICE(dev), 4, qdev_get_gpio_in(DEVICE(dev), 
9));
 }
 }
 
-- 
2.31.1




[PATCH v6 4/9] ppc/pnv: Add pca9552 to powernv10-rainier for PCIe hotplug power control

2023-11-27 Thread Glenn Miles
The Power Hypervisor code expects to see a pca9552 device connected
to the 3rd PNV I2C engine on port 1 at I2C address 0x63 (or left-
justified address of 0xC6).  This is used by hypervisor code to
control PCIe slot power during hotplug events.

Reviewed-by: Cédric Le Goater 
Signed-off-by: Glenn Miles 
---

No change from previous version

 hw/ppc/Kconfig   |  1 +
 hw/ppc/pnv.c | 25 +
 include/hw/ppc/pnv.h |  1 +
 3 files changed, 27 insertions(+)

diff --git a/hw/ppc/Kconfig b/hw/ppc/Kconfig
index 56f0475a8e..f77ca773cf 100644
--- a/hw/ppc/Kconfig
+++ b/hw/ppc/Kconfig
@@ -32,6 +32,7 @@ config POWERNV
 select XIVE
 select FDT_PPC
 select PCI_POWERNV
+select PCA9552
 
 config PPC405
 bool
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 08704ce695..d8d19fb065 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -790,6 +790,7 @@ static void pnv_init(MachineState *machine)
 const char *bios_name = machine->firmware ?: FW_FILE_NAME;
 PnvMachineState *pnv = PNV_MACHINE(machine);
 MachineClass *mc = MACHINE_GET_CLASS(machine);
+PnvMachineClass *pmc = PNV_MACHINE_GET_CLASS(machine);
 char *fw_filename;
 long fw_size;
 uint64_t chip_ram_start = 0;
@@ -979,6 +980,13 @@ static void pnv_init(MachineState *machine)
  */
 pnv->powerdown_notifier.notify = pnv_powerdown_notify;
 qemu_register_powerdown_notifier(>powerdown_notifier);
+
+/*
+ * Create/Connect any machine-specific I2C devices
+ */
+if (pmc->i2c_init) {
+pmc->i2c_init(pnv);
+}
 }
 
 /*
@@ -1879,6 +1887,21 @@ static void pnv_chip_power10_realize(DeviceState *dev, 
Error **errp)
   qdev_get_gpio_in(DEVICE(>psi),
PSIHB9_IRQ_SBE_I2C));
 }
+
+}
+
+static void pnv_rainier_i2c_init(PnvMachineState *pnv)
+{
+int i;
+for (i = 0; i < pnv->num_chips; i++) {
+Pnv10Chip *chip10 = PNV10_CHIP(pnv->chips[i]);
+
+/*
+ * Add a PCA9552 I2C device for PCIe hotplug control
+ * to engine 2, bus 1, address 0x63
+ */
+i2c_slave_create_simple(chip10->i2c[2].busses[1], "pca9552", 0x63);
+}
 }
 
 static uint32_t pnv_chip_power10_xscom_pcba(PnvChip *chip, uint64_t addr)
@@ -2286,9 +2309,11 @@ static void pnv_machine_power10_class_init(ObjectClass 
*oc, void *data)
 static void pnv_machine_p10_rainier_class_init(ObjectClass *oc, void *data)
 {
 MachineClass *mc = MACHINE_CLASS(oc);
+PnvMachineClass *pmc = PNV_MACHINE_CLASS(oc);
 
 pnv_machine_p10_common_class_init(oc, data);
 mc->desc = "IBM PowerNV (Non-Virtualized) POWER10 Rainier";
+pmc->i2c_init = pnv_rainier_i2c_init;
 }
 
 static bool pnv_machine_get_hb(Object *obj, Error **errp)
diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index 7e5fef7c43..110ac9aace 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -76,6 +76,7 @@ struct PnvMachineClass {
 int compat_size;
 
 void (*dt_power_mgt)(PnvMachineState *pnv, void *fdt);
+void (*i2c_init)(PnvMachineState *pnv);
 };
 
 struct PnvMachineState {
-- 
2.31.1




[PATCH v6 0/9] Add powernv10 I2C devices and tests

2023-11-27 Thread Glenn Miles
This series of patches includes support, tests and fixes for
adding PCA9552 and PCA9554 I2C devices to the powernv10 chip.

The PCA9552 device is used for PCIe slot hotplug power control
and monitoring, while the PCA9554 device is used for presence
detection of IBM CableCard devices.  Both devices are required
by the Power Hypervisor Firmware on the Power10 Ranier platform.

Changes from previous version:
  - Changed variable name, 'hotplug', to 'dev' in 5/9

Glenn Miles (9):
  misc/pca9552: Fix inverted input status
  misc/pca9552: Let external devices set pca9552 inputs
  ppc/pnv: New powernv10-rainier machine type
  ppc/pnv: Add pca9552 to powernv10-rainier for PCIe hotplug power
control
  ppc/pnv: Wire up pca9552 GPIO pins for PCIe hotplug power control
  ppc/pnv: Use resettable interface to reset child I2C buses
  misc: Add a pca9554 GPIO device model
  ppc/pnv: Add a pca9554 I2C device to powernv10-rainier
  ppc/pnv: Test pnv i2c master and connected devices

 MAINTAINERS |  10 +-
 hw/misc/Kconfig |   4 +
 hw/misc/meson.build |   1 +
 hw/misc/pca9552.c   |  58 ++-
 hw/misc/pca9554.c   | 328 
 hw/ppc/Kconfig  |   2 +
 hw/ppc/pnv.c|  67 +++-
 hw/ppc/pnv_i2c.c|  15 +-
 include/hw/misc/pca9552.h   |   3 +-
 include/hw/misc/pca9554.h   |  36 ++
 include/hw/misc/pca9554_regs.h  |  19 +
 include/hw/ppc/pnv.h|   1 +
 tests/qtest/meson.build |   1 +
 tests/qtest/pca9552-test.c  |   6 +-
 tests/qtest/pnv-host-i2c-test.c | 650 
 15 files changed, 1185 insertions(+), 16 deletions(-)
 create mode 100644 hw/misc/pca9554.c
 create mode 100644 include/hw/misc/pca9554.h
 create mode 100644 include/hw/misc/pca9554_regs.h
 create mode 100644 tests/qtest/pnv-host-i2c-test.c

-- 
2.31.1




[PATCH v6 1/9] misc/pca9552: Fix inverted input status

2023-11-27 Thread Glenn Miles
The pca9552 INPUT0 and INPUT1 registers are supposed to
hold the logical values of the LED pins.  A logical 0
should be seen in the INPUT0/1 registers for a pin when
its corresponding LSn bits are set to 0, which is also
the state needed for turning on an LED in a typical
usage scenario.  Existing code was doing the opposite
and setting INPUT0/1 bit to a 1 when the LSn bit was
set to 0, so this commit fixes that.

Reviewed-by: Andrew Jeffery 
Signed-off-by: Glenn Miles 
---

No change from previous version

 hw/misc/pca9552.c  | 18 +-
 tests/qtest/pca9552-test.c |  6 +++---
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/hw/misc/pca9552.c b/hw/misc/pca9552.c
index fff19e369a..445f56a9e8 100644
--- a/hw/misc/pca9552.c
+++ b/hw/misc/pca9552.c
@@ -36,7 +36,10 @@ typedef struct PCA955xClass PCA955xClass;
 
 DECLARE_CLASS_CHECKERS(PCA955xClass, PCA955X,
TYPE_PCA955X)
-
+/*
+ * Note:  The LED_ON and LED_OFF configuration values for the PCA955X
+ *chips are the reverse of the PCA953X family of chips.
+ */
 #define PCA9552_LED_ON   0x0
 #define PCA9552_LED_OFF  0x1
 #define PCA9552_LED_PWM0 0x2
@@ -112,13 +115,18 @@ static void pca955x_update_pin_input(PCA955xState *s)
 
 switch (config) {
 case PCA9552_LED_ON:
-qemu_set_irq(s->gpio[i], 1);
-s->regs[input_reg] |= 1 << input_shift;
-break;
-case PCA9552_LED_OFF:
+/* Pin is set to 0V to turn on LED */
 qemu_set_irq(s->gpio[i], 0);
 s->regs[input_reg] &= ~(1 << input_shift);
 break;
+case PCA9552_LED_OFF:
+/*
+ * Pin is set to Hi-Z to turn off LED and
+ * pullup sets it to a logical 1.
+ */
+qemu_set_irq(s->gpio[i], 1);
+s->regs[input_reg] |= 1 << input_shift;
+break;
 case PCA9552_LED_PWM0:
 case PCA9552_LED_PWM1:
 /* TODO */
diff --git a/tests/qtest/pca9552-test.c b/tests/qtest/pca9552-test.c
index d80ed93cd3..ccca2b3d91 100644
--- a/tests/qtest/pca9552-test.c
+++ b/tests/qtest/pca9552-test.c
@@ -60,7 +60,7 @@ static void send_and_receive(void *obj, void *data, 
QGuestAllocator *alloc)
 g_assert_cmphex(value, ==, 0x55);
 
 value = i2c_get8(i2cdev, PCA9552_INPUT0);
-g_assert_cmphex(value, ==, 0x0);
+g_assert_cmphex(value, ==, 0xFF);
 
 pca9552_init(i2cdev);
 
@@ -68,13 +68,13 @@ static void send_and_receive(void *obj, void *data, 
QGuestAllocator *alloc)
 g_assert_cmphex(value, ==, 0x54);
 
 value = i2c_get8(i2cdev, PCA9552_INPUT0);
-g_assert_cmphex(value, ==, 0x01);
+g_assert_cmphex(value, ==, 0xFE);
 
 value = i2c_get8(i2cdev, PCA9552_LS3);
 g_assert_cmphex(value, ==, 0x54);
 
 value = i2c_get8(i2cdev, PCA9552_INPUT1);
-g_assert_cmphex(value, ==, 0x10);
+g_assert_cmphex(value, ==, 0xEF);
 }
 
 static void pca9552_register_nodes(void)
-- 
2.31.1




Re: [PATCH v1 2/2] hw/mem/cxl_type3: allocate more vectors for MSI-X

2023-11-27 Thread Davidlohr Bueso

On Mon, 27 Nov 2023, Hyeonggon Yoo wrote:


commit 43efb0bfad2b ("hw/cxl/mbox: Wire up interrupts for background
completion") enables notifying background command completion via MSI-X
interrupt (vector number 9).

However, the commit uses vector number 9 but the maximum number of
entries is less thus resulting in error below. Fix it by passing
nentries = 10 when calling msix_init_exclusive_bar().


Hmm yeah this was already set to 10 in Jonathan's tree, thanks for reporting.



[PATCH for-8.2] target/arm: Disable SME if SVE is disabled

2023-11-27 Thread Peter Maydell
There is no architectural requirement that SME implies SVE, but
our implementation currently assumes it. (FEAT_SME_FA64 does
imply SVE.) So if you try to run a CPU with eg "-cpu max,sve=off"
you quickly run into an assert when the guest tries to write to
SMCR_EL1:

#6  0x74b38e96 in __GI___assert_fail
(assertion=0x566e69cb "sm", file=0x566e5b24 
"../../target/arm/helper.c", line=6865, function=0x566e82f0 
<__PRETTY_FUNCTION__.31> "sve_vqm1_for_el_sm") at ./assert/assert.c:101
#7  0x55ee33aa in sve_vqm1_for_el_sm (env=0x57d291f0, el=2, 
sm=false) at ../../target/arm/helper.c:6865
#8  0x55ee3407 in sve_vqm1_for_el (env=0x57d291f0, el=2) at 
../../target/arm/helper.c:6871
#9  0x55ee3724 in smcr_write (env=0x57d291f0, ri=0x57da23b0, 
value=2147483663) at ../../target/arm/helper.c:6995
#10 0x55fd1dba in helper_set_cp_reg64 (env=0x57d291f0, 
rip=0x57da23b0, value=2147483663) at ../../target/arm/tcg/op_helper.c:839
#11 0x7fff60056781 in code_gen_buffer ()

Avoid this unsupported and slightly odd combination by
disabling SME when SVE is not present.

Cc: qemu-sta...@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2005
Signed-off-by: Peter Maydell 
---
'-cpu sve=off,sme=on,sme_fa64=off' crashes in the same way, so just
turning off FA64 isn't sufficient.  Maybe we should support
SME-no-SVE, but for 8.2 at least turning off SME is better than
letting users hit an assertion.
---
 target/arm/cpu.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 25e9d2ae7b8..0fe268ac785 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1743,6 +1743,15 @@ void arm_cpu_finalize_features(ARMCPU *cpu, Error **errp)
 return;
 }
 
+/*
+ * FEAT_SME is not architecturally dependent on FEAT_SVE (unless
+ * FEAT_SME_FA64 is present). However our implementation currently
+ * assumes it, so if the user asked for sve=off then turn off SME also.
+ */
+if (cpu_isar_feature(aa64_sme, cpu) && !cpu_isar_feature(aa64_sve, 
cpu)) {
+object_property_set_bool(OBJECT(cpu), "sme", false, _abort);
+}
+
 arm_cpu_sme_finalize(cpu, _err);
 if (local_err != NULL) {
 error_propagate(errp, local_err);
-- 
2.34.1




[PATCH v6 1/3] hw/ppc: Add pnv nest pervasive common chiplet model

2023-11-27 Thread Chalapathi V
A POWER10 chip is divided into logical pieces called chiplets. Chiplets
are broadly divided into "core chiplets" (with the processor cores) and
"nest chiplets" (with everything else). Each chiplet has an attachment
to the pervasive bus (PIB) and with chiplet-specific registers. All nest
chiplets have a common basic set of registers and This model will provide
the registers functionality for common registers of nest chiplet (Pervasive
Chiplet, PB Chiplet, PCI Chiplets, MC Chiplet, PAU Chiplets)

This commit implement the read/write functions of chiplet control registers.

Signed-off-by: Chalapathi V 
---
 include/hw/ppc/pnv_nest_pervasive.h |  36 +
 include/hw/ppc/pnv_xscom.h  |   3 +
 hw/ppc/pnv_nest_pervasive.c | 219 
 hw/ppc/meson.build  |   1 +
 4 files changed, 259 insertions(+)
 create mode 100644 include/hw/ppc/pnv_nest_pervasive.h
 create mode 100644 hw/ppc/pnv_nest_pervasive.c

diff --git a/include/hw/ppc/pnv_nest_pervasive.h 
b/include/hw/ppc/pnv_nest_pervasive.h
new file mode 100644
index 00..9f11531f52
--- /dev/null
+++ b/include/hw/ppc/pnv_nest_pervasive.h
@@ -0,0 +1,36 @@
+/*
+ * QEMU PowerPC nest pervasive common chiplet model
+ *
+ * Copyright (c) 2023, IBM Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_PNV_NEST_PERVASIVE_H
+#define PPC_PNV_NEST_PERVASIVE_H
+
+#define TYPE_PNV_NEST_PERVASIVE "pnv-nest-chiplet-pervasive"
+#define PNV_NEST_PERVASIVE(obj) OBJECT_CHECK(PnvNestChipletPervasive, (obj), 
TYPE_PNV_NEST_PERVASIVE)
+
+typedef struct PnvPervasiveCtrlRegs {
+#define CPLT_CTRL_SIZE 6
+uint64_t cplt_ctrl[CPLT_CTRL_SIZE];
+uint64_t cplt_cfg0;
+uint64_t cplt_cfg1;
+uint64_t cplt_stat0;
+uint64_t cplt_mask0;
+uint64_t ctrl_protect_mode;
+uint64_t ctrl_atomic_lock;
+} PnvPervasiveCtrlRegs;
+
+typedef struct PnvNestChipletPervasive {
+DeviceState parent;
+char*parent_obj_name;
+MemoryRegionxscom_ctrl_regs;
+PnvPervasiveCtrlRegscontrol_regs;
+} PnvNestChipletPervasive;
+
+#endif /*PPC_PNV_NEST_PERVASIVE_H */
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index f5becbab41..3e15706dec 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -170,6 +170,9 @@ struct PnvXScomInterfaceClass {
 #define PNV10_XSCOM_XIVE2_BASE 0x2010800
 #define PNV10_XSCOM_XIVE2_SIZE 0x400
 
+#define PNV10_XSCOM_N1_CHIPLET_CTRL_REGS_BASE  0x300
+#define PNV10_XSCOM_CHIPLET_CTRL_REGS_SIZE 0x400
+
 #define PNV10_XSCOM_PEC_NEST_BASE  0x3011800 /* index goes downwards ... */
 #define PNV10_XSCOM_PEC_NEST_SIZE  0x100
 
diff --git a/hw/ppc/pnv_nest_pervasive.c b/hw/ppc/pnv_nest_pervasive.c
new file mode 100644
index 00..0575f87e8f
--- /dev/null
+++ b/hw/ppc/pnv_nest_pervasive.c
@@ -0,0 +1,219 @@
+/*
+ * QEMU PowerPC nest pervasive common chiplet model
+ *
+ * Copyright (c) 2023, IBM Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "hw/qdev-properties.h"
+#include "hw/ppc/pnv.h"
+#include "hw/ppc/pnv_xscom.h"
+#include "hw/ppc/pnv_nest_pervasive.h"
+
+/*
+ * Status, configuration, and control units in POWER chips is provided
+ * by the pervasive subsystem, which connects registers to the SCOM bus,
+ * which can be programmed by processor cores, other units on the chip,
+ * BMCs, or other POWER chips.
+ *
+ * A POWER10 chip is divided into logical pieces called chiplets. Chiplets
+ * are broadly divided into "core chiplets" (with the processor cores) and
+ * "nest chiplets" (with everything else). Each chiplet has an attachment
+ * to the nest_pervasiveasive bus (PIB) and with chiplet-specific registers.
+ * All nest chiplets have a common basic set of registers.
+ *
+ * This model will provide the registers fuctionality for common registers of
+ * nest unit (PB Chiplet, PCI Chiplets, MC Chiplet, PAU Chiplets)
+ *
+ * Currently this model provide the read/write fuctionality of chiplet control
+ * scom registers.
+ */
+
+#define CPLT_CONF0   0x08
+#define CPLT_CONF0_OR0x18
+#define CPLT_CONF0_CLEAR 0x28
+#define CPLT_CONF1   0x09
+#define CPLT_CONF1_OR0x19
+#define CPLT_CONF1_CLEAR 0x29
+#define CPLT_STAT0   0x100
+#define CPLT_MASK0   0x101
+#define CPLT_PROTECT_MODE0x3FE
+#define CPLT_ATOMIC_CLOCK0x3FF
+
+static uint64_t pnv_chiplet_ctrl_read(void *opaque, hwaddr addr, unsigned size)
+{
+PnvNestChipletPervasive *nest_pervasive = PNV_NEST_PERVASIVE(opaque);
+int reg = addr >> 3;
+uint64_t val = ~0ull;
+
+/* CPLT_CTRL0 to CPLT_CTRL5 

[PATCH v6 2/3] hw/ppc: Add N1 chiplet model

2023-11-27 Thread Chalapathi V
The N1 chiplet handle the high speed i/o traffic over PCIe and others.
The N1 chiplet consists of PowerBus Fabric controller,
nest Memory Management Unit, chiplet control unit and more.

This commit creates a N1 chiplet model and initialize and realize the
pervasive chiplet model where chiplet control registers are implemented.

This commit also implement the read/write method for the powerbus scom
registers

Signed-off-by: Chalapathi V 
---
 include/hw/ppc/pnv_n1_chiplet.h |  35 +++
 include/hw/ppc/pnv_xscom.h  |   6 ++
 hw/ppc/pnv_n1_chiplet.c | 171 
 hw/ppc/meson.build  |   1 +
 4 files changed, 213 insertions(+)
 create mode 100644 include/hw/ppc/pnv_n1_chiplet.h
 create mode 100644 hw/ppc/pnv_n1_chiplet.c

diff --git a/include/hw/ppc/pnv_n1_chiplet.h b/include/hw/ppc/pnv_n1_chiplet.h
new file mode 100644
index 00..3c42ada7f4
--- /dev/null
+++ b/include/hw/ppc/pnv_n1_chiplet.h
@@ -0,0 +1,35 @@
+/*
+ * QEMU PowerPC N1 chiplet model
+ *
+ * Copyright (c) 2023, IBM Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PPC_PNV_N1_CHIPLET_H
+#define PPC_PNV_N1_CHIPLET_H
+
+#include "hw/ppc/pnv_nest_pervasive.h"
+
+#define TYPE_PNV_N1_CHIPLET "pnv-N1-chiplet"
+#define PNV_N1_CHIPLET(obj) OBJECT_CHECK(PnvN1Chiplet, (obj), 
TYPE_PNV_N1_CHIPLET)
+
+typedef struct pb_scom {
+uint64_t mode;
+uint64_t hp_mode2_curr;
+} pb_scom;
+
+typedef struct PnvN1Chiplet {
+DeviceState parent;
+MemoryRegion xscom_pb_eq_regs;
+MemoryRegion xscom_pb_es_regs;
+/* common pervasive chiplet unit */
+PnvNestChipletPervasive nest_pervasive;
+pb_scom eq[8];
+pb_scom es[4];
+} PnvN1Chiplet;
+#endif /*PPC_PNV_N1_CHIPLET_H */
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index 3e15706dec..535ae1dab0 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -173,6 +173,12 @@ struct PnvXScomInterfaceClass {
 #define PNV10_XSCOM_N1_CHIPLET_CTRL_REGS_BASE  0x300
 #define PNV10_XSCOM_CHIPLET_CTRL_REGS_SIZE 0x400
 
+#define PNV10_XSCOM_N1_PB_SCOM_EQ_BASE  0x3011000
+#define PNV10_XSCOM_N1_PB_SCOM_EQ_SIZE  0x200
+
+#define PNV10_XSCOM_N1_PB_SCOM_ES_BASE  0x3011300
+#define PNV10_XSCOM_N1_PB_SCOM_ES_SIZE  0x100
+
 #define PNV10_XSCOM_PEC_NEST_BASE  0x3011800 /* index goes downwards ... */
 #define PNV10_XSCOM_PEC_NEST_SIZE  0x100
 
diff --git a/hw/ppc/pnv_n1_chiplet.c b/hw/ppc/pnv_n1_chiplet.c
new file mode 100644
index 00..8e4c21dbf6
--- /dev/null
+++ b/hw/ppc/pnv_n1_chiplet.c
@@ -0,0 +1,171 @@
+/*
+ * QEMU PowerPC N1 chiplet model
+ *
+ * Copyright (c) 2023, IBM Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "hw/qdev-properties.h"
+#include "hw/ppc/pnv.h"
+#include "hw/ppc/pnv_xscom.h"
+#include "hw/ppc/pnv_n1_chiplet.h"
+#include "hw/ppc/pnv_nest_pervasive.h"
+
+/*
+ * The n1 chiplet contains chiplet control unit,
+ * PowerBus/RaceTrack/Bridge logic, nest Memory Management Unit(nMMU)
+ * and more.
+ *
+ * In this model Nest1 chiplet control registers are modelled via common
+ * nest pervasive model and few PowerBus racetrack registers are modelled.
+ */
+
+#define PB_SCOM_EQ0_HP_MODE2_CURR  0xe
+#define PB_SCOM_ES3_MODE   0x8a
+
+static uint64_t pnv_n1_chiplet_pb_scom_eq_read(void *opaque, hwaddr addr,
+  unsigned size)
+{
+PnvN1Chiplet *n1_chiplet = PNV_N1_CHIPLET(opaque);
+int reg = addr >> 3;
+uint64_t val = ~0ull;
+
+switch (reg) {
+case PB_SCOM_EQ0_HP_MODE2_CURR:
+val = n1_chiplet->eq[0].hp_mode2_curr;
+break;
+default:
+qemu_log_mask(LOG_UNIMP, "%s: Invalid xscom read at 0x%" PRIx32 "\n",
+  __func__, reg);
+}
+return val;
+}
+
+static void pnv_n1_chiplet_pb_scom_eq_write(void *opaque, hwaddr addr,
+   uint64_t val, unsigned size)
+{
+PnvN1Chiplet *n1_chiplet = PNV_N1_CHIPLET(opaque);
+int reg = addr >> 3;
+
+switch (reg) {
+case PB_SCOM_EQ0_HP_MODE2_CURR:
+n1_chiplet->eq[0].hp_mode2_curr = val;
+break;
+default:
+qemu_log_mask(LOG_UNIMP, "%s: Invalid xscom write at 0x%" PRIx32 "\n",
+  __func__, reg);
+}
+}
+
+static const MemoryRegionOps pnv_n1_chiplet_pb_scom_eq_ops = {
+.read = pnv_n1_chiplet_pb_scom_eq_read,
+.write = pnv_n1_chiplet_pb_scom_eq_write,
+.valid.min_access_size = 8,
+.valid.max_access_size = 8,
+.impl.min_access_size = 8,
+.impl.max_access_size = 8,
+.endianness = DEVICE_BIG_ENDIAN,
+};
+
+static uint64_t 

[PATCH v6 3/3] hw/ppc: N1 chiplet wiring

2023-11-27 Thread Chalapathi V
This part of the patchset connects the nest1 chiplet model to p10 chip.

Signed-off-by: Chalapathi V 
---
 include/hw/ppc/pnv_chip.h |  2 ++
 hw/ppc/pnv.c  | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/hw/ppc/pnv_chip.h b/include/hw/ppc/pnv_chip.h
index 0ab5c42308..9b06c8d87c 100644
--- a/include/hw/ppc/pnv_chip.h
+++ b/include/hw/ppc/pnv_chip.h
@@ -4,6 +4,7 @@
 #include "hw/pci-host/pnv_phb4.h"
 #include "hw/ppc/pnv_core.h"
 #include "hw/ppc/pnv_homer.h"
+#include "hw/ppc/pnv_n1_chiplet.h"
 #include "hw/ppc/pnv_lpc.h"
 #include "hw/ppc/pnv_occ.h"
 #include "hw/ppc/pnv_psi.h"
@@ -113,6 +114,7 @@ struct Pnv10Chip {
 PnvOCC   occ;
 PnvSBE   sbe;
 PnvHomer homer;
+PnvN1Chiplet n1_chiplet;
 
 uint32_t nr_quads;
 PnvQuad  *quads;
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 0297871bdd..6cf1f3319f 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -1680,6 +1680,8 @@ static void pnv_chip_power10_instance_init(Object *obj)
 object_initialize_child(obj, "occ",  >occ, TYPE_PNV10_OCC);
 object_initialize_child(obj, "sbe",  >sbe, TYPE_PNV10_SBE);
 object_initialize_child(obj, "homer", >homer, TYPE_PNV10_HOMER);
+object_initialize_child(obj, "n1_chiplet", >n1_chiplet,
+TYPE_PNV_N1_CHIPLET);
 
 chip->num_pecs = pcc->num_pecs;
 
@@ -1849,6 +1851,19 @@ static void pnv_chip_power10_realize(DeviceState *dev, 
Error **errp)
 memory_region_add_subregion(get_system_memory(), PNV10_HOMER_BASE(chip),
 >homer.regs);
 
+/* N1 chiplet */
+if (!qdev_realize(DEVICE(>n1_chiplet), NULL, errp)) {
+return;
+}
+pnv_xscom_add_subregion(chip, PNV10_XSCOM_N1_CHIPLET_CTRL_REGS_BASE,
+ >n1_chiplet.nest_pervasive.xscom_ctrl_regs);
+
+pnv_xscom_add_subregion(chip, PNV10_XSCOM_N1_PB_SCOM_EQ_BASE,
+   >n1_chiplet.xscom_pb_eq_regs);
+
+pnv_xscom_add_subregion(chip, PNV10_XSCOM_N1_PB_SCOM_ES_BASE,
+   >n1_chiplet.xscom_pb_es_regs);
+
 /* PHBs */
 pnv_chip_power10_phb_realize(chip, _err);
 if (local_err) {
-- 
2.31.1




[PATCH v6 0/3] pnv nest1 chiplet model

2023-11-27 Thread Chalapathi V
Hello,

Thank you for the review and suggestions on V5.

The suggestions and changes requested from V5 are addressed in V6.

Updates in Version 6 of this series are: 
1. adding a device-tree node in QEMU is removed as skiboot defines the
   device-tree and QEMU should just follow it.
2. Renamed PnvPerv to PnvNestChipletPervasive in PATCH1 as the model provides
   the common pervasive registers of all nest chiplets.
3. Nest1_chiplet model in PATCH2 is renamed to N1_chiplet to avoid the
   confussions that may comeup later.

Hence the new qom-tree looks like below.
(qemu) info qom-tree 
/machine (powernv10-machine)
  /chip[0] (power10_v2.0-pnv-chip)
/n1_chiplet (pnv-N1-chiplet)
  /nest_pervasive_common (pnv-nest-chiplet-pervasive)
/xscom-n1_chiplet-control-regs[0] (memory-region)
  /xscom-n1_chiplet-pb-scom-eq-regs[0] (memory-region)
  /xscom-n1_chiplet-pb-scom-es-regs[0] (memory-region)

Patches overview in V6.
PATCH1: Create a common nest pervasive chiplet model with control chiplet scom
registers.
PATCH2: Create a N1 chiplet model and implement powerbus scom registers.
Connect common nest pervasive model to N1 chiplet model to define
chiplet control scoms for N1 chiplet.
PATCH3: Connect N1 chiplet model to p10 chip.

Test covered:
These changes are tested on a single socket and 2 socket P10 machine.

Thank You,
Chalapathi

Chalapathi V (3):
  hw/ppc: Add pnv nest pervasive common chiplet model
  hw/ppc: Add N1 chiplet model
  hw/ppc: N1 chiplet wiring

 include/hw/ppc/pnv_chip.h   |   2 +
 include/hw/ppc/pnv_n1_chiplet.h |  35 +
 include/hw/ppc/pnv_nest_pervasive.h |  36 +
 include/hw/ppc/pnv_xscom.h  |   9 ++
 hw/ppc/pnv.c|  15 ++
 hw/ppc/pnv_n1_chiplet.c | 171 ++
 hw/ppc/pnv_nest_pervasive.c | 219 
 hw/ppc/meson.build  |   2 +
 8 files changed, 489 insertions(+)
 create mode 100644 include/hw/ppc/pnv_n1_chiplet.h
 create mode 100644 include/hw/ppc/pnv_nest_pervasive.h
 create mode 100644 hw/ppc/pnv_n1_chiplet.c
 create mode 100644 hw/ppc/pnv_nest_pervasive.c

-- 
2.31.1




[PULL 02/13] target/arm: Handle overflow in calculation of next timer tick

2023-11-27 Thread Peter Maydell
In commit edac4d8a168 back in 2015 when we added support for
the virtual timer offset CNTVOFF_EL2, we didn't correctly update
the timer-recalculation code that figures out when the timer
interrupt is next going to change state. We got it wrong in
two ways:
 * for the 0->1 transition, we didn't notice that gt->cval + offset
   can overflow a uint64_t
 * for the 1->0 transition, we didn't notice that the transition
   might now happen before the count rolls over, if offset > count

In the former case, we end up trying to set the next interrupt
for a time in the past, which results in QEMU hanging as the
timer fires continuously.

In the latter case, we would fail to update the interrupt
status when we are supposed to.

Fix the calculations in both cases.

The test case is Alex Bennée's from the bug report, and tests
the 0->1 transition overflow case.

Fixes: edac4d8a168 ("target-arm: Add CNTVOFF_EL2")
Cc: qemu-sta...@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/60
Signed-off-by: Alex Bennée 
Signed-off-by: Peter Maydell 
Reviewed-by: Richard Henderson 
Message-id: 20231120173506.3729884-1-peter.mayd...@linaro.org
Reviewed-by: Peter Maydell 
---
 target/arm/helper.c   | 25 ++--
 tests/tcg/aarch64/system/vtimer.c | 48 +++
 tests/tcg/aarch64/Makefile.softmmu-target |  7 +++-
 3 files changed, 75 insertions(+), 5 deletions(-)
 create mode 100644 tests/tcg/aarch64/system/vtimer.c

diff --git a/target/arm/helper.c b/target/arm/helper.c
index ff1970981ee..2746d3fdac8 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -2646,11 +2646,28 @@ static void gt_recalc_timer(ARMCPU *cpu, int timeridx)
 gt->ctl = deposit32(gt->ctl, 2, 1, istatus);
 
 if (istatus) {
-/* Next transition is when count rolls back over to zero */
-nexttick = UINT64_MAX;
+/*
+ * Next transition is when (count - offset) rolls back over to 0.
+ * If offset > count then this is when count == offset;
+ * if offset <= count then this is when count == offset + 2^64
+ * For the latter case we set nexttick to an "as far in future
+ * as possible" value and let the code below handle it.
+ */
+if (offset > count) {
+nexttick = offset;
+} else {
+nexttick = UINT64_MAX;
+}
 } else {
-/* Next transition is when we hit cval */
-nexttick = gt->cval + offset;
+/*
+ * Next transition is when (count - offset) == cval, i.e.
+ * when count == (cval + offset).
+ * If that would overflow, then again we set up the next interrupt
+ * for "as far in the future as possible" for the code below.
+ */
+if (uadd64_overflow(gt->cval, offset, )) {
+nexttick = UINT64_MAX;
+}
 }
 /*
  * Note that the desired next expiry time might be beyond the
diff --git a/tests/tcg/aarch64/system/vtimer.c 
b/tests/tcg/aarch64/system/vtimer.c
new file mode 100644
index 000..42f2f7796c7
--- /dev/null
+++ b/tests/tcg/aarch64/system/vtimer.c
@@ -0,0 +1,48 @@
+/*
+ * Simple Virtual Timer Test
+ *
+ * Copyright (c) 2020 Linaro Ltd
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include 
+#include 
+
+/* grabbed from Linux */
+#define __stringify_1(x...) #x
+#define __stringify(x...)   __stringify_1(x)
+
+#define read_sysreg(r) ({   \
+uint64_t __val; \
+asm volatile("mrs %0, " __stringify(r) : "=r" (__val)); \
+__val;  \
+})
+
+#define write_sysreg(r, v) do { \
+uint64_t __val = (uint64_t)(v); \
+asm volatile("msr " __stringify(r) ", %x0"  \
+ : : "rZ" (__val)); \
+} while (0)
+
+int main(void)
+{
+int i;
+
+ml_printf("VTimer Test\n");
+
+write_sysreg(cntvoff_el2, 1);
+write_sysreg(cntv_cval_el0, -1);
+write_sysreg(cntv_ctl_el0, 1);
+
+ml_printf("cntvoff_el2=%lx\n", read_sysreg(cntvoff_el2));
+ml_printf("cntv_cval_el0=%lx\n", read_sysreg(cntv_cval_el0));
+ml_printf("cntv_ctl_el0=%lx\n", read_sysreg(cntv_ctl_el0));
+
+/* Now read cval a few times */
+for (i = 0; i < 10; i++) {
+ml_printf("%d: cntv_cval_el0=%lx\n", i, read_sysreg(cntv_cval_el0));
+}
+
+return 0;
+}
diff --git a/tests/tcg/aarch64/Makefile.softmmu-target 
b/tests/tcg/aarch64/Makefile.softmmu-target
index 77c5018e02a..4b03ef602ea 100644
--- a/tests/tcg/aarch64/Makefile.softmmu-target
+++ b/tests/tcg/aarch64/Makefile.softmmu-target
@@ -45,7 +45,8 @@ TESTS+=memory-sve
 
 # Running
 QEMU_BASE_MACHINE=-M virt -cpu max -display none
-QEMU_OPTS+=$(QEMU_BASE_MACHINE) -semihosting-config 

[PULL 11/13] hw/ssi/xilinx_spips: fix an out of bound access

2023-11-27 Thread Peter Maydell
From: Frederic Konrad 

The spips, qspips, and zynqmp-qspips share the same realize function
(xilinx_spips_realize) and initialize their io memory region with different
mmio_ops passed through the class.  The size of the memory region is set to
the largest area (0x200 bytes for zynqmp-qspips) thus it is possible to write
out of s->regs[addr] in xilinx_spips_write for spips and qspips.

This fixes that wrong behavior.

Reviewed-by: Luc Michel 
Signed-off-by: Frederic Konrad 
Reviewed-by: Francisco Iglesias 
Message-id: 20231124143505.1493184-2-fkon...@amd.com
Signed-off-by: Peter Maydell 
---
 include/hw/ssi/xilinx_spips.h | 3 +++
 hw/ssi/xilinx_spips.c | 7 ++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/hw/ssi/xilinx_spips.h b/include/hw/ssi/xilinx_spips.h
index 1386d5ac8fe..7a754bf67a2 100644
--- a/include/hw/ssi/xilinx_spips.h
+++ b/include/hw/ssi/xilinx_spips.h
@@ -33,7 +33,9 @@
 
 typedef struct XilinxSPIPS XilinxSPIPS;
 
+/* For SPIPS, QSPIPS.  */
 #define XLNX_SPIPS_R_MAX(0x100 / 4)
+/* For ZYNQMP_QSPIPS.  */
 #define XLNX_ZYNQMP_SPIPS_R_MAX (0x200 / 4)
 
 /* Bite off 4k chunks at a time */
@@ -125,6 +127,7 @@ struct XilinxSPIPSClass {
 SysBusDeviceClass parent_class;
 
 const MemoryRegionOps *reg_ops;
+uint64_t reg_size;
 
 uint32_t rx_fifo_size;
 uint32_t tx_fifo_size;
diff --git a/hw/ssi/xilinx_spips.c b/hw/ssi/xilinx_spips.c
index a3955c6c50c..0bdfad7e2e5 100644
--- a/hw/ssi/xilinx_spips.c
+++ b/hw/ssi/xilinx_spips.c
@@ -973,6 +973,8 @@ static void xilinx_spips_write(void *opaque, hwaddr addr,
 
 DB_PRINT_L(0, "addr=" HWADDR_FMT_plx " = %x\n", addr, (unsigned)value);
 addr >>= 2;
+assert(addr < XLNX_SPIPS_R_MAX);
+
 switch (addr) {
 case R_CONFIG:
 mask = ~(R_CONFIG_RSVD | MAN_START_COM);
@@ -1299,7 +1301,7 @@ static void xilinx_spips_realize(DeviceState *dev, Error 
**errp)
 }
 
 memory_region_init_io(>iomem, OBJECT(s), xsc->reg_ops, s,
-  "spi", XLNX_ZYNQMP_SPIPS_R_MAX * 4);
+  "spi", xsc->reg_size);
 sysbus_init_mmio(sbd, >iomem);
 
 s->irqline = -1;
@@ -1435,6 +1437,7 @@ static void xilinx_qspips_class_init(ObjectClass *klass, 
void * data)
 
 dc->realize = xilinx_qspips_realize;
 xsc->reg_ops = _ops;
+xsc->reg_size = XLNX_SPIPS_R_MAX * 4;
 xsc->rx_fifo_size = RXFF_A_Q;
 xsc->tx_fifo_size = TXFF_A_Q;
 }
@@ -1450,6 +1453,7 @@ static void xilinx_spips_class_init(ObjectClass *klass, 
void *data)
 dc->vmsd = _xilinx_spips;
 
 xsc->reg_ops = _ops;
+xsc->reg_size = XLNX_SPIPS_R_MAX * 4;
 xsc->rx_fifo_size = RXFF_A;
 xsc->tx_fifo_size = TXFF_A;
 }
@@ -1464,6 +1468,7 @@ static void xlnx_zynqmp_qspips_class_init(ObjectClass 
*klass, void * data)
 dc->vmsd = _xlnx_zynqmp_qspips;
 device_class_set_props(dc, xilinx_zynqmp_qspips_properties);
 xsc->reg_ops = _zynqmp_qspips_ops;
+xsc->reg_size = XLNX_ZYNQMP_SPIPS_R_MAX * 4;
 xsc->rx_fifo_size = RXFF_A_Q;
 xsc->tx_fifo_size = TXFF_A_Q;
 }
-- 
2.34.1




Re: [PATCH] RISC-V: Increase max vlen to 4096

2023-11-27 Thread Patrick O'Neill

Hi Phil,

On 11/23/23 02:21, Philippe Mathieu-Daudé wrote:

Hi Patrick,

On 23/11/23 01:17, Patrick O'Neill wrote:

QEMU currently limits the max vlenb to 1024. GCC sets the upper bound
to 4096 [1]. There doesn't seem to be an upper bound set by the spec [2]
so this patch just changes QEMU to match GCC's upper bound.

[1] 
https://github.com/gcc-mirror/gcc/blob/5d2a360f0a541646abb11efdbabc33c6a04de7ee/gcc/testsuite/gcc.target/riscv/rvv/base/zvl-unimplemented-2.c#L4

[2] https://github.com/riscv/riscv-v-spec/issues/204

Signed-off-by: Patrick O'Neill 
---
Tested by applying to QEMU v8.1.2 and running the GCC testsuite in QEMU
user mode with rv64gcv_zvl4096b. Failures are somewhat reasonable and on
first inspection appear to be in the same ballpark as failures for
rv64gcv_zvl1024b. Since I used tip-of-tree GCC I'm expecting those
failures to be GCC-caused & from skimming the debug log they appear to
be.
---
  target/riscv/cpu.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 6ea22e0eea..2ff3a72fc0 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -97,7 +97,7 @@ typedef enum {
  #include "debug.h"
  #endif

-#define RV_VLEN_MAX 1024
+#define RV_VLEN_MAX 4096


This seems to break the "cpu/vector" field migration. Maybe we don't
care, but this should be clarified in the commit description.

I wasn't aware of that (this is actually my first patch to qemu!). Do
you have a pointer to more information about the migration so I can
write an appropriate blurb/understand what the migration was/did?

Thanks,
Patrick


Regards,

Phil.





[PULL 05/13] hw/virtio: Add VirtioPCIDeviceTypeInfo::instance_finalize field

2023-11-27 Thread Peter Maydell
From: Philippe Mathieu-Daudé 

The VirtioPCIDeviceTypeInfo structure, added in commit a4ee4c8baa
("virtio: Helper for registering virtio device types") got extended
in commit 8ea90ee690 ("virtio: add class_size") with the @class_size
field. Do similarly with the @instance_finalize field.

Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20231121174051.63038-2-phi...@linaro.org
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
 include/hw/virtio/virtio-pci.h | 1 +
 hw/virtio/virtio-pci.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/hw/virtio/virtio-pci.h b/include/hw/virtio/virtio-pci.h
index 5a3f182f998..59d88018c16 100644
--- a/include/hw/virtio/virtio-pci.h
+++ b/include/hw/virtio/virtio-pci.h
@@ -246,6 +246,7 @@ typedef struct VirtioPCIDeviceTypeInfo {
 size_t instance_size;
 size_t class_size;
 void (*instance_init)(Object *obj);
+void (*instance_finalize)(Object *obj);
 void (*class_init)(ObjectClass *klass, void *data);
 InterfaceInfo *interfaces;
 } VirtioPCIDeviceTypeInfo;
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 205dbf24fb1..e4338795423 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2391,6 +2391,7 @@ void virtio_pci_types_register(const 
VirtioPCIDeviceTypeInfo *t)
 .parent= t->parent ? t->parent : TYPE_VIRTIO_PCI,
 .instance_size = t->instance_size,
 .instance_init = t->instance_init,
+.instance_finalize = t->instance_finalize,
 .class_size= t->class_size,
 .abstract  = true,
 .interfaces= t->interfaces,
-- 
2.34.1




[PULL 13/13] hw/dma/xlnx_csu_dma: don't throw guest errors when stopping the SRC DMA

2023-11-27 Thread Peter Maydell
From: Frederic Konrad 

UG1087 states for the source channel that: if SIZE is programmed to 0, and the
DMA is started, the interrupts DONE and MEM_DONE will be asserted.

This implies that it is allowed for the guest to stop the source DMA by writing
a size of 0 to the SIZE register, so remove the LOG_GUEST_ERROR in that case.

While at it remove the comment marking the SIZE register as write-only.

See: 
https://docs.xilinx.com/r/en-US/ug1087-zynq-ultrascale-registers/CSUDMA_SRC_SIZE-CSUDMA-Register

Signed-off-by: Frederic Konrad 
Reviewed-by: Francisco Iglesias 
Message-id: 20231124143505.1493184-4-fkon...@amd.com
Signed-off-by: Peter Maydell 
---
 hw/dma/xlnx_csu_dma.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/hw/dma/xlnx_csu_dma.c b/hw/dma/xlnx_csu_dma.c
index 531013f35aa..bc1505aade7 100644
--- a/hw/dma/xlnx_csu_dma.c
+++ b/hw/dma/xlnx_csu_dma.c
@@ -39,7 +39,7 @@
 REG32(ADDR, 0x0)
 FIELD(ADDR, ADDR, 2, 30) /* wo */
 REG32(SIZE, 0x4)
-FIELD(SIZE, SIZE, 2, 27) /* wo */
+FIELD(SIZE, SIZE, 2, 27)
 FIELD(SIZE, LAST_WORD, 0, 1) /* rw, only exists in SRC */
 REG32(STATUS, 0x8)
 FIELD(STATUS, DONE_CNT, 13, 3) /* wtc */
@@ -335,10 +335,14 @@ static uint64_t addr_pre_write(RegisterInfo *reg, 
uint64_t val)
 static uint64_t size_pre_write(RegisterInfo *reg, uint64_t val)
 {
 XlnxCSUDMA *s = XLNX_CSU_DMA(reg->opaque);
+uint64_t size = val & R_SIZE_SIZE_MASK;
 
 if (s->regs[R_SIZE] != 0) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: Starting DMA while already running.\n", __func__);
+if (size || s->is_dst) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Starting DMA while already running.\n",
+  __func__);
+}
 }
 
 if (!s->is_dst) {
@@ -346,7 +350,7 @@ static uint64_t size_pre_write(RegisterInfo *reg, uint64_t 
val)
 }
 
 /* Size is word aligned */
-return val & R_SIZE_SIZE_MASK;
+return size;
 }
 
 static uint64_t size_post_read(RegisterInfo *reg, uint64_t val)
-- 
2.34.1




[PULL 04/13] hw/net/can/xlnx-zynqmp: Avoid underflow while popping RX FIFO

2023-11-27 Thread Peter Maydell
From: Philippe Mathieu-Daudé 

Per https://docs.xilinx.com/r/en-US/ug1085-zynq-ultrascale-trm/Message-Format

  Message Format

  The same message format is used for RXFIFO, TXFIFO, and TXHPB.
  Each message includes four words (16 bytes). Software must read
  and write all four words regardless of the actual number of data
  bytes and valid fields in the message.

There is no mention in this reference manual about what the
hardware does when not all four words are read. To fix the
reported underflow behavior, I choose to fill the 4 frame data
registers when the first register (ID) is accessed, which is how
I expect hardware would do.

Reported-by: Qiang Liu 
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Francisco Iglesias 
Reviewed-by: Vikram Garhwal 
Message-id: 20231124183325.95392-3-phi...@linaro.org
Fixes: 98e5d7a2b7 ("hw/net/can: Introduce Xilinx ZynqMP CAN controller")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1427
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Francisco Iglesias 
Reviewed-by: Vikram Garhwal 
Signed-off-by: Peter Maydell 
Reviewed-by: Peter Maydell 
---
 hw/net/can/xlnx-zynqmp-can.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/hw/net/can/xlnx-zynqmp-can.c b/hw/net/can/xlnx-zynqmp-can.c
index 1f1c686479c..f60e480c3ab 100644
--- a/hw/net/can/xlnx-zynqmp-can.c
+++ b/hw/net/can/xlnx-zynqmp-can.c
@@ -778,14 +778,18 @@ static void update_rx_fifo(XlnxZynqMPCANState *s, const 
qemu_can_frame *frame)
 }
 }
 
-static uint64_t can_rxfifo_pre_read(RegisterInfo *reg, uint64_t val)
+static uint64_t can_rxfifo_post_read_id(RegisterInfo *reg, uint64_t val)
 {
 XlnxZynqMPCANState *s = XLNX_ZYNQMP_CAN(reg->opaque);
+unsigned used = fifo32_num_used(>rx_fifo);
 
-if (!fifo32_is_empty(>rx_fifo)) {
-val = fifo32_pop(>rx_fifo);
-} else {
+if (used < CAN_FRAME_SIZE) {
 ARRAY_FIELD_DP32(s->regs, INTERRUPT_STATUS_REGISTER, RXUFLW, 1);
+} else {
+val = s->regs[R_RXFIFO_ID] = fifo32_pop(>rx_fifo);
+s->regs[R_RXFIFO_DLC] = fifo32_pop(>rx_fifo);
+s->regs[R_RXFIFO_DATA1] = fifo32_pop(>rx_fifo);
+s->regs[R_RXFIFO_DATA2] = fifo32_pop(>rx_fifo);
 }
 
 can_update_irq(s);
@@ -946,14 +950,11 @@ static const RegisterAccessInfo can_regs_info[] = {
 .post_write = can_tx_post_write,
 },{ .name = "RXFIFO_ID",  .addr = A_RXFIFO_ID,
 .ro = 0x,
-.post_read = can_rxfifo_pre_read,
+.post_read = can_rxfifo_post_read_id,
 },{ .name = "RXFIFO_DLC",  .addr = A_RXFIFO_DLC,
 .rsvd = 0xfff,
-.post_read = can_rxfifo_pre_read,
 },{ .name = "RXFIFO_DATA1",  .addr = A_RXFIFO_DATA1,
-.post_read = can_rxfifo_pre_read,
 },{ .name = "RXFIFO_DATA2",  .addr = A_RXFIFO_DATA2,
-.post_read = can_rxfifo_pre_read,
 },{ .name = "AFR",  .addr = A_AFR,
 .rsvd = 0xfff0,
 .post_write = can_filter_enable_post_write,
-- 
2.34.1




[PULL 06/13] hw/virtio: Free VirtIOIOMMUPCI::vdev.reserved_regions[] on finalize()

2023-11-27 Thread Peter Maydell
From: Philippe Mathieu-Daudé 

Commit 0be6bfac62 ("qdev: Implement variable length array properties")
added the DEFINE_PROP_ARRAY() macro with the following comment:

  * It is the responsibility of the device deinit code to free the
  * @_arrayfield memory.

Commit 8077b8e549 added:

  DEFINE_PROP_ARRAY("reserved-regions", VirtIOIOMMUPCI,
vdev.nb_reserved_regions, vdev.reserved_regions,
qdev_prop_reserved_region, ReservedRegion),

but forgot to free the 'vdev.reserved_regions' array. Do it in the
instance_finalize() handler.

Cc: qemu-sta...@nongnu.org
Fixes: 8077b8e549 ("virtio-iommu-pci: Add array of Interval properties") # 
v5.1.0+
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Eric Auger 
Message-id: 20231121174051.63038-3-phi...@linaro.org
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
 hw/virtio/virtio-iommu-pci.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/virtio/virtio-iommu-pci.c b/hw/virtio/virtio-iommu-pci.c
index 9459fbf6edf..cbdfe4c591c 100644
--- a/hw/virtio/virtio-iommu-pci.c
+++ b/hw/virtio/virtio-iommu-pci.c
@@ -95,10 +95,18 @@ static void virtio_iommu_pci_instance_init(Object *obj)
 TYPE_VIRTIO_IOMMU);
 }
 
+static void virtio_iommu_pci_instance_finalize(Object *obj)
+{
+VirtIOIOMMUPCI *dev = VIRTIO_IOMMU_PCI(obj);
+
+g_free(dev->vdev.prop_resv_regions);
+}
+
 static const VirtioPCIDeviceTypeInfo virtio_iommu_pci_info = {
 .generic_name  = TYPE_VIRTIO_IOMMU_PCI,
 .instance_size = sizeof(VirtIOIOMMUPCI),
 .instance_init = virtio_iommu_pci_instance_init,
+.instance_finalize = virtio_iommu_pci_instance_finalize,
 .class_init= virtio_iommu_pci_class_init,
 };
 
-- 
2.34.1




[PULL 00/13] target-arm queue

2023-11-27 Thread Peter Maydell
Hi; here are some more arm bug fixes for rc2. Nothing
earth-shakingly important here, I think.

thanks
-- PMM

The following changes since commit 4705fc0c8511d073bee4751c3c974aab2b10a970:

  Merge tag 'pull-for-8.2-fixes-231123-1' of https://gitlab.com/stsquad/qemu 
into staging (2023-11-24 08:00:18 -0500)

are available in the Git repository at:

  https://git.linaro.org/people/pmaydell/qemu-arm.git 
tags/pull-target-arm-20231127

for you to fetch changes up to 1ee80592bf24eabef77e2260a86d9358b54c08fd:

  hw/dma/xlnx_csu_dma: don't throw guest errors when stopping the SRC DMA 
(2023-11-27 17:02:04 +)


target-arm queue:
 * Set IL bit for pauth, SVE access, BTI trap syndromes
 * Handle overflow in calculation of next timer tick
 * hw/net/can/xlnx-zynqmp: Avoid underflow when popping FIFOs
 * Various devices: Free array property memory on device finalize
 * hw/ssi/xilinx_spips: fix an out of bound access
 * hw/misc, hw/ssi: Fix some URLs for AMD / Xilinx models
 * hw/dma/xlnx_csu_dma: don't throw guest errors when stopping the SRC DMA


Frederic Konrad (3):
  hw/ssi/xilinx_spips: fix an out of bound access
  hw/misc, hw/ssi: Fix some URLs for AMD / Xilinx models
  hw/dma/xlnx_csu_dma: don't throw guest errors when stopping the SRC DMA

Peter Maydell (2):
  target/arm: Set IL bit for pauth, SVE access, BTI trap syndromes
  target/arm: Handle overflow in calculation of next timer tick

Philippe Mathieu-Daudé (8):
  hw/net/can/xlnx-zynqmp: Avoid underflow while popping TX FIFOs
  hw/net/can/xlnx-zynqmp: Avoid underflow while popping RX FIFO
  hw/virtio: Add VirtioPCIDeviceTypeInfo::instance_finalize field
  hw/virtio: Free VirtIOIOMMUPCI::vdev.reserved_regions[] on finalize()
  hw/misc/mps2-scc: Free MPS2SCC::oscclk[] array on finalize()
  hw/nvram/xlnx-efuse: Free XlnxEFuse::ro_bits[] array on finalize()
  hw/nvram/xlnx-efuse-ctrl: Free XlnxVersalEFuseCtrl[] "pg0-lock" array
  hw/input/stellaris_gamepad: Free StellarisGamepad::keycodes[] array

 include/hw/misc/xlnx-versal-cframe-reg.h   |  2 +-
 include/hw/misc/xlnx-versal-cfu.h  |  2 +-
 include/hw/misc/xlnx-versal-pmc-iou-slcr.h |  2 +-
 include/hw/ssi/xilinx_spips.h  |  3 ++
 include/hw/ssi/xlnx-versal-ospi.h  |  2 +-
 include/hw/virtio/virtio-pci.h |  1 +
 target/arm/syndrome.h  |  6 +--
 hw/dma/xlnx_csu_dma.c  | 14 ---
 hw/input/stellaris_gamepad.c   |  8 
 hw/misc/mps2-scc.c |  8 
 hw/net/can/xlnx-zynqmp-can.c   | 67 +-
 hw/nvram/xlnx-efuse.c  |  8 
 hw/nvram/xlnx-versal-efuse-ctrl.c  |  8 
 hw/ssi/xilinx_spips.c  |  7 +++-
 hw/virtio/virtio-iommu-pci.c   |  8 
 hw/virtio/virtio-pci.c |  1 +
 target/arm/helper.c| 25 +--
 tests/tcg/aarch64/system/vtimer.c  | 48 +
 tests/tcg/aarch64/Makefile.softmmu-target  |  7 +++-
 19 files changed, 198 insertions(+), 29 deletions(-)
 create mode 100644 tests/tcg/aarch64/system/vtimer.c



[PULL 12/13] hw/misc, hw/ssi: Fix some URLs for AMD / Xilinx models

2023-11-27 Thread Peter Maydell
From: Frederic Konrad 

It seems that the url changed a bit, and it triggers an error.  Fix the URLs so
the documentation can be reached again.

Signed-off-by: Frederic Konrad 
Reviewed-by: Francisco Iglesias 
Message-id: 20231124143505.1493184-3-fkon...@amd.com
Signed-off-by: Peter Maydell 
---
 include/hw/misc/xlnx-versal-cframe-reg.h   | 2 +-
 include/hw/misc/xlnx-versal-cfu.h  | 2 +-
 include/hw/misc/xlnx-versal-pmc-iou-slcr.h | 2 +-
 include/hw/ssi/xlnx-versal-ospi.h  | 2 +-
 hw/dma/xlnx_csu_dma.c  | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/hw/misc/xlnx-versal-cframe-reg.h 
b/include/hw/misc/xlnx-versal-cframe-reg.h
index a14fbd7fe45..0091505246f 100644
--- a/include/hw/misc/xlnx-versal-cframe-reg.h
+++ b/include/hw/misc/xlnx-versal-cframe-reg.h
@@ -12,7 +12,7 @@
  * 
https://www.xilinx.com/support/documentation/architecture-manuals/am011-versal-acap-trm.pdf
  *
  * [2] Versal ACAP Register Reference,
- * 
https://www.xilinx.com/htmldocs/registers/am012/am012-versal-register-reference.html
+ * 
https://docs.xilinx.com/r/en-US/am012-versal-register-reference/CFRAME_REG-Module
  */
 #ifndef HW_MISC_XLNX_VERSAL_CFRAME_REG_H
 #define HW_MISC_XLNX_VERSAL_CFRAME_REG_H
diff --git a/include/hw/misc/xlnx-versal-cfu.h 
b/include/hw/misc/xlnx-versal-cfu.h
index 86fb8410538..be62bab8c8c 100644
--- a/include/hw/misc/xlnx-versal-cfu.h
+++ b/include/hw/misc/xlnx-versal-cfu.h
@@ -12,7 +12,7 @@
  * 
https://www.xilinx.com/support/documentation/architecture-manuals/am011-versal-acap-trm.pdf
  *
  * [2] Versal ACAP Register Reference,
- * 
https://www.xilinx.com/htmldocs/registers/am012/am012-versal-register-reference.html
+ * 
https://docs.xilinx.com/r/en-US/am012-versal-register-reference/CFU_CSR-Module
  */
 #ifndef HW_MISC_XLNX_VERSAL_CFU_APB_H
 #define HW_MISC_XLNX_VERSAL_CFU_APB_H
diff --git a/include/hw/misc/xlnx-versal-pmc-iou-slcr.h 
b/include/hw/misc/xlnx-versal-pmc-iou-slcr.h
index f7d24c93c41..0c4a4fd66d9 100644
--- a/include/hw/misc/xlnx-versal-pmc-iou-slcr.h
+++ b/include/hw/misc/xlnx-versal-pmc-iou-slcr.h
@@ -34,7 +34,7 @@
  * 
https://www.xilinx.com/support/documentation/architecture-manuals/am011-versal-acap-trm.pdf
  *
  * [2] Versal ACAP Register Reference,
- * 
https://www.xilinx.com/html_docs/registers/am012/am012-versal-register-reference.html#mod___pmc_iop_slcr.html
+ * 
https://docs.xilinx.com/r/en-US/am012-versal-register-reference/PMC_IOP_SLCR-Module
  *
  * QEMU interface:
  * + sysbus MMIO region 0: MemoryRegion for the device's registers
diff --git a/include/hw/ssi/xlnx-versal-ospi.h 
b/include/hw/ssi/xlnx-versal-ospi.h
index 5d131d351d2..4ac975aa2fd 100644
--- a/include/hw/ssi/xlnx-versal-ospi.h
+++ b/include/hw/ssi/xlnx-versal-ospi.h
@@ -34,7 +34,7 @@
  * 
https://www.xilinx.com/support/documentation/architecture-manuals/am011-versal-acap-trm.pdf
  *
  * [2] Versal ACAP Register Reference,
- * 
https://www.xilinx.com/html_docs/registers/am012/am012-versal-register-reference.html#mod___ospi.html
+ * 
https://docs.xilinx.com/r/en-US/am012-versal-register-reference/OSPI-Module
  *
  *
  * QEMU interface:
diff --git a/hw/dma/xlnx_csu_dma.c b/hw/dma/xlnx_csu_dma.c
index e89089821a3..531013f35aa 100644
--- a/hw/dma/xlnx_csu_dma.c
+++ b/hw/dma/xlnx_csu_dma.c
@@ -33,7 +33,7 @@
 
 /*
  * Ref: UG1087 (v1.7) February 8, 2019
- * 
https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html
+ * 
https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers
  * CSUDMA Module section
  */
 REG32(ADDR, 0x0)
-- 
2.34.1




[PULL 01/13] target/arm: Set IL bit for pauth, SVE access, BTI trap syndromes

2023-11-27 Thread Peter Maydell
The syndrome register value always has an IL field at bit 25, which
is 0 for a trap on a 16 bit instruction, and 1 for a trap on a 32
bit instruction (or for exceptions which aren't traps on a known
instruction, like PC alignment faults). This means that our
syn_*() functions should always either take an is_16bit argument to
determine whether to set the IL bit, or else unconditionally set it.

We missed setting the IL bit for the syndrome for three kinds of trap:
 * an SVE access exception
 * a pointer authentication check failure
 * a BTI (branch target identification) check failure

All of these traps are AArch64 only, and so the instruction causing
the trap is always 64 bit. This means we can unconditionally set
the IL bit in the syn_*() function.

Signed-off-by: Peter Maydell 
Reviewed-by: Richard Henderson 
Message-id: 20231120150121.3458408-1-peter.mayd...@linaro.org
Cc: qemu-sta...@nongnu.org
Reviewed-by: Peter Maydell 
---
 target/arm/syndrome.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/target/arm/syndrome.h b/target/arm/syndrome.h
index 5d34755508d..95454b5b3bb 100644
--- a/target/arm/syndrome.h
+++ b/target/arm/syndrome.h
@@ -216,7 +216,7 @@ static inline uint32_t syn_simd_access_trap(int cv, int 
cond, bool is_16bit)
 
 static inline uint32_t syn_sve_access_trap(void)
 {
-return EC_SVEACCESSTRAP << ARM_EL_EC_SHIFT;
+return (EC_SVEACCESSTRAP << ARM_EL_EC_SHIFT) | ARM_EL_IL;
 }
 
 /*
@@ -242,12 +242,12 @@ static inline uint32_t syn_pacfail(bool data, int 
keynumber)
 
 static inline uint32_t syn_pactrap(void)
 {
-return EC_PACTRAP << ARM_EL_EC_SHIFT;
+return (EC_PACTRAP << ARM_EL_EC_SHIFT) | ARM_EL_IL;
 }
 
 static inline uint32_t syn_btitrap(int btype)
 {
-return (EC_BTITRAP << ARM_EL_EC_SHIFT) | btype;
+return (EC_BTITRAP << ARM_EL_EC_SHIFT) | ARM_EL_IL | btype;
 }
 
 static inline uint32_t syn_bxjtrap(int cv, int cond, int rm)
-- 
2.34.1




[PULL 10/13] hw/input/stellaris_gamepad: Free StellarisGamepad::keycodes[] array

2023-11-27 Thread Peter Maydell
From: Philippe Mathieu-Daudé 

Commit 0be6bfac62 ("qdev: Implement variable length array properties")
added the DEFINE_PROP_ARRAY() macro with the following comment:

  * It is the responsibility of the device deinit code to free the
  * @_arrayfield memory.

Commit a75f336b97 added:

  DEFINE_PROP_ARRAY("keycodes", StellarisGamepad, num_buttons,
keycodes, qdev_prop_uint32, uint32_t),

but forgot to free the 'keycodes' array. Do it in the instance_finalize
handler.

Fixes: a75f336b97 ("hw/input/stellaris_input: Convert to qdev")
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20231121174051.63038-7-phi...@linaro.org
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
 hw/input/stellaris_gamepad.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/input/stellaris_gamepad.c b/hw/input/stellaris_gamepad.c
index 06a0c0ce839..9dfa620e29a 100644
--- a/hw/input/stellaris_gamepad.c
+++ b/hw/input/stellaris_gamepad.c
@@ -63,6 +63,13 @@ static void stellaris_gamepad_realize(DeviceState *dev, 
Error **errp)
 qemu_input_handler_register(dev, _gamepad_handler);
 }
 
+static void stellaris_gamepad_finalize(Object *obj)
+{
+StellarisGamepad *s = STELLARIS_GAMEPAD(obj);
+
+g_free(s->keycodes);
+}
+
 static void stellaris_gamepad_reset_enter(Object *obj, ResetType type)
 {
 StellarisGamepad *s = STELLARIS_GAMEPAD(obj);
@@ -92,6 +99,7 @@ static const TypeInfo stellaris_gamepad_info[] = {
 .name = TYPE_STELLARIS_GAMEPAD,
 .parent = TYPE_SYS_BUS_DEVICE,
 .instance_size = sizeof(StellarisGamepad),
+.instance_finalize = stellaris_gamepad_finalize,
 .class_init = stellaris_gamepad_class_init,
 },
 };
-- 
2.34.1




[PULL 08/13] hw/nvram/xlnx-efuse: Free XlnxEFuse::ro_bits[] array on finalize()

2023-11-27 Thread Peter Maydell
From: Philippe Mathieu-Daudé 

Commit 0be6bfac62 ("qdev: Implement variable length array properties")
added the DEFINE_PROP_ARRAY() macro with the following comment:

  * It is the responsibility of the device deinit code to free the
  * @_arrayfield memory.

Commit 68fbcc344e added:

  DEFINE_PROP_ARRAY("read-only", XlnxEFuse, ro_bits_cnt, ro_bits,
qdev_prop_uint32, uint32_t),

but forgot to free the 'ro_bits' array. Do it in the instance_finalize
handler.

Cc: qemu-sta...@nongnu.org
Fixes: 68fbcc344e ("hw/nvram: Introduce Xilinx eFuse QOM") # v6.2.0+
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20231121174051.63038-5-phi...@linaro.org
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
 hw/nvram/xlnx-efuse.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/nvram/xlnx-efuse.c b/hw/nvram/xlnx-efuse.c
index 655c40b8d1e..f7b849f7de4 100644
--- a/hw/nvram/xlnx-efuse.c
+++ b/hw/nvram/xlnx-efuse.c
@@ -224,6 +224,13 @@ static void efuse_realize(DeviceState *dev, Error **errp)
 }
 }
 
+static void efuse_finalize(Object *obj)
+{
+XlnxEFuse *s = XLNX_EFUSE(obj);
+
+g_free(s->ro_bits);
+}
+
 static void efuse_prop_set_drive(Object *obj, Visitor *v, const char *name,
  void *opaque, Error **errp)
 {
@@ -280,6 +287,7 @@ static const TypeInfo efuse_info = {
 .name  = TYPE_XLNX_EFUSE,
 .parent= TYPE_DEVICE,
 .instance_size = sizeof(XlnxEFuse),
+.instance_finalize = efuse_finalize,
 .class_init= efuse_class_init,
 };
 
-- 
2.34.1




[PULL 09/13] hw/nvram/xlnx-efuse-ctrl: Free XlnxVersalEFuseCtrl[] "pg0-lock" array

2023-11-27 Thread Peter Maydell
From: Philippe Mathieu-Daudé 

Commit 0be6bfac62 ("qdev: Implement variable length array properties")
added the DEFINE_PROP_ARRAY() macro with the following comment:

  * It is the responsibility of the device deinit code to free the
  * @_arrayfield memory.

Commit 9e4aa1fafe added:

  DEFINE_PROP_ARRAY("pg0-lock",
XlnxVersalEFuseCtrl, extra_pg0_lock_n16,
extra_pg0_lock_spec, qdev_prop_uint16, uint16_t),

but forgot to free the 'extra_pg0_lock_spec' array. Do it in the
instance_finalize() handler.

Cc: qemu-sta...@nongnu.org
Fixes: 9e4aa1fafe ("hw/nvram: Xilinx Versal eFuse device") # v6.2.0+
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20231121174051.63038-6-phi...@linaro.org
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
 hw/nvram/xlnx-versal-efuse-ctrl.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/nvram/xlnx-versal-efuse-ctrl.c 
b/hw/nvram/xlnx-versal-efuse-ctrl.c
index beb5661c35f..2480af35e1b 100644
--- a/hw/nvram/xlnx-versal-efuse-ctrl.c
+++ b/hw/nvram/xlnx-versal-efuse-ctrl.c
@@ -726,6 +726,13 @@ static void efuse_ctrl_init(Object *obj)
 sysbus_init_irq(sbd, >irq_efuse_imr);
 }
 
+static void efuse_ctrl_finalize(Object *obj)
+{
+XlnxVersalEFuseCtrl *s = XLNX_VERSAL_EFUSE_CTRL(obj);
+
+g_free(s->extra_pg0_lock_spec);
+}
+
 static const VMStateDescription vmstate_efuse_ctrl = {
 .name = TYPE_XLNX_VERSAL_EFUSE_CTRL,
 .version_id = 1,
@@ -764,6 +771,7 @@ static const TypeInfo efuse_ctrl_info = {
 .instance_size = sizeof(XlnxVersalEFuseCtrl),
 .class_init= efuse_ctrl_class_init,
 .instance_init = efuse_ctrl_init,
+.instance_finalize = efuse_ctrl_finalize,
 };
 
 static void efuse_ctrl_register_types(void)
-- 
2.34.1




[PULL 07/13] hw/misc/mps2-scc: Free MPS2SCC::oscclk[] array on finalize()

2023-11-27 Thread Peter Maydell
From: Philippe Mathieu-Daudé 

Commit 0be6bfac62 ("qdev: Implement variable length array properties")
added the DEFINE_PROP_ARRAY() macro with the following comment:

  * It is the responsibility of the device deinit code to free the
  * @_arrayfield memory.

Commit 4fb013afcc added:

  DEFINE_PROP_ARRAY("oscclk", MPS2SCC, num_oscclk, oscclk_reset,
qdev_prop_uint32, uint32_t),

but forgot to free the 'oscclk_reset' array. Do it in the
instance_finalize() handler.

Cc: qemu-sta...@nongnu.org
Fixes: 4fb013afcc ("hw/misc/mps2-scc: Support configurable number of OSCCLK 
values") # v6.0.0+
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20231121174051.63038-4-phi...@linaro.org
Reviewed-by: Peter Maydell 
Signed-off-by: Peter Maydell 
---
 hw/misc/mps2-scc.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/misc/mps2-scc.c b/hw/misc/mps2-scc.c
index b3b42a792cd..fe5034db140 100644
--- a/hw/misc/mps2-scc.c
+++ b/hw/misc/mps2-scc.c
@@ -329,6 +329,13 @@ static void mps2_scc_realize(DeviceState *dev, Error 
**errp)
 s->oscclk = g_new0(uint32_t, s->num_oscclk);
 }
 
+static void mps2_scc_finalize(Object *obj)
+{
+MPS2SCC *s = MPS2_SCC(obj);
+
+g_free(s->oscclk_reset);
+}
+
 static const VMStateDescription mps2_scc_vmstate = {
 .name = "mps2-scc",
 .version_id = 3,
@@ -385,6 +392,7 @@ static const TypeInfo mps2_scc_info = {
 .parent = TYPE_SYS_BUS_DEVICE,
 .instance_size = sizeof(MPS2SCC),
 .instance_init = mps2_scc_init,
+.instance_finalize = mps2_scc_finalize,
 .class_init = mps2_scc_class_init,
 };
 
-- 
2.34.1




[PULL 03/13] hw/net/can/xlnx-zynqmp: Avoid underflow while popping TX FIFOs

2023-11-27 Thread Peter Maydell
From: Philippe Mathieu-Daudé 

Per https://docs.xilinx.com/r/en-US/ug1085-zynq-ultrascale-trm/Message-Format

  Message Format

  The same message format is used for RXFIFO, TXFIFO, and TXHPB.
  Each message includes four words (16 bytes). Software must read
  and write all four words regardless of the actual number of data
  bytes and valid fields in the message.

There is no mention in this reference manual about what the
hardware does when not all four words are written. To fix the
reported underflow behavior when DATA2 register is written,
I choose to fill the data with the previous content of the
ID / DLC / DATA1 registers, which is how I expect hardware
would do.

Note there is no hardware flag raised under such condition.

Reported-by: Qiang Liu 
Reviewed-by: Francisco Iglesias 
Reviewed-by: Vikram Garhwal 
Signed-off-by: Philippe Mathieu-Daudé 
Message-id: 20231124183325.95392-2-phi...@linaro.org
Fixes: 98e5d7a2b7 ("hw/net/can: Introduce Xilinx ZynqMP CAN controller")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1425
Reviewed-by: Francisco Iglesias 
Reviewed-by: Vikram Garhwal 
Signed-off-by: Philippe Mathieu-Daudé 
Signed-off-by: Peter Maydell 
Reviewed-by: Peter Maydell 
---
 hw/net/can/xlnx-zynqmp-can.c | 50 +---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/hw/net/can/xlnx-zynqmp-can.c b/hw/net/can/xlnx-zynqmp-can.c
index e93e6c5e194..1f1c686479c 100644
--- a/hw/net/can/xlnx-zynqmp-can.c
+++ b/hw/net/can/xlnx-zynqmp-can.c
@@ -434,6 +434,52 @@ static bool tx_ready_check(XlnxZynqMPCANState *s)
 return true;
 }
 
+static void read_tx_frame(XlnxZynqMPCANState *s, Fifo32 *fifo, uint32_t *data)
+{
+unsigned used = fifo32_num_used(fifo);
+bool is_txhpb = fifo == >txhpb_fifo;
+
+assert(used > 0);
+used %= CAN_FRAME_SIZE;
+
+/*
+ * Frame Message Format
+ *
+ * Each frame includes four words (16 bytes). Software must read and write
+ * all four words regardless of the actual number of data bytes and valid
+ * fields in the message.
+ * If software misbehave (not writing all four words), we use the previous
+ * registers content to initialize each missing word.
+ *
+ * If used is 1 then ID, DLC and DATA1 are missing.
+ * if used is 2 then ID and DLC are missing.
+ * if used is 3 then only ID is missing.
+ */
+ if (used > 0) {
+data[0] = s->regs[is_txhpb ? R_TXHPB_ID : R_TXFIFO_ID];
+} else {
+data[0] = fifo32_pop(fifo);
+}
+if (used == 1 || used == 2) {
+data[1] = s->regs[is_txhpb ? R_TXHPB_DLC : R_TXFIFO_DLC];
+} else {
+data[1] = fifo32_pop(fifo);
+}
+if (used == 1) {
+data[2] = s->regs[is_txhpb ? R_TXHPB_DATA1 : R_TXFIFO_DATA1];
+} else {
+data[2] = fifo32_pop(fifo);
+}
+/* DATA2 triggered the transfer thus is always available */
+data[3] = fifo32_pop(fifo);
+
+if (used) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Incomplete CAN frame (only %u/%u slots used)\n",
+  TYPE_XLNX_ZYNQMP_CAN, used, CAN_FRAME_SIZE);
+}
+}
+
 static void transfer_fifo(XlnxZynqMPCANState *s, Fifo32 *fifo)
 {
 qemu_can_frame frame;
@@ -451,9 +497,7 @@ static void transfer_fifo(XlnxZynqMPCANState *s, Fifo32 
*fifo)
 }
 
 while (!fifo32_is_empty(fifo)) {
-for (i = 0; i < CAN_FRAME_SIZE; i++) {
-data[i] = fifo32_pop(fifo);
-}
+read_tx_frame(s, fifo, data);
 
 if (ARRAY_FIELD_EX32(s->regs, STATUS_REGISTER, LBACK)) {
 /*
-- 
2.34.1




Re: [RFC PATCH v2 18/19] heki: x86: Protect guest kernel memory using the KVM hypervisor

2023-11-27 Thread Madhavan T. Venkataraman
Apologies for the late reply. I was on vacation. Please see my response below:

On 11/13/23 02:54, Peter Zijlstra wrote:
> On Sun, Nov 12, 2023 at 09:23:25PM -0500, Mickaël Salaün wrote:
>> From: Madhavan T. Venkataraman 
>>
>> Implement a hypervisor function, kvm_protect_memory() that calls the
>> KVM_HC_PROTECT_MEMORY hypercall to request the KVM hypervisor to
>> set specified permissions on a list of guest pages.
>>
>> Using the protect_memory() function, set proper EPT permissions for all
>> guest pages.
>>
>> Use the MEM_ATTR_IMMUTABLE property to protect the kernel static
>> sections and the boot-time read-only sections. This enables to make sure
>> a compromised guest will not be able to change its main physical memory
>> page permissions. However, this also disable any feature that may change
>> the kernel's text section (e.g., ftrace, Kprobes), but they can still be
>> used on kernel modules.
>>
>> Module loading/unloading, and eBPF JIT is allowed without restrictions
>> for now, but we'll need a way to authenticate these code changes to
>> really improve the guests' security. We plan to use module signatures,
>> but there is no solution yet to authenticate eBPF programs.
>>
>> Being able to use ftrace and Kprobes in a secure way is a challenge not
>> solved yet. We're looking for ideas to make this work.
>>
>> Likewise, the JUMP_LABEL feature cannot work because the kernel's text
>> section is read-only.
> 
> What is the actual problem? As is the kernel text map is already RO and
> never changed.

For the JUMP_LABEL optimization, the text needs to be patched at some point.
That patching requires a writable mapping of the text page at the time of
patching.

In this Heki feature, we currently lock down the kernel text at the end of
kernel boot just before kicking off the init process. The lockdown is
implemented by setting the permissions of a text page to R_X in the extended
page table and not allowing write permissions in the EPT after that. So, jump 
label
patching during kernel boot is not a problem. But doing it after kernel
boot is a problem.

The lockdown is just for the current Heki implementation. In the future, we plan
to have a way of authenticating guest requests to change permissions on a text 
page.
Once that is in place, permissions on text pages can be changed on the fly to
support features that depend on text patching - FTrace, KProbes, etc.

Madhavan



  1   2   3   >