date:20211101

Re: [RFC PATCH v5 12/26] vhost: Route guest->host notification through shadow virtqueue

2021-11-01 Thread Jason Wang




在 2021/10/30 上午2:35, Eugenio Pérez 写道:

+/**
+ * Enable or disable shadow virtqueue in a vhost vdpa device.
+ *
+ * This function is idempotent, to call it many times with the same value for
+ * enable_svq will simply return success.
+ *
+ * @v   Vhost vdpa device
+ * @enable  True to set SVQ mode
+ * @errpError pointer
+ */
+void vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable, Error **errp)
+{



What happens if vhost_vpda is not stated when we try to enable svq? 
Another note is that, the vhost device could be stopped and started 
after svq is enabled/disabled. We need to deal with them.


Thanks

Re: [RFC PATCH v5 21/26] vhost: Add vhost_svq_valid_guest_features to shadow vq

2021-11-01 Thread Jason Wang

On Sat, Oct 30, 2021 at 2:44 AM Eugenio Pérez  wrote:
>
> This allows it to test if the guest has aknowledge an invalid transport
> feature for SVQ. This will include packed vq layout or event_idx,
> where VirtIO device needs help from SVQ.
>
> There is not needed at this moment, but since SVQ will not re-negotiate
> features again with the guest, a failure in acknowledge them is fatal
> for SVQ.
>

It's not clear to me why we need this. Maybe you can give me an
example. E.g isn't it sufficient to filter out the device with
event_idx?

Thanks

> Signed-off-by: Eugenio Pérez 
> ---
>  hw/virtio/vhost-shadow-virtqueue.h | 1 +
>  hw/virtio/vhost-shadow-virtqueue.c | 6 ++
>  2 files changed, 7 insertions(+)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h 
> b/hw/virtio/vhost-shadow-virtqueue.h
> index 946b2c6295..ac55588009 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -16,6 +16,7 @@
>  typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>
>  bool vhost_svq_valid_device_features(uint64_t *features);
> +bool vhost_svq_valid_guest_features(uint64_t *features);
>
>  void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
>  void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int 
> call_fd);
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
> b/hw/virtio/vhost-shadow-virtqueue.c
> index 6e0508a231..cb9ffcb015 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -62,6 +62,12 @@ bool vhost_svq_valid_device_features(uint64_t 
> *dev_features)
>  return true;
>  }
>
> +/* If the guest is using some of these, SVQ cannot communicate */
> +bool vhost_svq_valid_guest_features(uint64_t *guest_features)
> +{
> +return true;
> +}
> +
>  /* Forward guest notifications */
>  static void vhost_handle_guest_kick(EventNotifier *n)
>  {
> --
> 2.27.0
>

Re: [RFC PATCH v5 03/26] virtio: Add VIRTIO_F_QUEUE_STATE

2021-11-01 Thread Jason Wang

On Sat, Oct 30, 2021 at 2:36 AM Eugenio Pérez  wrote:
>
> Implementation of RFC of device state capability:
> https://lists.oasis-open.org/archives/virtio-comment/202012/msg5.html

Considering this still requires time to be done, we need to think of a
way to go without this.

Thanks



>
> With this capability, vdpa device can reset it's index so it can start
> consuming from guest after disabling shadow virtqueue (SVQ), with state
> not 0.
>
> The use case is to test SVQ with virtio-pci vdpa (vp_vdpa) with nested
> virtualization. Spawning a L0 qemu with a virtio-net device, use
> vp_vdpa driver to handle it in the guest, and then spawn a L1 qemu using
> that vdpa device. When L1 qemu calls device to set a new state though
> vdpa ioctl, vp_vdpa should set each queue state though virtio
> VIRTIO_PCI_COMMON_Q_AVAIL_STATE.
>
> Since this is only for testing vhost-vdpa, it's added here before of
> proposing to kernel code. No effort is done for checking that device
> can actually change its state, its layout, or if the device even
> supports to change state at all. These will be added in the future.
>
> Also, a modified version of vp_vdpa that allows to set these in PCI
> config is needed.
>
> TODO: Check for feature enabled and split in virtio pci config
>
> Signed-off-by: Eugenio Pérez 
> ---
>  hw/virtio/virtio-pci.h | 1 +
>  include/hw/virtio/virtio.h | 4 +++-
>  include/standard-headers/linux/virtio_config.h | 3 +++
>  include/standard-headers/linux/virtio_pci.h| 2 ++
>  hw/virtio/virtio-pci.c | 9 +
>  5 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
> index 2446dcd9ae..019badbd7c 100644
> --- a/hw/virtio/virtio-pci.h
> +++ b/hw/virtio/virtio-pci.h
> @@ -120,6 +120,7 @@ typedef struct VirtIOPCIQueue {
>uint32_t desc[2];
>uint32_t avail[2];
>uint32_t used[2];
> +  uint16_t state;
>  } VirtIOPCIQueue;
>
>  struct VirtIOPCIProxy {
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 8bab9cfb75..5fe575b8f0 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -289,7 +289,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
>  DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
>VIRTIO_F_IOMMU_PLATFORM, false), \
>  DEFINE_PROP_BIT64("packed", _state, _field, \
> -  VIRTIO_F_RING_PACKED, false)
> +  VIRTIO_F_RING_PACKED, false), \
> +DEFINE_PROP_BIT64("save_restore_q_state", _state, _field, \
> +  VIRTIO_F_QUEUE_STATE, true)
>
>  hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
>  bool virtio_queue_enabled_legacy(VirtIODevice *vdev, int n);
> diff --git a/include/standard-headers/linux/virtio_config.h 
> b/include/standard-headers/linux/virtio_config.h
> index 22e3a85f67..59fad3eb45 100644
> --- a/include/standard-headers/linux/virtio_config.h
> +++ b/include/standard-headers/linux/virtio_config.h
> @@ -90,4 +90,7 @@
>   * Does the device support Single Root I/O Virtualization?
>   */
>  #define VIRTIO_F_SR_IOV37
> +
> +/* Device support save and restore virtqueue state */
> +#define VIRTIO_F_QUEUE_STATE40
>  #endif /* _LINUX_VIRTIO_CONFIG_H */
> diff --git a/include/standard-headers/linux/virtio_pci.h 
> b/include/standard-headers/linux/virtio_pci.h
> index db7a8e2fcb..c8d9802a87 100644
> --- a/include/standard-headers/linux/virtio_pci.h
> +++ b/include/standard-headers/linux/virtio_pci.h
> @@ -164,6 +164,7 @@ struct virtio_pci_common_cfg {
> uint32_t queue_avail_hi;/* read-write */
> uint32_t queue_used_lo; /* read-write */
> uint32_t queue_used_hi; /* read-write */
> +   uint16_t queue_avail_state; /* read-write */
>  };
>
>  /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
> @@ -202,6 +203,7 @@ struct virtio_pci_cfg_cap {
>  #define VIRTIO_PCI_COMMON_Q_AVAILHI44
>  #define VIRTIO_PCI_COMMON_Q_USEDLO 48
>  #define VIRTIO_PCI_COMMON_Q_USEDHI 52
> +#define VIRTIO_PCI_COMMON_Q_AVAIL_STATE56
>
>  #endif /* VIRTIO_PCI_NO_MODERN */
>
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index 750aa47ec1..d7bb549033 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -1244,6 +1244,9 @@ static uint64_t virtio_pci_common_read(void *opaque, 
> hwaddr addr,
>  case VIRTIO_PCI_COMMON_Q_USEDHI:
>  val = proxy->vqs[vdev->queue_sel].used[1];
>  break;
> +case VIRTIO_PCI_COMMON_Q_AVAIL_STATE:
> +val = virtio_queue_get_last_avail_idx(vdev, vdev->queue_sel);
> +break;
>  default:
>  val = 0;
>  }
> @@ -1330,6 +1333,8 @@ static void virtio_pci_common_write(void *opaque, 
> hwaddr addr,
> proxy->vqs[vdev->queue_sel].avail[0],
>

Re: [RFC PATCH v5 00/26] vDPA shadow virtqueue

2021-11-01 Thread Jason Wang




在 2021/10/30 上午2:34, Eugenio Pérez 写道:

This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
is intended as a new method of tracking the memory the devices touch
during a migration process: Instead of relay on vhost device's dirty
logging capability, SVQ intercepts the VQ dataplane forwarding the
descriptors between VM and device. This way qemu is the effective
writer of guests memory, like in qemu's virtio device operation.

When SVQ is enabled qemu offers a new virtual address space to the
device to read and write into, and it maps new vrings and the guest
memory in it. SVQ also intercepts kicks and calls between the device
and the guest. Used buffers relay would cause dirty memory being
tracked, but at this RFC SVQ is not enabled on migration automatically.

Thanks of being a buffers relay system, SVQ can be used also to
communicate devices and drivers with different capabilities, like
devices that only supports packed vring and not split and old guest
with no driver packed support.

It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
not map the shadow vq in guest's VA, but in qemu's.

For qemu to use shadow virtqueues the guest virtio driver must not use
features like event_idx.

SVQ needs to be enabled with QMP command:

{ "execute": "x-vhost-set-shadow-vq",
   "arguments": { "name": "vhost-vdpa0", "enable": true } }

This series includes some patches to delete in the final version that
helps with its testing. The first two of the series have been sent
sepparately but they haven't been included in qemu main branch.

The two after them adds the feature to stop the device and be able to
set and get its status. It's intended to be used with vp_vpda driver in
a nested environment, so they are also external to this series. The
vp_vdpa driver also need modifications to forward the new status bit,
they will be proposed sepparately

Patches 5-12 prepares the SVQ and QMP command to support guest to host
notifications forwarding. If the SVQ is enabled with these ones
applied and the device supports it, that part can be tested in
isolation (for example, with networking), hopping through SVQ.

Same thing is true with patches 13-17, but with device to guest
notifications.

Based on them, patches from 18 to 22 implement the actual buffer
forwarding, using some features already introduced in previous.
However, they will need a host device with no iommu, something that
is not available at the moment.

The last part of the series uses properly the host iommu, so the driver
can access this new virtual address space created.

Comments are welcome.



I think we need do some benchmark to see the performance impact.

Thanks




TODO:
* Event, indirect, packed, and others features of virtio.
* To sepparate buffers forwarding in its own AIO context, so we can
   throw more threads to that task and we don't need to stop the main
   event loop.
* Support multiqueue virtio-net vdpa.
* Proper documentation.

Changes from v4 RFC:
* Support of allocating / freeing iova ranges in IOVA tree. Extending
   already present iova-tree for that.
* Proper validation of guest features. Now SVQ can negotiate a
   different set of features with the device when enabled.
* Support of host notifiers memory regions
* Handling of SVQ full queue in case guest's descriptors span to
   different memory regions (qemu's VA chunks).
* Flush pending used buffers at end of SVQ operation.
* QMP command now looks by NetClientState name. Other devices will need
   to implement it's way to enable vdpa.
* Rename QMP command to set, so it looks more like a way of working
* Better use of qemu error system
* Make a few assertions proper error-handling paths.
* Add more documentation
* Less coupling of virtio / vhost, that could cause friction on changes
* Addressed many other small comments and small fixes.

Changes from v3 RFC:
   * Move everything to vhost-vdpa backend. A big change, this allowed
 some cleanup but more code has been added in other places.
   * More use of glib utilities, especially to manage memory.
v3 link:
https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html

Changes from v2 RFC:
   * Adding vhost-vdpa devices support
   * Fixed some memory leaks pointed by different comments
v2 link:
https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html

Changes from v1 RFC:
   * Use QMP instead of migration to start SVQ mode.
   * Only accepting IOMMU devices, closer behavior with target devices
 (vDPA)
   * Fix invalid masking/unmasking of vhost call fd.
   * Use of proper methods for synchronization.
   * No need to modify VirtIO device code, all of the changes are
 contained in vhost code.
   * Delete superfluous code.
   * An intermediate RFC was sent with only the notifications forwarding
 changes. It can be seen in
 https://patchew.org/QEMU/20210129205415.876290-1-epere...@redhat.com/
v1

Re: [PATCH] vhost: Fix last queue index of devices with no cvq

2021-11-01 Thread Jason Wang

On Mon, Nov 1, 2021 at 4:59 PM Eugenio Perez Martin  wrote:
>
> On Mon, Nov 1, 2021 at 4:34 AM Jason Wang  wrote:
> >
> > On Fri, Oct 29, 2021 at 10:16 PM Eugenio Pérez  wrote:
> > >
> > > The -1 assumes that all devices with no cvq have an spare vq allocated
> > > for them, but with no offer of VIRTIO_NET_F_CTRL_VQ. This may not be the
> > > case, and the device may have a pair number of queues.
> > >
> > > To fix this, just resort to the lower even number of queues.
> > >
> > > Fixes: 049eb15b5fc9 ("vhost: record the last virtqueue index for the 
> > > virtio device")
> > > Signed-off-by: Eugenio Pérez 
> > > ---
> > >  hw/net/vhost_net.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > > index 0d888f29a6..edf56a597f 100644
> > > --- a/hw/net/vhost_net.c
> > > +++ b/hw/net/vhost_net.c
> > > @@ -330,7 +330,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState 
> > > *ncs,
> > >  NetClientState *peer;
> > >
> > >  if (!cvq) {
> > > -last_index -= 1;
> > > +last_index &= ~1ULL;
> > >  }
> >
> > The math here looks correct but we need to fix vhost_vdpa_dev_start() 
> > instead?
> >
> > if (dev->vq_index + dev->nvqs - 1 != dev->last_index) {
> > ...
> > }
> >
>
> If we just do that, devices that offer an odd number of queues but do
> not offer ctrl vq would never enable the last vq pair, isn't it?

For vq pair, you assume that it's a networking device, so the device
you described here violates the spec.

>
> Also, I would say that the right place for the solution of this
> problem should not be virtio/vhost-vdpa: This is highly dependent on
> having cvq, and this implies a knowledge about the use of each
> virtqueue. Another kind of device could have an odd number of
> virtqueues naturally, and that (-1) would not work for them, isn't it?

It actually depends on how multiqueue is modeled for each specific
type of device. They need to initialize the vq_index and nvqs
correctly:

E.g if we had a device with 3 queues, we could model it with the following:

vhost_dev 1, vq_index = 0, nvqs = 2
vhost_dev 2, vq_index = 2, nvqs = 1

In this case the last_index should be initialized to 2, then we know
all the vhost_dev is initialized and we can start the hardware.

Thanks

>
> Thanks!
>
> > Thanks
> >
> > >
> > >  if (!k->set_guest_notifiers) {
> > > --
> > > 2.27.0
> > >
> >
>

Re: [PATCH v8 28/29] accel/tcg/user-exec: Implement CPU-specific signal handler for loongarch64 hosts

2021-11-01 Thread WANG Xuerui

Hi,

On 2021/11/1 19:21, gaosong wrote:
> Hi Xuerui,
>
> On 2021/11/1 下午6:45, WANG Xuerui wrote:
>> While I can see this patch and the next one are clearly from me, my
>> author info is lost as I didn't spot any "From:" line in the mail body?
>> Also I don't remember seeing "Base-on" tags in QEMU either.
>
> Sorry,  I refer to the commit 35f171a2eb25fcdf1b719c58a61a7da15b4fe078
>
> It seems that the reference is wrong.  I 'll correct it.
My patch series haven't gone into upstream yet, so I'm pretty sure this
commit hash would change in the final merged version. I think basing
your whole series on top of mine should be okay; mine has been
completely reviewed and IIUC only waiting for a test-purpose Docker
builder image before it can be merged, so the code should be fairly
stable and friendly for rebases.
>
>> I think you're meaning to include the "Based-on" tags in your cover
>> letter instead?
>
> I should take this way,  Sorry Again,
>
Never mind; you could of course use more caution when it comes to Git
operations later.

Re: [PATCH 02/13] target/riscv: Extend pc for runtime pc write

2021-11-01 Thread LIU Zhiwei




On 2021/11/1 下午6:33, Richard Henderson wrote:

On 11/1/21 6:01 AM, LIU Zhiwei wrote:
In some cases, we must restore the guest PC to the address of the 
start of
the TB, such as when the instruction counter hit zero. So extend pc 
register

according to current xlen for these cases.

Signed-off-by: LIU Zhiwei 
---
  target/riscv/cpu.c    | 20 +---
  target/riscv/cpu.h    |  2 ++
  target/riscv/cpu_helper.c |  2 +-
  3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 7d53125dbc..7eefd4f6a6 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -319,7 +319,12 @@ static void riscv_cpu_set_pc(CPUState *cs, vaddr 
value)

  {
  RISCVCPU *cpu = RISCV_CPU(cs);
  CPURISCVState *env = >env;
-    env->pc = value;
+
+    if (cpu_get_xl(env) == MXL_RV32) {
+    env->pc = (int32_t)value;
+    } else {
+    env->pc = value;
+    }
  }


Good.


  static void riscv_cpu_synchronize_from_tb(CPUState *cs,
@@ -327,7 +332,12 @@ static void 
riscv_cpu_synchronize_from_tb(CPUState *cs,

  {
  RISCVCPU *cpu = RISCV_CPU(cs);
  CPURISCVState *env = >env;
-    env->pc = tb->pc;
+
+    if (cpu_get_xl(env) == MXL_RV32) {
+    env->pc = (int32_t)tb->pc;
+    } else {
+    env->pc = tb->pc;
+    }


Bad, since TB->PC should be extended properly.
Though this waits on a change to cpu_get_tb_cpu_state.


Should the env->pc always hold the sign-extend result? In 
cpu_get_tb_cpu_state, we just truncate to the XLEN bits.


Thanks,
Zhiwei




@@ -348,7 +358,11 @@ static bool riscv_cpu_has_work(CPUState *cs)
  void restore_state_to_opc(CPURISCVState *env, TranslationBlock *tb,
    target_ulong *data)
  {
-    env->pc = data[0];
+   if (cpu_get_xl(env) == MXL_RV32) {
+    env->pc = (int32_t)data[0];
+    } else {
+    env->pc = data[0];
+    }


Likewise.


r~

Re: [PATCH v2 0/5] pci/iommu: Fail early if vfio-pci detected before vIOMMU

2021-11-01 Thread Peter Xu

On Thu, Oct 28, 2021 at 12:31:24PM +0800, Peter Xu wrote:
> Note that patch 1-4 are cleanups for pci subsystem, and patch 5 is a fix to
> fail early for mis-ordered qemu cmdline on vfio and vIOMMU.  Logically they
> should be posted separately and they're not directly related, however to make
> it still correlated to v1 I kept them in the same patchset.
> 
> In this version I used pre_plug() hook for q35 to detect the ordering issue as
> Igor suggested, meanwhile it's done via object_resolve_path_type() rather than
> scanning the pci bus as Michael suggested.
> 
> Please review, thanks.

Michael,

Would you consider review/pick patches 1-4 first?  The last patch needs further
discussion, and I would like to address it separately in the future.

Thanks,

-- 
Peter Xu

[PATCH v5 09/10] target/ppc: PMU Event-Based exception support

2021-11-01 Thread Daniel Henrique Barboza

From: Gustavo Romero 

Following up the rfebb implementation, this patch adds the EBB exception
support that are triggered by Performance Monitor alerts. This exception
occurs when an enabled PMU condition or event happens and both MMCR0_EBE
and BESCR_PME are set.

The supported PM alerts will consist of counter negative conditions of
the PMU counters. This will be achieved by a timer mechanism that will
predict when a counter becomes negative. The PMU timer callback will set
the appropriate bits in MMCR0 and fire a PMC interrupt. The EBB
exception code will then set the appropriate BESCR bits, set the next
instruction pointer to the address pointed by the return register
(SPR_EBBRR), and redirect execution to the handler (pointed by
SPR_EBBHR).

CC: Gustavo Romero 
Signed-off-by: Gustavo Romero 
Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/cpu.h |  5 -
 target/ppc/excp_helper.c | 28 
 target/ppc/power8-pmu.c  | 26 --
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 8f545ff482..592031ce54 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -129,8 +129,10 @@ enum {
 /* ISA 3.00 additions */
 POWERPC_EXCP_HVIRT= 101,
 POWERPC_EXCP_SYSCALL_VECTORED = 102, /* scv exception 
*/
+POWERPC_EXCP_EBB = 103, /* Event-based branch exception  */
+
 /* EOL   */
-POWERPC_EXCP_NB   = 103,
+POWERPC_EXCP_NB   = 104,
 /* QEMU exceptions: special cases we want to stop translation*/
 POWERPC_EXCP_SYSCALL_USER = 0x203, /* System call in user mode only  */
 };
@@ -2455,6 +2457,7 @@ enum {
 PPC_INTERRUPT_HMI,/* Hypervisor Maintenance interrupt*/
 PPC_INTERRUPT_HDOORBELL,  /* Hypervisor Doorbell interrupt*/
 PPC_INTERRUPT_HVIRT,  /* Hypervisor virtualization interrupt  */
+PPC_INTERRUPT_PMC,/* Hypervisor virtualization interrupt  */
 };
 
 /* Processor Compatibility mask (PCR) */
diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
index 7be334e007..88aa0a84f8 100644
--- a/target/ppc/excp_helper.c
+++ b/target/ppc/excp_helper.c
@@ -797,6 +797,22 @@ static inline void powerpc_excp(PowerPCCPU *cpu, int 
excp_model, int excp)
 cpu_abort(cs, "Non maskable external exception "
   "is not implemented yet !\n");
 break;
+case POWERPC_EXCP_EBB:   /* Event-based branch exception */
+if ((env->spr[SPR_BESCR] & BESCR_GE) &&
+(env->spr[SPR_BESCR] & BESCR_PME)) {
+target_ulong nip;
+
+env->spr[SPR_BESCR] &= ~BESCR_GE;   /* Clear GE */
+env->spr[SPR_BESCR] |= BESCR_PMEO;  /* Set PMEO */
+env->spr[SPR_EBBRR] = env->nip; /* Save NIP for rfebb insn */
+nip = env->spr[SPR_EBBHR];  /* EBB handler */
+powerpc_set_excp_state(cpu, nip, env->msr);
+}
+/*
+ * This interrupt is handled by userspace. No need
+ * to proceed.
+ */
+return;
 default:
 excp_invalid:
 cpu_abort(cs, "Invalid PowerPC exception %d. Aborting\n", excp);
@@ -1044,6 +1060,18 @@ static void ppc_hw_interrupt(CPUPPCState *env)
 powerpc_excp(cpu, env->excp_model, POWERPC_EXCP_THERM);
 return;
 }
+/* PMC -> Event-based branch exception */
+if (env->pending_interrupts & (1 << PPC_INTERRUPT_PMC)) {
+/*
+ * Performance Monitor event-based exception can only
+ * occur in problem state.
+ */
+if (msr_pr == 1) {
+env->pending_interrupts &= ~(1 << PPC_INTERRUPT_PMC);
+powerpc_excp(cpu, env->excp_model, POWERPC_EXCP_EBB);
+return;
+}
+}
 }
 
 if (env->resume_as_sreset) {
diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
index aa10233b29..ca3954ff0e 100644
--- a/target/ppc/power8-pmu.c
+++ b/target/ppc/power8-pmu.c
@@ -323,8 +323,30 @@ static void fire_PMC_interrupt(PowerPCCPU *cpu)
 return;
 }
 
-/* PMC interrupt not implemented yet */
-return;
+if (env->spr[SPR_POWER_MMCR0] & MMCR0_FCECE) {
+env->spr[SPR_POWER_MMCR0] &= ~MMCR0_FCECE;
+env->spr[SPR_POWER_MMCR0] |= MMCR0_FC;
+
+/* Changing MMCR0_FC demands a new hflags compute */
+hreg_compute_hflags(env);
+
+/*
+ * Delete all pending timers if we need to freeze
+ * the PMC. We'll restart them when the PMC starts
+ * running again.
+ */
+pmu_delete_timers(env);
+}
+
+pmu_update_cycles(env);
+
+if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMAE) {
+env->spr[SPR_POWER_MMCR0] &= ~MMCR0_PMAE;
+env->spr[SPR_POWER_MMCR0] |= MMCR0_PMAO;
+}
+
+/*

[PATCH v5 06/10] target/ppc: PMU: handle setting of PMCs while running

2021-11-01 Thread Daniel Henrique Barboza

The initial PMU support were made under the assumption that the counters
would be set before running the PMU and read after either freezing the
PMU manually or via a performance monitor alert.

Turns out that some EBB powerpc kernel tests set the counters after
unfreezing the counters. Setting a PMC value when the PMU is running
means that, at that moment, the baseline for calculating cycle
events needs to be updated. Updating this baseline means that we need
to update all the PMCs with their actual value at that moment. Any
existing counter negative timer needs to be discarded an a new one,
with the updated values, must be set again.

This patch does that via a new 'helper_store_pmc()' that is called in
the mtspr() callbacks of PMU counters.

Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/cpu_init.c| 12 ++--
 target/ppc/helper.h  |  1 +
 target/ppc/power8-pmu-regs.c.inc | 16 +++-
 target/ppc/power8-pmu.c  | 18 ++
 target/ppc/spr_tcg.h |  1 +
 5 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c
index e6f3ff9b96..33b4df3b99 100644
--- a/target/ppc/cpu_init.c
+++ b/target/ppc/cpu_init.c
@@ -6832,27 +6832,27 @@ static void register_book3s_pmu_sup_sprs(CPUPPCState 
*env)
  KVM_REG_PPC_MMCRA, 0x);
 spr_register_kvm(env, SPR_POWER_PMC1, "PMC1",
  SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
+ _read_generic, _write_PMC,
  KVM_REG_PPC_PMC1, 0x);
 spr_register_kvm(env, SPR_POWER_PMC2, "PMC2",
  SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
+ _read_generic, _write_PMC,
  KVM_REG_PPC_PMC2, 0x);
 spr_register_kvm(env, SPR_POWER_PMC3, "PMC3",
  SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
+ _read_generic, _write_PMC,
  KVM_REG_PPC_PMC3, 0x);
 spr_register_kvm(env, SPR_POWER_PMC4, "PMC4",
  SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
+ _read_generic, _write_PMC,
  KVM_REG_PPC_PMC4, 0x);
 spr_register_kvm(env, SPR_POWER_PMC5, "PMC5",
  SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
+ _read_generic, _write_PMC,
  KVM_REG_PPC_PMC5, 0x);
 spr_register_kvm(env, SPR_POWER_PMC6, "PMC6",
  SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
+ _read_generic, _write_PMC,
  KVM_REG_PPC_PMC6, 0x);
 spr_register_kvm(env, SPR_POWER_SIAR, "SIAR",
  SPR_NOACCESS, SPR_NOACCESS,
diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index b8a89f02f4..373326203b 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -21,6 +21,7 @@ DEF_HELPER_1(hrfid, void, env)
 DEF_HELPER_2(store_lpcr, void, env, tl)
 DEF_HELPER_2(store_pcr, void, env, tl)
 DEF_HELPER_2(store_mmcr0, void, env, tl)
+DEF_HELPER_3(store_pmc, void, env, i32, i64)
 DEF_HELPER_2(insns_inc, void, env, i32)
 #endif
 DEF_HELPER_1(check_tlb_flush_local, void, env)
diff --git a/target/ppc/power8-pmu-regs.c.inc b/target/ppc/power8-pmu-regs.c.inc
index a92437b0c4..3406649130 100644
--- a/target/ppc/power8-pmu-regs.c.inc
+++ b/target/ppc/power8-pmu-regs.c.inc
@@ -212,13 +212,23 @@ void spr_read_PMC56_ureg(DisasContext *ctx, int gprn, int 
sprn)
 spr_read_PMC14_ureg(ctx, gprn, sprn);
 }
 
+void spr_write_PMC(DisasContext *ctx, int sprn, int gprn)
+{
+TCGv_i32 t_sprn = tcg_const_i32(sprn);
+
+gen_icount_io_start(ctx);
+gen_helper_store_pmc(cpu_env, t_sprn, cpu_gpr[gprn]);
+
+tcg_temp_free_i32(t_sprn);
+}
+
 void spr_write_PMC14_ureg(DisasContext *ctx, int sprn, int gprn)
 {
 if (!spr_groupA_write_allowed(ctx)) {
 return;
 }
 
-spr_write_ureg(ctx, sprn, gprn);
+spr_write_PMC(ctx, sprn + 0x10, gprn);
 }
 
 void spr_write_PMC56_ureg(DisasContext *ctx, int sprn, int gprn)
@@ -286,4 +296,8 @@ void spr_write_MMCR0(DisasContext *ctx, int sprn, int gprn)
 {
 spr_write_generic(ctx, sprn, gprn);
 }
+void spr_write_PMC(DisasContext *ctx, int sprn, int gprn)
+{
+spr_write_generic(ctx, sprn, gprn);
+}
 #endif /* defined(TARGET_PPC64) && !defined(CONFIG_USER_ONLY) */
diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
index 3751b6de55..d66266829f 100644
--- a/target/ppc/power8-pmu.c
+++ b/target/ppc/power8-pmu.c
@@ -336,4 +336,22 @@ void cpu_ppc_pmu_init(CPUPPCState *env)
 }
 }
 
+void helper_store_pmc(CPUPPCState *env, uint32_t sprn, uint64_t value)
+{
+bool pmu_frozen = env->spr[SPR_POWER_MMCR0] &

[PATCH v5 08/10] PPC64/TCG: Implement 'rfebb' instruction

2021-11-01 Thread Daniel Henrique Barboza

An Event-Based Branch (EBB) allows applications to change the NIA when a
event-based exception occurs. Event-based exceptions are enabled by
setting the Branch Event Status and Control Register (BESCR). If the
event-based exception is enabled when the exception occurs, an EBB
happens.

The following operations happens during an EBB:

- Global Enable (GE) bit of BESCR is set to 0;
- bits 0-61 of the Event-Based Branch Return Register (EBBRR) are set
to the the effective address of the NIA that would have executed if the EBB
didn't happen;
- Instruction fetch and execution will continue in the effective address
contained in the Event-Based Branch Handler Register (EBBHR).

The EBB Handler will process the event and then execute the Return From
Event-Based Branch (rfebb) instruction. rfebb sets BESCR_GE and then
redirects execution to the address pointed in EBBRR. This process is
described in the PowerISA v3.1, Book II, Chapter 6 [1].

This patch implements the rfebb instruction. Descriptions of all
relevant BESCR bits are also added - this patch is only using BESCR_GE,
but the next patches will use the remaining bits.

[1] https://wiki.raptorcs.com/w/images/f/f5/PowerISA_public.v3.1.pdf

Reviewed-by: Matheus Ferst 
Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/cpu.h   | 13 ++
 target/ppc/excp_helper.c   | 31 
 target/ppc/helper.h|  1 +
 target/ppc/insn32.decode   |  5 
 target/ppc/translate.c |  2 ++
 target/ppc/translate/branch-impl.c.inc | 33 ++
 6 files changed, 85 insertions(+)
 create mode 100644 target/ppc/translate/branch-impl.c.inc

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 6c281a4ef4..8f545ff482 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -392,6 +392,19 @@ typedef enum {
 /* PMU uses CTRL_RUN to sample PM_RUN_INST_CMPL */
 #define CTRL_RUN PPC_BIT(63)
 
+/* EBB/BESCR bits */
+/* Global Enable */
+#define BESCR_GE PPC_BIT(0)
+/* External Event-based Exception Enable */
+#define BESCR_EE PPC_BIT(30)
+/* Performance Monitor Event-based Exception Enable */
+#define BESCR_PME PPC_BIT(31)
+/* External Event-based Exception Occurred */
+#define BESCR_EEO PPC_BIT(62)
+/* Performance Monitor Event-based Exception Occurred */
+#define BESCR_PMEO PPC_BIT(63)
+#define BESCR_INVALID PPC_BITMASK(32, 33)
+
 /* LPCR bits */
 #define LPCR_VPM0 PPC_BIT(0)
 #define LPCR_VPM1 PPC_BIT(1)
diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
index b7d1767920..7be334e007 100644
--- a/target/ppc/excp_helper.c
+++ b/target/ppc/excp_helper.c
@@ -1248,6 +1248,37 @@ void helper_hrfid(CPUPPCState *env)
 }
 #endif
 
+#if defined(TARGET_PPC64) && !defined(CONFIG_USER_ONLY)
+void helper_rfebb(CPUPPCState *env, target_ulong s)
+{
+target_ulong msr = env->msr;
+
+/*
+ * Handling of BESCR bits 32:33 according to PowerISA v3.1:
+ *
+ * "If BESCR 32:33 != 0b00 the instruction is treated as if
+ *  the instruction form were invalid."
+ */
+if (env->spr[SPR_BESCR] & BESCR_INVALID) {
+raise_exception_err(env, POWERPC_EXCP_PROGRAM,
+POWERPC_EXCP_INVAL | POWERPC_EXCP_INVAL_INVAL);
+}
+
+env->nip = env->spr[SPR_EBBRR];
+
+/* Switching to 32-bit ? Crop the nip */
+if (!msr_is_64bit(env, msr)) {
+env->nip = (uint32_t)env->spr[SPR_EBBRR];
+}
+
+if (s) {
+env->spr[SPR_BESCR] |= BESCR_GE;
+} else {
+env->spr[SPR_BESCR] &= ~BESCR_GE;
+}
+}
+#endif
+
 /*/
 /* Embedded PowerPC specific helpers */
 void helper_40x_rfci(CPUPPCState *env)
diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 373326203b..593b420f78 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -18,6 +18,7 @@ DEF_HELPER_2(pminsn, void, env, i32)
 DEF_HELPER_1(rfid, void, env)
 DEF_HELPER_1(rfscv, void, env)
 DEF_HELPER_1(hrfid, void, env)
+DEF_HELPER_2(rfebb, void, env, tl)
 DEF_HELPER_2(store_lpcr, void, env, tl)
 DEF_HELPER_2(store_pcr, void, env, tl)
 DEF_HELPER_2(store_mmcr0, void, env, tl)
diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index 65075f0d03..6edb748940 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -334,3 +334,8 @@ DSCRIQ  11 . . .. 001100010 .   
@Z22_tap_sh_rc
 ## Vector Bit Manipulation Instruction
 
 VCFUGED 000100 . . . 10101001101@VX
+
+### rfebb
+_s   s:uint8_t
+@XL_s   ..-- s:1 .. -   _s
+RFEBB   010011-- .   0010010010 -   @XL_s
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index ac91bd4fba..632e064813 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -7460,6 +7460,8 @@ static bool resolve_PLS_D(DisasContext *ctx, arg_D *d, 
arg_PLS_D *a)
 
 #include

[PATCH v5 02/10] target/ppc: PMU basic cycle count for pseries TCG

2021-11-01 Thread Daniel Henrique Barboza

This patch adds the barebones of the PMU logic by enabling cycle
counting. The overall logic goes as follows:

- a helper is added to control the PMU state on each MMCR0 write. This
allows for the PMU to start/stop as the frozen counter bit (MMCR0_FC)
is cleared or set;

- MMCR0 reg initial value is set to 0x8000 (MMCR0_FC set) to avoid
having to spin the PMU right at system init;

- to retrieve the events that are being profiled, getPMUEventType() will
check the current MMCR1 value and return the appropriate PMUEventType.
For PMCs 1-4, event 0x2 is the implementation dependent value of
PMU_EVENT_INSTRUCTIONS and event 0x1E is the implementation dependent
value of PMU_EVENT_CYCLES. These events are supported by IBM Power chips
since Power8, at least, and the Linux Perf driver makes use of these
events until kernel v5.15. For PMC1, event 0xF0 is the architected
PowerISA event for cycles. Event 0xFE is the architected PowerISA event
for instructions;

- the intended usage is to freeze the counters by setting MMCR0_FC, do
any additional setting of events to be counted via MMCR1 and enable
the PMU by zeroing MMCR0_FC. Software must freeze counters to read the
results - on the fly reading of the PMCs will return the starting value
of each one. This act of unfreezing the PMU, counting cycles and then
freezing the PMU again is being called a cycle count session.

Given that the base CPU frequency is fixed at 1Ghz for both powernv and
pseries clock, cycle calculation assumes that 1 nanosecond equals 1 CPU
cycle. Cycle value is then calculated by subtracting the current time
the PMU was frozen against the time in which the PMU started spining.

The counter specific frozen bits MMCR0_FC14 and MMCR0_FC56 were also
added as a mean to further control which PMCs were supposed to be
counting cycles during the session.

Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/cpu.h |  20 +
 target/ppc/cpu_init.c|   6 +-
 target/ppc/helper.h  |   1 +
 target/ppc/power8-pmu-regs.c.inc |  23 -
 target/ppc/power8-pmu.c  | 142 +++
 target/ppc/spr_tcg.h |   1 +
 6 files changed, 189 insertions(+), 4 deletions(-)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 5aeaee8a9c..6c4643044b 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -360,6 +360,9 @@ typedef enum {
 #define MMCR0_FCECE  PPC_BIT(38) /* FC on Enabled Cond or Event */
 #define MMCR0_PMCC0  PPC_BIT(44) /* PMC Control bit 0 */
 #define MMCR0_PMCC1  PPC_BIT(45) /* PMC Control bit 1 */
+#define MMCR0_PMCC   PPC_BITMASK(44, 45) /* PMC Control */
+#define MMCR0_FC14   PPC_BIT(58) /* PMC Freeze Counters 1-4 bit */
+#define MMCR0_FC56   PPC_BIT(59) /* PMC Freeze Counters 5-6 bit */
 /* MMCR0 userspace r/w mask */
 #define MMCR0_UREG_MASK (MMCR0_FC | MMCR0_PMAO | MMCR0_PMAE)
 /* MMCR2 userspace r/w mask */
@@ -372,6 +375,17 @@ typedef enum {
 #define MMCR2_UREG_MASK (MMCR2_FC1P0 | MMCR2_FC2P0 | MMCR2_FC3P0 | \
  MMCR2_FC4P0 | MMCR2_FC5P0 | MMCR2_FC6P0)
 
+#define MMCR1_EVT_SIZE 8
+/* extract64() does a right shift before extracting */
+#define MMCR1_PMC1SEL_START 32
+#define MMCR1_PMC1EVT_EXTR (64 - MMCR1_PMC1SEL_START - MMCR1_EVT_SIZE)
+#define MMCR1_PMC2SEL_START 40
+#define MMCR1_PMC2EVT_EXTR (64 - MMCR1_PMC2SEL_START - MMCR1_EVT_SIZE)
+#define MMCR1_PMC3SEL_START 48
+#define MMCR1_PMC3EVT_EXTR (64 - MMCR1_PMC3SEL_START - MMCR1_EVT_SIZE)
+#define MMCR1_PMC4SEL_START 56
+#define MMCR1_PMC4EVT_EXTR (64 - MMCR1_PMC4SEL_START - MMCR1_EVT_SIZE)
+
 /* LPCR bits */
 #define LPCR_VPM0 PPC_BIT(0)
 #define LPCR_VPM1 PPC_BIT(1)
@@ -1206,6 +1220,12 @@ struct CPUPPCState {
  * when counting cycles.
  */
 QEMUTimer *pmu_cyc_overflow_timers[PMU_TIMERS_NUM];
+
+/*
+ * PMU base time value used by the PMU to calculate
+ * running cycles.
+ */
+uint64_t pmu_base_time;
 };
 
 #define SET_FIT_PERIOD(a_, b_, c_, d_)  \
diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c
index 65545ba9ca..6c384c3ac2 100644
--- a/target/ppc/cpu_init.c
+++ b/target/ppc/cpu_init.c
@@ -6820,8 +6820,8 @@ static void register_book3s_pmu_sup_sprs(CPUPPCState *env)
 {
 spr_register_kvm(env, SPR_POWER_MMCR0, "MMCR0",
  SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
- KVM_REG_PPC_MMCR0, 0x);
+ _read_generic, _write_MMCR0,
+ KVM_REG_PPC_MMCR0, 0x8000);
 spr_register_kvm(env, SPR_POWER_MMCR1, "MMCR1",
  SPR_NOACCESS, SPR_NOACCESS,
  _read_generic, _write_generic,
@@ -6869,7 +6869,7 @@ static void register_book3s_pmu_user_sprs(CPUPPCState 
*env)
 spr_register(env, SPR_POWER_UMMCR0, "UMMCR0",
  _read_MMCR0_ureg, _write_MMCR0_ureg,
  _read_ureg, _write_ureg,
-

[PATCH v5 00/10] PMU-EBB support for PPC64 TCG

2021-11-01 Thread Daniel Henrique Barboza

Hi,

In this new version the concept of PMUEvent was removed. We're now
using only the PMUEventType enum and retrieving it on demand via a
new helper called getPMUEventType. This also means that we're not
trapping MMCR1 writes.

Changes from v4:
- patches 1-4 from v4: already upstream
- former patch 6 (initialize PMUEvents on MMCR1 write): removed
- patch 1 (former 6):
  * removed PMUEvent type
  * overflow timers are back to CPUPPCState
- patch 2 (former 7):
  * added a new getPMUEventType() function
- other patches were changed to accomodate the changes in patch 1 and 2
- v4 link: https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg03710.html

Daniel Henrique Barboza (9):
  target/ppc: introduce PMUEventType and PMU overflow timers
  target/ppc: PMU basic cycle count for pseries TCG
  target/ppc: enable PMU counter overflow with cycle events
  target/ppc: enable PMU instruction count
  target/ppc/power8-pmu.c: add PM_RUN_INST_CMPL (0xFA) event
  target/ppc: PMU: handle setting of PMCs while running
  target/ppc/power8-pmu.c: handle overflow bits when PMU is running
  PPC64/TCG: Implement 'rfebb' instruction
  target/ppc/excp_helper.c: EBB handling adjustments

Gustavo Romero (1):
  target/ppc: PMU Event-Based exception support

 hw/ppc/spapr_cpu_core.c|   6 +
 target/ppc/cpu.h   |  60 +++-
 target/ppc/cpu_init.c  |  20 +-
 target/ppc/excp_helper.c   |  92 ++
 target/ppc/helper.h|   4 +
 target/ppc/helper_regs.c   |   4 +
 target/ppc/insn32.decode   |   5 +
 target/ppc/meson.build |   1 +
 target/ppc/power8-pmu-regs.c.inc   |  45 ++-
 target/ppc/power8-pmu.c| 403 +
 target/ppc/power8-pmu.h|  25 ++
 target/ppc/spr_tcg.h   |   3 +
 target/ppc/translate.c |  60 
 target/ppc/translate/branch-impl.c.inc |  33 ++
 14 files changed, 748 insertions(+), 13 deletions(-)
 create mode 100644 target/ppc/power8-pmu.c
 create mode 100644 target/ppc/power8-pmu.h
 create mode 100644 target/ppc/translate/branch-impl.c.inc

-- 
2.31.1

[PATCH v5 01/10] target/ppc: introduce PMUEventType and PMU overflow timers

2021-11-01 Thread Daniel Henrique Barboza

This patch starts an IBM Power8+ compatible PMU implementation by adding
the representation of PMU events that we are going to sample,
PMUEventType. This enum represents a Perf event that is being sampled by
a specific counter 'sprn'. Events that aren't available (i.e. no event
was set in MMCR1) will be of type 'PMU_EVENT_INVALID'. Other types added
in this patch are PMU_EVENT_CYCLES and PMU_EVENT_INSTRUCTIONS. More
types will be added later on.

Let's also add the required PMU cycle overflow timers. They will be used
to trigger cycle overflows when cycle events are being sampled. This
timer will call cpu_ppc_pmu_timer_cb(), which in turn calls
fire_PMC_interrupt().  Both functions are stubs that will be implemented
later on when EBB support is added.

Two new helper files are created to host this new logic.
cpu_ppc_pmu_init() will init all overflow timers during CPU init time.

Signed-off-by: Daniel Henrique Barboza 
---
 hw/ppc/spapr_cpu_core.c |  6 +
 target/ppc/cpu.h| 15 +++
 target/ppc/meson.build  |  1 +
 target/ppc/power8-pmu.c | 57 +
 target/ppc/power8-pmu.h | 25 ++
 5 files changed, 104 insertions(+)
 create mode 100644 target/ppc/power8-pmu.c
 create mode 100644 target/ppc/power8-pmu.h

diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 58e7341cb7..45abffd891 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -20,6 +20,7 @@
 #include "target/ppc/kvm_ppc.h"
 #include "hw/ppc/ppc.h"
 #include "target/ppc/mmu-hash64.h"
+#include "target/ppc/power8-pmu.h"
 #include "sysemu/numa.h"
 #include "sysemu/reset.h"
 #include "sysemu/hw_accel.h"
@@ -266,6 +267,11 @@ static bool spapr_realize_vcpu(PowerPCCPU *cpu, 
SpaprMachineState *spapr,
 return false;
 }
 
+/* Init PMU interrupt timer (TCG only) */
+if (!kvm_enabled()) {
+cpu_ppc_pmu_init(env);
+}
+
 if (!sc->pre_3_0_migration) {
 vmstate_register(NULL, cs->cpu_index, _spapr_cpu_state,
  cpu->machine_data);
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 0472ec9154..5aeaee8a9c 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -296,6 +296,15 @@ typedef struct ppc_v3_pate_t {
 uint64_t dw1;
 } ppc_v3_pate_t;
 
+/* PMU related structs and defines */
+#define PMU_COUNTERS_NUM 6
+#define PMU_TIMERS_NUM   (PMU_COUNTERS_NUM - 1) /* PMC5 doesn't count cycles */
+typedef enum {
+PMU_EVENT_INVALID = 0,
+PMU_EVENT_CYCLES,
+PMU_EVENT_INSTRUCTIONS,
+} PMUEventType;
+
 /*/
 /* Machine state register bits definition*/
 #define MSR_SF   63 /* Sixty-four-bit modehflags */
@@ -1191,6 +1200,12 @@ struct CPUPPCState {
 uint32_t tm_vscr;
 uint64_t tm_dscr;
 uint64_t tm_tar;
+
+/*
+ * Timers used to fire performance monitor alerts
+ * when counting cycles.
+ */
+QEMUTimer *pmu_cyc_overflow_timers[PMU_TIMERS_NUM];
 };
 
 #define SET_FIT_PERIOD(a_, b_, c_, d_)  \
diff --git a/target/ppc/meson.build b/target/ppc/meson.build
index b85f295703..a49a8911e0 100644
--- a/target/ppc/meson.build
+++ b/target/ppc/meson.build
@@ -51,6 +51,7 @@ ppc_softmmu_ss.add(when: 'TARGET_PPC64', if_true: files(
   'mmu-book3s-v3.c',
   'mmu-hash64.c',
   'mmu-radix64.c',
+  'power8-pmu.c',
 ))
 
 target_arch += {'ppc': ppc_ss}
diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
new file mode 100644
index 00..3c2f73896f
--- /dev/null
+++ b/target/ppc/power8-pmu.c
@@ -0,0 +1,57 @@
+/*
+ * PMU emulation helpers for TCG IBM POWER chips
+ *
+ *  Copyright IBM Corp. 2021
+ *
+ * Authors:
+ *  Daniel Henrique Barboza  
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "power8-pmu.h"
+#include "cpu.h"
+#include "helper_regs.h"
+#include "exec/exec-all.h"
+#include "exec/helper-proto.h"
+#include "qemu/error-report.h"
+#include "qemu/main-loop.h"
+#include "hw/ppc/ppc.h"
+
+#if defined(TARGET_PPC64) && !defined(CONFIG_USER_ONLY)
+
+static void fire_PMC_interrupt(PowerPCCPU *cpu)
+{
+CPUPPCState *env = >env;
+
+if (!(env->spr[SPR_POWER_MMCR0] & MMCR0_EBE)) {
+return;
+}
+
+/* PMC interrupt not implemented yet */
+return;
+}
+
+static void cpu_ppc_pmu_timer_cb(void *opaque)
+{
+PowerPCCPU *cpu = opaque;
+
+fire_PMC_interrupt(cpu);
+}
+
+void cpu_ppc_pmu_init(CPUPPCState *env)
+{
+PowerPCCPU *cpu = env_archcpu(env);
+int i;
+
+for (i = 0; i < PMU_TIMERS_NUM; i++) {
+env->pmu_cyc_overflow_timers[i] = timer_new_ns(QEMU_CLOCK_VIRTUAL,
+   _ppc_pmu_timer_cb,
+   cpu);
+}
+}
+
+#endif /* defined(TARGET_PPC64) &&

[PATCH v5 05/10] target/ppc/power8-pmu.c: add PM_RUN_INST_CMPL (0xFA) event

2021-11-01 Thread Daniel Henrique Barboza

PM_RUN_INST_CMPL, instructions completed with the run latch set, is
the architected PowerISA v3.1 event defined with PMC4SEL = 0xFA.

Implement it by checking for the CTRL RUN bit before incrementing the
counter. To make this work properly we also need to force a new
translation block each time SPR_CTRL is written. A small tweak in
pmu_increment_insns() is then needed to only increment this event
if the thread has the run latch.

Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/cpu.h|  4 
 target/ppc/cpu_init.c   |  2 +-
 target/ppc/power8-pmu.c | 25 ++---
 target/ppc/spr_tcg.h|  1 +
 target/ppc/translate.c  | 12 
 5 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index f965436d19..6c281a4ef4 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -303,6 +303,7 @@ typedef enum {
 PMU_EVENT_INVALID = 0,
 PMU_EVENT_CYCLES,
 PMU_EVENT_INSTRUCTIONS,
+PMU_EVENT_INSN_RUN_LATCH,
 } PMUEventType;
 
 /*/
@@ -388,6 +389,9 @@ typedef enum {
 #define MMCR1_PMC4SEL_START 56
 #define MMCR1_PMC4EVT_EXTR (64 - MMCR1_PMC4SEL_START - MMCR1_EVT_SIZE)
 
+/* PMU uses CTRL_RUN to sample PM_RUN_INST_CMPL */
+#define CTRL_RUN PPC_BIT(63)
+
 /* LPCR bits */
 #define LPCR_VPM0 PPC_BIT(0)
 #define LPCR_VPM1 PPC_BIT(1)
diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c
index 6c384c3ac2..e6f3ff9b96 100644
--- a/target/ppc/cpu_init.c
+++ b/target/ppc/cpu_init.c
@@ -6748,7 +6748,7 @@ static void register_book3s_ctrl_sprs(CPUPPCState *env)
 {
 spr_register(env, SPR_CTRL, "SPR_CTRL",
  SPR_NOACCESS, SPR_NOACCESS,
- SPR_NOACCESS, _write_generic,
+ SPR_NOACCESS, _write_CTRL,
  0x);
 spr_register(env, SPR_UCTRL, "SPR_UCTRL",
  _read_ureg, SPR_NOACCESS,
diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
index 5f90828aed..3751b6de55 100644
--- a/target/ppc/power8-pmu.c
+++ b/target/ppc/power8-pmu.c
@@ -70,6 +70,15 @@ static PMUEventType getPMUEventType(CPUPPCState *env, int 
sprn)
 evt_type = PMU_EVENT_CYCLES;
 }
 break;
+case 0xFA:
+/*
+ * PMC4SEL = 0xFA is the "instructions completed
+ * with run latch set" event.
+ */
+if (sprn == SPR_POWER_PMC4) {
+evt_type = PMU_EVENT_INSN_RUN_LATCH;
+}
+break;
 case 0xFE:
 /*
  * PMC1SEL = 0xFE is the architected PowerISA v3.1
@@ -111,12 +120,22 @@ static bool pmu_increment_insns(CPUPPCState *env, 
uint32_t num_insns)
 
 /* PMC6 never counts instructions */
 for (sprn = SPR_POWER_PMC1; sprn <= SPR_POWER_PMC5; sprn++) {
-if (!pmc_is_active(env, sprn) ||
-getPMUEventType(env, sprn) != PMU_EVENT_INSTRUCTIONS) {
+PMUEventType evt_type = getPMUEventType(env, sprn);
+bool insn_event = evt_type == PMU_EVENT_INSTRUCTIONS ||
+  evt_type == PMU_EVENT_INSN_RUN_LATCH;
+
+if (!pmc_is_active(env, sprn) || !insn_event) {
 continue;
 }
 
-env->spr[sprn] += num_insns;
+if (evt_type == PMU_EVENT_INSTRUCTIONS) {
+env->spr[sprn] += num_insns;
+}
+
+if (evt_type == PMU_EVENT_INSN_RUN_LATCH &&
+env->spr[SPR_CTRL] & CTRL_RUN) {
+env->spr[sprn] += num_insns;
+}
 
 if (env->spr[sprn] >= COUNTER_NEGATIVE_VAL &&
 pmc_has_overflow_enabled(env, sprn)) {
diff --git a/target/ppc/spr_tcg.h b/target/ppc/spr_tcg.h
index eb1d0c2bf0..fdc6adfc31 100644
--- a/target/ppc/spr_tcg.h
+++ b/target/ppc/spr_tcg.h
@@ -26,6 +26,7 @@ void spr_noaccess(DisasContext *ctx, int gprn, int sprn);
 void spr_read_generic(DisasContext *ctx, int gprn, int sprn);
 void spr_write_generic(DisasContext *ctx, int sprn, int gprn);
 void spr_write_MMCR0(DisasContext *ctx, int sprn, int gprn);
+void spr_write_CTRL(DisasContext *ctx, int sprn, int gprn);
 void spr_read_xer(DisasContext *ctx, int gprn, int sprn);
 void spr_write_xer(DisasContext *ctx, int sprn, int gprn);
 void spr_read_lr(DisasContext *ctx, int gprn, int sprn);
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 01bacb573d..ac91bd4fba 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -403,6 +403,18 @@ void spr_write_generic(DisasContext *ctx, int sprn, int 
gprn)
 spr_store_dump_spr(sprn);
 }
 
+void spr_write_CTRL(DisasContext *ctx, int sprn, int gprn)
+{
+spr_write_generic(ctx, sprn, gprn);
+
+/*
+ * SPR_CTRL writes must force a new translation block,
+ * allowing the PMU to calculate the run latch events with
+ * more accuracy.
+ */
+ctx->base.is_jmp = DISAS_EXIT_UPDATE;
+}
+
 #if !defined(CONFIG_USER_ONLY)
 void spr_write_generic32(DisasContext *ctx, int sprn, int gprn)
 {
-- 
2.31.1

[PULL 9/9] hw/i386: fix vmmouse registration

2021-11-01 Thread Michael S. Tsirkin

From: Pavel Dovgalyuk 

According to the logic of vmmouse_update_handler function,
vmmouse should be registered as an event handler when
it's status is zero.
vmmouse_read_id resets the status but does not register
the handler.
This patch adds vmmouse registration and activation when
status is reset.

Signed-off-by: Pavel Dovgalyuk 
Message-Id: 
<163524204515.1914131.16465061981774791228.stgit@pasha-ThinkPad-X280>
Signed-off-by: Michael S. Tsirkin 
---
 hw/i386/vmmouse.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/vmmouse.c b/hw/i386/vmmouse.c
index df4798f502..3d66368286 100644
--- a/hw/i386/vmmouse.c
+++ b/hw/i386/vmmouse.c
@@ -158,6 +158,7 @@ static void vmmouse_read_id(VMMouseState *s)
 
 s->queue[s->nb_queue++] = VMMOUSE_VERSION;
 s->status = 0;
+vmmouse_update_handler(s, s->absolute);
 }
 
 static void vmmouse_request_relative(VMMouseState *s)
-- 
MST

[PATCH v5 10/10] target/ppc/excp_helper.c: EBB handling adjustments

2021-11-01 Thread Daniel Henrique Barboza

The current logic is only considering event-based exceptions triggered
by the performance monitor. This is true now, but we might want to add
support for external event-based exceptions in the future.

Let's make it a bit easier to do so by adding the bit logic that would
happen in case we were dealing with an external event-based exception.

While we're at it, add a few comments explaining why we're setting and
clearing BESCR bits.

Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/excp_helper.c | 45 ++--
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
index 88aa0a84f8..d30020d991 100644
--- a/target/ppc/excp_helper.c
+++ b/target/ppc/excp_helper.c
@@ -798,14 +798,47 @@ static inline void powerpc_excp(PowerPCCPU *cpu, int 
excp_model, int excp)
   "is not implemented yet !\n");
 break;
 case POWERPC_EXCP_EBB:   /* Event-based branch exception */
-if ((env->spr[SPR_BESCR] & BESCR_GE) &&
-(env->spr[SPR_BESCR] & BESCR_PME)) {
+if (env->spr[SPR_BESCR] & BESCR_GE) {
 target_ulong nip;
 
-env->spr[SPR_BESCR] &= ~BESCR_GE;   /* Clear GE */
-env->spr[SPR_BESCR] |= BESCR_PMEO;  /* Set PMEO */
-env->spr[SPR_EBBRR] = env->nip; /* Save NIP for rfebb insn */
-nip = env->spr[SPR_EBBHR];  /* EBB handler */
+/*
+ * If we have Performance Monitor Event-Based exception
+ * enabled (BESCR_PME) and a Performance Monitor alert
+ * occurred (MMCR0_PMAO), clear BESCR_PME and set BESCR_PMEO
+ * (Performance Monitor Event-Based Exception Occurred).
+ *
+ * Software is responsible for clearing both BESCR_PMEO and
+ * MMCR0_PMAO after the event has been handled.
+ */
+if ((env->spr[SPR_BESCR] & BESCR_PME) &&
+(env->spr[SPR_POWER_MMCR0] & MMCR0_PMAO)) {
+env->spr[SPR_BESCR] &= ~BESCR_PME;
+env->spr[SPR_BESCR] |= BESCR_PMEO;
+}
+
+/*
+ * In the case of External Event-Based exceptions, do a
+ * similar logic with BESCR_EE and BESCR_EEO. BESCR_EEO must
+ * also be cleared by software.
+ *
+ * PowerISA 3.1 considers that we'll not have BESCR_PMEO and
+ * BESCR_EEO set at the same time. We can check for BESCR_PMEO
+ * being not set in step above to see if this exception was
+ * trigged by an external event.
+ */
+if (env->spr[SPR_BESCR] & BESCR_EE &&
+!(env->spr[SPR_BESCR] & BESCR_PMEO)) {
+env->spr[SPR_BESCR] &= ~BESCR_EE;
+env->spr[SPR_BESCR] |= BESCR_EEO;
+}
+
+/*
+ * Clear BESCR_GE, save NIP for 'rfebb' and point the
+ * execution to the event handler (SPR_EBBHR) address.
+ */
+env->spr[SPR_BESCR] &= ~BESCR_GE;
+env->spr[SPR_EBBRR] = env->nip;
+nip = env->spr[SPR_EBBHR];
 powerpc_set_excp_state(cpu, nip, env->msr);
 }
 /*
-- 
2.31.1

[PATCH v5 07/10] target/ppc/power8-pmu.c: handle overflow bits when PMU is running

2021-11-01 Thread Daniel Henrique Barboza

Up until this moment we were assuming that the counter negative
enabled bits, PMC1CE and PMCjCE, would never be changed when the
PMU is already started.

Turns out that there is no such restriction in the PowerISA v3.1,
and software can enable/disable overflow conditions of the counters
at any time.

To support this scenario, track the overflow bits state when a
write in MMCR0 is made in which the run state of the PMU (MMCR0_FC
bit) didn't change and, if some overflow bit were changed in the
middle of a cycle count session, restart it.

Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/power8-pmu.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
index d66266829f..aa10233b29 100644
--- a/target/ppc/power8-pmu.c
+++ b/target/ppc/power8-pmu.c
@@ -288,6 +288,30 @@ void helper_store_mmcr0(CPUPPCState *env, target_ulong 
value)
 } else {
 start_cycle_count_session(env);
 }
+} else {
+/*
+ * No change in MMCR0_FC state, but if the PMU is running and
+ * a change in the counter negative overflow bits is made,
+ * we need to restart a new cycle count session to restart
+ * the appropriate overflow timers.
+ */
+if (curr_FC) {
+return;
+}
+
+bool pmc1ce_curr = curr_value & MMCR0_PMC1CE;
+bool pmc1ce_new  = value & MMCR0_PMC1CE;
+bool pmcjce_curr = curr_value & MMCR0_PMCjCE;
+bool pmcjce_new  = value & MMCR0_PMCjCE;
+
+if (pmc1ce_curr == pmc1ce_new && pmcjce_curr == pmcjce_new) {
+return;
+}
+
+/* Update the counter with the events counted so far */
+pmu_update_cycles(env);
+
+start_cycle_count_session(env);
 }
 }
 
-- 
2.31.1

Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI

2021-11-01 Thread Gavin Shan


On 11/1/21 7:44 PM, Igor Mammedov wrote:

On Thu, 28 Oct 2021 22:32:09 +1100
Gavin Shan  wrote: 

On 10/28/21 2:40 AM, Igor Mammedov wrote:

On Wed, 27 Oct 2021 13:29:58 +0800
Gavin Shan  wrote:
   

The empty NUMA nodes, where no memory resides, aren't exposed
through ACPI SRAT table. It's not user preferred behaviour because
the corresponding memory node devices are missed from the guest
kernel as the following example shows. It means the guest kernel
doesn't have the node information as user specifies. However,
memory can be still hot added to these empty NUMA nodes when
they're not exposed.

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
-accel kvm -machine virt,gic-version=host   \
-cpu host -smp 4,sockets=2,cores=2,threads=1\
-m 1024M,slots=16,maxmem=64G\
-object memory-backend-ram,id=mem0,size=512M\
-object memory-backend-ram,id=mem1,size=512M\
-numa node,nodeid=0,cpus=0-1,memdev=mem0\
-numa node,nodeid=1,cpus=2-3,memdev=mem1\
-numa node,nodeid=2 \
-numa node,nodeid=3 \
   :
guest# ls /sys/devices/system/node | grep node
node0
node1
(qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
(qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
guest# ls /sys/devices/system/node | grep node
node0
node1
node2
guest# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
Node 2 MemTotal:1048576 kB

This exposes these empty NUMA nodes through ACPI SRAT table. With
this applied, the corresponding memory node devices can be found
from the guest. Note that the hotpluggable capability is explicitly
given to these empty NUMA nodes for sake of completeness.

guest# ls /sys/devices/system/node | grep node
node0
node1
node2
node3
guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
Node 3 MemTotal:0 kB
(qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
(qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
Node 3 MemTotal:1048576 kB


I'm still not sure why this is necessary and if it's a good idea,
is there a real hardware that have such nodes?

SRAT is used to assign resources to nodes, I haven't seen it being
used  as means to describe an empty node anywhere in the spec.
(perhaps we should not allow empty nodes on QEMU CLI at all).

Then if we really need this, why it's done for ARM only
and not for x86?
   


I think this case exists in real hardware where the memory DIMM
isn't plugged, but the node is still probed.

Then please, provide SRAT table from such hw
(a lot of them (to justify it as defacto 'standard')?
since such hw firmware could be buggy as well).

BTW, fake memory node doesn't have to be present to make guest
notice an existing numa node. it can be represented by affinity
entries as well (see chapter:System Resource Affinity Table (SRAT)
in the spec).

At the moment, I'm totally unconvinced that empty numa nodes
are valid to provide.



Igor, thanks for your continuous review. I don't have strong sense
the fake nodes should be presented. So please ignore this patch
until it's needed by virtio-mem. In that time, I can revisit this.
More context is provided as below to make the discussion complete.




Besides, this patch
addresses two issues:

(1) To make the information contained in guest kernel consistent
  to the command line as the user expects. It means the sysfs
  entries for these empty NUMA nodes in guest kernel reflects
  what user provided.

-numa/SRAT describe boot time configuration.
So if you do not specify empty nodes on CLI, then number of nodes
would be consistent.



Correct.


(2) Without this patch, the node number can be twisted from user's
  perspective. As the example included in the commit log, node3
  should be created, but node2 is actually created. The patch
  reserves the NUMA node IDs in advance to avoid the issue.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
 :
  -numa node,nodeid=0,cpus=0-1,memdev=mem0\
  -numa node,nodeid=1,cpus=2-3,memdev=mem1\
  -numa node,nodeid=2 \
  -numa node,nodeid=3 \
  guest# ls /sys/devices/system/node | grep node
  node0  node1
  (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
  (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
  guest# ls /sys/devices/system/node | grep node
  node0  node1  node2


The same node numbering on guest side and QEMU CLI works only
by accident not by design. In short numbers may not match
(in linux kernel case it depends on the order the nodes
are enumerated), if you

[PULL 8/9] pci: Export pci_for_each_device_under_bus*()

2021-11-01 Thread Michael S. Tsirkin

From: Peter Xu 

They're actually more commonly used than the helper without _under_bus, because
most callers do have the pci bus on hand.  After exporting we can switch a lot
of the call sites to use these two helpers.

Reviewed-by: David Hildenbrand 
Reviewed-by: Eric Auger 
Signed-off-by: Peter Xu 
Message-Id: <20211028043129.38871-3-pet...@redhat.com>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
Acked-by: David Gibson 
---
 include/hw/pci/pci.h   |  5 +
 hw/i386/acpi-build.c   |  5 ++---
 hw/pci/pci.c   | 10 +-
 hw/pci/pcie.c  |  4 +---
 hw/ppc/spapr_pci.c | 12 +---
 hw/ppc/spapr_pci_nvlink2.c |  7 +++
 hw/ppc/spapr_pci_vfio.c|  4 ++--
 hw/s390x/s390-pci-bus.c|  5 ++---
 hw/xen/xen_pt.c|  4 ++--
 9 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 4a8740b76b..5c4016b995 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -467,6 +467,11 @@ void pci_for_each_device(PCIBus *bus, int bus_num,
 void pci_for_each_device_reverse(PCIBus *bus, int bus_num,
  pci_bus_dev_fn fn,
  void *opaque);
+void pci_for_each_device_under_bus(PCIBus *bus,
+   pci_bus_dev_fn fn, void *opaque);
+void pci_for_each_device_under_bus_reverse(PCIBus *bus,
+   pci_bus_dev_fn fn,
+   void *opaque);
 void pci_for_each_bus_depth_first(PCIBus *bus, pci_bus_ret_fn begin,
   pci_bus_fn end, void *parent_state);
 PCIDevice *pci_get_function_0(PCIDevice *pci_dev);
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 3ca6cc8118..a3ad6abd33 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2134,8 +2134,7 @@ dmar_host_bridges(Object *obj, void *opaque)
 PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
 
 if (bus && !pci_bus_bypass_iommu(bus)) {
-pci_for_each_device(bus, pci_bus_num(bus), insert_scope,
-scope_blob);
+pci_for_each_device_under_bus(bus, insert_scope, scope_blob);
 }
 }
 
@@ -2341,7 +2340,7 @@ ivrs_host_bridges(Object *obj, void *opaque)
 PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
 
 if (bus && !pci_bus_bypass_iommu(bus)) {
-pci_for_each_device(bus, pci_bus_num(bus), insert_ivhd, ivhd_blob);
+pci_for_each_device_under_bus(bus, insert_ivhd, ivhd_blob);
 }
 }
 
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 17e59cb3a3..4a84e478ce 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1654,9 +1654,9 @@ static const pci_class_desc pci_class_descriptions[] =
 { 0, NULL}
 };
 
-static void pci_for_each_device_under_bus_reverse(PCIBus *bus,
-  pci_bus_dev_fn fn,
-  void *opaque)
+void pci_for_each_device_under_bus_reverse(PCIBus *bus,
+   pci_bus_dev_fn fn,
+   void *opaque)
 {
 PCIDevice *d;
 int devfn;
@@ -1679,8 +1679,8 @@ void pci_for_each_device_reverse(PCIBus *bus, int bus_num,
 }
 }
 
-static void pci_for_each_device_under_bus(PCIBus *bus,
-  pci_bus_dev_fn fn, void *opaque)
+void pci_for_each_device_under_bus(PCIBus *bus,
+   pci_bus_dev_fn fn, void *opaque)
 {
 PCIDevice *d;
 int devfn;
diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 6e95d82903..914a9bf3d1 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -694,9 +694,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
 (!(old_slt_ctl & PCI_EXP_SLTCTL_PCC) ||
 (old_slt_ctl & PCI_EXP_SLTCTL_PIC_OFF) != PCI_EXP_SLTCTL_PIC_OFF)) {
 PCIBus *sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(dev));
-pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
-pcie_unplug_device, NULL);
-
+pci_for_each_device_under_bus(sec_bus, pcie_unplug_device, NULL);
 pci_word_test_and_clear_mask(exp_cap + PCI_EXP_SLTSTA,
  PCI_EXP_SLTSTA_PDS);
 if (dev->cap_present & QEMU_PCIE_LNKSTA_DLLLA ||
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 7430bd6314..5bfd4aa9e5 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1317,8 +1317,7 @@ static int spapr_dt_pci_bus(SpaprPhbState *sphb, PCIBus 
*bus,
   RESOURCE_CELLS_SIZE));
 
 assert(bus);
-pci_for_each_device_reverse(bus, pci_bus_num(bus),
-spapr_dt_pci_device_cb, );
+pci_for_each_device_under_bus_reverse(bus, spapr_dt_pci_device_cb, 
);
 if (cbinfo.err) {
 return cbinfo.err;
 }
@@ -2306,8 +2305,8 @@ static void spapr_phb_pci_enumerate_bridge(PCIBus

[PATCH v5 03/10] target/ppc: enable PMU counter overflow with cycle events

2021-11-01 Thread Daniel Henrique Barboza

The PowerISA v3.1 defines that if the proper bits are set (MMCR0_PMC1CE
for PMC1 and MMCR0_PMCjCE for the remaining PMCs), counter negative
conditions are enabled. This means that if the counter value overflows
(i.e. exceeds 0x8000) a performance monitor alert will occur. This alert
can trigger an event-based exception (to be implemented in the next patches)
if the MMCR0_EBE bit is set.

For now, overflowing the counter when the PMC is counting cycles will
just trigger a performance monitor alert. This is done by starting the
overflow timer to expire in the moment the overflow would be occuring. The
timer will call fire_PMC_interrupt() (via cpu_ppc_pmu_timer_cb) which will
trigger the PMU alert and, if the conditions are met, an EBB exception.

Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/cpu.h|  2 +
 target/ppc/power8-pmu.c | 86 -
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 6c4643044b..bf718334a5 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -363,6 +363,8 @@ typedef enum {
 #define MMCR0_PMCC   PPC_BITMASK(44, 45) /* PMC Control */
 #define MMCR0_FC14   PPC_BIT(58) /* PMC Freeze Counters 1-4 bit */
 #define MMCR0_FC56   PPC_BIT(59) /* PMC Freeze Counters 5-6 bit */
+#define MMCR0_PMC1CE PPC_BIT(48) /* MMCR0 PMC1 Condition Enabled */
+#define MMCR0_PMCjCE PPC_BIT(49) /* MMCR0 PMCj Condition Enabled */
 /* MMCR0 userspace r/w mask */
 #define MMCR0_UREG_MASK (MMCR0_FC | MMCR0_PMAO | MMCR0_PMAE)
 /* MMCR2 userspace r/w mask */
diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
index a0a42b666c..fdc94d40b2 100644
--- a/target/ppc/power8-pmu.c
+++ b/target/ppc/power8-pmu.c
@@ -23,6 +23,8 @@
 
 #if defined(TARGET_PPC64) && !defined(CONFIG_USER_ONLY)
 
+#define COUNTER_NEGATIVE_VAL 0x8000
+
 /*
  * For PMCs 1-4, IBM POWER chips has support for an implementation
  * dependent event, 0x1E, that enables cycle counting. The Linux kernel
@@ -93,6 +95,15 @@ static bool pmc_is_active(CPUPPCState *env, int sprn)
 return !(env->spr[SPR_POWER_MMCR0] & MMCR0_FC56);
 }
 
+static bool pmc_has_overflow_enabled(CPUPPCState *env, int sprn)
+{
+if (sprn == SPR_POWER_PMC1) {
+return env->spr[SPR_POWER_MMCR0] & MMCR0_PMC1CE;
+}
+
+return env->spr[SPR_POWER_MMCR0] & MMCR0_PMCjCE;
+}
+
 static void pmu_update_cycles(CPUPPCState *env)
 {
 uint64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
@@ -121,6 +132,63 @@ static void pmu_update_cycles(CPUPPCState *env)
 }
 }
 
+static void pmu_delete_timers(CPUPPCState *env)
+{
+int i;
+
+for (i = 0; i < PMU_TIMERS_NUM; i++) {
+timer_del(env->pmu_cyc_overflow_timers[i]);
+}
+}
+
+/*
+ * Helper function to retrieve the cycle overflow timer of the
+ * 'sprn' counter. Given that PMC5 doesn't have a timer, the
+ * amount of timers is less than the total counters and the PMC6
+ * timer is the last of the array.
+ */
+static QEMUTimer *get_cyc_overflow_timer(CPUPPCState *env, int sprn)
+{
+if (sprn == SPR_POWER_PMC5) {
+return NULL;
+}
+
+if (sprn == SPR_POWER_PMC6) {
+return env->pmu_cyc_overflow_timers[PMU_TIMERS_NUM - 1];
+}
+
+return env->pmu_cyc_overflow_timers[sprn - SPR_POWER_PMC1];
+}
+
+static void pmu_start_overflow_timers(CPUPPCState *env)
+{
+uint64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+int64_t timeout;
+int sprn;
+
+env->pmu_base_time = now;
+
+/*
+ * Scroll through all PMCs ad start counter overflow timers for
+ * PM_CYC events, if needed.
+ */
+for (sprn = SPR_POWER_PMC1; sprn <= SPR_POWER_PMC6; sprn++) {
+if (!pmc_is_active(env, sprn) ||
+!(getPMUEventType(env, sprn) == PMU_EVENT_CYCLES) ||
+!pmc_has_overflow_enabled(env, sprn)) {
+continue;
+}
+
+if (env->spr[sprn] >= COUNTER_NEGATIVE_VAL) {
+timeout =  0;
+} else {
+timeout  = COUNTER_NEGATIVE_VAL - env->spr[sprn];
+}
+
+timer_mod(get_cyc_overflow_timer(env, sprn), now + timeout);
+}
+}
+
 /*
  * A cycle count session consists of the basic operations we
  * need to do to support PM_CYC events: redefine a new base_time
@@ -128,8 +196,22 @@ static void pmu_update_cycles(CPUPPCState *env)
  */
 static void start_cycle_count_session(CPUPPCState *env)
 {
-/* Just define pmu_base_time for now */
-env->pmu_base_time = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+bool overflow_enabled = env->spr[SPR_POWER_MMCR0] &
+(MMCR0_PMC1CE | MMCR0_PMCjCE);
+
+/*
+ * Always delete existing overflow timers when starting a
+ * new cycle counting session.
+ */
+pmu_delete_timers(env);
+
+if (!overflow_enabled) {
+/* Define pmu_base_time and leave */
+env->pmu_base_time = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+return;
+}
+
+

[PULL 7/9] pci: Define pci_bus_dev_fn/pci_bus_fn/pci_bus_ret_fn

2021-11-01 Thread Michael S. Tsirkin

From: Peter Xu 

They're used in quite a few places of pci.[ch] and also in the rest of the code
base.  Define them so that it doesn't need to be defined all over the places.

The pci_bus_fn is similar to pci_bus_dev_fn that only takes a PCIBus* and an
opaque.  The pci_bus_ret_fn is similar to pci_bus_fn but it allows to return a
void* pointer.

Reviewed-by: David Hildenbrand 
Reviewed-by: Eric Auger 
Signed-off-by: Peter Xu 
Message-Id: <20211028043129.38871-2-pet...@redhat.com>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/pci/pci.h | 19 +--
 hw/pci/pci.c | 20 ++--
 2 files changed, 15 insertions(+), 24 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 7fc90132cf..4a8740b76b 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -401,6 +401,10 @@ typedef PCIINTxRoute (*pci_route_irq_fn)(void *opaque, int 
pin);
 OBJECT_DECLARE_TYPE(PCIBus, PCIBusClass, PCI_BUS)
 #define TYPE_PCIE_BUS "PCIE"
 
+typedef void (*pci_bus_dev_fn)(PCIBus *b, PCIDevice *d, void *opaque);
+typedef void (*pci_bus_fn)(PCIBus *b, void *opaque);
+typedef void *(*pci_bus_ret_fn)(PCIBus *b, void *opaque);
+
 bool pci_bus_is_express(PCIBus *bus);
 
 void pci_root_bus_init(PCIBus *bus, size_t bus_size, DeviceState *parent,
@@ -458,23 +462,18 @@ static inline int pci_dev_bus_num(const PCIDevice *dev)
 
 int pci_bus_numa_node(PCIBus *bus);
 void pci_for_each_device(PCIBus *bus, int bus_num,
- void (*fn)(PCIBus *bus, PCIDevice *d, void *opaque),
+ pci_bus_dev_fn fn,
  void *opaque);
 void pci_for_each_device_reverse(PCIBus *bus, int bus_num,
- void (*fn)(PCIBus *bus, PCIDevice *d,
-void *opaque),
+ pci_bus_dev_fn fn,
  void *opaque);
-void pci_for_each_bus_depth_first(PCIBus *bus,
-  void *(*begin)(PCIBus *bus, void 
*parent_state),
-  void (*end)(PCIBus *bus, void *state),
-  void *parent_state);
+void pci_for_each_bus_depth_first(PCIBus *bus, pci_bus_ret_fn begin,
+  pci_bus_fn end, void *parent_state);
 PCIDevice *pci_get_function_0(PCIDevice *pci_dev);
 
 /* Use this wrapper when specific scan order is not required. */
 static inline
-void pci_for_each_bus(PCIBus *bus,
-  void (*fn)(PCIBus *bus, void *opaque),
-  void *opaque)
+void pci_for_each_bus(PCIBus *bus, pci_bus_fn fn, void *opaque)
 {
 pci_for_each_bus_depth_first(bus, NULL, fn, opaque);
 }
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 186758ee11..17e59cb3a3 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1655,9 +1655,7 @@ static const pci_class_desc pci_class_descriptions[] =
 };
 
 static void pci_for_each_device_under_bus_reverse(PCIBus *bus,
-  void (*fn)(PCIBus *b,
- PCIDevice *d,
- void *opaque),
+  pci_bus_dev_fn fn,
   void *opaque)
 {
 PCIDevice *d;
@@ -1672,8 +1670,7 @@ static void pci_for_each_device_under_bus_reverse(PCIBus 
*bus,
 }
 
 void pci_for_each_device_reverse(PCIBus *bus, int bus_num,
- void (*fn)(PCIBus *b, PCIDevice *d, void *opaque),
- void *opaque)
+ pci_bus_dev_fn fn, void *opaque)
 {
 bus = pci_find_bus_nr(bus, bus_num);
 
@@ -1683,9 +1680,7 @@ void pci_for_each_device_reverse(PCIBus *bus, int bus_num,
 }
 
 static void pci_for_each_device_under_bus(PCIBus *bus,
-  void (*fn)(PCIBus *b, PCIDevice *d,
- void *opaque),
-  void *opaque)
+  pci_bus_dev_fn fn, void *opaque)
 {
 PCIDevice *d;
 int devfn;
@@ -1699,8 +1694,7 @@ static void pci_for_each_device_under_bus(PCIBus *bus,
 }
 
 void pci_for_each_device(PCIBus *bus, int bus_num,
- void (*fn)(PCIBus *b, PCIDevice *d, void *opaque),
- void *opaque)
+ pci_bus_dev_fn fn, void *opaque)
 {
 bus = pci_find_bus_nr(bus, bus_num);
 
@@ -2078,10 +2072,8 @@ static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num)
 return NULL;
 }
 
-void pci_for_each_bus_depth_first(PCIBus *bus,
-  void *(*begin)(PCIBus *bus, void 
*parent_state),
-  void (*end)(PCIBus *bus, void *state),
-  void *parent_state)
+void

[PATCH v5 04/10] target/ppc: enable PMU instruction count

2021-11-01 Thread Daniel Henrique Barboza

The PMU is already counting cycles by calculating time elapsed in
nanoseconds. Counting instructions is a different matter and requires
another approach.

This patch adds the capability of counting completed instructions
(Perf event PM_INST_CMPL) by counting the amount of instructions
translated in each translation block right before exiting it.

A new pmu_count_insns() helper in translation.c was added to do that.
After verifying that the PMU is running (MMCR0_FC bit not set), call
helper_insns_inc(). This new helper from power8-pmu.c will add the
instructions to the relevant counters. It'll also be responsible for
triggering counter negative overflows as it is already being done with
cycles.

Signed-off-by: Daniel Henrique Barboza 
---
 target/ppc/cpu.h |  1 +
 target/ppc/helper.h  |  1 +
 target/ppc/helper_regs.c |  4 +++
 target/ppc/power8-pmu-regs.c.inc |  6 +
 target/ppc/power8-pmu.c  | 39 +++
 target/ppc/translate.c   | 46 
 6 files changed, 97 insertions(+)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index bf718334a5..f965436d19 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -655,6 +655,7 @@ enum {
 HFLAGS_PR = 14,  /* MSR_PR */
 HFLAGS_PMCC0 = 15,  /* MMCR0 PMCC bit 0 */
 HFLAGS_PMCC1 = 16,  /* MMCR0 PMCC bit 1 */
+HFLAGS_MMCR0FC = 17, /* MMCR0 FC bit */
 HFLAGS_VSX = 23, /* MSR_VSX if cpu has VSX */
 HFLAGS_VR = 25,  /* MSR_VR if cpu has VRE */
 
diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 8e3657afe1..b8a89f02f4 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -21,6 +21,7 @@ DEF_HELPER_1(hrfid, void, env)
 DEF_HELPER_2(store_lpcr, void, env, tl)
 DEF_HELPER_2(store_pcr, void, env, tl)
 DEF_HELPER_2(store_mmcr0, void, env, tl)
+DEF_HELPER_2(insns_inc, void, env, i32)
 #endif
 DEF_HELPER_1(check_tlb_flush_local, void, env)
 DEF_HELPER_1(check_tlb_flush_global, void, env)
diff --git a/target/ppc/helper_regs.c b/target/ppc/helper_regs.c
index 99562edd57..875c2fdfc6 100644
--- a/target/ppc/helper_regs.c
+++ b/target/ppc/helper_regs.c
@@ -115,6 +115,10 @@ static uint32_t hreg_compute_hflags_value(CPUPPCState *env)
 if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCC1) {
 hflags |= 1 << HFLAGS_PMCC1;
 }
+if (env->spr[SPR_POWER_MMCR0] & MMCR0_FC) {
+hflags |= 1 << HFLAGS_MMCR0FC;
+}
+
 
 #ifndef CONFIG_USER_ONLY
 if (!env->has_hv_mode || (msr & (1ull << MSR_HV))) {
diff --git a/target/ppc/power8-pmu-regs.c.inc b/target/ppc/power8-pmu-regs.c.inc
index fbb8977641..a92437b0c4 100644
--- a/target/ppc/power8-pmu-regs.c.inc
+++ b/target/ppc/power8-pmu-regs.c.inc
@@ -113,6 +113,12 @@ static void write_MMCR0_common(DisasContext *ctx, TCGv val)
  */
 gen_icount_io_start(ctx);
 gen_helper_store_mmcr0(cpu_env, val);
+
+/*
+ * End the translation block because MMCR0 writes can change
+ * ctx->pmu_frozen.
+ */
+ctx->base.is_jmp = DISAS_EXIT_UPDATE;
 }
 
 void spr_write_MMCR0_ureg(DisasContext *ctx, int sprn, int gprn)
diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
index fdc94d40b2..5f90828aed 100644
--- a/target/ppc/power8-pmu.c
+++ b/target/ppc/power8-pmu.c
@@ -104,6 +104,31 @@ static bool pmc_has_overflow_enabled(CPUPPCState *env, int 
sprn)
 return env->spr[SPR_POWER_MMCR0] & MMCR0_PMCjCE;
 }
 
+static bool pmu_increment_insns(CPUPPCState *env, uint32_t num_insns)
+{
+bool overflow_triggered = false;
+int sprn;
+
+/* PMC6 never counts instructions */
+for (sprn = SPR_POWER_PMC1; sprn <= SPR_POWER_PMC5; sprn++) {
+if (!pmc_is_active(env, sprn) ||
+getPMUEventType(env, sprn) != PMU_EVENT_INSTRUCTIONS) {
+continue;
+}
+
+env->spr[sprn] += num_insns;
+
+if (env->spr[sprn] >= COUNTER_NEGATIVE_VAL &&
+pmc_has_overflow_enabled(env, sprn)) {
+
+overflow_triggered = true;
+env->spr[sprn] = COUNTER_NEGATIVE_VAL;
+}
+}
+
+return overflow_triggered;
+}
+
 static void pmu_update_cycles(CPUPPCState *env)
 {
 uint64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
@@ -259,6 +284,20 @@ static void fire_PMC_interrupt(PowerPCCPU *cpu)
 return;
 }
 
+/* This helper assumes that the PMC is running. */
+void helper_insns_inc(CPUPPCState *env, uint32_t num_insns)
+{
+bool overflow_triggered;
+PowerPCCPU *cpu;
+
+overflow_triggered = pmu_increment_insns(env, num_insns);
+
+if (overflow_triggered) {
+cpu = env_archcpu(env);
+fire_PMC_interrupt(cpu);
+}
+}
+
 static void cpu_ppc_pmu_timer_cb(void *opaque)
 {
 PowerPCCPU *cpu = opaque;
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 659859ff5f..01bacb573d 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -177,6 +177,7 @@ struct DisasContext {
 bool hr;
 bool mmcr0_pmcc0;
 bool mmcr0_pmcc1;
+bool pmu_frozen;

[PULL 4/9] hw/i386/pc: Remove x86_iommu_get_type()

2021-11-01 Thread Michael S. Tsirkin

From: Jean-Philippe Brucker 

To generate the IOMMU ACPI table, acpi-build.c can use base QEMU types
instead of a special IommuType value.

Reviewed-by: Eric Auger 
Reviewed-by: Igor Mammedov 
Signed-off-by: Jean-Philippe Brucker 
Message-Id: <20211026182024.2642038-3-jean-phili...@linaro.org>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/i386/x86-iommu.h | 12 
 hw/i386/acpi-build.c| 20 +---
 hw/i386/amd_iommu.c |  2 --
 hw/i386/intel_iommu.c   |  3 ---
 hw/i386/x86-iommu-stub.c|  5 -
 hw/i386/x86-iommu.c |  5 -
 6 files changed, 9 insertions(+), 38 deletions(-)

diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
index 9de92d33a1..5ba0c056d6 100644
--- a/include/hw/i386/x86-iommu.h
+++ b/include/hw/i386/x86-iommu.h
@@ -33,12 +33,6 @@ OBJECT_DECLARE_TYPE(X86IOMMUState, X86IOMMUClass, 
X86_IOMMU_DEVICE)
 typedef struct X86IOMMUIrq X86IOMMUIrq;
 typedef struct X86IOMMU_MSIMessage X86IOMMU_MSIMessage;
 
-typedef enum IommuType {
-TYPE_INTEL,
-TYPE_AMD,
-TYPE_NONE
-} IommuType;
-
 struct X86IOMMUClass {
 SysBusDeviceClass parent;
 /* Intel/AMD specific realize() hook */
@@ -71,7 +65,6 @@ struct X86IOMMUState {
 OnOffAuto intr_supported;   /* Whether vIOMMU supports IR */
 bool dt_supported;  /* Whether vIOMMU supports DT */
 bool pt_supported;  /* Whether vIOMMU supports pass-through */
-IommuType type; /* IOMMU type - AMD/Intel */
 QLIST_HEAD(, IEC_Notifier) iec_notifiers; /* IEC notify list */
 };
 
@@ -140,11 +133,6 @@ struct X86IOMMU_MSIMessage {
  */
 X86IOMMUState *x86_iommu_get_default(void);
 
-/*
- * x86_iommu_get_type - get IOMMU type
- */
-IommuType x86_iommu_get_type(void);
-
 /**
  * x86_iommu_iec_register_notifier - register IEC (Interrupt Entry
  *   Cache) notifiers
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 81418b7911..ab49e799ff 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2488,6 +2488,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState 
*machine)
 PCMachineState *pcms = PC_MACHINE(machine);
 PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
 X86MachineState *x86ms = X86_MACHINE(machine);
+X86IOMMUState *iommu = x86_iommu_get_default();
 GArray *table_offsets;
 unsigned facs, dsdt, rsdt, fadt;
 AcpiPmInfo pm;
@@ -2604,17 +2605,14 @@ void acpi_build(AcpiBuildTables *tables, MachineState 
*machine)
 build_mcfg(tables_blob, tables->linker, , x86ms->oem_id,
x86ms->oem_table_id);
 }
-if (x86_iommu_get_default()) {
-IommuType IOMMUType = x86_iommu_get_type();
-if (IOMMUType == TYPE_AMD) {
-acpi_add_table(table_offsets, tables_blob);
-build_amd_iommu(tables_blob, tables->linker, x86ms->oem_id,
-x86ms->oem_table_id);
-} else if (IOMMUType == TYPE_INTEL) {
-acpi_add_table(table_offsets, tables_blob);
-build_dmar_q35(tables_blob, tables->linker, x86ms->oem_id,
-   x86ms->oem_table_id);
-}
+if (object_dynamic_cast(OBJECT(iommu), TYPE_AMD_IOMMU_DEVICE)) {
+acpi_add_table(table_offsets, tables_blob);
+build_amd_iommu(tables_blob, tables->linker, x86ms->oem_id,
+x86ms->oem_table_id);
+} else if (object_dynamic_cast(OBJECT(iommu), TYPE_INTEL_IOMMU_DEVICE)) {
+acpi_add_table(table_offsets, tables_blob);
+build_dmar_q35(tables_blob, tables->linker, x86ms->oem_id,
+   x86ms->oem_table_id);
 }
 if (machine->nvdimms_state->is_enabled) {
 nvdimm_build_acpi(table_offsets, tables_blob, tables->linker,
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 9242a0d3ed..91fe34ae58 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1538,7 +1538,6 @@ static void amdvi_sysbus_realize(DeviceState *dev, Error 
**errp)
 {
 int ret = 0;
 AMDVIState *s = AMD_IOMMU_DEVICE(dev);
-X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(dev);
 MachineState *ms = MACHINE(qdev_get_machine());
 PCMachineState *pcms = PC_MACHINE(ms);
 X86MachineState *x86ms = X86_MACHINE(ms);
@@ -1548,7 +1547,6 @@ static void amdvi_sysbus_realize(DeviceState *dev, Error 
**errp)
  amdvi_uint64_equal, g_free, g_free);
 
 /* This device should take care of IOMMU PCI properties */
-x86_iommu->type = TYPE_AMD;
 if (!qdev_realize(DEVICE(>pci), >qbus, errp)) {
 return;
 }
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 75f075547f..c27b20090e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3806,9 +3806,6 @@ static void vtd_realize(DeviceState *dev, Error **errp)
 X86MachineState *x86ms = X86_MACHINE(ms);
 PCIBus *bus = pcms->bus;
 IntelIOMMUState *s =

[PULL 0/9] pc,pci,virtio: features, fixes

2021-11-01 Thread Michael S. Tsirkin

The following changes since commit af531756d25541a1b3b3d9a14e72e7fedd941a2e:

  Merge remote-tracking branch 'remotes/philmd/tags/renesas-20211030' into 
staging (2021-10-30 11:31:41 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream

for you to fetch changes up to d99e8b5fcb138b19f751c027ed5599224f9b5036:

  hw/i386: fix vmmouse registration (2021-11-01 19:36:11 -0400)


pc,pci,virtio: features, fixes

virtio-iommu support for x86/ACPI.
Fixes, cleanups all over the place.

Signed-off-by: Michael S. Tsirkin 


David Hildenbrand (1):
  vhost-vdpa: Set discarding of RAM broken when initializing the backend

Igor Mammedov (1):
  qtest: fix 'expression is always false' build failure in qtest_has_accel()

Jean-Philippe Brucker (4):
  hw/acpi: Add VIOT table
  hw/i386/pc: Remove x86_iommu_get_type()
  hw/i386/pc: Move IOMMU singleton into PCMachineState
  hw/i386/pc: Allow instantiating a virtio-iommu device

Pavel Dovgalyuk (1):
  hw/i386: fix vmmouse registration

Peter Xu (2):
  pci: Define pci_bus_dev_fn/pci_bus_fn/pci_bus_ret_fn
  pci: Export pci_for_each_device_under_bus*()

 hw/acpi/viot.h  |  13 +
 include/hw/i386/pc.h|   1 +
 include/hw/i386/x86-iommu.h |  12 -
 include/hw/pci/pci.h|  24 ++
 hw/acpi/viot.c  | 114 
 hw/i386/acpi-build.c|  33 +++--
 hw/i386/amd_iommu.c |   2 -
 hw/i386/intel_iommu.c   |   3 --
 hw/i386/pc.c|  26 +-
 hw/i386/vmmouse.c   |   1 +
 hw/i386/x86-iommu-stub.c|   5 --
 hw/i386/x86-iommu.c |  31 
 hw/pci/pci.c|  26 --
 hw/pci/pcie.c   |   4 +-
 hw/ppc/spapr_pci.c  |  12 ++---
 hw/ppc/spapr_pci_nvlink2.c  |   7 ++-
 hw/ppc/spapr_pci_vfio.c |   4 +-
 hw/s390x/s390-pci-bus.c |   5 +-
 hw/virtio/vhost-vdpa.c  |  13 +
 hw/xen/xen_pt.c |   4 +-
 hw/acpi/Kconfig |   4 ++
 hw/acpi/meson.build |   1 +
 hw/i386/Kconfig |   1 +
 meson.build |   2 +-
 24 files changed, 239 insertions(+), 109 deletions(-)
 create mode 100644 hw/acpi/viot.h
 create mode 100644 hw/acpi/viot.c

[PULL 3/9] hw/acpi: Add VIOT table

2021-11-01 Thread Michael S. Tsirkin

From: Jean-Philippe Brucker 

Add a function that generates a Virtual I/O Translation table (VIOT),
describing the topology of paravirtual IOMMUs. The table is created if a
virtio-iommu device is present. It contains a virtio-iommu node and PCI
Range nodes for endpoints managed by the IOMMU. By default, a single
node describes all PCI devices. When passing the
"default_bus_bypass_iommu" machine option and "bypass_iommu" PXB option,
only buses that do not bypass the IOMMU are described by PCI Range
nodes.

Reviewed-by: Eric Auger 
Tested-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
Message-Id: <20211026182024.2642038-2-jean-phili...@linaro.org>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/acpi/viot.h  |  13 +
 hw/acpi/viot.c  | 114 
 hw/acpi/Kconfig |   4 ++
 hw/acpi/meson.build |   1 +
 4 files changed, 132 insertions(+)
 create mode 100644 hw/acpi/viot.h
 create mode 100644 hw/acpi/viot.c

diff --git a/hw/acpi/viot.h b/hw/acpi/viot.h
new file mode 100644
index 00..9fe565bb87
--- /dev/null
+++ b/hw/acpi/viot.h
@@ -0,0 +1,13 @@
+/*
+ * ACPI Virtual I/O Translation Table implementation
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+#ifndef VIOT_H
+#define VIOT_H
+
+void build_viot(MachineState *ms, GArray *table_data, BIOSLinker *linker,
+uint16_t virtio_iommu_bdf, const char *oem_id,
+const char *oem_table_id);
+
+#endif /* VIOT_H */
diff --git a/hw/acpi/viot.c b/hw/acpi/viot.c
new file mode 100644
index 00..c1af75206e
--- /dev/null
+++ b/hw/acpi/viot.c
@@ -0,0 +1,114 @@
+/*
+ * ACPI Virtual I/O Translation table implementation
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+#include "qemu/osdep.h"
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
+#include "hw/acpi/viot.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_host.h"
+
+struct viot_pci_ranges {
+GArray *blob;
+size_t count;
+uint16_t output_node;
+};
+
+/* Build PCI range for a given PCI host bridge */
+static int build_pci_range_node(Object *obj, void *opaque)
+{
+struct viot_pci_ranges *pci_ranges = opaque;
+GArray *blob = pci_ranges->blob;
+
+if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
+PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
+
+if (bus && !pci_bus_bypass_iommu(bus)) {
+int min_bus, max_bus;
+
+pci_bus_range(bus, _bus, _bus);
+
+/* Type */
+build_append_int_noprefix(blob, 1 /* PCI range */, 1);
+/* Reserved */
+build_append_int_noprefix(blob, 0, 1);
+/* Length */
+build_append_int_noprefix(blob, 24, 2);
+/* Endpoint start */
+build_append_int_noprefix(blob, PCI_BUILD_BDF(min_bus, 0), 4);
+/* PCI Segment start */
+build_append_int_noprefix(blob, 0, 2);
+/* PCI Segment end */
+build_append_int_noprefix(blob, 0, 2);
+/* PCI BDF start */
+build_append_int_noprefix(blob, PCI_BUILD_BDF(min_bus, 0), 2);
+/* PCI BDF end */
+build_append_int_noprefix(blob, PCI_BUILD_BDF(max_bus, 0xff), 2);
+/* Output node */
+build_append_int_noprefix(blob, pci_ranges->output_node, 2);
+/* Reserved */
+build_append_int_noprefix(blob, 0, 6);
+
+pci_ranges->count++;
+}
+}
+
+return 0;
+}
+
+/*
+ * Generate a VIOT table with one PCI-based virtio-iommu that manages PCI
+ * endpoints.
+ *
+ * Defined in the ACPI Specification (Version TBD)
+ */
+void build_viot(MachineState *ms, GArray *table_data, BIOSLinker *linker,
+uint16_t virtio_iommu_bdf, const char *oem_id,
+const char *oem_table_id)
+{
+/* The virtio-iommu node follows the 48-bytes header */
+int viommu_off = 48;
+AcpiTable table = { .sig = "VIOT", .rev = 0,
+.oem_id = oem_id, .oem_table_id = oem_table_id };
+struct viot_pci_ranges pci_ranges = {
+.output_node = viommu_off,
+.blob = g_array_new(false, true /* clear */, 1),
+};
+
+/* Build the list of PCI ranges that this viommu manages */
+object_child_foreach_recursive(OBJECT(ms), build_pci_range_node,
+   _ranges);
+
+/* ACPI table header */
+acpi_table_begin(, table_data);
+/* Node count */
+build_append_int_noprefix(table_data, pci_ranges.count + 1, 2);
+/* Node offset */
+build_append_int_noprefix(table_data, viommu_off, 2);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 8);
+
+/* Virtio-iommu node */
+/* Type */
+build_append_int_noprefix(table_data, 3 /* virtio-pci IOMMU */, 1);
+/* Reserved */
+build_append_int_noprefix(table_data, 0, 1);
+/* Length */
+build_append_int_noprefix(table_data, 16, 2);
+/* PCI Segment */
+

[PULL 5/9] hw/i386/pc: Move IOMMU singleton into PCMachineState

2021-11-01 Thread Michael S. Tsirkin

From: Jean-Philippe Brucker 

We're about to support a third vIOMMU for x86, virtio-iommu which
doesn't inherit X86IOMMUState. Move the IOMMU singleton into
PCMachineState, so it can be shared between all three vIOMMUs.

The x86_iommu_get_default() helper is still needed by KVM and IOAPIC to
fetch the default IRQ-remapping IOMMU. Since virtio-iommu doesn't
support IRQ remapping, this interface doesn't need to change for the
moment. We could later replace X86IOMMUState with an "IRQ remapping
IOMMU" interface if necessary.

Reviewed-by: Eric Auger 
Reviewed-by: Igor Mammedov 
Signed-off-by: Jean-Philippe Brucker 
Message-Id: <20211026182024.2642038-4-jean-phili...@linaro.org>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 include/hw/i386/pc.h |  1 +
 hw/i386/pc.c | 12 +++-
 hw/i386/x86-iommu.c  | 28 +---
 3 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 11426e26dc..b72e5bf9d1 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -35,6 +35,7 @@ typedef struct PCMachineState {
 I2CBus *smbus;
 PFlashCFI01 *flash[2];
 ISADevice *pcspk;
+DeviceState *iommu;
 
 /* Configuration options: */
 uint64_t max_ram_below_4g;
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 86223acfd3..7b1c4f41cd 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1330,6 +1330,15 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler 
*hotplug_dev,
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
 pc_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
+PCMachineState *pcms = PC_MACHINE(hotplug_dev);
+
+if (pcms->iommu) {
+error_setg(errp, "QEMU does not support multiple vIOMMUs "
+   "for x86 yet.");
+return;
+}
+pcms->iommu = dev;
 }
 }
 
@@ -1384,7 +1393,8 @@ static HotplugHandler 
*pc_get_hotplug_handler(MachineState *machine,
 if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
 object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
-object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI) ||
+object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
 return HOTPLUG_HANDLER(machine);
 }
 
diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
index dc968c7a53..01d11325a6 100644
--- a/hw/i386/x86-iommu.c
+++ b/hw/i386/x86-iommu.c
@@ -77,25 +77,17 @@ void x86_iommu_irq_to_msi_message(X86IOMMUIrq *irq, 
MSIMessage *msg_out)
 msg_out->data = msg.msi_data;
 }
 
-/* Default X86 IOMMU device */
-static X86IOMMUState *x86_iommu_default = NULL;
-
-static void x86_iommu_set_default(X86IOMMUState *x86_iommu)
-{
-assert(x86_iommu);
-
-if (x86_iommu_default) {
-error_report("QEMU does not support multiple vIOMMUs "
- "for x86 yet.");
-exit(1);
-}
-
-x86_iommu_default = x86_iommu;
-}
-
 X86IOMMUState *x86_iommu_get_default(void)
 {
-return x86_iommu_default;
+MachineState *ms = MACHINE(qdev_get_machine());
+PCMachineState *pcms =
+PC_MACHINE(object_dynamic_cast(OBJECT(ms), TYPE_PC_MACHINE));
+
+if (pcms &&
+object_dynamic_cast(OBJECT(pcms->iommu), TYPE_X86_IOMMU_DEVICE)) {
+return X86_IOMMU_DEVICE(pcms->iommu);
+}
+return NULL;
 }
 
 static void x86_iommu_realize(DeviceState *dev, Error **errp)
@@ -131,8 +123,6 @@ static void x86_iommu_realize(DeviceState *dev, Error 
**errp)
 if (x86_class->realize) {
 x86_class->realize(dev, errp);
 }
-
-x86_iommu_set_default(X86_IOMMU_DEVICE(dev));
 }
 
 static Property x86_iommu_properties[] = {
-- 
MST

[PULL 1/9] qtest: fix 'expression is always false' build failure in qtest_has_accel()

2021-11-01 Thread Michael S. Tsirkin

From: Igor Mammedov 

If KVM is disabled or not present, qtest library build
may fail with:
   libqtest.c: In function 'qtest_has_accel':
  comparison of unsigned expression < 0 is always false
  [-Werror=type-limits]
 for (i = 0; i < ARRAY_SIZE(targets); i++) {

due to empty 'targets' array.
Fix it by making sure that CONFIG_KVM_TARGETS isn't empty.

Fixes: e741aff0f43343 ("tests: qtest: add qtest_has_accel() to check if tested 
binary supports accelerator")
Reported-by: Jason Andryuk 
Suggested-by: "Michael S. Tsirkin" 
Signed-off-by: Igor Mammedov 
Message-Id: <20211027151012.2639284-1-imamm...@redhat.com>
Tested-by: Jason Andryuk 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index b092728397..ab4a5723f0 100644
--- a/meson.build
+++ b/meson.build
@@ -75,7 +75,7 @@ else
   kvm_targets = []
 endif
 
-kvm_targets_c = ''
+kvm_targets_c = '""'
 if not get_option('kvm').disabled() and targetos == 'linux'
   kvm_targets_c = '"' + '" ,"'.join(kvm_targets) + '"'
 endif
-- 
MST

[PULL 2/9] vhost-vdpa: Set discarding of RAM broken when initializing the backend

2021-11-01 Thread Michael S. Tsirkin

From: David Hildenbrand 

Similar to VFIO, vDPA will go ahead an map+pin all guest memory. Memory
that used to be discarded will get re-populated and if we
discard+re-access memory after mapping+pinning, the pages mapped into the
vDPA IOMMU will go out of sync with the actual pages mapped into the user
space page tables.

Set discarding of RAM broken such that:
- virtio-mem and vhost-vdpa run mutually exclusive
- virtio-balloon is inhibited and no memory discards will get issued

In the future, we might be able to support coordinated discarding of RAM
as used by virtio-mem and already supported by vfio via the
RamDiscardManager.

Acked-by: Jason Wang 
Cc: Jason Wang 
Cc: Michael S. Tsirkin 
Cc: Cindy Lu 
Signed-off-by: David Hildenbrand 
Message-Id: <20211027130324.59791-1-da...@redhat.com>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
Reviewed-by: Stefano Garzarella 
---
 hw/virtio/vhost-vdpa.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 12661fd5b1..0d8051426c 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -331,6 +331,17 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)
 struct vhost_vdpa *v;
 assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
 trace_vhost_vdpa_init(dev, opaque);
+int ret;
+
+/*
+ * Similar to VFIO, we end up pinning all guest memory and have to
+ * disable discarding of RAM.
+ */
+ret = ram_block_discard_disable(true);
+if (ret) {
+error_report("Cannot set discarding of RAM broken");
+return ret;
+}
 
 v = opaque;
 v->dev = dev;
@@ -442,6 +453,8 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
 memory_listener_unregister(>listener);
 
 dev->opaque = NULL;
+ram_block_discard_disable(false);
+
 return 0;
 }
 
-- 
MST

[PULL 6/9] hw/i386/pc: Allow instantiating a virtio-iommu device

2021-11-01 Thread Michael S. Tsirkin

From: Jean-Philippe Brucker 

Allow instantiating a virtio-iommu device by adding an ACPI Virtual I/O
Translation table (VIOT), which describes the relation between the
virtio-iommu and the endpoints it manages.

Add a hotplug handler for virtio-iommu on x86 and set the necessary
reserved region property. On x86, the [0xfee0, 0xfeef] DMA
region is reserved for MSIs. DMA transactions to this range either
trigger IRQ remapping in the IOMMU or bypasses IOMMU translation.

Although virtio-iommu does not support IRQ remapping it must be informed
of the reserved region so that it can forward DMA transactions targeting
this region.

Reviewed-by: Eric Auger 
Reviewed-by: Igor Mammedov 
Tested-by: Eric Auger 
Signed-off-by: Jean-Philippe Brucker 
Message-Id: <20211026182024.2642038-5-jean-phili...@linaro.org>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/i386/acpi-build.c | 10 +-
 hw/i386/pc.c | 16 +++-
 hw/i386/Kconfig  |  1 +
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index ab49e799ff..3ca6cc8118 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -68,9 +68,11 @@
 #include "qom/qom-qobject.h"
 #include "hw/i386/amd_iommu.h"
 #include "hw/i386/intel_iommu.h"
+#include "hw/virtio/virtio-iommu.h"
 
 #include "hw/acpi/ipmi.h"
 #include "hw/acpi/hmat.h"
+#include "hw/acpi/viot.h"
 
 /* These are used to size the ACPI tables for -M pc-i440fx-1.7 and
  * -M pc-i440fx-2.0.  Even if the actual amount of AML generated grows
@@ -2488,7 +2490,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState 
*machine)
 PCMachineState *pcms = PC_MACHINE(machine);
 PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
 X86MachineState *x86ms = X86_MACHINE(machine);
-X86IOMMUState *iommu = x86_iommu_get_default();
+DeviceState *iommu = pcms->iommu;
 GArray *table_offsets;
 unsigned facs, dsdt, rsdt, fadt;
 AcpiPmInfo pm;
@@ -2613,6 +2615,12 @@ void acpi_build(AcpiBuildTables *tables, MachineState 
*machine)
 acpi_add_table(table_offsets, tables_blob);
 build_dmar_q35(tables_blob, tables->linker, x86ms->oem_id,
x86ms->oem_table_id);
+} else if (object_dynamic_cast(OBJECT(iommu), TYPE_VIRTIO_IOMMU_PCI)) {
+PCIDevice *pdev = PCI_DEVICE(iommu);
+
+acpi_add_table(table_offsets, tables_blob);
+build_viot(machine, tables_blob, tables->linker, pci_get_bdf(pdev),
+   x86ms->oem_id, x86ms->oem_table_id);
 }
 if (machine->nvdimms_state->is_enabled) {
 nvdimm_build_acpi(table_offsets, tables_blob, tables->linker,
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 7b1c4f41cd..e99017e662 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -83,6 +83,7 @@
 #include "hw/i386/intel_iommu.h"
 #include "hw/net/ne2000-isa.h"
 #include "standard-headers/asm-x86/bootparam.h"
+#include "hw/virtio/virtio-iommu.h"
 #include "hw/virtio/virtio-pmem-pci.h"
 #include "hw/virtio/virtio-mem-pci.h"
 #include "hw/mem/memory-device.h"
@@ -1330,7 +1331,19 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler 
*hotplug_dev,
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
 pc_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
-} else if (object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) {
+/* Declare the APIC range as the reserved MSI region */
+char *resv_prop_str = g_strdup_printf("0xfee0:0xfeef:%d",
+  VIRTIO_IOMMU_RESV_MEM_T_MSI);
+
+object_property_set_uint(OBJECT(dev), "len-reserved-regions", 1, errp);
+object_property_set_str(OBJECT(dev), "reserved-regions[0]",
+resv_prop_str, errp);
+g_free(resv_prop_str);
+}
+
+if (object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE) ||
+object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) {
 PCMachineState *pcms = PC_MACHINE(hotplug_dev);
 
 if (pcms->iommu) {
@@ -1394,6 +1407,7 @@ static HotplugHandler 
*pc_get_hotplug_handler(MachineState *machine,
 object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
 object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI) ||
+object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI) ||
 object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
 return HOTPLUG_HANDLER(machine);
 }
diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index 962d2c981b..d22ac4a4b9 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -59,6 +59,7 @@ config PC_ACPI
 select ACPI_X86
 select ACPI_CPU_HOTPLUG
 select ACPI_MEMORY_HOTPLUG
+select ACPI_VIOT
 select SMBUS_EEPROM
 select

Re: [PATCH v6 6/7] tests/acpi: add test case for VIOT on q35 machine

2021-11-01 Thread Michael S. Tsirkin

On Tue, Oct 26, 2021 at 07:20:25PM +0100, Jean-Philippe Brucker wrote:
> Add a test case for VIOT on the q35 machine. To test complex topologies
> it has two PCIe buses that bypass the IOMMU (and are therefore not
> described by VIOT), and two buses that are translated by virtio-iommu.
> 
> Reviewed-by: Eric Auger 
> Reviewed-by: Igor Mammedov 
> Signed-off-by: Jean-Philippe Brucker 

seems to need the bypass property patch

qemu-system-x86_64: Property 'pc-q35-6.2-machine.default-bus-bypass-iommu' not 
found

given Paolo decided to pick that one up, pls ping me
once that one is merged.



> ---
>  tests/qtest/bios-tables-test.c | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
> index 258874167e..a5e0fab9d5 100644
> --- a/tests/qtest/bios-tables-test.c
> +++ b/tests/qtest/bios-tables-test.c
> @@ -1465,6 +1465,26 @@ static void test_acpi_virt_tcg(void)
>  free_test_data();
>  }
>  
> +static void test_acpi_q35_viot(void)
> +{
> +test_data data = {
> +.machine = MACHINE_Q35,
> +.variant = ".viot",
> +};
> +
> +/*
> + * To keep things interesting, two buses bypass the IOMMU.
> + * VIOT should only describes the other two buses.
> + */
> +test_acpi_one("-machine default_bus_bypass_iommu=on "
> +  "-device virtio-iommu-pci "
> +  "-device pxb-pcie,bus_nr=0x10,id=pcie.100,bus=pcie.0 "
> +  "-device 
> pxb-pcie,bus_nr=0x20,id=pcie.200,bus=pcie.0,bypass_iommu=on "
> +  "-device pxb-pcie,bus_nr=0x30,id=pcie.300,bus=pcie.0",
> +  );
> +free_test_data();
> +}
> +
>  static void test_oem_fields(test_data *data)
>  {
>  int i;
> @@ -1639,6 +1659,7 @@ int main(int argc, char *argv[])
>  qtest_add_func("acpi/q35/kvm/xapic", test_acpi_q35_kvm_xapic);
>  qtest_add_func("acpi/q35/kvm/dmar", test_acpi_q35_kvm_dmar);
>  }
> +qtest_add_func("acpi/q35/viot", test_acpi_q35_viot);
>  } else if (strcmp(arch, "aarch64") == 0) {
>  if (has_tcg) {
>  qtest_add_func("acpi/virt", test_acpi_virt_tcg);
> -- 
> 2.33.0

[PATCH v3 5/6] hw/nvram: Update at24c EEPROM init function in NPCM7xx boards

2021-11-01 Thread Hao Wu

We made 3 changes to the at24c_eeprom_init function in
npcm7xx_boards.c:

1. We allow the function to take a I2CBus* as parameter. This allows
   us to attach an EEPROM device behind an I2C mux which is not
   possible with the old method.

2. We make at24c EEPROMs are backed by drives so that we can
   specify the content of the EEPROMs.

3. Instead of using i2c address as unit number, This patch assigns
   unique unit numbers for each eeproms in each board. This avoids
   conflict in providing multiple eeprom contents with the same address.
   In the old method if we specify two drives with the same unit number,
   the following error will occur: `Device with id 'none85' exists`.

Signed-off-by: Hao Wu 
---
 hw/arm/npcm7xx_boards.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/hw/arm/npcm7xx_boards.c b/hw/arm/npcm7xx_boards.c
index a656169f61..8c72024007 100644
--- a/hw/arm/npcm7xx_boards.c
+++ b/hw/arm/npcm7xx_boards.c
@@ -107,13 +107,17 @@ static I2CBus *npcm7xx_i2c_get_bus(NPCM7xxState *soc, 
uint32_t num)
 return I2C_BUS(qdev_get_child_bus(DEVICE(>smbus[num]), "i2c-bus"));
 }
 
-static void at24c_eeprom_init(NPCM7xxState *soc, int bus, uint8_t addr,
-  uint32_t rsize)
+static void at24c_eeprom_init(I2CBus *i2c_bus, int bus, uint8_t addr,
+  uint32_t rsize, int unit_number)
 {
-I2CBus *i2c_bus = npcm7xx_i2c_get_bus(soc, bus);
 I2CSlave *i2c_dev = i2c_slave_new("at24c-eeprom", addr);
 DeviceState *dev = DEVICE(i2c_dev);
+DriveInfo *dinfo;
 
+dinfo = drive_get(IF_NONE, bus, unit_number);
+if (dinfo) {
+qdev_prop_set_drive(dev, "drive", blk_by_legacy_dinfo(dinfo));
+}
 qdev_prop_set_uint32(dev, "rom-size", rsize);
 i2c_slave_realize_and_unref(i2c_dev, i2c_bus, _abort);
 }
@@ -220,8 +224,8 @@ static void quanta_gsj_i2c_init(NPCM7xxState *soc)
 i2c_slave_create_simple(npcm7xx_i2c_get_bus(soc, 3), "tmp105", 0x5c);
 i2c_slave_create_simple(npcm7xx_i2c_get_bus(soc, 4), "tmp105", 0x5c);
 
-at24c_eeprom_init(soc, 9, 0x55, 8192);
-at24c_eeprom_init(soc, 10, 0x55, 8192);
+at24c_eeprom_init(npcm7xx_i2c_get_bus(soc, 9), 9, 0x55, 8192, 0);
+at24c_eeprom_init(npcm7xx_i2c_get_bus(soc, 10), 10, 0x55, 8192, 1);
 
 /*
  * i2c-11:
-- 
2.33.1.1089.g2158813163f-goog

[PATCH v3 4/6] hw/adc: Make adci[*] R/W in NPCM7XX ADC

2021-11-01 Thread Hao Wu

Our sensor test requires both reading and writing from a sensor's
QOM property. So we need to make the input of ADC module R/W instead
of write only for that to work.

Signed-off-by: Hao Wu 
Reviewed-by: Titus Rwantare 
Reviewed-by: Peter Maydell 
---
 hw/adc/npcm7xx_adc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/adc/npcm7xx_adc.c b/hw/adc/npcm7xx_adc.c
index 47fb9e5f74..bc6f3f55e6 100644
--- a/hw/adc/npcm7xx_adc.c
+++ b/hw/adc/npcm7xx_adc.c
@@ -242,7 +242,7 @@ static void npcm7xx_adc_init(Object *obj)
 
 for (i = 0; i < NPCM7XX_ADC_NUM_INPUTS; ++i) {
 object_property_add_uint32_ptr(obj, "adci[*]",
->adci[i], OBJ_PROP_FLAG_WRITE);
+>adci[i], OBJ_PROP_FLAG_READWRITE);
 }
 object_property_add_uint32_ptr(obj, "vref",
 >vref, OBJ_PROP_FLAG_WRITE);
-- 
2.33.1.1089.g2158813163f-goog

[PATCH v3 6/6] hw/arm: quanta-gbs-bmc add i2c devices

2021-11-01 Thread Hao Wu

From: Patrick Venture 

Adds supported i2c devices to the quanta-gbc-bmc board.

Signed-off-by: Patrick Venture 
Reviewed-by: Hao Wu 
---
 hw/arm/npcm7xx_boards.c | 82 -
 1 file changed, 49 insertions(+), 33 deletions(-)

diff --git a/hw/arm/npcm7xx_boards.c b/hw/arm/npcm7xx_boards.c
index 8c72024007..e3f0d337ab 100644
--- a/hw/arm/npcm7xx_boards.c
+++ b/hw/arm/npcm7xx_boards.c
@@ -257,10 +257,12 @@ static void quanta_gsj_fan_init(NPCM7xxMachine *machine, 
NPCM7xxState *soc)
 
 static void quanta_gbs_i2c_init(NPCM7xxState *soc)
 {
+I2CSlave *i2c_mux;
+
+/* i2c-0: */
+i2c_slave_create_simple(npcm7xx_i2c_get_bus(soc, 0), TYPE_PCA9546, 0x71);
+
 /*
- * i2c-0:
- * pca9546@71
- *
  * i2c-1:
  * pca9535@24
  * pca9535@20
@@ -269,46 +271,60 @@ static void quanta_gbs_i2c_init(NPCM7xxState *soc)
  * pca9535@23
  * pca9535@25
  * pca9535@26
- *
- * i2c-2:
- * sbtsi@4c
- *
- * i2c-5:
- * atmel,24c64@50 mb_fru
- * pca9546@71
- * - channel 0: max31725@54
- * - channel 1: max31725@55
- * - channel 2: max31725@5d
- *  atmel,24c64@51 fan_fru
- * - channel 3: atmel,24c64@52 hsbp_fru
- *
+ */
+
+/* i2c-2: sbtsi@4c */
+
+/* i2c-5: */
+/* mb_fru */
+at24c_eeprom_init(npcm7xx_i2c_get_bus(soc, 5), 5, 0x50, 8192, 0);
+i2c_mux = i2c_slave_create_simple(npcm7xx_i2c_get_bus(soc, 5),
+  TYPE_PCA9546, 0x71);
+/* max31725 is tmp105 compatible. */
+i2c_slave_create_simple(pca954x_i2c_get_bus(i2c_mux, 0), "tmp105", 0x54);
+i2c_slave_create_simple(pca954x_i2c_get_bus(i2c_mux, 1), "tmp105", 0x55);
+i2c_slave_create_simple(pca954x_i2c_get_bus(i2c_mux, 2), "tmp105", 0x5d);
+/* fan_fru */
+at24c_eeprom_init(pca954x_i2c_get_bus(i2c_mux, 2), 5, 0x51, 8192, 1);
+/* hsbp_fru */
+at24c_eeprom_init(pca954x_i2c_get_bus(i2c_mux, 3), 5, 0x52, 8192, 2);
+
+/*
  * i2c-6:
  * pca9545@73
  *
  * i2c-7:
  * pca9545@72
- *
- * i2c-8:
- * adi,adm1272@10
- *
- * i2c-9:
- * pca9546@71
- * - channel 0: isil,isl68137@60
- * - channel 1: isil,isl68137@61
- * - channel 2: isil,isl68137@63
- * - channel 3: isil,isl68137@45
- *
+ */
+
+/* i2c-8: */
+i2c_slave_create_simple(npcm7xx_i2c_get_bus(soc, 8), "adm1272", 0x10);
+
+/* i2c-9: */
+i2c_slave_create_simple(npcm7xx_i2c_get_bus(soc, 9), TYPE_PCA9546, 0x71);
+/*
+ * - channel 0: isil,isl68137@60
+ * - channel 1: isil,isl68137@61
+ * - channel 2: isil,isl68137@63
+ * - channel 3: isil,isl68137@45
+ */
+
+/*
  * i2c-10:
  * pca9545@71
  *
  * i2c-11:
  * pca9545@76
- *
- * i2c-12:
- * maxim,max34451@4e
- * isil,isl68137@5d
- * isil,isl68137@5e
- *
+ */
+
+/* i2c-12: */
+i2c_slave_create_simple(npcm7xx_i2c_get_bus(soc, 12), "max34451", 0x4e);
+/*
+ * isil,isl68137@5d
+ * isil,isl68137@5e
+ */
+
+/*
  * i2c-14:
  * pca9545@70
  */
-- 
2.33.1.1089.g2158813163f-goog

[PATCH v3 2/6] hw/i2c: Read FIFO during RXF_CTL change in NPCM7XX SMBus

2021-11-01 Thread Hao Wu

Originally we read in from SMBus when RXF_STS is cleared. However,
the driver clears RXF_STS before setting RXF_CTL, causing the SM bus
module to read incorrect amount of bytes in FIFO mode when the number
of bytes read changed. This patch fixes this issue.

Signed-off-by: Hao Wu 
Reviewed-by: Titus Rwantare 
Acked-by: Corey Minyard 
---
 hw/i2c/npcm7xx_smbus.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/i2c/npcm7xx_smbus.c b/hw/i2c/npcm7xx_smbus.c
index f18e311556..1435daea94 100644
--- a/hw/i2c/npcm7xx_smbus.c
+++ b/hw/i2c/npcm7xx_smbus.c
@@ -637,9 +637,6 @@ static void npcm7xx_smbus_write_rxf_sts(NPCM7xxSMBusState 
*s, uint8_t value)
 {
 if (value & NPCM7XX_SMBRXF_STS_RX_THST) {
 s->rxf_sts &= ~NPCM7XX_SMBRXF_STS_RX_THST;
-if (s->status == NPCM7XX_SMBUS_STATUS_RECEIVING) {
-npcm7xx_smbus_recv_fifo(s);
-}
 }
 }
 
@@ -651,6 +648,9 @@ static void npcm7xx_smbus_write_rxf_ctl(NPCM7xxSMBusState 
*s, uint8_t value)
 new_ctl = KEEP_OLD_BIT(s->rxf_ctl, new_ctl, NPCM7XX_SMBRXF_CTL_LAST);
 }
 s->rxf_ctl = new_ctl;
+if (s->status == NPCM7XX_SMBUS_STATUS_RECEIVING) {
+npcm7xx_smbus_recv_fifo(s);
+}
 }
 
 static uint64_t npcm7xx_smbus_read(void *opaque, hwaddr offset, unsigned size)
-- 
2.33.1.1089.g2158813163f-goog

[PATCH v3 3/6] hw/adc: Fix CONV bit in NPCM7XX ADC CON register

2021-11-01 Thread Hao Wu

The correct bit for the CONV bit in NPCM7XX ADC is bit 13. This patch
fixes that in the module, and also lower the IRQ when the guest
is done handling an interrupt event from the ADC module.

Signed-off-by: Hao Wu 
Reviewed-by: Patrick Venture
Reviewed-by: Peter Maydell 
---
 hw/adc/npcm7xx_adc.c   | 2 +-
 tests/qtest/npcm7xx_adc-test.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/adc/npcm7xx_adc.c b/hw/adc/npcm7xx_adc.c
index 0f0a9f63e2..47fb9e5f74 100644
--- a/hw/adc/npcm7xx_adc.c
+++ b/hw/adc/npcm7xx_adc.c
@@ -36,7 +36,7 @@ REG32(NPCM7XX_ADC_DATA, 0x4)
 #define NPCM7XX_ADC_CON_INT BIT(18)
 #define NPCM7XX_ADC_CON_EN  BIT(17)
 #define NPCM7XX_ADC_CON_RST BIT(16)
-#define NPCM7XX_ADC_CON_CONVBIT(14)
+#define NPCM7XX_ADC_CON_CONVBIT(13)
 #define NPCM7XX_ADC_CON_DIV(rv) extract32(rv, 1, 8)
 
 #define NPCM7XX_ADC_MAX_RESULT  1023
diff --git a/tests/qtest/npcm7xx_adc-test.c b/tests/qtest/npcm7xx_adc-test.c
index 5ce8ce13b3..aaf127dd42 100644
--- a/tests/qtest/npcm7xx_adc-test.c
+++ b/tests/qtest/npcm7xx_adc-test.c
@@ -50,7 +50,7 @@
 #define CON_INT BIT(18)
 #define CON_EN  BIT(17)
 #define CON_RST BIT(16)
-#define CON_CONVBIT(14)
+#define CON_CONVBIT(13)
 #define CON_DIV(rv) extract32(rv, 1, 8)
 
 #define FST_RDSTBIT(1)
-- 
2.33.1.1089.g2158813163f-goog

[PATCH v3 1/6] hw/i2c: Clear ACK bit in NPCM7xx SMBus module

2021-11-01 Thread Hao Wu

The ACK bit in NPCM7XX SMBus module should be cleared each time it
sends out a NACK signal. This patch fixes the bug that it fails to
do so.

Signed-off-by: Hao Wu 
Reviewed-by: Titus Rwantare 
Reviewed-by: Peter Maydell 
---
 hw/i2c/npcm7xx_smbus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i2c/npcm7xx_smbus.c b/hw/i2c/npcm7xx_smbus.c
index e7e0ba66fe..f18e311556 100644
--- a/hw/i2c/npcm7xx_smbus.c
+++ b/hw/i2c/npcm7xx_smbus.c
@@ -270,7 +270,7 @@ static void npcm7xx_smbus_recv_byte(NPCM7xxSMBusState *s)
 if (s->st & NPCM7XX_SMBCTL1_ACK) {
 trace_npcm7xx_smbus_nack(DEVICE(s)->canonical_path);
 i2c_nack(s->bus);
-s->st &= NPCM7XX_SMBCTL1_ACK;
+s->st &= ~NPCM7XX_SMBCTL1_ACK;
 }
 trace_npcm7xx_smbus_recv_byte((DEVICE(s)->canonical_path), s->sda);
 npcm7xx_smbus_update_irq(s);
-- 
2.33.1.1089.g2158813163f-goog

[PATCH v3 0/6] Misc NPCM7XX patches

2021-11-01 Thread Hao Wu

This patch set contains a few bug fixes and I2C devices for some
NPCM7XX boards.

Patch 1~2 fix a problem that causes the SMBus module to behave
incorrectly when it's in FIFO mode and trying to receive more than
16 bytes at a time.

Patch 3 fixes a error in a register for ADC module.

Patch 4 makes the ADC input to be R/W instead of write only. It allows
a test system to read these via QMP and has no negative effect.

Patch 5 modifies at24c_eeprom_init in NPCM7xx boards so that it can fit
more use cases.

Patch 6 uses the function defined in patch 5 to add the EEPROM and other
I2C devices for Quanta GBS board.

-- Changes since v2:
1. Dropped patch 7.
2. Drop an extra variable in patch 5.

-- Changes since v1:
1. Rewrote patch 5 to implement the function in NPCM7xx board file instead
   of the EEPROM device file.
2. Slightly modify patch 6 to adapt to the changes and QEMU comment style.
3. Squash patch 7 into patch 5 to make it compile.
4. Add a new patch 7.

Hao Wu (5):
  hw/i2c: Clear ACK bit in NPCM7xx SMBus module
  hw/i2c: Read FIFO during RXF_CTL change in NPCM7XX SMBus
  hw/adc: Fix CONV bit in NPCM7XX ADC CON register
  hw/adc: Make adci[*] R/W in NPCM7XX ADC
  hw/nvram: Update at24c EEPROM init function in NPCM7xx boards

Patrick Venture (1):
  hw/arm: quanta-gbs-bmc add i2c devices

 hw/adc/npcm7xx_adc.c   |  4 +-
 hw/arm/npcm7xx_boards.c| 96 --
 hw/i2c/npcm7xx_smbus.c |  8 +--
 tests/qtest/npcm7xx_adc-test.c |  2 +-
 4 files changed, 65 insertions(+), 45 deletions(-)

-- 
2.33.1.1089.g2158813163f-goog

Re: [PATCH v2] hmp: Add shortcut to stop command to match cont

2021-11-01 Thread BALATON Zoltan


Ping? This is really simple addition that sholdn't take long to review.

On Sat, 30 Oct 2021, BALATON Zoltan wrote:

Some commands such as quit or cont have one letter alternatives but
stop is missing that. Add stop|s to match cont|c for consistency and
convenience.

Signed-off-by: BALATON Zoltan 
---
c2: Fixed typo in commit title

hmp-commands.hx | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index cf723c69ac..07a738a8e2 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -382,7 +382,7 @@ SRST
ERST

{
-.name   = "stop",
+.name   = "stop|s",
.args_type  = "",
.params = "",
.help   = "stop emulation",
@@ -390,7 +390,7 @@ ERST
},

SRST
-``stop``
+``stop`` or ``s``
  Stop emulation.
ERST

Re: [PATCH v2 7/7] hw/arm: Add ID for NPCM7XX SMBus

2021-11-01 Thread Hao Wu

I was trying to allow attaching a device using "-device xxx,bus=smbus[0]"
Maybe there's a better way to allow that?

I guess I can drop this one from the patch set.

On Mon, Nov 1, 2021 at 10:33 AM Peter Maydell 
wrote:

> On Thu, 21 Oct 2021 at 19:40, Hao Wu  wrote:
> >
> > The ID can be used to indicate SMBus modules when adding
> > dynamic devices to them.
> >
> > Signed-off-by: Hao Wu 
> > ---
> >  hw/arm/npcm7xx.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/hw/arm/npcm7xx.c b/hw/arm/npcm7xx.c
> > index 2ab0080e0b..72953d65ef 100644
> > --- a/hw/arm/npcm7xx.c
> > +++ b/hw/arm/npcm7xx.c
> > @@ -421,6 +421,7 @@ static void npcm7xx_init(Object *obj)
> >  for (i = 0; i < ARRAY_SIZE(s->smbus); i++) {
> >  object_initialize_child(obj, "smbus[*]", >smbus[i],
> >  TYPE_NPCM7XX_SMBUS);
> > +DEVICE(>smbus[i])->id = g_strdup_printf("smbus[%d]", i);
> >  }
>
> This one looks weird to me -- I'm pretty sure we shouldn't be messing
> about with the DeviceState id string like that. It's supposed to be
> internal to the QOM/qdev code.
>
> -- PMM
>

Re: [PATCH v4 5/6] tests/acceptance: Add bFLT loader linux-user test

2021-11-01 Thread Philippe Mathieu-Daudé

On 11/1/21 18:51, Willian Rampazzo wrote:
> Hi, Phill,
> 
> On Mon, Sep 27, 2021 at 1:31 PM Philippe Mathieu-Daudé  
> wrote:
>>
>> Add a very quick test that runs a busybox binary in bFLT format:
>>
>>   $ AVOCADO_ALLOW_UNTRUSTED_CODE=1 \
>> avocado --show=app run -t linux_user tests/acceptance/load_bflt.py
>>   JOB ID : db94d5960ce564c50904d666a7e259148c27e88f
>>   JOB LOG: ~/avocado/job-results/job-2019-06-25T10.52-db94d59/job.log
>>(1/1) tests/acceptance/load_bflt.py:LoadBFLT.test_stm32: PASS (0.15 s)
>>   RESULTS: PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | 
>> CANCEL 0
>>   JOB TIME   : 0.54 s
>>
>> Reviewed-by: Willian Rampazzo 
>> Signed-off-by: Philippe Mathieu-Daudé 
>> ---
>>  tests/acceptance/load_bflt.py | 54 +++
>>  1 file changed, 54 insertions(+)
>>  create mode 100644 tests/acceptance/load_bflt.py
>>
>> diff --git a/tests/acceptance/load_bflt.py b/tests/acceptance/load_bflt.py
>> new file mode 100644
>> index 000..f071a979d8e
>> --- /dev/null
>> +++ b/tests/acceptance/load_bflt.py
>> @@ -0,0 +1,54 @@
>> +# Test the bFLT loader format
>> +#
>> +# Copyright (C) 2019 Philippe Mathieu-Daudé 
>> +#
>> +# SPDX-License-Identifier: GPL-2.0-or-later
>> +
>> +import os
>> +import bz2
>> +import subprocess
>> +
>> +from avocado import skipUnless
>> +from avocado_qemu import QemuUserTest
>> +from avocado_qemu import has_cmd
>> +
>> +
>> +class LoadBFLT(QemuUserTest):
>> +
>> +def extract_cpio(self, cpio_path):
>> +"""
>> +Extracts a cpio archive into the test workdir
>> +
>> +:param cpio_path: path to the cpio archive
>> +"""
>> +cwd = os.getcwd()
>> +os.chdir(self.workdir)
>> +with bz2.open(cpio_path, 'rb') as archive_cpio:
>> +subprocess.run(['cpio', '-i'], input=archive_cpio.read(),
>> +   stderr=subprocess.DEVNULL)
>> +os.chdir(cwd)
>> +
>> +@skipUnless(*has_cmd('cpio'))
>> +@skipUnless(os.getenv('AVOCADO_ALLOW_UNTRUSTED_CODE'), 'untrusted code')
>> +def test_stm32(self):
>> +"""
>> +:avocado: tags=arch:arm
>> +:avocado: tags=linux_user
>> +:avocado: tags=quick
>> +"""
>> +# See https://elinux.org/STM32#User_Space
>> +rootfs_url = ('https://elinux.org/images/5/51/'
>> +  'Stm32_mini_rootfs.cpio.bz2')
>> +rootfs_hash = '9f065e6ba40cce7411ba757f924f30fcc57951e6'
>> +rootfs_path_bz2 = self.fetch_asset(rootfs_url, 
>> asset_hash=rootfs_hash)
>> +busybox_path = self.workdir + "/bin/busybox"
> 
> If there are other changes to this patch, also, change this to use the
> `os` library:
> 
> busybox_path = os.path.join(self.workdir, "/bin/busybox")

OK, I'll update.

Re: [PATCH v4 6/6] tests/acceptance: Rename avocado_qemu.Test -> QemuSystemTest

2021-11-01 Thread Philippe Mathieu-Daudé

On 11/1/21 20:11, Willian Rampazzo wrote:
> On Mon, Sep 27, 2021 at 1:32 PM Philippe Mathieu-Daudé  
> wrote:
>>
>> To run user-mode emulation tests, we introduced the
>> avocado_qemu.QemuUserTest which inherits from avocado_qemu.QemuBaseTest.
>> System-mode emulation tests are based on the avocado_qemu.Test class,
>> which also inherits avocado_qemu.QemuBaseTest. To avoid confusion,
>> rename it as avocado_qemu.QemuSystemTest.
>>
>> Suggested-by: Wainer dos Santos Moschetta 
>> Signed-off-by: Philippe Mathieu-Daudé 
>> ---
>>  tests/acceptance/avocado_qemu/__init__.py| 21 +---
>>  tests/acceptance/boot_linux_console.py   |  4 ++--
>>  tests/acceptance/cpu_queries.py  |  4 ++--
>>  tests/acceptance/empty_cpu_model.py  |  4 ++--
>>  tests/acceptance/info_usernet.py |  4 ++--
>>  tests/acceptance/linux_initrd.py |  4 ++--
>>  tests/acceptance/linux_ssh_mips_malta.py |  5 +++--
>>  tests/acceptance/machine_arm_canona1100.py   |  4 ++--
>>  tests/acceptance/machine_arm_integratorcp.py |  4 ++--
>>  tests/acceptance/machine_arm_n8x0.py |  4 ++--
>>  tests/acceptance/machine_avr6.py |  4 ++--
>>  tests/acceptance/machine_m68k_nextcube.py|  4 ++--
>>  tests/acceptance/machine_microblaze.py   |  4 ++--
>>  tests/acceptance/machine_mips_fuloong2e.py   |  4 ++--
>>  tests/acceptance/machine_mips_loongson3v.py  |  4 ++--
>>  tests/acceptance/machine_mips_malta.py   |  4 ++--
>>  tests/acceptance/machine_ppc.py  |  4 ++--
>>  tests/acceptance/machine_rx_gdbsim.py|  4 ++--
>>  tests/acceptance/machine_s390_ccw_virtio.py  |  4 ++--
>>  tests/acceptance/machine_sparc_leon3.py  |  4 ++--
>>  tests/acceptance/migration.py|  4 ++--
>>  tests/acceptance/multiprocess.py |  4 ++--
>>  tests/acceptance/pc_cpu_hotplug_props.py |  4 ++--
>>  tests/acceptance/ppc_prep_40p.py |  4 ++--
>>  tests/acceptance/version.py  |  4 ++--
>>  tests/acceptance/virtio-gpu.py   |  4 ++--
>>  tests/acceptance/virtio_check_params.py  |  4 ++--
>>  tests/acceptance/virtio_version.py   |  4 ++--
>>  tests/acceptance/vnc.py  |  4 ++--
>>  tests/acceptance/x86_cpu_model_versions.py   |  4 ++--
>>  30 files changed, 68 insertions(+), 70 deletions(-)

>> -class Test(QemuBaseTest):
>> -"""Facilitates system emulation tests.
>> -
>> -TODO: Rename this class as `QemuSystemTest`.
>> -"""
>> +class QemuSystemTest(QemuBaseTest):
>> +"""Facilitates system emulation tests."""
>>
>>  def setUp(self):
>>  self._vms = {}
>>
>> -super(Test, self).setUp('qemu-system-')
>> +super(QemuSystemTest, self).setUp('qemu-system-')
> 
> If you take my suggestion in one of the previous patches, you don't
> need this change here.

Indeed.

>>
>>  self.machine = self.params.get('machine',
>> 
>> default=self._get_unique_tag_val('machine'))
>> @@ -515,11 +512,11 @@ def default_kernel_params(self):
>>  return self._info.get('kernel_params', None)
>>
>>
>> -class LinuxTest(Test, LinuxSSHMixIn):
>> +class LinuxTest(QemuSystemTest, LinuxSSHMixIn):
>>  """Facilitates having a cloud-image Linux based available.
>>
>>  For tests that indend to interact with guests, this is a better choice
> 
> If you touch this patch again, please, s/indend/intend/

OK.

> 
> So far, looks good to me
> 
> Reviewed-by: Willian Rampazzo 

Thanks for reviewing the series :)

Re: [PATCH v4 1/6] tests/acceptance: Extract QemuBaseTest from Test

2021-11-01 Thread Philippe Mathieu-Daudé

On 11/1/21 19:01, Willian Rampazzo wrote:
> On Mon, Sep 27, 2021 at 1:31 PM Philippe Mathieu-Daudé  
> wrote:
>>
>> The Avocado Test::fetch_asset() is handy to download artifacts
>> before running tests. The current class is named Test but only
>> tests system emulation. As we want to test user emulation,
>> refactor the common code as QemuBaseTest.
>>
>> Signed-off-by: Philippe Mathieu-Daudé 
>> ---
>>  tests/acceptance/avocado_qemu/__init__.py | 72 +--
>>  1 file changed, 41 insertions(+), 31 deletions(-)
>>
>> diff --git a/tests/acceptance/avocado_qemu/__init__.py 
>> b/tests/acceptance/avocado_qemu/__init__.py
>> index 2c4fef3e149..8fcbed74849 100644
>> --- a/tests/acceptance/avocado_qemu/__init__.py
>> +++ b/tests/acceptance/avocado_qemu/__init__.py
>> @@ -175,7 +175,7 @@ def exec_command_and_wait_for_pattern(test, command,
>>  """
>>  _console_interaction(test, success_message, failure_message, command + 
>> '\r')
>>
>> -class Test(avocado.Test):
>> +class QemuBaseTest(avocado.Test):
>>  def _get_unique_tag_val(self, tag_name):
>>  """
>>  Gets a tag value, if unique for a key
>> @@ -185,6 +185,46 @@ def _get_unique_tag_val(self, tag_name):
>>  return vals.pop()
>>  return None
>>
>> +def setUp(self):
>> +self.arch = self.params.get('arch',
>> +
>> default=self._get_unique_tag_val('arch'))
>> +
>> +self.cpu = self.params.get('cpu',
>> +   default=self._get_unique_tag_val('cpu'))
>> +
>> +default_qemu_bin = pick_default_qemu_bin(arch=self.arch)
>> +self.qemu_bin = self.params.get('qemu_bin',
>> +default=default_qemu_bin)
>> +if self.qemu_bin is None:
>> +self.cancel("No QEMU binary defined or found in the build tree")
>> +
>> +def fetch_asset(self, name,
>> +asset_hash=None, algorithm=None,
>> +locations=None, expire=None,
>> +find_only=False, cancel_on_missing=True):
>> +return super(QemuBaseTest, self).fetch_asset(name,
> 
> It is preferable to use the PEP3135
> (https://www.python.org/dev/peps/pep-3135/) when calling `super` as
> linter are complaining about it:
> 
> return super().fetch_asset(name,
> 
> And after reading through the patch I noticed it was a method move,
> so, feel free to take the suggestion or ignore it for now.

This series was sent before commit  14f02d8a9ec ("Merge
'integration-testing-20210927' into staging") :/

I'll modify, thanks.

Re: [PATCH v3 0/3] pc: Support configuration of SMBIOS entry point type

2021-11-01 Thread Michael S. Tsirkin

On Tue, Oct 26, 2021 at 11:10:57AM -0400, Eduardo Habkost wrote:
> This includes code previously submitted[1] by Daniel P. Berrangé
> to add a "smbios-ep" machine property on PC.
> 
> SMBIOS 3.0 is necessary to support more than ~720 VCPUs, as a
> large number of VCPUs can easily hit the table size limit of
> SMBIOS 2.1 entry points.


We need acks from QAPI supporters on this.

> Changes from v2:
> * Renamed option to "smbios-entry-point-type" for clarity
> * Renamed option values to "32" and "64", for two reasons:
>   * The option is not about reporting an exact SMBIOS
> version, but just the entry point format.
> FWIW, the SMBIOS specification uses the phrases "32-bit entry
> point" and "64-bit entry point" more often than "2.1 entry
> point" and "3.0 entry point".
>   * QAPI doesn't allow us to use enum member names with dots
> or underscores
> 
> [1] 
> https://lore.kernel.org/qemu-devel/20200908165438.1008942-5-berra...@redhat.com
> 
> https://lore.kernel.org/qemu-devel/20200908165438.1008942-6-berra...@redhat.com
> 
> Eduardo Habkost (3):
>   smbios: Rename SMBIOS_ENTRY_POINT_* enums
>   hw/smbios: Use qapi for SmbiosEntryPointType
>   hw/i386: expose a "smbios-entry-point-type" PC machine property
> 
>  include/hw/firmware/smbios.h | 10 ++
>  include/hw/i386/pc.h |  4 
>  hw/arm/virt.c|  2 +-
>  hw/i386/pc.c | 26 ++
>  hw/i386/pc_piix.c|  2 +-
>  hw/i386/pc_q35.c |  2 +-
>  hw/smbios/smbios.c   |  8 
>  qapi/machine.json| 12 
>  8 files changed, 51 insertions(+), 15 deletions(-)
> 
> -- 
> 2.32.0

Re: [PATCH v7 1/2] memory: introduce total_dirty_pages to stat dirty pages

2021-11-01 Thread Juan Quintela

huang...@chinatelecom.cn wrote:
> From: Hyman Huang(黄勇) 
>
> introduce global var total_dirty_pages to stat dirty pages
> along with memory_global_dirty_log_sync.
>
> Signed-off-by: Hyman Huang(黄勇) 

Reviewed-by: Juan Quintela

Re: [PATCH] hw/qdev-core: Add compatibility for (non)-transitional devs

2021-11-01 Thread Michael S. Tsirkin

On Tue, Oct 12, 2021 at 10:24:28AM +0200, Jean-Louis Dupond wrote:
> hw_compat modes only take into account their base name.
> But if a device is created with (non)-transitional, then the compat
> values are not used, causing migrating issues.
> 
> This commit adds their (non)-transitional entries with the same settings
> as the base entry.
> 
> Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1999141
> 
> Signed-off-by: Jean-Louis Dupond 


Jean-Louis, any chance you are going to address the comments
and post a new patch?


> ---
>  include/hw/qdev-core.h | 34 ++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 4ff19c714b..5726825c2d 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -293,6 +293,30 @@ typedef struct GlobalProperty {
>  bool optional;
>  } GlobalProperty;
>  
> +
> +/**
> + * Helper to add (non)transitional compat properties
> + */
> +static inline void
> +compat_props_add_transitional(GPtrArray *arr, GlobalProperty *prop)
> +{
> +GlobalProperty *transitional = g_new0(typeof(*transitional), 1);
> +transitional->driver = g_strdup_printf("%s-transitional", prop->driver);
> +transitional->property = g_strdup(prop->property);
> +transitional->value = g_strdup(prop->value);
> +transitional->used = prop->used;
> +transitional->optional = prop->optional;
> +g_ptr_array_add(arr, (void *)transitional);
> +
> +GlobalProperty *non_transitional = g_new0(typeof(*non_transitional), 1);
> +non_transitional->driver = g_strdup_printf("%s-non-transitional", 
> prop->driver);
> +non_transitional->property = g_strdup(prop->property);
> +non_transitional->value = g_strdup(prop->value);
> +non_transitional->used = prop->used;
> +non_transitional->optional = prop->optional;
> +g_ptr_array_add(arr, (void *)non_transitional);
> +}
> +
>  static inline void
>  compat_props_add(GPtrArray *arr,
>   GlobalProperty props[], size_t nelem)
> @@ -300,6 +324,16 @@ compat_props_add(GPtrArray *arr,
>  int i;
>  for (i = 0; i < nelem; i++) {
>  g_ptr_array_add(arr, (void *)[i]);
> +if (g_str_equal(props[i].driver, "vhost-user-blk-pci") ||
> +g_str_equal(props[i].driver, "virtio-scsi-pci") ||
> +g_str_equal(props[i].driver, "virtio-blk-pci") ||
> +g_str_equal(props[i].driver, "virtio-balloon-pci") ||
> +g_str_equal(props[i].driver, "virtio-serial-pci") ||
> +g_str_equal(props[i].driver, "virtio-9p-pci") ||
> +g_str_equal(props[i].driver, "virtio-net-pci") ||
> +g_str_equal(props[i].driver, "virtio-rng-pci")) {
> +compat_props_add_transitional(arr, [i]);
> +}
>  }
>  }
>  
> -- 
> 2.33.0
> 
> 
>

[PULL 15/20] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination

2021-11-01 Thread Juan Quintela

From: David Hildenbrand 

Currently, when someone (i.e., the VM) accesses discarded parts inside a
RAMBlock with a RamDiscardManager managing the corresponding mapped memory
region, postcopy will request migration of the corresponding page from the
source. The source, however, will never answer, because it refuses to
migrate such pages with undefined content ("logically unplugged"): the
pages are never dirty, and get_queued_page() will consequently skip
processing these postcopy requests.

Especially reading discarded ("logically unplugged") ranges is supposed to
work in some setups (for example with current virtio-mem), although it
barely ever happens: still, not placing a page would currently stall the
VM, as it cannot make forward progress.

Let's check the state via the RamDiscardManager (the state e.g.,
of virtio-mem is migrated during precopy) and avoid sending a request
that will never get answered. Place a fresh zero page instead to keep
the VM working. This is the same behavior that would happen
automatically without userfaultfd being active, when accessing virtual
memory regions without populated pages -- "populate on demand".

For now, there are valid cases (as documented in the virtio-mem spec) where
a VM might read discarded memory; in the future, we will disallow that.
Then, we might want to handle that case differently, e.g., warning the
user that the VM seems to be mis-behaving.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.h  |  1 +
 migration/postcopy-ram.c | 31 +++
 migration/ram.c  | 21 +
 3 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/migration/ram.h b/migration/ram.h
index 4833e9fd5b..dda1988f3d 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -72,6 +72,7 @@ void ramblock_recv_bitmap_set_range(RAMBlock *rb, void 
*host_addr, size_t nr);
 int64_t ramblock_recv_bitmap_send(QEMUFile *file,
   const char *block_name);
 int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
+bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
 
 /* ram cache */
 int colo_init_ram_cache(void);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2e9697bdd2..3609ce7e52 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -671,6 +671,29 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
 return ret;
 }
 
+static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
+ ram_addr_t start, uint64_t haddr)
+{
+void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, 
qemu_ram_pagesize(rb));
+
+/*
+ * Discarded pages (via RamDiscardManager) are never migrated. On unlikely
+ * access, place a zeropage, which will also set the relevant bits in the
+ * recv_bitmap accordingly, so we won't try placing a zeropage twice.
+ *
+ * Checking a single bit is sufficient to handle pagesize > TPS as either
+ * all relevant bits are set or not.
+ */
+assert(QEMU_IS_ALIGNED(start, qemu_ram_pagesize(rb)));
+if (ramblock_page_is_discarded(rb, start)) {
+bool received = ramblock_recv_bitmap_test_byte_offset(rb, start);
+
+return received ? 0 : postcopy_place_page_zero(mis, aligned, rb);
+}
+
+return migrate_send_rp_req_pages(mis, rb, start, haddr);
+}
+
 /*
  * Callback from shared fault handlers to ask for a page,
  * the page must be specified by a RAMBlock and an offset in that rb
@@ -690,7 +713,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, 
RAMBlock *rb,
 qemu_ram_get_idstr(rb), rb_offset);
 return postcopy_wake_shared(pcfd, client_addr, rb);
 }
-migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr);
+postcopy_request_page(mis, rb, aligned_rbo, client_addr);
 return 0;
 }
 
@@ -984,8 +1007,8 @@ retry:
  * Send the request to the source - we want to request one
  * of our host page sizes (which is >= TPS)
  */
-ret = migrate_send_rp_req_pages(mis, rb, rb_offset,
-msg.arg.pagefault.address);
+ret = postcopy_request_page(mis, rb, rb_offset,
+msg.arg.pagefault.address);
 if (ret) {
 /* May be network failure, try to wait for recovery */
 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
@@ -993,7 +1016,7 @@ retry:
 goto retry;
 } else {
 /* This is a unavoidable fault */
-error_report("%s: migrate_send_rp_req_pages() get %d",
+error_report("%s: postcopy_request_page() get %d",
  __func__, ret);
 break;
 }
diff --git

[PULL 16/20] migration: Simplify alignment and alignment checks

2021-11-01 Thread Juan Quintela

From: David Hildenbrand 

Let's use QEMU_ALIGN_DOWN() and friends to make the code a bit easier to
read.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c| 6 +++---
 migration/postcopy-ram.c | 9 -
 migration/ram.c  | 2 +-
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index e1c0082530..53b9a8af96 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -391,7 +391,7 @@ int 
migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 int migrate_send_rp_req_pages(MigrationIncomingState *mis,
   RAMBlock *rb, ram_addr_t start, uint64_t haddr)
 {
-void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
+void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, 
qemu_ram_pagesize(rb));
 bool received = false;
 
 WITH_QEMU_LOCK_GUARD(>page_request_mutex) {
@@ -2637,8 +2637,8 @@ static void migrate_handle_rp_req_pages(MigrationState 
*ms, const char* rbname,
  * Since we currently insist on matching page sizes, just sanity check
  * we're being asked for whole host pages.
  */
-if (start & (our_host_ps - 1) ||
-   (len & (our_host_ps - 1))) {
+if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
+!QEMU_IS_ALIGNED(len, our_host_ps)) {
 error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
  " len: %zd", __func__, start, len);
 mark_source_rp_bad(ms);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 3609ce7e52..e721f69d0f 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -402,7 +402,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState 
*mis)
  strerror(errno));
 goto out;
 }
-g_assert(((size_t)testarea & (pagesize - 1)) == 0);
+g_assert(QEMU_PTR_IS_ALIGNED(testarea, pagesize));
 
 reg_struct.range.start = (uintptr_t)testarea;
 reg_struct.range.len = pagesize;
@@ -660,7 +660,7 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
 struct uffdio_range range;
 int ret;
 trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
-range.start = client_addr & ~(pagesize - 1);
+range.start = ROUND_DOWN(client_addr, pagesize);
 range.len = pagesize;
 ret = ioctl(pcfd->fd, UFFDIO_WAKE, );
 if (ret) {
@@ -702,8 +702,7 @@ static int postcopy_request_page(MigrationIncomingState 
*mis, RAMBlock *rb,
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
  uint64_t client_addr, uint64_t rb_offset)
 {
-size_t pagesize = qemu_ram_pagesize(rb);
-uint64_t aligned_rbo = rb_offset & ~(pagesize - 1);
+uint64_t aligned_rbo = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
 MigrationIncomingState *mis = migration_incoming_get_current();
 
 trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
@@ -993,7 +992,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
 break;
 }
 
-rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
+rb_offset = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
 trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
 qemu_ram_get_idstr(rb),
 rb_offset,
diff --git a/migration/ram.c b/migration/ram.c
index 4f629de7d0..54df5dc0fc 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -811,7 +811,7 @@ static void 
migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
 assert(shift >= 6);
 
 size = 1ULL << (TARGET_PAGE_BITS + shift);
-start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size);
+start = QEMU_ALIGN_DOWN((ram_addr_t)page << TARGET_PAGE_BITS, size);
 trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
 memory_region_clear_dirty_bitmap(rb->mr, start, size);
 }
-- 
2.33.1

[PULL 20/20] migration/dirtyrate: implement dirty-bitmap dirtyrate calculation

2021-11-01 Thread Juan Quintela

From: Hyman Huang(黄勇) 

introduce dirty-bitmap mode as the third method of calc-dirty-rate.
implement dirty-bitmap dirtyrate calculation, which can be used
to measuring dirtyrate in the absence of dirty-ring.

introduce "dirty_bitmap:-b" option in hmp calc_dirty_rate to
indicate dirty bitmap method should be used for calculation.

Signed-off-by: Hyman Huang(黄勇) 
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 qapi/migration.json   |   6 ++-
 migration/dirtyrate.c | 112 ++
 hmp-commands.hx   |   9 ++--
 3 files changed, 112 insertions(+), 15 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index fae4bc608c..87146ceea2 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1770,13 +1770,15 @@
 #
 # @page-sampling: calculate dirtyrate by sampling pages.
 #
-# @dirty-ring: calculate dirtyrate by via dirty ring.
+# @dirty-ring: calculate dirtyrate by dirty ring.
+#
+# @dirty-bitmap: calculate dirtyrate by dirty bitmap.
 #
 # Since: 6.1
 #
 ##
 { 'enum': 'DirtyRateMeasureMode',
-  'data': ['page-sampling', 'dirty-ring'] }
+  'data': ['page-sampling', 'dirty-ring', 'dirty-bitmap'] }
 
 ##
 # @DirtyRateInfo:
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index 17b3d2cbb5..d65e744af9 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -15,6 +15,7 @@
 #include "qapi/error.h"
 #include "cpu.h"
 #include "exec/ramblock.h"
+#include "exec/ram_addr.h"
 #include "qemu/rcu_queue.h"
 #include "qemu/main-loop.h"
 #include "qapi/qapi-commands-migration.h"
@@ -118,6 +119,10 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
 }
 info->vcpu_dirty_rate = head;
 }
+
+if (dirtyrate_mode == DIRTY_RATE_MEASURE_MODE_DIRTY_BITMAP) {
+info->sample_pages = 0;
+}
 }
 
 trace_query_dirty_rate_info(DirtyRateStatus_str(CalculatingState));
@@ -429,6 +434,79 @@ static int64_t do_calculate_dirtyrate_vcpu(DirtyPageRecord 
dirty_pages)
 return memory_size_MB / time_s;
 }
 
+static inline void record_dirtypages_bitmap(DirtyPageRecord *dirty_pages,
+bool start)
+{
+if (start) {
+dirty_pages->start_pages = total_dirty_pages;
+} else {
+dirty_pages->end_pages = total_dirty_pages;
+}
+}
+
+static void do_calculate_dirtyrate_bitmap(DirtyPageRecord dirty_pages)
+{
+DirtyStat.dirty_rate = do_calculate_dirtyrate_vcpu(dirty_pages);
+}
+
+static inline void dirtyrate_manual_reset_protect(void)
+{
+RAMBlock *block = NULL;
+
+WITH_RCU_READ_LOCK_GUARD() {
+RAMBLOCK_FOREACH_MIGRATABLE(block) {
+memory_region_clear_dirty_bitmap(block->mr, 0,
+ block->used_length);
+}
+}
+}
+
+static void calculate_dirtyrate_dirty_bitmap(struct DirtyRateConfig config)
+{
+int64_t msec = 0;
+int64_t start_time;
+DirtyPageRecord dirty_pages;
+
+qemu_mutex_lock_iothread();
+memory_global_dirty_log_start(GLOBAL_DIRTY_DIRTY_RATE);
+
+/*
+ * 1'round of log sync may return all 1 bits with
+ * KVM_DIRTY_LOG_INITIALLY_SET enable
+ * skip it unconditionally and start dirty tracking
+ * from 2'round of log sync
+ */
+memory_global_dirty_log_sync();
+
+/*
+ * reset page protect manually and unconditionally.
+ * this make sure kvm dirty log be cleared if
+ * KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE cap is enabled.
+ */
+dirtyrate_manual_reset_protect();
+qemu_mutex_unlock_iothread();
+
+record_dirtypages_bitmap(_pages, true);
+
+start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+DirtyStat.start_time = start_time / 1000;
+
+msec = config.sample_period_seconds * 1000;
+msec = set_sample_page_period(msec, start_time);
+DirtyStat.calc_time = msec / 1000;
+
+/*
+ * dirtyrate_global_dirty_log_stop do two things.
+ * 1. fetch dirty bitmap from kvm
+ * 2. stop dirty tracking
+ */
+dirtyrate_global_dirty_log_stop();
+
+record_dirtypages_bitmap(_pages, false);
+
+do_calculate_dirtyrate_bitmap(dirty_pages);
+}
+
 static void calculate_dirtyrate_dirty_ring(struct DirtyRateConfig config)
 {
 CPUState *cpu;
@@ -514,7 +592,9 @@ out:
 
 static void calculate_dirtyrate(struct DirtyRateConfig config)
 {
-if (config.mode == DIRTY_RATE_MEASURE_MODE_DIRTY_RING) {
+if (config.mode == DIRTY_RATE_MEASURE_MODE_DIRTY_BITMAP) {
+calculate_dirtyrate_dirty_bitmap(config);
+} else if (config.mode == DIRTY_RATE_MEASURE_MODE_DIRTY_RING) {
 calculate_dirtyrate_dirty_ring(config);
 } else {
 calculate_dirtyrate_sample_vm(config);
@@ -597,12 +677,15 @@ void qmp_calc_dirty_rate(int64_t calc_time,
 
 /*
  * dirty ring mode only works when kvm dirty ring is enabled.
+ * on the contrary, dirty bitmap mode is not.
  */
-if ((mode == DIRTY_RATE_MEASURE_MODE_DIRTY_RING)

Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots

2021-11-01 Thread Michael S. Tsirkin

On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> This is the follow-up of [1], dropping auto-detection and vhost-user
> changes from the initial RFC.
> 
> Based-on: 20211011175346.15499-1-da...@redhat.com
> 
> A virtio-mem device is represented by a single large RAM memory region
> backed by a single large mmap.
> 
> Right now, we map that complete memory region into guest physical addres
> space, resulting in a very large memory mapping, KVM memory slot, ...
> although only a small amount of memory might actually be exposed to the VM.
> 
> For example, when starting a VM with a 1 TiB virtio-mem device that only
> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> in order to hotplug more memory later, we waste a lot of memory on metadata
> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> optimizations in KVM are being worked on to reduce this metadata overhead
> on x86-64 in some cases, it remains a problem with nested VMs and there are
> other reasons why we would want to reduce the total memory slot to a
> reasonable minimum.
> 
> We want to:
> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>inside QEMU KVM code where possible.
> b) Not always expose all device-memory to the VM, to reduce the attack
>surface of malicious VMs without using userfaultfd.

I'm confused by the mention of these security considerations,
and I expect users will be just as confused.
So let's say user wants to not be exposed. What value for
the option should be used? What if a lower option is used?
Is there still some security advantage?

> So instead, expose the RAM memory region not by a single large mapping
> (consuming one memslot) but instead by multiple mappings, each consuming
> one memslot. To do that, we divide the RAM memory region via aliases into
> separate parts and only map the aliases into a device container we actually
> need. We have to make sure that QEMU won't silently merge the memory
> sections corresponding to the aliases (and thereby also memslots),
> otherwise we lose atomic updates with KVM and vhost-user, which we deeply
> care about when adding/removing memory. Further, to get memslot accounting
> right, such merging is better avoided.
> 
> Within the memslots, virtio-mem can (un)plug memory in smaller granularity
> dynamically. So memslots are a pure optimization to tackle a) and b) above.
> 
> The user configures how many memslots a virtio-mem device should use, the
> default is "1" -- essentially corresponding to the old behavior.
> 
> Memslots are right now mapped once they fall into the usable device region
> (which grows/shrinks on demand right now either when requesting to
>  hotplug more memory or during/after reboots). In the future, with
> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even
> more dynamically when (un)plugging device blocks.
> 
> 
> Adding a 500GiB virtio-mem device with "memslots=500" and not hotplugging
> any memory results in:
> 00014000-01047fff (prio 0, i/o): device-memory
>   00014000-007e3fff (prio 0, i/o): virtio-mem-memslots
> 
> Requesting the VM to consume 2 GiB results in (note: the usable region size
> is bigger than 2 GiB, so 3 * 1 GiB memslots are required):
> 00014000-01047fff (prio 0, i/o): device-memory
>   00014000-007e3fff (prio 0, i/o): virtio-mem-memslots
> 00014000-00017fff (prio 0, ram): alias 
> virtio-mem-memslot-0 @mem0 -3fff
> 00018000-0001bfff (prio 0, ram): alias 
> virtio-mem-memslot-1 @mem0 4000-7fff
> 0001c000-0001 (prio 0, ram): alias 
> virtio-mem-memslot-2 @mem0 8000-bfff
> 
> Requesting the VM to consume 20 GiB results in:
> 00014000-01047fff (prio 0, i/o): device-memory
>   00014000-007e3fff (prio 0, i/o): virtio-mem-memslots
> 00014000-00017fff (prio 0, ram): alias 
> virtio-mem-memslot-0 @mem0 -3fff
> 00018000-0001bfff (prio 0, ram): alias 
> virtio-mem-memslot-1 @mem0 4000-7fff
> 0001c000-0001 (prio 0, ram): alias 
> virtio-mem-memslot-2 @mem0 8000-bfff
> 0002-00023fff (prio 0, ram): alias 
> virtio-mem-memslot-3 @mem0 c000-
> 00024000-00027fff (prio 0, ram): alias 
> virtio-mem-memslot-4 @mem0 0001-00013fff
> 00028000-0002bfff (prio 0, ram): alias 
> virtio-mem-memslot-5 @mem0 00014000-00017fff
> 0002c000-0002 (prio 0, ram): alias 
> virtio-mem-memslot-6 @mem0 00018000-0001bfff
>

[PULL 13/20] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source

2021-11-01 Thread Juan Quintela

From: David Hildenbrand 

We don't want to migrate memory that corresponds to discarded ranges as
managed by a RamDiscardManager responsible for the mapped memory region of
the RAMBlock. The content of these pages is essentially stale and
without any guarantees for the VM ("logically unplugged").

Depending on the underlying memory type, even reading memory might populate
memory on the source, resulting in an undesired memory consumption. Of
course, on the destination, even writing a zeropage consumes memory,
which we also want to avoid (similar to free page hinting).

Currently, virtio-mem tries achieving that goal (not migrating "unplugged"
memory that was discarded) by going via qemu_guest_free_page_hint() - but
it's hackish and incomplete.

For example, background snapshots still end up reading all memory, as
they don't do bitmap syncs. Postcopy recovery code will re-add
previously cleared bits to the dirty bitmap and migrate them.

Let's consult the RamDiscardManager after setting up our dirty bitmap
initially and when postcopy recovery code reinitializes it: clear
corresponding bits in the dirty bitmaps (e.g., of the RAMBlock and inside
KVM). It's important to fixup the dirty bitmap *after* our initial bitmap
sync, such that the corresponding dirty bits in KVM are actually cleared.

As colo is incompatible with discarding of RAM and inhibits it, we don't
have to bother.

Note: if a misbehaving guest would use discarded ranges after migration
started we would still migrate that memory: however, then we already
populated that memory on the migration source.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 77 +
 1 file changed, 77 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index ae2601bf3b..e8c06f207c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -858,6 +858,60 @@ static inline bool migration_bitmap_clear_dirty(RAMState 
*rs,
 return ret;
 }
 
+static void dirty_bitmap_clear_section(MemoryRegionSection *section,
+   void *opaque)
+{
+const hwaddr offset = section->offset_within_region;
+const hwaddr size = int128_get64(section->size);
+const unsigned long start = offset >> TARGET_PAGE_BITS;
+const unsigned long npages = size >> TARGET_PAGE_BITS;
+RAMBlock *rb = section->mr->ram_block;
+uint64_t *cleared_bits = opaque;
+
+/*
+ * We don't grab ram_state->bitmap_mutex because we expect to run
+ * only when starting migration or during postcopy recovery where
+ * we don't have concurrent access.
+ */
+if (!migration_in_postcopy() && !migrate_background_snapshot()) {
+migration_clear_memory_region_dirty_bitmap_range(rb, start, npages);
+}
+*cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
+bitmap_clear(rb->bmap, start, npages);
+}
+
+/*
+ * Exclude all dirty pages from migration that fall into a discarded range as
+ * managed by a RamDiscardManager responsible for the mapped memory region of
+ * the RAMBlock. Clear the corresponding bits in the dirty bitmaps.
+ *
+ * Discarded pages ("logically unplugged") have undefined content and must
+ * not get migrated, because even reading these pages for migration might
+ * result in undesired behavior.
+ *
+ * Returns the number of cleared bits in the RAMBlock dirty bitmap.
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
+{
+uint64_t cleared_bits = 0;
+
+if (rb->mr && rb->bmap && memory_region_has_ram_discard_manager(rb->mr)) {
+RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+MemoryRegionSection section = {
+.mr = rb->mr,
+.offset_within_region = 0,
+.size = int128_make64(qemu_ram_get_used_length(rb)),
+};
+
+ram_discard_manager_replay_discarded(rdm, ,
+ dirty_bitmap_clear_section,
+ _bits);
+}
+return cleared_bits;
+}
+
 /* Called with RCU critical section */
 static void ramblock_sync_dirty_bitmap(RAMState *rs, RAMBlock *rb)
 {
@@ -2675,6 +2729,19 @@ static void ram_list_init_bitmaps(void)
 }
 }
 
+static void migration_bitmap_clear_discarded_pages(RAMState *rs)
+{
+unsigned long pages;
+RAMBlock *rb;
+
+RCU_READ_LOCK_GUARD();
+
+RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
+pages = ramblock_dirty_bitmap_clear_discarded_pages(rb);
+rs->migration_dirty_pages -= pages;
+}
+}
+
 static void ram_init_bitmaps(RAMState *rs)
 {
 /* For memory_global_dirty_log_start below.  */
@@ -2691,6 +2758,12 @@ static void ram_init_bitmaps(RAMState *rs)
 }
 qemu_mutex_unlock_ramlist();
 qemu_mutex_unlock_iothread();
+
+/*
+ * After

[PULL 18/20] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots

2021-11-01 Thread Juan Quintela

From: David Hildenbrand 

We already don't ever migrate memory that corresponds to discarded ranges
as managed by a RamDiscardManager responsible for the mapped memory region
of the RAMBlock.

virtio-mem uses this mechanism to logically unplug parts of a RAMBlock.
Right now, we still populate zeropages for the whole usable part of the
RAMBlock, which is undesired because:

1. Even populating the shared zeropage will result in memory getting
   consumed for page tables.
2. Memory backends without a shared zeropage (like hugetlbfs and shmem)
   will populate an actual, fresh page, resulting in an unintended
   memory consumption.

Discarded ("logically unplugged") parts have to remain discarded. As
these pages are never part of the migration stream, there is no need to
track modifications via userfaultfd WP reliably for these parts.

Further, any writes to these ranges by the VM are invalid and the
behavior is undefined.

Note that Linux only supports userfaultfd WP on private anonymous memory
for now, which usually results in the shared zeropage getting populated.
The issue will become more relevant once userfaultfd WP supports shmem
and hugetlb.

Acked-by: Peter Xu 
Signed-off-by: David Hildenbrand 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 38 --
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 92c7b788ae..680a5158aa 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1656,6 +1656,17 @@ static inline void populate_read_range(RAMBlock *block, 
ram_addr_t offset,
 }
 }
 
+static inline int populate_read_section(MemoryRegionSection *section,
+void *opaque)
+{
+const hwaddr size = int128_get64(section->size);
+hwaddr offset = section->offset_within_region;
+RAMBlock *block = section->mr->ram_block;
+
+populate_read_range(block, offset, size);
+return 0;
+}
+
 /*
  * ram_block_populate_read: preallocate page tables and populate pages in the
  *   RAM block by reading a byte of each page.
@@ -1665,9 +1676,32 @@ static inline void populate_read_range(RAMBlock *block, 
ram_addr_t offset,
  *
  * @block: RAM block to populate
  */
-static void ram_block_populate_read(RAMBlock *block)
+static void ram_block_populate_read(RAMBlock *rb)
 {
-populate_read_range(block, 0, block->used_length);
+/*
+ * Skip populating all pages that fall into a discarded range as managed by
+ * a RamDiscardManager responsible for the mapped memory region of the
+ * RAMBlock. Such discarded ("logically unplugged") parts of a RAMBlock
+ * must not get populated automatically. We don't have to track
+ * modifications via userfaultfd WP reliably, because these pages will
+ * not be part of the migration stream either way -- see
+ * ramblock_dirty_bitmap_exclude_discarded_pages().
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
+RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+MemoryRegionSection section = {
+.mr = rb->mr,
+.offset_within_region = 0,
+.size = rb->mr->size,
+};
+
+ram_discard_manager_replay_populated(rdm, ,
+ populate_read_section, NULL);
+} else {
+populate_read_range(rb, 0, rb->used_length);
+}
 }
 
 /*
-- 
2.33.1

Re: gitlab-ci: clang-user job failed with run-tcg-tests-sh4-linux-user

2021-11-01 Thread Philippe Mathieu-Daudé

On 11/1/21 19:00, Richard Henderson wrote:
> On 11/1/21 6:27 AM, Philippe Mathieu-Daudé wrote:
>> Build failed running the 'clang-user' job:
>>
>>    TEST    linux-test on sh4
>> ../linux-user/syscall.c:10373:34: runtime error: member access within
>> misaligned address 0x0048af34 for type 'struct linux_dirent64',
>> which requires 8 byte alignment
>> 0x0048af34: note: pointer points here
>>    00 00 00 00 00 40 0c 00  00 00 00 00 7b e2 f5 de  fc d8 a1 3a 20 00 0a
>> 66  69 6c 65 33 00 00 00 00
>>    ^
>> make[2]: *** [../Makefile.target:158: run-linux-test] Error 1
>> make[1]: *** [/builds/philmd/qemu/tests/tcg/Makefile.qemu:102:
>> run-guest-tests] Error 2
>> make: *** [/builds/philmd/qemu/tests/Makefile.include:63:
>> run-tcg-tests-sh4-linux-user] Error 2
>>
>> https://gitlab.com/philmd/qemu/-/jobs/1733066358
> 
> Interesting.  It's being skipped on master.  Also, this must have some
> sort of sanitizer enabled to get that warning?

Oh good point, I'm including "tests/tcg: Fix some targets default
cross compiler path" which re-enable alpha/mips/riscv64/sh4:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg846281.html

[PULL 19/20] memory: introduce total_dirty_pages to stat dirty pages

2021-11-01 Thread Juan Quintela

From: Hyman Huang(黄勇) 

introduce global var total_dirty_pages to stat dirty pages
along with memory_global_dirty_log_sync.

Signed-off-by: Hyman Huang(黄勇) 
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/exec/ram_addr.h | 9 +
 migration/dirtyrate.c   | 7 +++
 2 files changed, 16 insertions(+)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 45c913264a..64fb936c7c 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -26,6 +26,8 @@
 #include "exec/ramlist.h"
 #include "exec/ramblock.h"
 
+extern uint64_t total_dirty_pages;
+
 /**
  * clear_bmap_size: calculate clear bitmap size
  *
@@ -373,6 +375,10 @@ static inline void 
cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 qatomic_or(
 [DIRTY_MEMORY_MIGRATION][idx][offset],
 temp);
+if (unlikely(
+global_dirty_tracking & GLOBAL_DIRTY_DIRTY_RATE)) {
+total_dirty_pages += ctpopl(temp);
+}
 }
 
 if (tcg_enabled()) {
@@ -403,6 +409,9 @@ static inline void 
cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 for (i = 0; i < len; i++) {
 if (bitmap[i] != 0) {
 c = leul_to_cpu(bitmap[i]);
+if (unlikely(global_dirty_tracking & GLOBAL_DIRTY_DIRTY_RATE)) 
{
+total_dirty_pages += ctpopl(c);
+}
 do {
 j = ctzl(c);
 c &= ~(1ul << j);
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index f92c4b498e..17b3d2cbb5 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -28,6 +28,13 @@
 #include "sysemu/runstate.h"
 #include "exec/memory.h"
 
+/*
+ * total_dirty_pages is procted by BQL and is used
+ * to stat dirty pages during the period of two
+ * memory_global_dirty_log_sync
+ */
+uint64_t total_dirty_pages;
+
 typedef struct DirtyPageRecord {
 uint64_t start_pages;
 uint64_t end_pages;
-- 
2.33.1

[PULL 11/20] memory: Introduce replay_discarded callback for RamDiscardManager

2021-11-01 Thread Juan Quintela

From: David Hildenbrand 

Introduce replay_discarded callback similar to our existing
replay_populated callback, to be used my migration code to never migrate
discarded memory.

Acked-by: Peter Xu 
Signed-off-by: David Hildenbrand 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/exec/memory.h | 21 +
 softmmu/memory.c  | 11 +++
 2 files changed, 32 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 04280450c9..20f1b27377 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -550,6 +550,7 @@ static inline void 
ram_discard_listener_init(RamDiscardListener *rdl,
 }
 
 typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
+typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -638,6 +639,21 @@ struct RamDiscardManagerClass {
 MemoryRegionSection *section,
 ReplayRamPopulate replay_fn, void *opaque);
 
+/**
+ * @replay_discarded:
+ *
+ * Call the #ReplayRamDiscard callback for all discarded parts within the
+ * #MemoryRegionSection via the #RamDiscardManager.
+ *
+ * @rdm: the #RamDiscardManager
+ * @section: the #MemoryRegionSection
+ * @replay_fn: the #ReplayRamDiscard callback
+ * @opaque: pointer to forward to the callback
+ */
+void (*replay_discarded)(const RamDiscardManager *rdm,
+ MemoryRegionSection *section,
+ ReplayRamDiscard replay_fn, void *opaque);
+
 /**
  * @register_listener:
  *
@@ -682,6 +698,11 @@ int ram_discard_manager_replay_populated(const 
RamDiscardManager *rdm,
  ReplayRamPopulate replay_fn,
  void *opaque);
 
+void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+  MemoryRegionSection *section,
+  ReplayRamDiscard replay_fn,
+  void *opaque);
+
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
RamDiscardListener *rdl,
MemoryRegionSection *section);
diff --git a/softmmu/memory.c b/softmmu/memory.c
index f2ac0d2e89..7340e19ff5 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2081,6 +2081,17 @@ int ram_discard_manager_replay_populated(const 
RamDiscardManager *rdm,
 return rdmc->replay_populated(rdm, section, replay_fn, opaque);
 }
 
+void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+  MemoryRegionSection *section,
+  ReplayRamDiscard replay_fn,
+  void *opaque)
+{
+RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
+
+g_assert(rdmc->replay_discarded);
+rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+}
+
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
RamDiscardListener *rdl,
MemoryRegionSection *section)
-- 
2.33.1

[PULL 17/20] migration/ram: Factor out populating pages readable in ram_block_populate_pages()

2021-11-01 Thread Juan Quintela

From: David Hildenbrand 

Let's factor out prefaulting/populating to make further changes easier to
review and add a comment what we are actually expecting to happen. While at
it, use the actual page size of the ramblock, which defaults to
qemu_real_host_page_size for anonymous memory. Further, rename
ram_block_populate_pages() to ram_block_populate_read() as well, to make
it clearer what we are doing.

In the future, we might want to use MADV_POPULATE_READ to speed up
population.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 54df5dc0fc..92c7b788ae 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1639,26 +1639,35 @@ out:
 return ret;
 }
 
+static inline void populate_read_range(RAMBlock *block, ram_addr_t offset,
+   ram_addr_t size)
+{
+/*
+ * We read one byte of each page; this will preallocate page tables if
+ * required and populate the shared zeropage on MAP_PRIVATE anonymous 
memory
+ * where no page was populated yet. This might require adaption when
+ * supporting other mappings, like shmem.
+ */
+for (; offset < size; offset += block->page_size) {
+char tmp = *((char *)block->host + offset);
+
+/* Don't optimize the read out */
+asm volatile("" : "+r" (tmp));
+}
+}
+
 /*
- * ram_block_populate_pages: populate memory in the RAM block by reading
- *   an integer from the beginning of each page.
+ * ram_block_populate_read: preallocate page tables and populate pages in the
+ *   RAM block by reading a byte of each page.
  *
  * Since it's solely used for userfault_fd WP feature, here we just
  *   hardcode page size to qemu_real_host_page_size.
  *
  * @block: RAM block to populate
  */
-static void ram_block_populate_pages(RAMBlock *block)
+static void ram_block_populate_read(RAMBlock *block)
 {
-char *ptr = (char *) block->host;
-
-for (ram_addr_t offset = 0; offset < block->used_length;
-offset += qemu_real_host_page_size) {
-char tmp = *(ptr + offset);
-
-/* Don't optimize the read out */
-asm volatile("" : "+r" (tmp));
-}
+populate_read_range(block, 0, block->used_length);
 }
 
 /*
@@ -1684,7 +1693,7 @@ void ram_write_tracking_prepare(void)
  * UFFDIO_WRITEPROTECT_MODE_WP mode setting would silently skip
  * pages with pte_none() entries in page table.
  */
-ram_block_populate_pages(block);
+ram_block_populate_read(block);
 }
 }
 
-- 
2.33.1

[PULL 12/20] virtio-mem: Implement replay_discarded RamDiscardManager callback

2021-11-01 Thread Juan Quintela

From: David Hildenbrand 

Implement it similar to the replay_populated callback.

Acked-by: Peter Xu 
Signed-off-by: David Hildenbrand 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 hw/virtio/virtio-mem.c | 58 ++
 1 file changed, 58 insertions(+)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index df91e454b2..284096ec5f 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -228,6 +228,38 @@ static int virtio_mem_for_each_plugged_section(const 
VirtIOMEM *vmem,
 return ret;
 }
 
+static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
+ MemoryRegionSection *s,
+ void *arg,
+ virtio_mem_section_cb cb)
+{
+unsigned long first_bit, last_bit;
+uint64_t offset, size;
+int ret = 0;
+
+first_bit = s->offset_within_region / vmem->bitmap_size;
+first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, first_bit);
+while (first_bit < vmem->bitmap_size) {
+MemoryRegionSection tmp = *s;
+
+offset = first_bit * vmem->block_size;
+last_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+ first_bit + 1) - 1;
+size = (last_bit - first_bit + 1) * vmem->block_size;
+
+if (!virito_mem_intersect_memory_section(, offset, size)) {
+break;
+}
+ret = cb(, arg);
+if (ret) {
+break;
+}
+first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+   last_bit + 2);
+}
+return ret;
+}
+
 static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
 {
 RamDiscardListener *rdl = arg;
@@ -1170,6 +1202,31 @@ static int virtio_mem_rdm_replay_populated(const 
RamDiscardManager *rdm,
 
virtio_mem_rdm_replay_populated_cb);
 }
 
+static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
+  void *arg)
+{
+struct VirtIOMEMReplayData *data = arg;
+
+((ReplayRamDiscard)data->fn)(s, data->opaque);
+return 0;
+}
+
+static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
+MemoryRegionSection *s,
+ReplayRamDiscard replay_fn,
+void *opaque)
+{
+const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
+struct VirtIOMEMReplayData data = {
+.fn = replay_fn,
+.opaque = opaque,
+};
+
+g_assert(s->mr == >memdev->mr);
+virtio_mem_for_each_unplugged_section(vmem, s, ,
+  virtio_mem_rdm_replay_discarded_cb);
+}
+
 static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
  RamDiscardListener *rdl,
  MemoryRegionSection *s)
@@ -1234,6 +1291,7 @@ static void virtio_mem_class_init(ObjectClass *klass, 
void *data)
 rdmc->get_min_granularity = virtio_mem_rdm_get_min_granularity;
 rdmc->is_populated = virtio_mem_rdm_is_populated;
 rdmc->replay_populated = virtio_mem_rdm_replay_populated;
+rdmc->replay_discarded = virtio_mem_rdm_replay_discarded;
 rdmc->register_listener = virtio_mem_rdm_register_listener;
 rdmc->unregister_listener = virtio_mem_rdm_unregister_listener;
 }
-- 
2.33.1

[PULL 06/20] migration/dirtyrate: move init step of calculation to main thread

2021-11-01 Thread Juan Quintela

From: Hyman Huang(é»„å‹‡) 

since main thread may "query dirty rate" at any time, it's better
to move init step into main thead so that synchronization overhead
between "main" and "get_dirtyrate" can be reduced.

Signed-off-by: Hyman Huang(é»„å‹‡) 
Message-Id: 
<109f8077518ed2f13068e3bfb10e625e964780f1.1624040308.git.huang...@chinatelecom.cn>
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/dirtyrate.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index a9bdd60034..b8f61cc650 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -380,7 +380,6 @@ void *get_dirtyrate_thread(void *arg)
 {
 struct DirtyRateConfig config = *(struct DirtyRateConfig *)arg;
 int ret;
-int64_t start_time;
 rcu_register_thread();
 
 ret = dirtyrate_set_state(, DIRTY_RATE_STATUS_UNSTARTED,
@@ -390,9 +389,6 @@ void *get_dirtyrate_thread(void *arg)
 return NULL;
 }
 
-start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
-init_dirtyrate_stat(start_time, config);
-
 calculate_dirtyrate(config);
 
 ret = dirtyrate_set_state(, DIRTY_RATE_STATUS_MEASURING,
@@ -411,6 +407,7 @@ void qmp_calc_dirty_rate(int64_t calc_time, bool 
has_sample_pages,
 static struct DirtyRateConfig config;
 QemuThread thread;
 int ret;
+int64_t start_time;
 
 /*
  * If the dirty rate is already being measured, don't attempt to start.
@@ -451,6 +448,10 @@ void qmp_calc_dirty_rate(int64_t calc_time, bool 
has_sample_pages,
 config.sample_period_seconds = calc_time;
 config.sample_pages_per_gigabytes = sample_pages;
 config.mode = DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING;
+
+start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
+init_dirtyrate_stat(start_time, config);
+
 qemu_thread_create(, "get_dirtyrate", get_dirtyrate_thread,
(void *), QEMU_THREAD_DETACHED);
 }
-- 
2.33.1

[PULL 14/20] virtio-mem: Drop precopy notifier

2021-11-01 Thread Juan Quintela

From: David Hildenbrand 

Migration code now properly handles RAMBlocks which are indirectly managed
by a RamDiscardManager. No need for manual handling via the free page
optimization interface, let's get rid of it.

Acked-by: Michael S. Tsirkin 
Acked-by: Peter Xu 
Signed-off-by: David Hildenbrand 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/hw/virtio/virtio-mem.h |  3 ---
 hw/virtio/virtio-mem.c | 34 --
 2 files changed, 37 deletions(-)

diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 9a6e348fa2..a5dd6a493b 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -65,9 +65,6 @@ struct VirtIOMEM {
 /* notifiers to notify when "size" changes */
 NotifierList size_change_notifiers;
 
-/* don't migrate unplugged memory */
-NotifierWithReturn precopy_notifier;
-
 /* listeners to notify on plug/unplug activity. */
 QLIST_HEAD(, RamDiscardListener) rdl_list;
 };
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 284096ec5f..d5a578142b 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -776,7 +776,6 @@ static void virtio_mem_device_realize(DeviceState *dev, 
Error **errp)
 host_memory_backend_set_mapped(vmem->memdev, true);
 vmstate_register_ram(>memdev->mr, DEVICE(vmem));
 qemu_register_reset(virtio_mem_system_reset, vmem);
-precopy_add_notifier(>precopy_notifier);
 
 /*
  * Set ourselves as RamDiscardManager before the plug handler maps the
@@ -796,7 +795,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
  * found via an address space anymore. Unset ourselves.
  */
 memory_region_set_ram_discard_manager(>memdev->mr, NULL);
-precopy_remove_notifier(>precopy_notifier);
 qemu_unregister_reset(virtio_mem_system_reset, vmem);
 vmstate_unregister_ram(>memdev->mr, DEVICE(vmem));
 host_memory_backend_set_mapped(vmem->memdev, false);
@@ -1089,43 +1087,11 @@ static void virtio_mem_set_block_size(Object *obj, 
Visitor *v, const char *name,
 vmem->block_size = value;
 }
 
-static int virtio_mem_precopy_exclude_range_cb(const VirtIOMEM *vmem, void 
*arg,
-   uint64_t offset, uint64_t size)
-{
-void * const host = qemu_ram_get_host_addr(vmem->memdev->mr.ram_block);
-
-qemu_guest_free_page_hint(host + offset, size);
-return 0;
-}
-
-static void virtio_mem_precopy_exclude_unplugged(VirtIOMEM *vmem)
-{
-virtio_mem_for_each_unplugged_range(vmem, NULL,
-virtio_mem_precopy_exclude_range_cb);
-}
-
-static int virtio_mem_precopy_notify(NotifierWithReturn *n, void *data)
-{
-VirtIOMEM *vmem = container_of(n, VirtIOMEM, precopy_notifier);
-PrecopyNotifyData *pnd = data;
-
-switch (pnd->reason) {
-case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
-virtio_mem_precopy_exclude_unplugged(vmem);
-break;
-default:
-break;
-}
-
-return 0;
-}
-
 static void virtio_mem_instance_init(Object *obj)
 {
 VirtIOMEM *vmem = VIRTIO_MEM(obj);
 
 notifier_list_init(>size_change_notifiers);
-vmem->precopy_notifier.notify = virtio_mem_precopy_notify;
 QLIST_INIT(>rdl_list);
 
 object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
-- 
2.33.1

[PULL 07/20] migration/dirtyrate: implement dirty-ring dirtyrate calculation

2021-11-01 Thread Juan Quintela

From: Hyman Huang(é»„å‹‡) 

use dirty ring feature to implement dirtyrate calculation.

introduce mode option in qmp calc_dirty_rate to specify what
method should be used when calculating dirtyrate, either
page-sampling or dirty-ring should be passed.

introduce "dirty_ring:-r" option in hmp calc_dirty_rate to
indicate dirty ring method should be used for calculation.

Signed-off-by: Hyman Huang(é»„å‹‡) 
Message-Id: 
<7db445109bd18125ce8ec86816d14f6ab5de6a7d.1624040308.git.huang...@chinatelecom.cn>
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 qapi/migration.json|  16 +++-
 migration/dirtyrate.c  | 208 +++--
 hmp-commands.hx|   7 +-
 migration/trace-events |   2 +
 4 files changed, 218 insertions(+), 15 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 94eece16e1..fae4bc608c 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1796,6 +1796,12 @@
 # @sample-pages: page count per GB for sample dirty pages
 #the default value is 512 (since 6.1)
 #
+# @mode: mode containing method of calculate dirtyrate includes
+#'page-sampling' and 'dirty-ring' (Since 6.1)
+#
+# @vcpu-dirty-rate: dirtyrate for each vcpu if dirty-ring
+#   mode specified (Since 6.1)
+#
 # Since: 5.2
 #
 ##
@@ -1804,7 +1810,9 @@
'status': 'DirtyRateStatus',
'start-time': 'int64',
'calc-time': 'int64',
-   'sample-pages': 'uint64'} }
+   'sample-pages': 'uint64',
+   'mode': 'DirtyRateMeasureMode',
+   '*vcpu-dirty-rate': [ 'DirtyRateVcpu' ] } }
 
 ##
 # @calc-dirty-rate:
@@ -1816,6 +1824,9 @@
 # @sample-pages: page count per GB for sample dirty pages
 #the default value is 512 (since 6.1)
 #
+# @mode: mechanism of calculating dirtyrate includes
+#'page-sampling' and 'dirty-ring' (Since 6.1)
+#
 # Since: 5.2
 #
 # Example:
@@ -1824,7 +1835,8 @@
 #
 ##
 { 'command': 'calc-dirty-rate', 'data': {'calc-time': 'int64',
- '*sample-pages': 'int'} }
+ '*sample-pages': 'int',
+ '*mode': 'DirtyRateMeasureMode'} }
 
 ##
 # @query-dirty-rate:
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index b8f61cc650..f92c4b498e 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -16,6 +16,7 @@
 #include "cpu.h"
 #include "exec/ramblock.h"
 #include "qemu/rcu_queue.h"
+#include "qemu/main-loop.h"
 #include "qapi/qapi-commands-migration.h"
 #include "ram.h"
 #include "trace.h"
@@ -23,9 +24,19 @@
 #include "monitor/hmp.h"
 #include "monitor/monitor.h"
 #include "qapi/qmp/qdict.h"
+#include "sysemu/kvm.h"
+#include "sysemu/runstate.h"
+#include "exec/memory.h"
+
+typedef struct DirtyPageRecord {
+uint64_t start_pages;
+uint64_t end_pages;
+} DirtyPageRecord;
 
 static int CalculatingState = DIRTY_RATE_STATUS_UNSTARTED;
 static struct DirtyRateStat DirtyStat;
+static DirtyRateMeasureMode dirtyrate_mode =
+DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING;
 
 static int64_t set_sample_page_period(int64_t msec, int64_t initial_time)
 {
@@ -70,18 +81,37 @@ static int dirtyrate_set_state(int *state, int old_state, 
int new_state)
 
 static struct DirtyRateInfo *query_dirty_rate_info(void)
 {
+int i;
 int64_t dirty_rate = DirtyStat.dirty_rate;
 struct DirtyRateInfo *info = g_malloc0(sizeof(DirtyRateInfo));
-
-if (qatomic_read() == DIRTY_RATE_STATUS_MEASURED) {
-info->has_dirty_rate = true;
-info->dirty_rate = dirty_rate;
-}
+DirtyRateVcpuList *head = NULL, **tail = 
 
 info->status = CalculatingState;
 info->start_time = DirtyStat.start_time;
 info->calc_time = DirtyStat.calc_time;
 info->sample_pages = DirtyStat.sample_pages;
+info->mode = dirtyrate_mode;
+
+if (qatomic_read() == DIRTY_RATE_STATUS_MEASURED) {
+info->has_dirty_rate = true;
+info->dirty_rate = dirty_rate;
+
+if (dirtyrate_mode == DIRTY_RATE_MEASURE_MODE_DIRTY_RING) {
+/*
+ * set sample_pages with 0 to indicate page sampling
+ * isn't enabled
+ **/
+info->sample_pages = 0;
+info->has_vcpu_dirty_rate = true;
+for (i = 0; i < DirtyStat.dirty_ring.nvcpu; i++) {
+DirtyRateVcpu *rate = g_malloc0(sizeof(DirtyRateVcpu));
+rate->id = DirtyStat.dirty_ring.rates[i].id;
+rate->dirty_rate = DirtyStat.dirty_ring.rates[i].dirty_rate;
+QAPI_LIST_APPEND(tail, rate);
+}
+info->vcpu_dirty_rate = head;
+}
+}
 
 trace_query_dirty_rate_info(DirtyRateStatus_str(CalculatingState));
 
@@ -111,6 +141,15 @@ static void init_dirtyrate_stat(int64_t start_time,
 }
 }
 
+static void cleanup_dirtyrate_stat(struct DirtyRateConfig config)
+{
+/* last

[PULL 09/20] migration: Add migrate_add_blocker_internal()

2021-11-01 Thread Juan Quintela

From: Peter Xu 

An internal version that removes -only-migratable implications.  It can be used
for temporary migration blockers like dump-guest-memory.

Reviewed-by: Marc-André Lureau 
Reviewed-by: Juan Quintela 
Signed-off-by: Peter Xu 
Signed-off-by: Juan Quintela 
---
 include/migration/blocker.h | 16 
 migration/migration.c   | 21 +
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/migration/blocker.h b/include/migration/blocker.h
index acd27018e9..9cebe2ba06 100644
--- a/include/migration/blocker.h
+++ b/include/migration/blocker.h
@@ -25,6 +25,22 @@
  */
 int migrate_add_blocker(Error *reason, Error **errp);
 
+/**
+ * @migrate_add_blocker_internal - prevent migration from proceeding without
+ * only-migrate implications
+ *
+ * @reason - an error to be returned whenever migration is attempted
+ *
+ * @errp - [out] The reason (if any) we cannot block migration right now.
+ *
+ * @returns - 0 on success, -EBUSY on failure, with errp set.
+ *
+ * Some of the migration blockers can be temporary (e.g., for a few seconds),
+ * so it shouldn't need to conflict with "-only-migratable".  For those cases,
+ * we can call this function rather than @migrate_add_blocker().
+ */
+int migrate_add_blocker_internal(Error *reason, Error **errp);
+
 /**
  * @migrate_del_blocker - remove a blocking error from migration
  *
diff --git a/migration/migration.c b/migration/migration.c
index e81e473f5a..e1c0082530 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2049,15 +2049,8 @@ void migrate_init(MigrationState *s)
 s->threshold_size = 0;
 }
 
-int migrate_add_blocker(Error *reason, Error **errp)
+int migrate_add_blocker_internal(Error *reason, Error **errp)
 {
-if (only_migratable) {
-error_propagate_prepend(errp, error_copy(reason),
-"disallowing migration blocker "
-"(--only-migratable) for: ");
-return -EACCES;
-}
-
 /* Snapshots are similar to migrations, so check RUN_STATE_SAVE_VM too. */
 if (runstate_check(RUN_STATE_SAVE_VM) || !migration_is_idle()) {
 error_propagate_prepend(errp, error_copy(reason),
@@ -2070,6 +2063,18 @@ int migrate_add_blocker(Error *reason, Error **errp)
 return 0;
 }
 
+int migrate_add_blocker(Error *reason, Error **errp)
+{
+if (only_migratable) {
+error_propagate_prepend(errp, error_copy(reason),
+"disallowing migration blocker "
+"(--only-migratable) for: ");
+return -EACCES;
+}
+
+return migrate_add_blocker_internal(reason, errp);
+}
+
 void migrate_del_blocker(Error *reason)
 {
 migration_blockers = g_slist_remove(migration_blockers, reason);
-- 
2.33.1

[PULL 10/20] dump-guest-memory: Block live migration

2021-11-01 Thread Juan Quintela

From: Peter Xu 

Both dump-guest-memory and live migration caches vm state at the beginning.
Either of them entering the other one will cause race on the vm state, and even
more severe on that (please refer to the crash report in the bug link).

Let's block live migration in dump-guest-memory, and that'll also block
dump-guest-memory if it detected that we're during a live migration.

Side note: migrate_del_blocker() can be called even if the blocker is not
inserted yet, so it's safe to unconditionally delete that blocker in
dump_cleanup (g_slist_remove allows no-entry-found case).

Suggested-by: Dr. David Alan Gilbert 
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1996609
Signed-off-by: Peter Xu 
Reviewed-by: Marc-André Lureau 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 dump/dump.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/dump/dump.c b/dump/dump.c
index ab625909f3..662d0a62cd 100644
--- a/dump/dump.c
+++ b/dump/dump.c
@@ -29,6 +29,7 @@
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "hw/misc/vmcoreinfo.h"
+#include "migration/blocker.h"
 
 #ifdef TARGET_X86_64
 #include "win_dump.h"
@@ -47,6 +48,8 @@
 
 #define MAX_GUEST_NOTE_SIZE (1 << 20) /* 1MB should be enough */
 
+static Error *dump_migration_blocker;
+
 #define ELF_NOTE_SIZE(hdr_size, name_size, desc_size)   \
 ((DIV_ROUND_UP((hdr_size), 4) + \
   DIV_ROUND_UP((name_size), 4) +\
@@ -101,6 +104,7 @@ static int dump_cleanup(DumpState *s)
 qemu_mutex_unlock_iothread();
 }
 }
+migrate_del_blocker(dump_migration_blocker);
 
 return 0;
 }
@@ -2005,6 +2009,21 @@ void qmp_dump_guest_memory(bool paging, const char *file,
 return;
 }
 
+if (!dump_migration_blocker) {
+error_setg(_migration_blocker,
+   "Live migration disabled: dump-guest-memory in progress");
+}
+
+/*
+ * Allows even for -only-migratable, but forbid migration during the
+ * process of dump guest memory.
+ */
+if (migrate_add_blocker_internal(dump_migration_blocker, errp)) {
+/* Remember to release the fd before passing it over to dump state */
+close(fd);
+return;
+}
+
 s = _state_global;
 dump_state_prepare(s);
 
-- 
2.33.1

[PULL 04/20] migration/dirtyrate: introduce struct and adjust DirtyRateStat

2021-11-01 Thread Juan Quintela

From: Hyman Huang(é»„å‹‡) 

introduce "DirtyRateMeasureMode" to specify what method should be
used to calculate dirty rate, introduce "DirtyRateVcpu" to store
dirty rate for each vcpu.

use union to store stat data of specific mode

Signed-off-by: Hyman Huang(é»„å‹‡) 
Message-Id: 
<661c98c40f40e163aa58334337af8f3ddf41316a.1624040308.git.huang...@chinatelecom.cn>
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 qapi/migration.json   | 30 +++
 migration/dirtyrate.h | 21 +++
 migration/dirtyrate.c | 48 +--
 3 files changed, 75 insertions(+), 24 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 9aa8bc5759..94eece16e1 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1731,6 +1731,21 @@
 { 'event': 'UNPLUG_PRIMARY',
   'data': { 'device-id': 'str' } }
 
+##
+# @DirtyRateVcpu:
+#
+# Dirty rate of vcpu.
+#
+# @id: vcpu index.
+#
+# @dirty-rate: dirty rate.
+#
+# Since: 6.1
+#
+##
+{ 'struct': 'DirtyRateVcpu',
+  'data': { 'id': 'int', 'dirty-rate': 'int64' } }
+
 ##
 # @DirtyRateStatus:
 #
@@ -1748,6 +1763,21 @@
 { 'enum': 'DirtyRateStatus',
   'data': [ 'unstarted', 'measuring', 'measured'] }
 
+##
+# @DirtyRateMeasureMode:
+#
+# An enumeration of mode of measuring dirtyrate.
+#
+# @page-sampling: calculate dirtyrate by sampling pages.
+#
+# @dirty-ring: calculate dirtyrate by via dirty ring.
+#
+# Since: 6.1
+#
+##
+{ 'enum': 'DirtyRateMeasureMode',
+  'data': ['page-sampling', 'dirty-ring'] }
+
 ##
 # @DirtyRateInfo:
 #
diff --git a/migration/dirtyrate.h b/migration/dirtyrate.h
index e1fd29089e..69d4c5b865 100644
--- a/migration/dirtyrate.h
+++ b/migration/dirtyrate.h
@@ -43,6 +43,7 @@
 struct DirtyRateConfig {
 uint64_t sample_pages_per_gigabytes; /* sample pages per GB */
 int64_t sample_period_seconds; /* time duration between two sampling */
+DirtyRateMeasureMode mode; /* mode of dirtyrate measurement */
 };
 
 /*
@@ -58,17 +59,29 @@ struct RamblockDirtyInfo {
 uint32_t *hash_result; /* array of hash result for sampled pages */
 };
 
-/*
- * Store calculation statistics for each measure.
- */
-struct DirtyRateStat {
+typedef struct SampleVMStat {
 uint64_t total_dirty_samples; /* total dirty sampled page */
 uint64_t total_sample_count; /* total sampled pages */
 uint64_t total_block_mem_MB; /* size of total sampled pages in MB */
+} SampleVMStat;
+
+typedef struct VcpuStat {
+int nvcpu; /* number of vcpu */
+DirtyRateVcpu *rates; /* array of dirty rate for each vcpu */
+} VcpuStat;
+
+/*
+ * Store calculation statistics for each measure.
+ */
+struct DirtyRateStat {
 int64_t dirty_rate; /* dirty rate in MB/s */
 int64_t start_time; /* calculation start time in units of second */
 int64_t calc_time; /* time duration of two sampling in units of second */
 uint64_t sample_pages; /* sample pages per GB */
+union {
+SampleVMStat page_sampling;
+VcpuStat dirty_ring;
+};
 };
 
 void *get_dirtyrate_thread(void *arg);
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index 320c56ba2c..e0a27a992c 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -88,33 +88,44 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
 return info;
 }
 
-static void init_dirtyrate_stat(int64_t start_time, int64_t calc_time,
-uint64_t sample_pages)
+static void init_dirtyrate_stat(int64_t start_time,
+struct DirtyRateConfig config)
 {
-DirtyStat.total_dirty_samples = 0;
-DirtyStat.total_sample_count = 0;
-DirtyStat.total_block_mem_MB = 0;
 DirtyStat.dirty_rate = -1;
 DirtyStat.start_time = start_time;
-DirtyStat.calc_time = calc_time;
-DirtyStat.sample_pages = sample_pages;
+DirtyStat.calc_time = config.sample_period_seconds;
+DirtyStat.sample_pages = config.sample_pages_per_gigabytes;
+
+switch (config.mode) {
+case DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING:
+DirtyStat.page_sampling.total_dirty_samples = 0;
+DirtyStat.page_sampling.total_sample_count = 0;
+DirtyStat.page_sampling.total_block_mem_MB = 0;
+break;
+case DIRTY_RATE_MEASURE_MODE_DIRTY_RING:
+DirtyStat.dirty_ring.nvcpu = -1;
+DirtyStat.dirty_ring.rates = NULL;
+break;
+default:
+break;
+}
 }
 
 static void update_dirtyrate_stat(struct RamblockDirtyInfo *info)
 {
-DirtyStat.total_dirty_samples += info->sample_dirty_count;
-DirtyStat.total_sample_count += info->sample_pages_count;
+DirtyStat.page_sampling.total_dirty_samples += info->sample_dirty_count;
+DirtyStat.page_sampling.total_sample_count += info->sample_pages_count;
 /* size of total pages in MB */
-DirtyStat.total_block_mem_MB += (info->ramblock_pages *
- TARGET_PAGE_SIZE) >> 20;
+

[PULL 03/20] memory: make global_dirty_tracking a bitmask

2021-11-01 Thread Juan Quintela

From: Hyman Huang(é»„å‹‡) 

since dirty ring has been introduced, there are two methods
to track dirty pages of vm. it seems that "logging" has
a hint on the method, so rename the global_dirty_log to
global_dirty_tracking would make description more accurate.

dirty rate measurement may start or stop dirty tracking during
calculation. this conflict with migration because stop dirty
tracking make migration leave dirty pages out then that'll be
a problem.

make global_dirty_tracking a bitmask can let both migration and
dirty rate measurement work fine. introduce GLOBAL_DIRTY_MIGRATION
and GLOBAL_DIRTY_DIRTY_RATE to distinguish what current dirty
tracking aims for, migration or dirty rate.

Signed-off-by: Hyman Huang(é»„å‹‡) 
Message-Id: 
<9c9388657cfa0301bd2c1cfa36e7cf6da4aeca19.1624040308.git.huang...@chinatelecom.cn>
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/exec/memory.h   | 20 +---
 include/exec/ram_addr.h |  4 ++--
 hw/i386/xen/xen-hvm.c   |  4 ++--
 migration/ram.c | 15 +++
 softmmu/memory.c| 32 +---
 softmmu/trace-events|  1 +
 6 files changed, 54 insertions(+), 22 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index a185b6dcb8..04280450c9 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -61,7 +61,17 @@ static inline void fuzz_dma_read_cb(size_t addr,
 }
 #endif
 
-extern bool global_dirty_log;
+/* Possible bits for global_dirty_log_{start|stop} */
+
+/* Dirty tracking enabled because migration is running */
+#define GLOBAL_DIRTY_MIGRATION  (1U << 0)
+
+/* Dirty tracking enabled because measuring dirty rate */
+#define GLOBAL_DIRTY_DIRTY_RATE (1U << 1)
+
+#define GLOBAL_DIRTY_MASK  (0x3)
+
+extern unsigned int global_dirty_tracking;
 
 typedef struct MemoryRegionOps MemoryRegionOps;
 
@@ -2388,13 +2398,17 @@ void memory_listener_unregister(MemoryListener 
*listener);
 
 /**
  * memory_global_dirty_log_start: begin dirty logging for all regions
+ *
+ * @flags: purpose of starting dirty log, migration or dirty rate
  */
-void memory_global_dirty_log_start(void);
+void memory_global_dirty_log_start(unsigned int flags);
 
 /**
  * memory_global_dirty_log_stop: end dirty logging for all regions
+ *
+ * @flags: purpose of stopping dirty log, migration or dirty rate
  */
-void memory_global_dirty_log_stop(void);
+void memory_global_dirty_log_stop(unsigned int flags);
 
 void mtree_info(bool flatview, bool dispatch_tree, bool owner, bool disabled);
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 551876bed0..45c913264a 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -369,7 +369,7 @@ static inline void 
cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 
 qatomic_or([DIRTY_MEMORY_VGA][idx][offset], temp);
 
-if (global_dirty_log) {
+if (global_dirty_tracking) {
 qatomic_or(
 [DIRTY_MEMORY_MIGRATION][idx][offset],
 temp);
@@ -392,7 +392,7 @@ static inline void 
cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 } else {
 uint8_t clients = tcg_enabled() ? DIRTY_CLIENTS_ALL : 
DIRTY_CLIENTS_NOCODE;
 
-if (!global_dirty_log) {
+if (!global_dirty_tracking) {
 clients &= ~(1 << DIRTY_MEMORY_MIGRATION);
 }
 
diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index e3d3d5cf89..482be95415 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -1613,8 +1613,8 @@ void xen_hvm_modified_memory(ram_addr_t start, ram_addr_t 
length)
 void qmp_xen_set_global_dirty_log(bool enable, Error **errp)
 {
 if (enable) {
-memory_global_dirty_log_start();
+memory_global_dirty_log_start(GLOBAL_DIRTY_MIGRATION);
 } else {
-memory_global_dirty_log_stop();
+memory_global_dirty_log_stop(GLOBAL_DIRTY_MIGRATION);
 }
 }
diff --git a/migration/ram.c b/migration/ram.c
index bb908822d5..ae2601bf3b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2216,7 +2216,14 @@ static void ram_save_cleanup(void *opaque)
 /* caller have hold iothread lock or is in a bh, so there is
  * no writing race against the migration bitmap
  */
-memory_global_dirty_log_stop();
+if (global_dirty_tracking & GLOBAL_DIRTY_MIGRATION) {
+/*
+ * do not stop dirty log without starting it, since
+ * memory_global_dirty_log_stop will assert that
+ * memory_global_dirty_log_start/stop used in pairs
+ */
+memory_global_dirty_log_stop(GLOBAL_DIRTY_MIGRATION);
+}
 }
 
 RAMBLOCK_FOREACH_NOT_IGNORED(block) {
@@ -2678,7 +2685,7 @@ static void ram_init_bitmaps(RAMState *rs)
 ram_list_init_bitmaps();
 /* We don't use dirty log with background snapshots

[PULL 01/20] migration/rdma: Fix out of order wrid

2021-11-01 Thread Juan Quintela

From: Li Zhijian 

destination:
../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 
2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl 
-spice streaming-video=filter,port=5902,disable-ticketing -incoming 
rdma:192.168.22.23:
qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: 
warning: short-form boolean option 'disable-ticketing' deprecated
Please use disable-ticketing=on instead
QEMU 6.0.50 monitor - type 'help' for more information
(qemu) trace-event qemu_rdma_block_for_wrid_miss on
(qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name 
uverbs2, infiniband_verbs class device path 
/sys/class/infiniband_verbs/uverbs2, infiniband class device path 
/sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL 
RECV (4000)

source:
../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 
2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl 
-spice streaming-video=filter,port=5901,disable-ticketing -S
qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: 
warning: short-form boolean option 'disable-ticketing' deprecated
Please use disable-ticketing=on instead
QEMU 6.0.50 monitor - type 'help' for more information
(qemu)
(qemu) trace-event qemu_rdma_block_for_wrid_miss on
(qemu) migrate -d rdma:192.168.22.23:
source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name 
uverbs2, infiniband_verbs class device path 
/sys/class/infiniband_verbs/uverbs2, infiniband class device path 
/sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
(qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got 
CONTROL RECV (4000)

NOTE: we use soft RoCE as the rdma device.
[root@iaas-rpma images]# rdma link show rxe_eth0/1
link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0

This migration could not be completed when out of order(OOO) CQ event occurs.
The send queue and receive queue shared a same completion queue, and
qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
So in this case, qemu_rdma_block_for_wrid() will block forever.

OOO cases will occur in both source side and destination side. And a
forever blocking happens on only SEND and RECV are out of order. OOO between
'WRITE RDMA' and 'RECV' doesn't matter.

below the OOO sequence:
   source destination
  rdma_write_one()   qemu_rdma_registration_handle()
1.S1: post_recv XD1: post_recv Y
2.wait for recv CQ event X
3.   D2: post_send X ---+
4.   wait for send CQ send event X (D2) |
5.recv CQ event X reaches (D2)  |
6.  +-S2: post_send Y   |
7.  | wait for send CQ event Y  |
8.  |recv CQ event Y (S2) (drop it) |
9.  +-send CQ event Y reaches (S2)  |
10.  send CQ event X reaches (D2)  -+
11.  wait recv CQ event Y (dropped by (8))

Although a hardware IB works fine in my a hundred of runs, the IB specification
doesn't guaratee the CQ order in such case.

Here we introduce a independent send completion queue to distinguish
ibv_post_send completion queue from the original mixed completion queue.
It helps us to poll the specific CQE we are really interested in.

Signed-off-by: Li Zhijian 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/rdma.c | 138 ++-
 1 file changed, 101 insertions(+), 37 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index 2a3c7889b9..f5d3bbe7e9 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -358,9 +358,11 @@ typedef struct RDMAContext {
 struct ibv_context  *verbs;
 struct rdma_event_channel   *channel;
 struct ibv_qp *qp;  /* queue pair */
-struct ibv_comp_channel *comp_channel;  /* completion channel */
+struct ibv_comp_channel *recv_comp_channel;  /* recv

[PULL 08/20] migration: Make migration blocker work for snapshots too

2021-11-01 Thread Juan Quintela

From: Peter Xu 

save_snapshot() checks migration blocker, which looks sane.  At the meantime we
should also teach the blocker add helper to fail if during a snapshot, just
like for migrations.

Reviewed-by: Marc-André Lureau 
Signed-off-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/migration.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 9172686b89..e81e473f5a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2058,15 +2058,16 @@ int migrate_add_blocker(Error *reason, Error **errp)
 return -EACCES;
 }
 
-if (migration_is_idle()) {
-migration_blockers = g_slist_prepend(migration_blockers, reason);
-return 0;
+/* Snapshots are similar to migrations, so check RUN_STATE_SAVE_VM too. */
+if (runstate_check(RUN_STATE_SAVE_VM) || !migration_is_idle()) {
+error_propagate_prepend(errp, error_copy(reason),
+"disallowing migration blocker "
+"(migration/snapshot in progress) for: ");
+return -EBUSY;
 }
 
-error_propagate_prepend(errp, error_copy(reason),
-"disallowing migration blocker "
-"(migration in progress) for: ");
-return -EBUSY;
+migration_blockers = g_slist_prepend(migration_blockers, reason);
+return 0;
 }
 
 void migrate_del_blocker(Error *reason)
-- 
2.33.1

[PULL 00/20] Migration 20211031 patches

2021-11-01 Thread Juan Quintela

The following changes since commit af531756d25541a1b3b3d9a14e72e7fedd941a2e:

  Merge remote-tracking branch 'remotes/philmd/tags/renesas-20211030' into 
staging (2021-10-30 11:31:41 -0700)

are available in the Git repository at:

  https://github.com/juanquintela/qemu.git tags/migration-20211031-pull-request

for you to fetch changes up to 826b8bc80cb191557a4ce7cf0e155b436d2d1afa:

  migration/dirtyrate: implement dirty-bitmap dirtyrate calculation (2021-11-01 
22:56:44 +0100)


Migration Pull request

Hi

this includes pending bits of migration patches.

- virtio-mem support by David Hildenbrand
- dirtyrate improvements by Hyman Huang
- fix rdma wrid by Li Zhijian
- dump-guest-memory fixes by Peter Xu

Pleas apply.

Thanks, Juan.



David Hildenbrand (8):
  memory: Introduce replay_discarded callback for RamDiscardManager
  virtio-mem: Implement replay_discarded RamDiscardManager callback
  migration/ram: Handle RAMBlocks with a RamDiscardManager on the
migration source
  virtio-mem: Drop precopy notifier
  migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
destination
  migration: Simplify alignment and alignment checks
  migration/ram: Factor out populating pages readable in
ram_block_populate_pages()
  migration/ram: Handle RAMBlocks with a RamDiscardManager on background
snapshots

Hyman Huang(é»„å‹‡) (6):
  KVM: introduce dirty_pages and kvm_dirty_ring_enabled
  memory: make global_dirty_tracking a bitmask
  migration/dirtyrate: introduce struct and adjust DirtyRateStat
  migration/dirtyrate: adjust order of registering thread
  migration/dirtyrate: move init step of calculation to main thread
  migration/dirtyrate: implement dirty-ring dirtyrate calculation

Hyman Huang(黄勇) (2):
  memory: introduce total_dirty_pages to stat dirty pages
  migration/dirtyrate: implement dirty-bitmap dirtyrate calculation

Li Zhijian (1):
  migration/rdma: Fix out of order wrid

Peter Xu (3):
  migration: Make migration blocker work for snapshots too
  migration: Add migrate_add_blocker_internal()
  dump-guest-memory: Block live migration

 qapi/migration.json|  48 -
 include/exec/memory.h  |  41 +++-
 include/exec/ram_addr.h|  13 +-
 include/hw/core/cpu.h  |   1 +
 include/hw/virtio/virtio-mem.h |   3 -
 include/migration/blocker.h|  16 ++
 include/sysemu/kvm.h   |   1 +
 migration/dirtyrate.h  |  21 +-
 migration/ram.h|   1 +
 accel/kvm/kvm-all.c|   7 +
 accel/stubs/kvm-stub.c |   5 +
 dump/dump.c|  19 ++
 hw/i386/xen/xen-hvm.c  |   4 +-
 hw/virtio/virtio-mem.c |  92 ++---
 migration/dirtyrate.c  | 367 ++---
 migration/migration.c  |  30 +--
 migration/postcopy-ram.c   |  40 +++-
 migration/ram.c| 180 ++--
 migration/rdma.c   | 138 +
 softmmu/memory.c   |  43 +++-
 hmp-commands.hx|   8 +-
 migration/trace-events |   2 +
 softmmu/trace-events   |   1 +
 23 files changed, 909 insertions(+), 172 deletions(-)

-- 
2.33.1

[PULL 05/20] migration/dirtyrate: adjust order of registering thread

2021-11-01 Thread Juan Quintela

From: Hyman Huang(é»„å‹‡) 

registering get_dirtyrate thread in advance so that both
page-sampling and dirty-ring mode can be covered.

Signed-off-by: Hyman Huang(é»„å‹‡) 
Message-Id: 

Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/dirtyrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index e0a27a992c..a9bdd60034 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -352,7 +352,6 @@ static void calculate_dirtyrate(struct DirtyRateConfig 
config)
 int64_t msec = 0;
 int64_t initial_time;
 
-rcu_register_thread();
 rcu_read_lock();
 initial_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 if (!record_ramblock_hash_info(_dinfo, config, _count)) {
@@ -375,7 +374,6 @@ static void calculate_dirtyrate(struct DirtyRateConfig 
config)
 out:
 rcu_read_unlock();
 free_ramblock_dirty_info(block_dinfo, block_count);
-rcu_unregister_thread();
 }
 
 void *get_dirtyrate_thread(void *arg)
@@ -383,6 +381,7 @@ void *get_dirtyrate_thread(void *arg)
 struct DirtyRateConfig config = *(struct DirtyRateConfig *)arg;
 int ret;
 int64_t start_time;
+rcu_register_thread();
 
 ret = dirtyrate_set_state(, DIRTY_RATE_STATUS_UNSTARTED,
   DIRTY_RATE_STATUS_MEASURING);
@@ -401,6 +400,8 @@ void *get_dirtyrate_thread(void *arg)
 if (ret == -1) {
 error_report("change dirtyrate state failed.");
 }
+
+rcu_unregister_thread();
 return NULL;
 }
 
-- 
2.33.1

[PULL 02/20] KVM: introduce dirty_pages and kvm_dirty_ring_enabled

2021-11-01 Thread Juan Quintela

From: Hyman Huang(é»„å‹‡) 

dirty_pages is used to calculate dirtyrate via dirty ring, when
enabled, kvm-reaper will increase the dirty pages after gfns
being dirtied.

kvm_dirty_ring_enabled shows if kvm-reaper is working. dirtyrate
thread could use it to check if measurement can base on dirty
ring feature.

Signed-off-by: Hyman Huang(é»„å‹‡) 
Message-Id: 

Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 include/hw/core/cpu.h  | 1 +
 include/sysemu/kvm.h   | 1 +
 accel/kvm/kvm-all.c| 7 +++
 accel/stubs/kvm-stub.c | 5 +
 4 files changed, 14 insertions(+)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 1a10497af3..e948e81f1a 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -381,6 +381,7 @@ struct CPUState {
 struct kvm_run *kvm_run;
 struct kvm_dirty_gfn *kvm_dirty_gfns;
 uint32_t kvm_fetch_index;
+uint64_t dirty_pages;
 
 /* Used for events with 'vcpu' and *without* the 'disabled' properties */
 DECLARE_BITMAP(trace_dstate_delayed, CPU_TRACE_DSTATE_MAX_EVENTS);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a1ab1ee12d..7b22aeb6ae 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -547,4 +547,5 @@ bool kvm_cpu_check_are_resettable(void);
 
 bool kvm_arch_cpu_check_are_resettable(void);
 
+bool kvm_dirty_ring_enabled(void);
 #endif
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index db8d83b137..eecd8031cf 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -469,6 +469,7 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
 cpu->kvm_fd = ret;
 cpu->kvm_state = s;
 cpu->vcpu_dirty = true;
+cpu->dirty_pages = 0;
 
 mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
 if (mmap_size < 0) {
@@ -743,6 +744,7 @@ static uint32_t kvm_dirty_ring_reap_one(KVMState *s, 
CPUState *cpu)
 count++;
 }
 cpu->kvm_fetch_index = fetch;
+cpu->dirty_pages += count;
 
 return count;
 }
@@ -2296,6 +2298,11 @@ bool kvm_vcpu_id_is_valid(int vcpu_id)
 return vcpu_id >= 0 && vcpu_id < kvm_max_vcpu_id(s);
 }
 
+bool kvm_dirty_ring_enabled(void)
+{
+return kvm_state->kvm_dirty_ring_size ? true : false;
+}
+
 static int kvm_init(MachineState *ms)
 {
 MachineClass *mc = MACHINE_GET_CLASS(ms);
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 5b1d00a222..5319573e00 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -147,4 +147,9 @@ bool kvm_arm_supports_user_irq(void)
 {
 return false;
 }
+
+bool kvm_dirty_ring_enabled(void)
+{
+return false;
+}
 #endif
-- 
2.33.1

Re: [PATCH] Revert "elf: Relax MIPS' elf_check_arch() to accept EM_NANOMIPS too"

2021-11-01 Thread Philippe Mathieu-Daudé

On 11/1/21 22:40, Vince Del Vecchio wrote:
> Philippe said:
> 
>> So far QEMU only support the MIPS o32 / n32 / n64 ABIs. The p32 ABI is not 
>> implemented, therefore we can not run any nanoMIPS binary.
> 
> We use it internally to run nanoMIPS binaries every day.  I had thought 
> everything relevant was completed and upstreamed, but perhaps there is a gap 
> somewhere.  Let us investigate a little and get back to you.

I could wait few days until QEMU hard freeze and queue this patch as a
bug fix, but I doubt there is much you can do in that time frame, since
tomorrow is the soft freeze deadline.

Here I am simply changing the code to reject p32 binaries to avoid
users to waste their time trying to run a nanoMIPS binary. I am not
removing any of the nanoMIPS emulation code.

> Were you trying to run a bare metal executable or was it linux?

While I tested both toolchains (bare metal and musl/linux), here I am
only referring to the musl/linux one, since it is related to user-mode
emulation (files under linux-user/ directory).

The system emulation part is left unchanged (you can still boot a
nanoMIPS kernel if you select the proper CPU type).

> -Original Message-
> From: Qemu-devel 
>  On Behalf Of 
> Philippe Mathieu-Daudé
> Sent: Monday, November 1, 2021 7:48 AM
> To: qemu-devel@nongnu.org
> Cc: Aleksandar Rikalo ; Richard Henderson 
> ; Laurent Vivier ; Philippe 
> Mathieu-Daudé ; Petar Jovanovic 
> ; Aurelien Jarno 
> Subject: [PATCH] Revert "elf: Relax MIPS' elf_check_arch() to accept 
> EM_NANOMIPS too"
> 
> Per the "P32 Porting Guide" (rev 1.2) [1], chapter 2:
> 
>   p32 ABI Overview
>   
> 
>   The Application Binary Interface, or ABI, is the set of rules
>   that all binaries must follow in order to run on a nanoMIPS
>   system. This includes, for example, object file format,
>   instruction set, data layout, subroutine calling convention,
>   and system call numbers. The ABI is one part of the mechanism
>   that maintains binary compatibility across all nanoMIPS platforms.
> 
>   p32 improves on o32 to provide an ABI that is efficient in both
>   code density and performance. p32 is required for the nanoMIPS
>   architecture.
> 
> So far QEMU only support the MIPS o32 / n32 / n64 ABIs. The p32 ABI is not 
> implemented, therefore we can not run any nanoMIPS binary.
> 
> Revert commit f72541f3a59 ("elf: Relax MIPS' elf_check_arch() to accept 
> EM_NANOMIPS too").
> 
> See also the "ELF ABI Supplement" [2].
> 
> [1] 
> http://codescape.mips.com/components/toolchain/nanomips/2019.03-01/docs/MIPS_nanoMIPS_p32_ABI_Porting_Guide_01_02_DN00184.pdf
> [2] 
> http://codescape.mips.com/components/toolchain/nanomips/2019.03-01/docs/MIPS_nanoMIPS_ABI_supplement_01_03_DN00179.pdf
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  linux-user/elfload.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/linux-user/elfload.c b/linux-user/elfload.c index 
> f9b82616920..5da8c02d082 100644
> --- a/linux-user/elfload.c
> +++ b/linux-user/elfload.c
> @@ -925,8 +925,6 @@ static void elf_core_copy_regs(target_elf_gregset_t 
> *regs, const CPUPPCState *en  #endif
>  #define ELF_ARCHEM_MIPS
>  
> -#define elf_check_arch(x) ((x) == EM_MIPS || (x) == EM_NANOMIPS)
> -
>  #ifdef TARGET_ABI_MIPSN32
>  #define elf_check_abi(x) ((x) & EF_MIPS_ABI2)  #else
> --
> 2.31.1

[PULL 1/1] roms/openbios: update OpenBIOS images to b9062dea built from submodule

2021-11-01 Thread Mark Cave-Ayland

Signed-off-by: Mark Cave-Ayland 
---
 pc-bios/openbios-ppc | Bin 696912 -> 696912 bytes
 pc-bios/openbios-sparc32 | Bin 382048 -> 382048 bytes
 pc-bios/openbios-sparc64 | Bin 1593408 -> 1593408 bytes
 roms/openbios|   2 +-
 4 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/pc-bios/openbios-ppc b/pc-bios/openbios-ppc
index 
91a73db9a3b3a79d2bad6433bd73eecaac3c948f..67f32a8602e368a2485ec463cf6c8a96da77b46d
 100644
GIT binary patch
delta 105
zcmca`MC-y4t%fa(`?oSOPCu}fF`Qc+-|;%Mq)Z_p@NZt5s0udHft}a
jW8Pj+$C40)DfQq2OZ$ThtU$~L#Oy%KvHigX{o5>_a;

delta 105
zcmca`MC-y4t%fa(`?oSOO+T=eF`Qc+-|;%Mq2aB6{qg@TcRk)eWtrInF!
mdqExZ_JTT=gdj|*2Nzh{A6#GsVm2UV2V#!x4=!*PTm%3TSSYsu

diff --git a/pc-bios/openbios-sparc32 b/pc-bios/openbios-sparc32
index 
a5b738919119f7dde4e83ceb483fe2aa8a32f5ea..376b01c10b5ecc6c52f6f7f97866b54b3d2a3f3a
 100644
GIT binary patch
delta 48
ycmaE`Lj1uB@rD-0Eldv90)F{r3Wf?s21X#l%Gk6$z?un#Pdw?~oz)S#w`VK(=

delta 48
zcmaE`Lj1uB@rD-0Eldv90>P;T3Kj}R21bSo29{PvM(qLCOhC*G#4OtbtXTzS0sw-y
B4nP0^

diff --git a/pc-bios/openbios-sparc64 b/pc-bios/openbios-sparc64
index 
f7a501efc6a3917c29ed880d4236db8e983c7c0c..bbd746fde9494bd776ece55951974f393383e1b8
 100644
GIT binary patch
delta 243
zcmX@GAo0M0#D*5e7N!>F7M2#)7Pc1l7LFFq7OocV7M>Q~7QPn#EdqyYn4djLntrTC
zKo!U+n*I*LNZKw|D=?J{B)5HCyMQetnDwqh;A=5Uotwzy?a%)RFtQ8y<(DZKDi|3U
zfe0+X~a1-V5*YTMZ)1-G+F3Y`sQWtXpOR+xSvUr22GyF4LRCM1FBeuYAJnSlzn
x>lF!oWe3SLLX=Los}hO@%gn11DrH2HY1gX}0%Bnx76D>WAQs!MS0gTT1pps)Rx1Di

delta 243
zcmX@GAo0M0#D*5e7N!>F7M2#)7Pc1l7LFFq7OocV7M>Q~7QPn#EdqyYn7K-frXQ;j
zPz5raroV$QjJAu_3QXk!$!%ZPE?~!COTMj

diff --git a/roms/openbios b/roms/openbios
index d657b65318..b9062deaae 16
--- a/roms/openbios
+++ b/roms/openbios
@@ -1 +1 @@
-Subproject commit d657b653186c0fd6e062cab133497415c2a5a5b8
+Subproject commit b9062deaaea7269369eaa46260d75edcaf276afa
-- 
2.20.1

[PULL 0/1] qemu-openbios queue 20211101

2021-11-01 Thread Mark Cave-Ayland

The following changes since commit af531756d25541a1b3b3d9a14e72e7fedd941a2e:

  Merge remote-tracking branch 'remotes/philmd/tags/renesas-20211030' into 
staging (2021-10-30 11:31:41 -0700)

are available in the Git repository at:

  git://github.com/mcayland/qemu.git tags/qemu-openbios-20211101

for you to fetch changes up to 97a5b35c17f4bcddc7b550ac1f4d7e39f100aec1:

  roms/openbios: update OpenBIOS images to b9062dea built from submodule 
(2021-11-01 21:50:52 +)


qemu-openbios queue


Mark Cave-Ayland (1):
  roms/openbios: update OpenBIOS images to b9062dea built from submodule

 pc-bios/openbios-ppc | Bin 696912 -> 696912 bytes
 pc-bios/openbios-sparc32 | Bin 382048 -> 382048 bytes
 pc-bios/openbios-sparc64 | Bin 1593408 -> 1593408 bytes
 roms/openbios|   2 +-
 4 files changed, 1 insertion(+), 1 deletion(-)

Re: [PATCH 0/6] RfC: try improve native hotplug for pcie root ports

2021-11-01 Thread Michael S. Tsirkin

On Tue, Oct 19, 2021 at 08:29:19AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > > Yes.  Maybe ask rh qe to run the patch set through their hotplug test
> > > suite (to avoid a apci-hotplug style disaster where qe found various
> > > issues after release)?
> > 
> > I'll poke around to see if they can help us... we'll need
> > a backport for that though.
> 
> Easy, it's a clean cherry-pick for 6.1, scratch build is on the way.
> 
> > > > I would also like to see a shorter timeout, maybe 100ms, so
> > > > that we are more responsive to guest changes in resending request.
> > > 
> > > I don't think it is a good idea to go for a shorter timeout given that
> > > the 5 seconds are in the specs and we want avoid a resent request being
> > > interpreted as cancel.
> > > It also wouldn't change anything at least for linux guests because linux
> > > is waiting those 5 seconds (with power indicator in blinking state).
> > > Only the reason for refusing 'device_del' changes from "5 secs not over
> > > yet" to "guest is busy processing the hotplug request".
> > 
> > First 5 seconds yes. But the retries afterwards?
> 
> Hmm, maybe, but I'd tend to keep it simple and go for 5 secs no matter
> what.  If the guest isn't responding (maybe because it is in the middle
> of a reboot) it's unlikely that fast re-requests are fundamentally
> changing things.
> 
> > > We could consider to tackle the 5sec timeout on the guest side, i.e.
> > > have linux skip the 5sec wait in case the root port is virtual (should
> > > be easy to figure by checking the pci id).
> > > 
> > > take care,
> > >   Gerd
> > 
> > Yes ... do we want to control how long it blinks from hypervisor side?
> 
> Is there a good reason for that?
> If not, then no.  Keep it simple.
> 
> When the guest powers off the slot pcie_cap_slot_write_config() will
> happily unplug the device without additional checks (no check whenever
> the 5 seconds are over, also no check whenever there is a pending unplug
> request in the first place).
> 
> So in theory the guest turning off slot power quickly should work just
> fine and speed up the unplug process in the common case (guest is
> up'n'running and responsitive).  Down to 1-2 secs instead of 5-7.
> Didn't actually test that though.
> 
> take care,
>   Gerd

Even if this speeds up unplug, hotplug remains slow, right?
-- 
MST

RE: [PATCH] Revert "elf: Relax MIPS' elf_check_arch() to accept EM_NANOMIPS too"

2021-11-01 Thread Vince Del Vecchio

[PULL 2/2] vfio/common: Add a trace point when a MMIO RAM section cannot be mapped

2021-11-01 Thread Alex Williamson

From: Kunkun Jiang 

The MSI-X structures of some devices and other non-MSI-X structures
may be in the same BAR. They may share one host page, especially in
the case of large page granularity, such as 64K.

For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in
Bar 3(size 64KB) is 0x0. vfio_listener_region_add() will be called
to map the remaining range (0x30-0x). If host page size is 64KB,
it will return early at 'int128_ge((int128_make64(iova), llend))'
without any message. Let's add a trace point to inform users like commit
5c08600547c0 ("vfio: Use a trace point when a RAM section cannot be DMA mapped")
did.

Signed-off-by: Kunkun Jiang 
Link: https://lore.kernel.org/r/20211027090406.761-3-jiangkun...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a784b219e6d4..dd387b0d3959 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -893,6 +893,13 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
 
 if (int128_ge(int128_make64(iova), llend)) {
+if (memory_region_is_ram_device(section->mr)) {
+trace_vfio_listener_region_add_no_dma_map(
+memory_region_name(section->mr),
+section->offset_within_address_space,
+int128_getlo(section->size),
+qemu_real_host_page_size);
+}
 return;
 }
 end = int128_get64(int128_sub(llend, int128_one()));

Re: [PATCH] target/i386: ensure EXCP0D_GPF is propagated back to the guest

2021-11-01 Thread Mark Cave-Ayland


On 30/10/2021 14:29, Mark Cave-Ayland wrote:


In the case where mmu_translate() returns EXCP0D_GPF ensure that 
handle_mmu_fault()
returns immediately to propagate the fault back to the guest instead of 
returning
EXCP0E_PAGE.

Signed-off-by: Mark Cave-Ayland 
Fixes: 661ff4879e ("target/i386: extract mmu_translate")
Closes: https://gitlab.com/qemu-project/qemu/-/issues/394
---

[Paolo: this appears to fix the regression booting Windows 7 on TCG that 
appeared in 6.1
  as per the above Gitlab issue. Unfortunately as I'm not really familiar with 
x86 it will
  probably need a better implementation/description but it should at least 
indicate what
  the problem is.]

  target/i386/tcg/sysemu/excp_helper.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/target/i386/tcg/sysemu/excp_helper.c 
b/target/i386/tcg/sysemu/excp_helper.c
index 7af887be4d..0170f7f791 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -439,6 +439,10 @@ static int handle_mmu_fault(CPUState *cs, vaddr addr, int 
size,
  prot, mmu_idx, page_size);
  return 0;
  } else {
+if (cs->exception_index == EXCP0D_GPF) {
+return 1;
+}
+
  if (env->intercept_exceptions & (1 << EXCP0E_PAGE)) {
  /* cr2 is not modified in case of exceptions */
  x86_stq_phys(cs,


Revisiting this I think the real issue is that mmu_translate() sets env->error_code 
to 0 and cs->exception_index to EXCP0D_GPF in target/i386/tcg/sysemu/excp_helper.c at 
line 102, so if this path is taken then mmu_translate() can change the vCPU state 
which doesn't seem right.


Presumably the real solution is to update the code in handle_mmu_fault() to detect 
when mmu_translate() returns 1 and check certain flags in env to determine whether 
either a EXCP0D_GPF or EXCP0E_PAGE exception should be generated?



ATB,

Mark.

Re: [PATCH 2/2] qtest/am53c974-test: add test for cancelling in-flight requests

2021-11-01 Thread Philippe Mathieu-Daudé

On 11/1/21 19:35, Mark Cave-Ayland wrote:
> Based upon the qtest reproducer posted to Gitlab issue #663 at
> https://gitlab.com/qemu-project/qemu/-/issues/663.
> 
> Signed-off-by: Mark Cave-Ayland 
> ---
>  tests/qtest/am53c974-test.c | 36 
>  1 file changed, 36 insertions(+)

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v6 1/2] vhost-user: remove VirtQ notifier restore

2021-11-01 Thread Michael S. Tsirkin

On Mon, Nov 01, 2021 at 04:38:12PM +0800, Xueming Li wrote:
> When vhost-user vdpa client suspend, backend may close all resources,
> VQ notifier mmap address become invalid, restore MR which contains
> the invalid address is wrong. vdpa client will set VQ notifier after
> reconnect.
> 
> This patch removes VQ notifier restore and related flags to avoid reusing
> invalid address.
> 
> Fixes: 44866521bd6e ("vhost-user: support registering external host 
> notifiers")
> Cc: qemu-sta...@nongnu.org
> Cc: Yuwei Zhang 
> Signed-off-by: Xueming Li 
> ---
>  hw/virtio/vhost-user.c | 19 +--
>  include/hw/virtio/vhost-user.h |  1 -
>  2 files changed, 1 insertion(+), 19 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index bf6e50223c..c671719e9b 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -1143,19 +1143,6 @@ static int vhost_user_set_vring_num(struct vhost_dev 
> *dev,
>  return vhost_set_vring(dev, VHOST_USER_SET_VRING_NUM, ring);
>  }
>  
> -static void vhost_user_host_notifier_restore(struct vhost_dev *dev,
> - int queue_idx)
> -{
> -struct vhost_user *u = dev->opaque;
> -VhostUserHostNotifier *n = >user->notifier[queue_idx];
> -VirtIODevice *vdev = dev->vdev;
> -
> -if (n->addr && !n->set) {
> -virtio_queue_set_host_notifier_mr(vdev, queue_idx, >mr, true);
> -n->set = true;
> -}
> -}
> -
>  static void vhost_user_host_notifier_remove(struct vhost_dev *dev,
>  int queue_idx)
>  {
> @@ -1163,17 +1150,14 @@ static void vhost_user_host_notifier_remove(struct 
> vhost_dev *dev,
>  VhostUserHostNotifier *n = >user->notifier[queue_idx];
>  VirtIODevice *vdev = dev->vdev;
>  
> -if (n->addr && n->set) {
> +if (n->addr) {
>  virtio_queue_set_host_notifier_mr(vdev, queue_idx, >mr, false);
> -n->set = false;
>  }
>  }
>

So on vq stop we still remove the notifier...
  
>  static int vhost_user_set_vring_base(struct vhost_dev *dev,
>   struct vhost_vring_state *ring)
>  {
> -vhost_user_host_notifier_restore(dev, ring->index);
> -
>  return vhost_set_vring(dev, VHOST_USER_SET_VRING_BASE, ring);
>  }
>  

but on vq start we do not reinstate it? Does this not mean that
notifiers won't work after stop then start?


> @@ -1538,7 +1522,6 @@ static int 
> vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
>  }
>  
>  n->addr = addr;
> -n->set = true;
>  
>  return 0;
>  }
> diff --git a/include/hw/virtio/vhost-user.h b/include/hw/virtio/vhost-user.h
> index a9abca3288..f6012b2078 100644
> --- a/include/hw/virtio/vhost-user.h
> +++ b/include/hw/virtio/vhost-user.h
> @@ -14,7 +14,6 @@
>  typedef struct VhostUserHostNotifier {
>  MemoryRegion mr;
>  void *addr;
> -bool set;
>  } VhostUserHostNotifier;
>  
>  typedef struct VhostUserState {
> -- 
> 2.33.0

Re: [PATCH v6 2/2] vhost-user: fix VirtQ notifier cleanup

2021-11-01 Thread Michael S. Tsirkin

On Mon, Nov 01, 2021 at 04:38:13PM +0800, Xueming Li wrote:
> When vhost-user device cleanup is executed and un-mmaps notifier
> address, VM cpu thread writing the notifier fails by accessing invalid
> address error.
> 
> To avoid this concurrent issue, call RCU and wait for a memory flatview
> update, then un-mmap notifiers in callback.
> 
> Fixes: 44866521bd6e ("vhost-user: support registering external host 
> notifiers")
> Cc: qemu-sta...@nongnu.org
> Cc: Yuwei Zhang 
> Signed-off-by: Xueming Li 
> ---
>  hw/virtio/vhost-user.c | 50 +-
>  include/hw/virtio/vhost-user.h |  2 ++
>  2 files changed, 33 insertions(+), 19 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index c671719e9b..5adad4d029 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -25,6 +25,7 @@
>  #include "migration/migration.h"
>  #include "migration/postcopy-ram.h"
>  #include "trace.h"
> +#include "exec/ramblock.h"
>  
>  #include 
>  #include 
> @@ -1143,15 +1144,27 @@ static int vhost_user_set_vring_num(struct vhost_dev 
> *dev,
>  return vhost_set_vring(dev, VHOST_USER_SET_VRING_NUM, ring);
>  }
>  
> -static void vhost_user_host_notifier_remove(struct vhost_dev *dev,
> -int queue_idx)
> +static void vhost_user_host_notifier_free(VhostUserHostNotifier *n)
>  {
> -struct vhost_user *u = dev->opaque;
> -VhostUserHostNotifier *n = >user->notifier[queue_idx];
> -VirtIODevice *vdev = dev->vdev;
> +assert(n && n->old_addr);
> +munmap(n->old_addr, qemu_real_host_page_size);
> +n->old_addr = NULL;
> +}
> +
> +static void vhost_user_host_notifier_remove(VhostUserState *user,
> +VirtIODevice *vdev, int 
> queue_idx)
> +{
> +VhostUserHostNotifier *n = >notifier[queue_idx];
>  
>  if (n->addr) {
> -virtio_queue_set_host_notifier_mr(vdev, queue_idx, >mr, false);
> +if (vdev) {
> +virtio_queue_set_host_notifier_mr(vdev, queue_idx, >mr, 
> false);
> +}
> +assert(n->addr);
> +assert(!n->old_addr);
> +n->old_addr = n->addr;
> +n->addr = NULL;
> +call_rcu(n, vhost_user_host_notifier_free, rcu);
>  }
>  }
>  
> @@ -1190,8 +1203,9 @@ static int vhost_user_get_vring_base(struct vhost_dev 
> *dev,
>  .payload.state = *ring,
>  .hdr.size = sizeof(msg.payload.state),
>  };
> +struct vhost_user *u = dev->opaque;
>  
> -vhost_user_host_notifier_remove(dev, ring->index);
> +vhost_user_host_notifier_remove(u->user, dev->vdev, ring->index);
>  
>  if (vhost_user_write(dev, , NULL, 0) < 0) {
>  return -1;
> @@ -1486,12 +1500,7 @@ static int 
> vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
>  
>  n = >notifier[queue_idx];
>  
> -if (n->addr) {
> -virtio_queue_set_host_notifier_mr(vdev, queue_idx, >mr, false);
> -object_unparent(OBJECT(>mr));
> -munmap(n->addr, page_size);
> -n->addr = NULL;
> -}
> +vhost_user_host_notifier_remove(user, vdev, queue_idx);
>  
>  if (area->u64 & VHOST_USER_VRING_NOFD_MASK) {
>  return 0;
> @@ -1510,9 +1519,12 @@ static int 
> vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
>  
>  name = g_strdup_printf("vhost-user/host-notifier@%p mmaps[%d]",
> user, queue_idx);
> -if (!n->mr.ram) /* Don't init again after suspend. */
> +if (!n->mr.ram) { /* Don't init again after suspend. */
>  memory_region_init_ram_device_ptr(>mr, OBJECT(vdev), name,
>page_size, addr);
> +} else {
> +n->mr.ram_block->host = addr;
> +}
>  g_free(name);
>  
>  if (virtio_queue_set_host_notifier_mr(vdev, queue_idx, >mr, true)) {
> @@ -2460,17 +2472,17 @@ bool vhost_user_init(VhostUserState *user, 
> CharBackend *chr, Error **errp)
>  void vhost_user_cleanup(VhostUserState *user)
>  {
>  int i;
> +VhostUserHostNotifier *n;
>  
>  if (!user->chr) {
>  return;
>  }
>  memory_region_transaction_begin();
>  for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
> -if (user->notifier[i].addr) {
> -object_unparent(OBJECT(>notifier[i].mr));
> -munmap(user->notifier[i].addr, qemu_real_host_page_size);
> -user->notifier[i].addr = NULL;
> -}
> +n = >notifier[i];
> +assert(!n->addr);

I'm pretty confused as to why this assert holds.
Add a comment?

> +vhost_user_host_notifier_remove(user, NULL, i);
> +object_unparent(OBJECT(>mr));
>  }
>  memory_region_transaction_commit();
>  user->chr = NULL;

I'm also confused on why we can do unparent for notifiers which have
never been set up. Won't n->mr be invalid then?


> diff --git a/include/hw/virtio/vhost-user.h b/include/hw/virtio/vhost-user.h
> index

Re: [PATCH] vhost: Fix last queue index of devices with no cvq

2021-11-01 Thread Michael S. Tsirkin

On Mon, Nov 01, 2021 at 04:42:01PM +0100, Eugenio Perez Martin wrote:
> On Mon, Nov 1, 2021 at 9:58 AM Eugenio Perez Martin  
> wrote:
> >
> > On Mon, Nov 1, 2021 at 4:34 AM Jason Wang  wrote:
> > >
> > > On Fri, Oct 29, 2021 at 10:16 PM Eugenio Pérez  
> > > wrote:
> > > >
> > > > The -1 assumes that all devices with no cvq have an spare vq allocated
> > > > for them, but with no offer of VIRTIO_NET_F_CTRL_VQ. This may not be the
> > > > case, and the device may have a pair number of queues.
> > > >
> > > > To fix this, just resort to the lower even number of queues.
> > > >
> > > > Fixes: 049eb15b5fc9 ("vhost: record the last virtqueue index for the 
> > > > virtio device")
> > > > Signed-off-by: Eugenio Pérez 
> > > > ---
> > > >  hw/net/vhost_net.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > > > index 0d888f29a6..edf56a597f 100644
> > > > --- a/hw/net/vhost_net.c
> > > > +++ b/hw/net/vhost_net.c
> > > > @@ -330,7 +330,7 @@ int vhost_net_start(VirtIODevice *dev, 
> > > > NetClientState *ncs,
> > > >  NetClientState *peer;
> > > >
> > > >  if (!cvq) {
> > > > -last_index -= 1;
> > > > +last_index &= ~1ULL;
> > > >  }
> > >
> > > The math here looks correct but we need to fix vhost_vdpa_dev_start() 
> > > instead?
> > >
> > > if (dev->vq_index + dev->nvqs - 1 != dev->last_index) {
> > > ...
> > > }
> > >
> >
> > If we just do that, devices that offer an odd number of queues but do
> > not offer ctrl vq would never enable the last vq pair, isn't it?
> >
> 
> To expand the issue,
> 
> With that condition it is not possible to make vp_vdpa work on devices
> with no cvq. If I set the L0 guest's device with no cvq (with -device
> virtio-net-pci,...,ctrl_vq=off,mq=off). The nested VM will enter that
> conditional in vhost_net_start, and will mark last_index=1, making it
> impossible to start a vhost_vdpa device.
> 
> However, re-reading the standard:
> 
> controlq only exists if VIRTIO_NET_F_CTRL_VQ set.
> 
> So the code is actually handling an invalid device: The device set
> VIRTIO_NET_F_CTRL_VQ but offered an odd number of VQs.
> 
> Do we have an example of such a device? It's not the case on qemu
> virtio-net, with or without vhost-net in L0 device. The operation &=
> ~1ULL is an intended noop in case the queues are already even. I'm
> fine to keep making last_index even, so we have that safety net, with
> further clarifications as MST said, just in case the device is not
> behaving well. But maybe it's even better just to delete that
> conditional entirely?
> 
> Thanks!
> 

For sure, no need to handle an invalid configuration.
Do you have a patch in mind? It'd be easier to discuss
things with a specific patch rather than theoretically.

> 
> 
> > Also, I would say that the right place for the solution of this
> > problem should not be virtio/vhost-vdpa: This is highly dependent on
> > having cvq, and this implies a knowledge about the use of each
> > virtqueue. Another kind of device could have an odd number of
> > virtqueues naturally, and that (-1) would not work for them, isn't it?
> >
> > Thanks!
> >
> > > Thanks
> > >
> > > >
> > > >  if (!k->set_guest_notifiers) {
> > > > --
> > > > 2.27.0
> > > >
> > >

Re: [PATCH 00/31] passage: Define a standard for firmware data flow

2021-11-01 Thread François Ozog

Hi Mark,

Le lun. 1 nov. 2021 à 19:19, Mark Kettenis  a
écrit :

> > From: François Ozog 
> > Date: Mon, 1 Nov 2021 09:53:40 +0100
>
> [...]
>
> > We could further leverage Passage to pass Operating Systems parameters
> that
> > could be removed from device tree (migration of /chosen to Passage).
> Memory
> > inventory would still be in DT but allocations for CMA or GPUs would be
> in
> > Passage. This idea is to reach a point where  device tree is a "pristine"
> > hardware description.
>
> I wanted to react on something you said in an earlier thread, but this
> discussion seems to be appropriate as well:
>
> The notion that device trees only describe the hardware isn't really
> correct.  Device trees have always been used to configure firmware
> options (through the /options node) and between firmware and the OS
> (through the /chosen node) and to describe firmware interfaces
> (e.g. OpenFirmware calls, PSCI (on ARM), RTAS (on POWER)).  This was
> the case on the original Open Firmware systems, and is still done on
> PowerNV systems that use flattened device trees.


> I understand and agree with the above.
Yet, PSCI is different from /options and /chosen: those are platform
services made available to the OS when the boot firmware code has been
unloaded/neutralized.

What I (not just myself but let’s simplify) am trying to decouple the
supply chain: loosely coupled platform provider (ODM), the firmware
provider, OS provider, application provider. So it is not to prevent
presence of those existing nodes, it is to be able introduce some
rationalization in their use:

Platform interfaces such as PSCI: The question is “who” injects them in the
DT (build time or runtime). There is no single good answer and you may want
the authoritative entity that implements the service to actually inject
itself in the DT passed to the OS. I know some platforms are using SMC
calls from U-Boot to know what to inject in the DT. I see those as the same
nature of DIMM sensing and injection in the DT.

/chosen:  a must have when you do not have UEFI but not necessary with UEFI.

/options: it should be possible for the end customer to make the decision
of integration: at build time or at runtime based on a separate flattened
device tree file.

This decoupling should result for instance, in the long run, in adjustable
memory layouts without headaches. changing the secure dram size is simple
from hardware perspective but a massive issue from a firmware perspective:
multiple firmware projects sources need to be adjusted, making manual
calculations on explicit constants or “hidden” ones. It should even be
possible to adjust it at runtime on the field (user selected firmware
parameter).


> I don't see what the benefits are from using Passage instead.  It
> would only fragment things even more.
>
-- 
François-Frédéric Ozog | *Director Business Development*
T: +33.67221.6485
francois.o...@linaro.org | Skype: ffozog

[PULL 07/10] tests/unit: Add an unit test for smp parsing

2021-11-01 Thread Philippe Mathieu-Daudé

From: Yanan Wang 

Now that we have a generic parser smp_parse(), let's add an unit
test for the code. All possible valid/invalid SMP configurations
that the user can specify are covered.

Signed-off-by: Yanan Wang 
Reviewed-by: Andrew Jones 
Tested-by: Philippe Mathieu-Daudé 
Message-Id: <20211026034659.22040-3-wangyana...@huawei.com>
Acked-by: Eduardo Habkost 
Message-Id: 
[PMD: Squashed format string fixup from Yanan Wang]
Signed-off-by: Philippe Mathieu-Daudé 
---
 tests/unit/test-smp-parse.c | 594 
 MAINTAINERS |   1 +
 tests/unit/meson.build  |   1 +
 3 files changed, 596 insertions(+)
 create mode 100644 tests/unit/test-smp-parse.c

diff --git a/tests/unit/test-smp-parse.c b/tests/unit/test-smp-parse.c
new file mode 100644
index 000..cbe0c990494
--- /dev/null
+++ b/tests/unit/test-smp-parse.c
@@ -0,0 +1,594 @@
+/*
+ * SMP parsing unit-tests
+ *
+ * Copyright (c) 2021 Huawei Technologies Co., Ltd
+ *
+ * Authors:
+ *  Yanan Wang 
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qom/object.h"
+#include "qemu/module.h"
+#include "qapi/error.h"
+
+#include "hw/boards.h"
+
+#define T true
+#define F false
+
+#define MIN_CPUS 1   /* set the min CPUs supported by the machine as 1 */
+#define MAX_CPUS 512 /* set the max CPUs supported by the machine as 512 */
+
+/*
+ * Used to define the generic 3-level CPU topology hierarchy
+ *  -sockets/cores/threads
+ */
+#define SMP_CONFIG_GENERIC(ha, a, hb, b, hc, c, hd, d, he, e) \
+{ \
+.has_cpus= ha, .cpus= a,  \
+.has_sockets = hb, .sockets = b,  \
+.has_cores   = hc, .cores   = c,  \
+.has_threads = hd, .threads = d,  \
+.has_maxcpus = he, .maxcpus = e,  \
+}
+
+#define CPU_TOPOLOGY_GENERIC(a, b, c, d, e)   \
+{ \
+.cpus = a,\
+.sockets  = b,\
+.cores= c,\
+.threads  = d,\
+.max_cpus = e,\
+}
+
+/*
+ * Currently a 4-level topology hierarchy is supported on PC machines
+ *  -sockets/dies/cores/threads
+ */
+#define SMP_CONFIG_WITH_DIES(ha, a, hb, b, hc, c, hd, d, he, e, hf, f) \
+{ \
+.has_cpus= ha, .cpus= a,  \
+.has_sockets = hb, .sockets = b,  \
+.has_dies= hc, .dies= c,  \
+.has_cores   = hd, .cores   = d,  \
+.has_threads = he, .threads = e,  \
+.has_maxcpus = hf, .maxcpus = f,  \
+}
+
+/**
+ * @config - the given SMP configuration
+ * @expect_prefer_sockets - the expected parsing result for the
+ * valid configuration, when sockets are preferred over cores
+ * @expect_prefer_cores - the expected parsing result for the
+ * valid configuration, when cores are preferred over sockets
+ * @expect_error - the expected error report when the given
+ * configuration is invalid
+ */
+typedef struct SMPTestData {
+SMPConfiguration config;
+CpuTopology expect_prefer_sockets;
+CpuTopology expect_prefer_cores;
+const char *expect_error;
+} SMPTestData;
+
+/* Type info of the tested machine */
+static const TypeInfo smp_machine_info = {
+.name = TYPE_MACHINE,
+.parent = TYPE_OBJECT,
+.class_size = sizeof(MachineClass),
+.instance_size = sizeof(MachineState),
+};
+
+/*
+ * List all the possible valid sub-collections of the generic 5
+ * topology parameters (i.e. cpus/maxcpus/sockets/cores/threads),
+ * then test the automatic calculation algorithm of the missing
+ * values in the parser.
+ */
+static struct SMPTestData data_generic_valid[] = {
+{
+/* config: no configuration provided
+ * expect: cpus=1,sockets=1,cores=1,threads=1,maxcpus=1 */
+.config = SMP_CONFIG_GENERIC(F, 0, F, 0, F, 0, F, 0, F, 0),
+.expect_prefer_sockets = CPU_TOPOLOGY_GENERIC(1, 1, 1, 1, 1),
+.expect_prefer_cores   = CPU_TOPOLOGY_GENERIC(1, 1, 1, 1, 1),
+}, {
+/* config: -smp 8
+ * prefer_sockets: cpus=8,sockets=8,cores=1,threads=1,maxcpus=8
+ * prefer_cores: cpus=8,sockets=1,cores=8,threads=1,maxcpus=8 */
+.config = SMP_CONFIG_GENERIC(T, 8, F, 0, F, 0, F, 0, F, 0),
+.expect_prefer_sockets = CPU_TOPOLOGY_GENERIC(8, 8, 1, 1, 8),
+.expect_prefer_cores   = CPU_TOPOLOGY_GENERIC(8, 1, 8, 1, 8),
+

[PATCH] escc: update transmit status bits when switching to async mode

2021-11-01 Thread Mark Cave-Ayland

The recent ESCC reset changes cause a regression when attemping to use a real
SS-5 Sun PROM instead of OpenBIOS. The Sun PROM doesn't send an explicit reset
command to the ESCC but gets stuck in a loop probing the keyboard waiting for
STATUS_TXEMPTY to be set in R_STATUS followed by SPEC_ALLSENT in R_SPEC.

Reading through the ESCC datasheet again indicates that SPEC_ALLSENT is updated
when a write to W_TXCTRL1 selects async mode, or remains high if sync mode is
selected. Whilst there is no explicit mention of STATUS_TXEMPTY, the ESCC device
emulation always updates these two register bits together when transmitting 
data.

Add extra logic to update both transmit status bits accordingly when writing to
W_TXCTRL1 which enables the Sun PROM to initialise and boot again under QEMU.

Signed-off-by: Mark Cave-Ayland 
---
 hw/char/escc.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/hw/char/escc.c b/hw/char/escc.c
index 0fce4f6324..b33cdc229f 100644
--- a/hw/char/escc.c
+++ b/hw/char/escc.c
@@ -575,6 +575,18 @@ static void escc_mem_write(void *opaque, hwaddr addr,
 s->wregs[s->reg] = val;
 break;
 case W_TXCTRL1:
+s->wregs[s->reg] = val;
+if (val & TXCTRL1_STPMSK) {
+ESCCSERIOQueue *q = >queue;
+if (s->type == escc_serial || q->count == 0) {
+s->rregs[R_STATUS] |= STATUS_TXEMPTY;
+s->rregs[R_SPEC] |= SPEC_ALLSENT;
+}
+} else {
+s->rregs[R_STATUS] &= ~STATUS_TXEMPTY;
+s->rregs[R_SPEC] |= SPEC_ALLSENT;
+}
+/* fallthrough */
 case W_TXCTRL2:
 s->wregs[s->reg] = val;
 escc_update_parameters(s);
-- 
2.20.1

[PULL 03/10] hw/core: Declare meson source set

2021-11-01 Thread Philippe Mathieu-Daudé

As we want to be able to conditionally add files to the hw/core
file list, use a source set.

Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Yanan Wang 
Tested-by: Yanan Wang 
Acked-by: Eduardo Habkost 
Message-Id: <20211028150521.1973821-3-phi...@redhat.com>
---
 meson.build | 4 +++-
 hw/core/meson.build | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index b0927283976..85f1e43dfe6 100644
--- a/meson.build
+++ b/meson.build
@@ -2365,6 +2365,7 @@
 chardev_ss = ss.source_set()
 common_ss = ss.source_set()
 crypto_ss = ss.source_set()
+hwcore_ss = ss.source_set()
 io_ss = ss.source_set()
 linux_user_ss = ss.source_set()
 qmp_ss = ss.source_set()
@@ -2806,7 +2807,8 @@
 
 chardev = declare_dependency(link_whole: libchardev)
 
-libhwcore = static_library('hwcore', sources: hwcore_files + genh,
+hwcore_ss = hwcore_ss.apply(config_host, strict: false)
+libhwcore = static_library('hwcore', sources: hwcore_ss.sources() + genh,
name_suffix: 'fa',
build_by_default: false)
 hwcore = declare_dependency(link_whole: libhwcore)
diff --git a/hw/core/meson.build b/hw/core/meson.build
index 6af4c5c5cbc..cc1ebb8e0f4 100644
--- a/hw/core/meson.build
+++ b/hw/core/meson.build
@@ -1,5 +1,5 @@
 # core qdev-related obj files, also used by *-user and unit tests
-hwcore_files = files(
+hwcore_ss.add(files(
   'bus.c',
   'hotplug.c',
   'qdev-properties.c',
@@ -11,7 +11,7 @@
   'irq.c',
   'clock.c',
   'qdev-clock.c',
-)
+))
 
 common_ss.add(files('cpu-common.c'))
 softmmu_ss.add(when: 'CONFIG_FITLOADER', if_true: files('loader-fit.c'))
-- 
2.31.1

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-11-01 Thread Christian Schoenebeck

On Donnerstag, 28. Oktober 2021 11:00:48 CET Stefan Hajnoczi wrote:
> On Mon, Oct 25, 2021 at 05:03:25PM +0200, Christian Schoenebeck wrote:
> > On Montag, 25. Oktober 2021 12:30:41 CEST Stefan Hajnoczi wrote:
> > > On Thu, Oct 21, 2021 at 05:39:28PM +0200, Christian Schoenebeck wrote:
> > > > On Freitag, 8. Oktober 2021 18:08:48 CEST Christian Schoenebeck wrote:
> > > > > On Freitag, 8. Oktober 2021 16:24:42 CEST Christian Schoenebeck wrote:
> > > > > > On Freitag, 8. Oktober 2021 09:25:33 CEST Greg Kurz wrote:
> > > > > > > On Thu, 7 Oct 2021 16:42:49 +0100
> > > > > > > 
> > > > > > > Stefan Hajnoczi  wrote:
> > > > > > > > On Thu, Oct 07, 2021 at 02:51:55PM +0200, Christian Schoenebeck 
> > > > > > > > wrote:
> > > > > > > > > On Donnerstag, 7. Oktober 2021 07:23:59 CEST Stefan Hajnoczi 
> > > > > > > > > wrote:
> > > > > > > > > > On Mon, Oct 04, 2021 at 09:38:00PM +0200, Christian
> > > > > > > > > > Schoenebeck
> > > > > 
> > > > > wrote:
> > > > > > > > > > > At the moment the maximum transfer size with virtio is
> > > > > > > > > > > limited
> > > > > > > > > > > to
> > > > > > > > > > > 4M
> > > > > > > > > > > (1024 * PAGE_SIZE). This series raises this limit to its
> > > > > > > > > > > maximum
> > > > > > > > > > > theoretical possible transfer size of 128M (32k pages)
> > > > > > > > > > > according
> > > > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > virtio specs:
> > > > > > > > > > > 
> > > > > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virt
> > > > > > > > > > > io-v
> > > > > > > > > > > 1.1-
> > > > > > > > > > > cs
> > > > > > > > > > > 01
> > > > > > > > > > > .html#
> > > > > > > > > > > x1-240006
> > > > > > > > > > 
> > > > > > > > > > Hi Christian,
> > > > > > > 
> > > > > > > > > > I took a quick look at the code:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Thanks Stefan for sharing virtio expertise and helping Christian
> > > > > > > !
> > > > > > > 
> > > > > > > > > > - The Linux 9p driver restricts descriptor chains to 128
> > > > > > > > > > elements
> > > > > > > > > > 
> > > > > > > > > >   (net/9p/trans_virtio.c:VIRTQUEUE_NUM)
> > > > > > > > > 
> > > > > > > > > Yes, that's the limitation that I am about to remove (WIP);
> > > > > > > > > current
> > > > > > > > > kernel
> > > > > > > > > patches:
> > > > > > > > > https://lore.kernel.org/netdev/cover.1632327421.git.linux_os
> > > > > > > > > s@cr
> > > > > > > > > udeb
> > > > > > > > > yt
> > > > > > > > > e.
> > > > > > > > > com/>
> > > > > > > > 
> > > > > > > > I haven't read the patches yet but I'm concerned that today
> > > > > > > > the
> > > > > > > > driver
> > > > > > > > is pretty well-behaved and this new patch series introduces a
> > > > > > > > spec
> > > > > > > > violation. Not fixing existing spec violations is okay, but
> > > > > > > > adding
> > > > > > > > new
> > > > > > > > ones is a red flag. I think we need to figure out a clean
> > > > > > > > solution.
> > > > > > 
> > > > > > Nobody has reviewed the kernel patches yet. My main concern
> > > > > > therefore
> > > > > > actually is that the kernel patches are already too complex,
> > > > > > because
> > > > > > the
> > > > > > current situation is that only Dominique is handling 9p patches on
> > > > > > kernel
> > > > > > side, and he barely has time for 9p anymore.
> > > > > > 
> > > > > > Another reason for me to catch up on reading current kernel code
> > > > > > and
> > > > > > stepping in as reviewer of 9p on kernel side ASAP, independent of
> > > > > > this
> > > > > > issue.
> > > > > > 
> > > > > > As for current kernel patches' complexity: I can certainly drop
> > > > > > patch
> > > > > > 7
> > > > > > entirely as it is probably just overkill. Patch 4 is then the
> > > > > > biggest
> > > > > > chunk, I have to see if I can simplify it, and whether it would
> > > > > > make
> > > > > > sense to squash with patch 3.
> > > > > > 
> > > > > > > > > > - The QEMU 9pfs code passes iovecs directly to preadv(2)
> > > > > > > > > > and
> > > > > > > > > > will
> > > > > > > > > > fail
> > > > > > > > > > 
> > > > > > > > > >   with EINVAL when called with more than IOV_MAX iovecs
> > > > > > > > > >   (hw/9pfs/9p.c:v9fs_read())
> > > > > > > > > 
> > > > > > > > > Hmm, which makes me wonder why I never encountered this
> > > > > > > > > error
> > > > > > > > > during
> > > > > > > > > testing.
> > > > > > > > > 
> > > > > > > > > Most people will use the 9p qemu 'local' fs driver backend
> > > > > > > > > in
> > > > > > > > > practice,
> > > > > > > > > so
> > > > > > > > > that v9fs_read() call would translate for most people to
> > > > > > > > > this
> > > > > > > > > implementation on QEMU side (hw/9p/9p-local.c):
> > > > > > > > > 
> > > > > > > > > static ssize_t local_preadv(FsContext *ctx, V9fsFidOpenState
> > > > > > > > > *fs,
> > > > > > > > > 
> > > > > > > > > const struct iovec *iov,
> > > > > > > > > int iovcnt, off_t offset)
> >

[PULL 0/2] VFIO update 2021-11-01 (for v6.2)

2021-11-01 Thread Alex Williamson

The following changes since commit af531756d25541a1b3b3d9a14e72e7fedd941a2e:

  Merge remote-tracking branch 'remotes/philmd/tags/renesas-20211030' into 
staging (2021-10-30 11:31:41 -0700)

are available in the Git repository at:

  git://github.com/awilliam/qemu-vfio.git tags/vfio-update-20211101.0

for you to fetch changes up to e4b34708388b20f1ceb55f1d563d8da925a32424:

  vfio/common: Add a trace point when a MMIO RAM section cannot be mapped 
(2021-11-01 12:17:51 -0600)


VFIO update 2021-11-01

 * Re-enable expanded sub-page BAR mappings after migration (Kunkun Jiang)

 * Trace dropped listener sections due to page alignment (Kunkun Jiang)


Kunkun Jiang (2):
  vfio/pci: Add support for mmapping sub-page MMIO BARs after live migration
  vfio/common: Add a trace point when a MMIO RAM section cannot be mapped

 hw/vfio/common.c |  7 +++
 hw/vfio/pci.c| 19 ++-
 2 files changed, 25 insertions(+), 1 deletion(-)

[PATCH] MAINTAINERS: Change status to Odd Fixes

2021-11-01 Thread Cédric Le Goater

I haven't done any Aspeed development for a couple of years now and
maintaining the Aspeed QEMU machines has been a side project since.
I don't have time anymore.

Signed-off-by: Cédric Le Goater 
---
 MAINTAINERS | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 05a3451f115b..193e5c908a29 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1036,12 +1036,11 @@ F: hw/arm/msf2-som.c
 F: docs/system/arm/emcraft-sf2.rst
 
 ASPEED BMCs
-M: Cédric Le Goater 
 M: Peter Maydell 
 R: Andrew Jeffery 
 R: Joel Stanley 
 L: qemu-...@nongnu.org
-S: Maintained
+S: Odd Fixes
 F: hw/*/*aspeed*
 F: hw/misc/pca9552.c
 F: include/hw/*/*aspeed*
-- 
2.31.1

Re: [PATCH v2] hvf: arm: Ignore cache operations on MMIO

2021-11-01 Thread Peter Maydell

On Tue, 26 Oct 2021 at 09:09, Alexander Graf  wrote:
>
> Apple's Hypervisor.Framework forwards cache operations as MMIO traps
> into user space. For MMIO however, these have no meaning: There is no
> cache attached to them.
>
> So let's just treat cache data exits as nops.
>
> This fixes OpenBSD booting as guest.

I agree that "ignore cache maintenance ops" is the right thing
(among other things it's what KVM does in kvm_handle_guest_abort()).

But CM=1 isn't only cache maintenance, it is also set for faults
for address translation instructions. I think (but have not tested
or completely thought through) that before this you also want
   if (S1PTW is set) {
   /*
* Guest has put its page tables not into RAM. We
* can't do anything to retrieve this, so re-inject
* the abort back into the guest.
*/
   inject a data abort with suitable fault info;
   }

Compare the sequence in the KVM code:
https://elixir.bootlin.com/linux/latest/source/arch/arm64/kvm/mmu.c#L1233
where we check S1PTW, then CM, then go for "let userspace do
MMIO emulation".

It's possible that Hypervisor.Framework handles the S1PTW
case for you; you could test with a stunt guest that sets up
the page tables so that the 2nd level page table for a
particular VA range is mapped to an IPA that's not in RAM,
and then try just using that VA and/or passing that VA to
one of the AT instructions, to see whether you get handed
the fault or not. (My bet would be that hvf does not handle
this for you, because in general it seems to prefer to punt
everything.)

-- PMM

Re: [PATCH v2] hvf: arm: Ignore cache operations on MMIO

2021-11-01 Thread Peter Maydell

On Mon, 1 Nov 2021 at 19:28, Richard Henderson
 wrote:
>
> On 10/26/21 3:12 AM, Alexander Graf wrote:
> > Apple's Hypervisor.Framework forwards cache operations as MMIO traps
> > into user space. For MMIO however, these have no meaning: There is no
> > cache attached to them.
> >
> > So let's just treat cache data exits as nops.
> >
> > This fixes OpenBSD booting as guest.
> >
> > Signed-off-by: Alexander Graf 
> > Reported-by: AJ Barris 
> > Reference: https://github.com/utmapp/UTM/issues/3197
> > ---
> >   target/arm/hvf/hvf.c | 7 +++
> >   1 file changed, 7 insertions(+)
>
> Thanks, applied to target-arm.next

...did you see my email saying I think we also need
to test S1PTW ?

-- PMM

[PULL 1/2] vfio/pci: Add support for mmapping sub-page MMIO BARs after live migration

2021-11-01 Thread Alex Williamson

From: Kunkun Jiang 

We can expand MemoryRegions of sub-page MMIO BARs in
vfio_pci_write_config() to improve IO performance for some
devices. However, the MemoryRegions of destination VM are
not expanded any more after live migration. Because their
addresses have been updated in vmstate_load_state()
(vfio_pci_load_config) and vfio_sub_page_bar_update_mapping()
will not be called.

This may result in poor performance after live migration.
So iterate BARs in vfio_pci_load_config() and try to update
sub-page BARs.

Reported-by: Nianyao Tang 
Reported-by: Qixin Gan 
Signed-off-by: Kunkun Jiang 
Link: https://lore.kernel.org/r/20211027090406.761-2-jiangkun...@huawei.com
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci.c |   19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5cdf1d4298a7..7b45353ce27f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2453,7 +2453,12 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 {
 VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
 PCIDevice *pdev = >pdev;
-int ret;
+pcibus_t old_addr[PCI_NUM_REGIONS - 1];
+int bar, ret;
+
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+old_addr[bar] = pdev->io_regions[bar].addr;
+}
 
 ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
 if (ret) {
@@ -2463,6 +2468,18 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 vfio_pci_write_config(pdev, PCI_COMMAND,
   pci_get_word(pdev->config + PCI_COMMAND), 2);
 
+for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+/*
+ * The address may not be changed in some scenarios
+ * (e.g. the VF driver isn't loaded in VM).
+ */
+if (old_addr[bar] != pdev->io_regions[bar].addr &&
+vdev->bars[bar].region.size > 0 &&
+vdev->bars[bar].region.size < qemu_real_host_page_size) {
+vfio_sub_page_bar_update_mapping(pdev, bar);
+}
+}
+
 if (msi_enabled(pdev)) {
 vfio_msi_enable(vdev);
 } else if (msix_enabled(pdev)) {

Re: [PATCH 00/31] passage: Define a standard for firmware data flow

2021-11-01 Thread Mark Kettenis

> From: François Ozog 
> Date: Mon, 1 Nov 2021 09:53:40 +0100

[...]

> We could further leverage Passage to pass Operating Systems parameters that
> could be removed from device tree (migration of /chosen to Passage). Memory
> inventory would still be in DT but allocations for CMA or GPUs would be in
> Passage. This idea is to reach a point where  device tree is a "pristine"
> hardware description.

I wanted to react on something you said in an earlier thread, but this
discussion seems to be appropriate as well:

The notion that device trees only describe the hardware isn't really
correct.  Device trees have always been used to configure firmware
options (through the /options node) and between firmware and the OS
(through the /chosen node) and to describe firmware interfaces
(e.g. OpenFirmware calls, PSCI (on ARM), RTAS (on POWER)).  This was
the case on the original Open Firmware systems, and is still done on
PowerNV systems that use flattened device trees.

I don't see what the benefits are from using Passage instead.  It
would only fragment things even more.

Re: [PATCH v4 2/6] tests/acceptance: Make pick_default_qemu_bin() more generic

2021-11-01 Thread Willian Rampazzo

On Mon, Sep 27, 2021 at 1:31 PM Philippe Mathieu-Daudé  wrote:
>
> Make pick_default_qemu_bin() generic to find qemu-system or
> qemu-user binaries.
>
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  tests/acceptance/avocado_qemu/__init__.py | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/tests/acceptance/avocado_qemu/__init__.py 
> b/tests/acceptance/avocado_qemu/__init__.py
> index 8fcbed74849..2b9b5dd27fe 100644
> --- a/tests/acceptance/avocado_qemu/__init__.py
> +++ b/tests/acceptance/avocado_qemu/__init__.py
> @@ -52,7 +52,7 @@ def is_readable_executable_file(path):
>  return os.path.isfile(path) and os.access(path, os.R_OK | os.X_OK)
>
>
> -def pick_default_qemu_bin(arch=None):
> +def pick_default_qemu_bin(bin_prefix='qemu-system-', arch=None):
>  """
>  Picks the path of a QEMU binary, starting either in the current working
>  directory or in the source tree root directory.
> @@ -71,7 +71,7 @@ def pick_default_qemu_bin(arch=None):
>  # qemu binary path does not match arch for powerpc, handle it
>  if 'ppc64le' in arch:
>  arch = 'ppc64'
> -qemu_bin_relative_path = "./qemu-system-%s" % arch
> +qemu_bin_relative_path = os.path.join(".", bin_prefix + arch)
>  if is_readable_executable_file(qemu_bin_relative_path):
>  return qemu_bin_relative_path
>
> @@ -185,14 +185,14 @@ def _get_unique_tag_val(self, tag_name):
>  return vals.pop()
>  return None
>
> -def setUp(self):
> +def setUp(self, bin_prefix):
>  self.arch = self.params.get('arch',
>  default=self._get_unique_tag_val('arch'))
>
>  self.cpu = self.params.get('cpu',
> default=self._get_unique_tag_val('cpu'))
>
> -default_qemu_bin = pick_default_qemu_bin(arch=self.arch)
> +default_qemu_bin = pick_default_qemu_bin(bin_prefix, arch=self.arch)
>  self.qemu_bin = self.params.get('qemu_bin',
>  default=default_qemu_bin)
>  if self.qemu_bin is None:
> @@ -220,7 +220,7 @@ class Test(QemuBaseTest):
>  def setUp(self):
>  self._vms = {}
>
> -super(Test, self).setUp()
> +super(Test, self).setUp('qemu-system-')

If you need to change something else in this patch, consider using PEP3135:

super().setUp('qemu-system-')

Anyway,

Reviewed-by: Willian Rampazzo 

>
>  self.machine = self.params.get('machine',
> 
> default=self._get_unique_tag_val('machine'))
> --
> 2.31.1
>

Re: [PATCH v5 06/26] arm: qemu: Add a devicetree file for qemu_arm64

2021-11-01 Thread Tom Rini

On Mon, Nov 01, 2021 at 06:33:35PM +0100, François Ozog wrote:
> Hi Simon
> 
> Le lun. 1 nov. 2021 à 17:58, Simon Glass  a écrit :
> 
> > Hi Peter,
> >
> > On Mon, 1 Nov 2021 at 04:48, Peter Maydell 
> > wrote:
> > >
> > > On Tue, 26 Oct 2021 at 01:33, Simon Glass  wrote:
> > > >
> > > > Add this file, generated from qemu, so there is a reference devicetree
> > > > in the U-Boot tree.
> > > >
> > > > Signed-off-by: Simon Glass 
> > >
> > > Note that the dtb you get from QEMU is only guaranteed to work if:
> > >  1) you run it on the exact same QEMU version you generated it with
> > >  2) you pass QEMU the exact same command line arguments you used
> > > when you generated it
> >
> > Yes, I certainly understand that. In general this is not safe, but in
> > practice it works well enough for development and CI.
> 
> You recognize that you hijack a product directory with development hack
> facility. There is a test directory to keep things clear. There can be a
> dev-dts or something similar for Dev time tools.
> I have only seen push back on those fake dts files in the dts directory: I
> guess that unless someone strongly favors a continuation of the discussion,
> you may consider re-shaping the proposal to address concerns.

Yes.  We need to document how to make development easier.  But I'm still
not on board with the notion of including DTS files for platforms where
the source of truth for the DTB is elsewhere.  That's going to bring us
a lot more pain.

It is important to make sure our "develop our project" workflow is sane
and relatively pain free.  But that needs to not come by making
sacrifices to the "use our project" outcome.  I would hope for example
that the new Pi zero platform is just dtb changes, as far as the linux
kernel is concerned which means that for rpi_arm64 (which uses run time
dtb) it also just works.  And that's what we want to see.

-- 
Tom

signature.asc
Description: PGP signature

[PATCH v5] tests: qtest: Add virtio-iommu test

2021-11-01 Thread Eric Auger

Add the framework to test the virtio-iommu-pci device
and tests exercising the attach/detach, map/unmap API.

Signed-off-by: Eric Auger 
Tested-by: Jean-Philippe Brucker 
Reviewed-by: Jean-Philippe Brucker 

---

v4 -> v5:
- remove printf and move a comment
- Added Jean-Philippe's T-b and R-b
---
 tests/qtest/libqos/meson.build|   1 +
 tests/qtest/libqos/virtio-iommu.c | 126 
 tests/qtest/libqos/virtio-iommu.h |  40 
 tests/qtest/meson.build   |   1 +
 tests/qtest/virtio-iommu-test.c   | 326 ++
 5 files changed, 494 insertions(+)
 create mode 100644 tests/qtest/libqos/virtio-iommu.c
 create mode 100644 tests/qtest/libqos/virtio-iommu.h
 create mode 100644 tests/qtest/virtio-iommu-test.c

diff --git a/tests/qtest/libqos/meson.build b/tests/qtest/libqos/meson.build
index 1f5c8f1053..ba90bbe2b8 100644
--- a/tests/qtest/libqos/meson.build
+++ b/tests/qtest/libqos/meson.build
@@ -40,6 +40,7 @@ libqos_srcs = files('../libqtest.c',
 'virtio-rng.c',
 'virtio-scsi.c',
 'virtio-serial.c',
+'virtio-iommu.c',
 
 # qgraph machines:
 'aarch64-xlnx-zcu102-machine.c',
diff --git a/tests/qtest/libqos/virtio-iommu.c 
b/tests/qtest/libqos/virtio-iommu.c
new file mode 100644
index 00..18cba4ca36
--- /dev/null
+++ b/tests/qtest/libqos/virtio-iommu.c
@@ -0,0 +1,126 @@
+/*
+ * libqos driver virtio-iommu-pci framework
+ *
+ * Copyright (c) 2021 Red Hat, Inc.
+ *
+ * Authors:
+ *  Eric Auger 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at your
+ * option) any later version.  See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu/module.h"
+#include "qgraph.h"
+#include "virtio-iommu.h"
+#include "hw/virtio/virtio-iommu.h"
+
+static QGuestAllocator *alloc;
+
+/* virtio-iommu-device */
+static void *qvirtio_iommu_get_driver(QVirtioIOMMU *v_iommu,
+  const char *interface)
+{
+if (!g_strcmp0(interface, "virtio-iommu")) {
+return v_iommu;
+}
+if (!g_strcmp0(interface, "virtio")) {
+return v_iommu->vdev;
+}
+
+fprintf(stderr, "%s not present in virtio-iommu-device\n", interface);
+g_assert_not_reached();
+}
+
+static void virtio_iommu_cleanup(QVirtioIOMMU *interface)
+{
+qvirtqueue_cleanup(interface->vdev->bus, interface->vq, alloc);
+}
+
+static void virtio_iommu_setup(QVirtioIOMMU *interface)
+{
+QVirtioDevice *vdev = interface->vdev;
+uint64_t features;
+
+features = qvirtio_get_features(vdev);
+features &= ~(QVIRTIO_F_BAD_FEATURE |
+  (1ull << VIRTIO_RING_F_INDIRECT_DESC) |
+  (1ull << VIRTIO_RING_F_EVENT_IDX) |
+  (1ull << VIRTIO_IOMMU_F_BYPASS));
+qvirtio_set_features(vdev, features);
+interface->vq = qvirtqueue_setup(interface->vdev, alloc, 0);
+qvirtio_set_driver_ok(interface->vdev);
+}
+
+/* virtio-iommu-pci */
+static void *qvirtio_iommu_pci_get_driver(void *object, const char *interface)
+{
+QVirtioIOMMUPCI *v_iommu = object;
+if (!g_strcmp0(interface, "pci-device")) {
+return v_iommu->pci_vdev.pdev;
+}
+return qvirtio_iommu_get_driver(_iommu->iommu, interface);
+}
+
+static void qvirtio_iommu_pci_destructor(QOSGraphObject *obj)
+{
+QVirtioIOMMUPCI *iommu_pci = (QVirtioIOMMUPCI *) obj;
+QVirtioIOMMU *interface = _pci->iommu;
+QOSGraphObject *pci_vobj =  _pci->pci_vdev.obj;
+
+virtio_iommu_cleanup(interface);
+qvirtio_pci_destructor(pci_vobj);
+}
+
+static void qvirtio_iommu_pci_start_hw(QOSGraphObject *obj)
+{
+QVirtioIOMMUPCI *iommu_pci = (QVirtioIOMMUPCI *) obj;
+QVirtioIOMMU *interface = _pci->iommu;
+QOSGraphObject *pci_vobj =  _pci->pci_vdev.obj;
+
+qvirtio_pci_start_hw(pci_vobj);
+virtio_iommu_setup(interface);
+}
+
+
+static void *virtio_iommu_pci_create(void *pci_bus, QGuestAllocator *t_alloc,
+   void *addr)
+{
+QVirtioIOMMUPCI *virtio_rpci = g_new0(QVirtioIOMMUPCI, 1);
+QVirtioIOMMU *interface = _rpci->iommu;
+QOSGraphObject *obj = _rpci->pci_vdev.obj;
+
+virtio_pci_init(_rpci->pci_vdev, pci_bus, addr);
+interface->vdev = _rpci->pci_vdev.vdev;
+alloc = t_alloc;
+
+obj->get_driver = qvirtio_iommu_pci_get_driver;
+obj->start_hw = qvirtio_iommu_pci_start_hw;
+obj->destructor = qvirtio_iommu_pci_destructor;
+
+return obj;
+}
+
+static void virtio_iommu_register_nodes(void)
+{
+QPCIAddress addr = {
+.devfn = QPCI_DEVFN(4, 0),
+};
+
+QOSGraphEdgeOptions opts = {
+.extra_device_opts = "addr=04.0",
+};
+
+/* virtio-iommu-pci */
+add_qpci_address(, );
+qos_node_create_driver("virtio-iommu-pci", virtio_iommu_pci_create);
+qos_node_consumes("virtio-iommu-pci", "pci-bus", );
+qos_node_produces("virtio-iommu-pci", "pci-device");
+qos_node_produces("virtio-iommu-pci",

Re: [PATCH v2] hvf: arm: Ignore cache operations on MMIO

2021-11-01 Thread Richard Henderson


On 11/1/21 1:55 PM, Peter Maydell wrote:

On Tue, 26 Oct 2021 at 17:22, Richard Henderson
 wrote:


On 10/26/21 12:12 AM, Alexander Graf wrote:

+if (cm) {
+/* We don't cache MMIO regions */
+advance_pc = true;
+break;
+}
+
   assert(isv);


The assert should come first.  If the "iss valid" bit is not set, then nothing 
else in the
word is defined.


No, ISV only indicates that ISS[23:14] is valid; ISS[13:0] (including CM)
are valid regardless. (The distinction is that the bits which might or
might not be valid are the ones which encode information about the insn
necessary to possibly emulate it, like the data access size and the
source/destination register; the always-present ones are the ones that
have always been reported for data aborts -- AArch32 DFSR has a CM bit,
for instance.)


Thanks, that clears up some of my confusion here and a bit further downthread.


r~

Re: [PATCH v4 4/6] tests/acceptance: Share useful helpers from virtiofs_submounts test

2021-11-01 Thread Willian Rampazzo

On Mon, Sep 27, 2021 at 1:31 PM Philippe Mathieu-Daudé  wrote:
>
> Move the useful has_cmd()/has_cmds() helpers from the virtiofs
> test to the avocado_qemu public class.
>
> Reviewed-by: Wainer dos Santos Moschetta 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  tests/acceptance/avocado_qemu/__init__.py | 57 ++
>  tests/acceptance/virtiofs_submounts.py| 59 +--
>  2 files changed, 59 insertions(+), 57 deletions(-)
>

Reviewed-by: Willian Rampazzo

[PULL 10/10] machine: remove the done notifier for dynamic sysbus device type check

2021-11-01 Thread Philippe Mathieu-Daudé

From: Damien Hedde 

Now that we check sysbus device types during device creation, we
can remove the check in the machine init done notifier.
This was the only thing done by this notifier, so we remove the
whole sysbus_notifier structure of the MachineState.

Note: This notifier was checking all /peripheral and /peripheral-anon
sysbus devices. Now we only check those added by -device cli option or
device_add qmp command when handling the command/option. So if there
are some devices added in one of these containers manually (eg in
machine C code), these will not be checked anymore.
This use case does not seem to appear apart from
hw/xen/xen-legacy-backend.c (it uses qdev_set_id() and in this case,
not for a sysbus device, so it's ok).

Signed-off-by: Damien Hedde 
Acked-by: Alistair Francis 
Reviewed-by: Philippe Mathieu-Daudé 
Acked-by: Eduardo Habkost 
Message-Id: <20211029142258.484907-4-damien.he...@greensocs.com>
Signed-off-by: Philippe Mathieu-Daudé 
---
 include/hw/boards.h |  1 -
 hw/core/machine.c   | 27 ---
 2 files changed, 28 deletions(-)

diff --git a/include/hw/boards.h b/include/hw/boards.h
index 602993bd5ab..9c1c1901046 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -317,7 +317,6 @@ typedef struct CpuTopology {
 struct MachineState {
 /*< private >*/
 Object parent_obj;
-Notifier sysbus_notifier;
 
 /*< public >*/
 
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 7c4004ac5a0..e24e3e27dbe 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -572,18 +572,6 @@ bool device_type_is_dynamic_sysbus(MachineClass *mc, const 
char *type)
 return allowed;
 }
 
-static void validate_sysbus_device(SysBusDevice *sbdev, void *opaque)
-{
-MachineState *machine = opaque;
-MachineClass *mc = MACHINE_GET_CLASS(machine);
-
-if (!device_is_dynamic_sysbus(mc, DEVICE(sbdev))) {
-error_report("Option '-device %s' cannot be handled by this machine",
- object_class_get_name(object_get_class(OBJECT(sbdev;
-exit(1);
-}
-}
-
 static char *machine_get_memdev(Object *obj, Error **errp)
 {
 MachineState *ms = MACHINE(obj);
@@ -599,17 +587,6 @@ static void machine_set_memdev(Object *obj, const char 
*value, Error **errp)
 ms->ram_memdev_id = g_strdup(value);
 }
 
-static void machine_init_notify(Notifier *notifier, void *data)
-{
-MachineState *machine = MACHINE(qdev_get_machine());
-
-/*
- * Loop through all dynamically created sysbus devices and check if they 
are
- * all allowed.  If a device is not allowed, error out.
- */
-foreach_dynamic_sysbus_device(validate_sysbus_device, machine);
-}
-
 HotpluggableCPUList *machine_query_hotpluggable_cpus(MachineState *machine)
 {
 int i;
@@ -949,10 +926,6 @@ static void machine_initfn(Object *obj)
 "Table (HMAT)");
 }
 
-/* Register notifier when init is done for sysbus sanity checks */
-ms->sysbus_notifier.notify = machine_init_notify;
-qemu_add_machine_init_done_notifier(>sysbus_notifier);
-
 /* default to mc->default_cpus */
 ms->smp.cpus = mc->default_cpus;
 ms->smp.max_cpus = mc->default_cpus;
-- 
2.31.1

[PULL 06/10] hw/core/machine: Split out the smp parsing code

2021-11-01 Thread Philippe Mathieu-Daudé

From: Yanan Wang 

We are going to introduce an unit test for the parser smp_parse()
in hw/core/machine.c, but now machine.c is only built in softmmu.

In order to solve the build dependency on the smp parsing code and
avoid building unrelated stuff for the unit tests, move the tested
code from machine.c into a separate file, i.e., machine-smp.c and
build it in common field.

Signed-off-by: Yanan Wang 
Reviewed-by: Andrew Jones 
Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Message-Id: <20211026034659.22040-2-wangyana...@huawei.com>
Acked-by: Eduardo Habkost 
Signed-off-by: Philippe Mathieu-Daudé 
---
 include/hw/boards.h   |   1 +
 hw/core/machine-smp.c | 181 ++
 hw/core/machine.c | 159 -
 MAINTAINERS   |   1 +
 hw/core/meson.build   |   1 +
 5 files changed, 184 insertions(+), 159 deletions(-)
 create mode 100644 hw/core/machine-smp.c

diff --git a/include/hw/boards.h b/include/hw/boards.h
index 5adbcbb99b1..e36fc7d8615 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -34,6 +34,7 @@ HotpluggableCPUList 
*machine_query_hotpluggable_cpus(MachineState *machine);
 void machine_set_cpu_numa_node(MachineState *machine,
const CpuInstanceProperties *props,
Error **errp);
+void smp_parse(MachineState *ms, SMPConfiguration *config, Error **errp);
 
 /**
  * machine_class_allow_dynamic_sysbus_dev: Add type to list of valid devices
diff --git a/hw/core/machine-smp.c b/hw/core/machine-smp.c
new file mode 100644
index 000..116a0cbbfab
--- /dev/null
+++ b/hw/core/machine-smp.c
@@ -0,0 +1,181 @@
+/*
+ * QEMU Machine core (related to -smp parsing)
+ *
+ * Copyright (c) 2021 Huawei Technologies Co., Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "hw/boards.h"
+#include "qapi/error.h"
+
+/*
+ * Report information of a machine's supported CPU topology hierarchy.
+ * Topology members will be ordered from the largest to the smallest
+ * in the string.
+ */
+static char *cpu_hierarchy_to_string(MachineState *ms)
+{
+MachineClass *mc = MACHINE_GET_CLASS(ms);
+GString *s = g_string_new(NULL);
+
+g_string_append_printf(s, "sockets (%u)", ms->smp.sockets);
+
+if (mc->smp_props.dies_supported) {
+g_string_append_printf(s, " * dies (%u)", ms->smp.dies);
+}
+
+g_string_append_printf(s, " * cores (%u)", ms->smp.cores);
+g_string_append_printf(s, " * threads (%u)", ms->smp.threads);
+
+return g_string_free(s, false);
+}
+
+/*
+ * smp_parse - Generic function used to parse the given SMP configuration
+ *
+ * Any missing parameter in "cpus/maxcpus/sockets/cores/threads" will be
+ * automatically computed based on the provided ones.
+ *
+ * In the calculation of omitted sockets/cores/threads: we prefer sockets
+ * over cores over threads before 6.2, while preferring cores over sockets
+ * over threads since 6.2.
+ *
+ * In the calculation of cpus/maxcpus: When both maxcpus and cpus are omitted,
+ * maxcpus will be computed from the given parameters and cpus will be set
+ * equal to maxcpus. When only one of maxcpus and cpus is given then the
+ * omitted one will be set to its given counterpart's value. Both maxcpus and
+ * cpus may be specified, but maxcpus must be equal to or greater than cpus.
+ *
+ * For compatibility, apart from the parameters that will be computed, newly
+ * introduced topology members which are likely to be target specific should
+ * be directly set as 1 if they are omitted (e.g. dies for PC since 4.1).
+ */
+void smp_parse(MachineState *ms, SMPConfiguration *config, Error **errp)
+{
+MachineClass *mc = MACHINE_GET_CLASS(ms);
+unsigned cpus= config->has_cpus ? config->cpus : 0;
+unsigned sockets = config->has_sockets ? config->sockets : 0;
+unsigned dies= config->has_dies ? config->dies : 0;
+unsigned cores   = config->has_cores ? config->cores : 0;
+unsigned threads = config->has_threads ? config->threads : 0;
+unsigned maxcpus = config->has_maxcpus ? config->maxcpus : 0;
+
+/*
+ * Specified CPU topology parameters must be greater than zero,
+ * explicit configuration like "cpus=0" is not allowed.
+ */
+if ((config->has_cpus && config->cpus == 0) ||
+

[PULL 04/10] hw/core: Extract hotplug-related functions to qdev-hotplug.c

2021-11-01 Thread Philippe Mathieu-Daudé

Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Yanan Wang 
Acked-by: Eduardo Habkost 
Message-Id: <20211028150521.1973821-4-phi...@redhat.com>
---
 hw/core/qdev-hotplug.c | 73 ++
 hw/core/qdev.c | 60 --
 hw/core/meson.build|  1 +
 3 files changed, 74 insertions(+), 60 deletions(-)
 create mode 100644 hw/core/qdev-hotplug.c

diff --git a/hw/core/qdev-hotplug.c b/hw/core/qdev-hotplug.c
new file mode 100644
index 000..d495d0e9c70
--- /dev/null
+++ b/hw/core/qdev-hotplug.c
@@ -0,0 +1,73 @@
+/*
+ * QDev Hotplug handlers
+ *
+ * Copyright (c) Red Hat
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/qdev-core.h"
+#include "hw/boards.h"
+
+HotplugHandler *qdev_get_machine_hotplug_handler(DeviceState *dev)
+{
+MachineState *machine;
+MachineClass *mc;
+Object *m_obj = qdev_get_machine();
+
+if (object_dynamic_cast(m_obj, TYPE_MACHINE)) {
+machine = MACHINE(m_obj);
+mc = MACHINE_GET_CLASS(machine);
+if (mc->get_hotplug_handler) {
+return mc->get_hotplug_handler(machine, dev);
+}
+}
+
+return NULL;
+}
+
+bool qdev_hotplug_allowed(DeviceState *dev, Error **errp)
+{
+MachineState *machine;
+MachineClass *mc;
+Object *m_obj = qdev_get_machine();
+
+if (object_dynamic_cast(m_obj, TYPE_MACHINE)) {
+machine = MACHINE(m_obj);
+mc = MACHINE_GET_CLASS(machine);
+if (mc->hotplug_allowed) {
+return mc->hotplug_allowed(machine, dev, errp);
+}
+}
+
+return true;
+}
+
+HotplugHandler *qdev_get_bus_hotplug_handler(DeviceState *dev)
+{
+if (dev->parent_bus) {
+return dev->parent_bus->hotplug_handler;
+}
+return NULL;
+}
+
+HotplugHandler *qdev_get_hotplug_handler(DeviceState *dev)
+{
+HotplugHandler *hotplug_ctrl = qdev_get_machine_hotplug_handler(dev);
+
+if (hotplug_ctrl == NULL && dev->parent_bus) {
+hotplug_ctrl = qdev_get_bus_hotplug_handler(dev);
+}
+return hotplug_ctrl;
+}
+
+/* can be used as ->unplug() callback for the simple cases */
+void qdev_simple_device_unplug_cb(HotplugHandler *hotplug_dev,
+  DeviceState *dev, Error **errp)
+{
+qdev_unrealize(dev);
+}
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 1e38f997245..84f3019440f 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -33,7 +33,6 @@
 #include "qapi/visitor.h"
 #include "qemu/error-report.h"
 #include "qemu/option.h"
-#include "hw/hotplug.h"
 #include "hw/irq.h"
 #include "hw/qdev-properties.h"
 #include "hw/boards.h"
@@ -238,58 +237,6 @@ void qdev_set_legacy_instance_id(DeviceState *dev, int 
alias_id,
 dev->alias_required_for_version = required_for_version;
 }
 
-HotplugHandler *qdev_get_machine_hotplug_handler(DeviceState *dev)
-{
-MachineState *machine;
-MachineClass *mc;
-Object *m_obj = qdev_get_machine();
-
-if (object_dynamic_cast(m_obj, TYPE_MACHINE)) {
-machine = MACHINE(m_obj);
-mc = MACHINE_GET_CLASS(machine);
-if (mc->get_hotplug_handler) {
-return mc->get_hotplug_handler(machine, dev);
-}
-}
-
-return NULL;
-}
-
-bool qdev_hotplug_allowed(DeviceState *dev, Error **errp)
-{
-MachineState *machine;
-MachineClass *mc;
-Object *m_obj = qdev_get_machine();
-
-if (object_dynamic_cast(m_obj, TYPE_MACHINE)) {
-machine = MACHINE(m_obj);
-mc = MACHINE_GET_CLASS(machine);
-if (mc->hotplug_allowed) {
-return mc->hotplug_allowed(machine, dev, errp);
-}
-}
-
-return true;
-}
-
-HotplugHandler *qdev_get_bus_hotplug_handler(DeviceState *dev)
-{
-if (dev->parent_bus) {
-return dev->parent_bus->hotplug_handler;
-}
-return NULL;
-}
-
-HotplugHandler *qdev_get_hotplug_handler(DeviceState *dev)
-{
-HotplugHandler *hotplug_ctrl = qdev_get_machine_hotplug_handler(dev);
-
-if (hotplug_ctrl == NULL && dev->parent_bus) {
-hotplug_ctrl = qdev_get_bus_hotplug_handler(dev);
-}
-return hotplug_ctrl;
-}
-
 static int qdev_prereset(DeviceState *dev, void *opaque)
 {
 trace_qdev_reset_tree(dev, object_get_typename(OBJECT(dev)));
@@ -371,13 +318,6 @@ static void device_reset_child_foreach(Object *obj, 
ResettableChildCallback cb,
 }
 }
 
-/* can be used as ->unplug() callback for the simple cases */
-void qdev_simple_device_unplug_cb(HotplugHandler *hotplug_dev,
-  DeviceState *dev, Error **errp)
-{
-qdev_unrealize(dev);
-}
-
 bool qdev_realize(DeviceState *dev, BusState *bus, Error **errp)
 {
 assert(!dev->realized && !dev->parent_bus);
diff --git a/hw/core/meson.build b/hw/core/meson.build
index cc1ebb8e0f4..c9fe6441d92 100644
--- a/hw/core/meson.build
+++

[PULL 08/10] machine: add device_type_is_dynamic_sysbus function

2021-11-01 Thread Philippe Mathieu-Daudé

From: Damien Hedde 

Right now the allowance check for adding a sysbus device using
-device cli option (or device_add qmp command) is done well after
the device has been created. It is done during the machine init done
notifier: machine_init_notify() in hw/core/machine.c

This new function will allow us to do the check at the right time and
issue an error if it fails.

Also make device_is_dynamic_sysbus() use the new function.

Signed-off-by: Damien Hedde 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Alistair Francis 
Acked-by: Eduardo Habkost 
Message-Id: <20211029142258.484907-2-damien.he...@greensocs.com>
Signed-off-by: Philippe Mathieu-Daudé 
---
 include/hw/boards.h | 15 +++
 hw/core/machine.c   | 13 ++---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/hw/boards.h b/include/hw/boards.h
index e36fc7d8615..602993bd5ab 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -52,6 +52,21 @@ void smp_parse(MachineState *ms, SMPConfiguration *config, 
Error **errp);
  */
 void machine_class_allow_dynamic_sysbus_dev(MachineClass *mc, const char 
*type);
 
+/**
+ * device_type_is_dynamic_sysbus: Check if type is an allowed sysbus device
+ * type for the machine class.
+ * @mc: Machine class
+ * @type: type to check (should be a subtype of TYPE_SYS_BUS_DEVICE)
+ *
+ * Returns: true if @type is a type in the machine's list of
+ * dynamically pluggable sysbus devices; otherwise false.
+ *
+ * Check if the QOM type @type is in the list of allowed sysbus device
+ * types (see machine_class_allowed_dynamic_sysbus_dev()).
+ * Note that if @type has a parent type in the list, it is allowed too.
+ */
+bool device_type_is_dynamic_sysbus(MachineClass *mc, const char *type);
+
 /**
  * device_is_dynamic_sysbus: test whether device is a dynamic sysbus device
  * @mc: Machine class
diff --git a/hw/core/machine.c b/hw/core/machine.c
index dc15f5f9e5c..7c4004ac5a0 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -548,18 +548,25 @@ void machine_class_allow_dynamic_sysbus_dev(MachineClass 
*mc, const char *type)
 
 bool device_is_dynamic_sysbus(MachineClass *mc, DeviceState *dev)
 {
-bool allowed = false;
-strList *wl;
 Object *obj = OBJECT(dev);
 
 if (!object_dynamic_cast(obj, TYPE_SYS_BUS_DEVICE)) {
 return false;
 }
 
+return device_type_is_dynamic_sysbus(mc, object_get_typename(obj));
+}
+
+bool device_type_is_dynamic_sysbus(MachineClass *mc, const char *type)
+{
+bool allowed = false;
+strList *wl;
+ObjectClass *klass = object_class_by_name(type);
+
 for (wl = mc->allowed_dynamic_sysbus_devices;
  !allowed && wl;
  wl = wl->next) {
-allowed |= !!object_dynamic_cast(obj, wl->value);
+allowed |= !!object_class_dynamic_cast(klass, wl->value);
 }
 
 return allowed;
-- 
2.31.1

1 2 3 4 5 >

1 - 100 of 402 matches

Mail list logo