date:20231113


On 13/11/23 18:11, David Woodhouse wrote:

On Mon, 2023-11-13 at 17:09 +0100, Philippe Mathieu-Daudé wrote:

On 13/11/23 16:58, Woodhouse, David wrote:

On 13 Nov 2023 10:22, Philippe Mathieu-Daudé 
wrote:

     Per commit f17068c1c7 ("xen-hvm: reorganize xen-hvm and move
common
     function to xen-hvm-common"), handle_ioreq() is expected to be
     target-agnostic. However it uses 'target_ulong', which is a
target
     specific definition.

     In order to compile this file once for all targets, factor the
     target-specific code out of handle_ioreq() as a per-target
handler
     called xen_arch_align_ioreq_data().

     Signed-off-by: Philippe Mathieu-Daudé 
     ---
     Should we have a 'unsigned qemu_target_long_bits();' helper
     such qemu_target_page_foo() API and target_words_bigendian()?


It can be more fun than that though. What about
qemu_target_alignof_uint64() for example, which differs between
i386 and
x86_64 and causes even structs with *explicitly* sized fields to
differ
because of padding.

I'd *love* to see this series as a step towards my fantasy of being
able
to support Xen under TCG. After all, without that what's the point
in
being target-agnostic?


Another win is we are building all these files once instead of one
for
each i386/x86_64/aarch64 targets, so we save CI time and Amazon
trees.


However, I am mildly concerned that some of these files are
accidentally
using the host ELF ABI, perhaps with explicit management of 32-bit
compatibility, and the target-agnosticity is purely an illusion?

See the "protocol" handling and the three ABIs for the ring in
xen-block, for example.


If so I'd expect build failures or violent runtime assertions.


Heh, mostly the guest just crashes in the cases I've seen so far.

See commit a1c1082908d ("hw/xen: use correct default protocol for xen-
block on x86").


Reviewing quickly hw/block/dataplane/xen-block.c, this code doesn't
seem target specific at all IMHO. Otherwise I'd really expect it to
fail compiling. But I don't know much about Xen, so I'll let block &
xen experts to have a look.


Where it checks dataplane->protocol and does different things for
BLKIF_PROTOCOL_NATIVE/BLKIF_PROTOCOL_X86_32/BLKIF_PROTOCOL_X86_64, the
*structures* it uses are intended to be using the correct ABI. I think
the structs for BLKIF_PROTOCOL_NATIVE may actually be *different*
according to the target, in theory?


OK I see what you mean, blkif_back_rings_t union in hw/block/xen_blkif.h

These structures shouldn't differ between targets, this is the point of
an ABI :) And if they were, they wouldn't compile as target agnostic.


I don't know that they are *correct* right now, if the host is
different from the target. But that's just a bug (that only matters if
we ever want to support Xen-compatible guests using TCG).


Can we be explicit about what's expected to work here and what's
not in scope?


What do you mean? Everything is expected to work like without this
series applied :)


I think that if we ever do support Xen-compatible guests using TCG,
we'll have to fix that bug and use the right target-specific
structures... and then perhaps we'll want the affected files to
actually become target-specfic again?

I think this series makes it look like target-agnostic support *should*
work... but it doesn't really?


For testing we have:

aarch64: tests/avocado/boot_xen.py
x86_64: tests/avocado/kvm_xen_guest.py

No combination with i386 is tested,
Xen within aarch64 KVM is not tested (not sure it works).

Re: [PATCH-for-9.0 08/10] system/physmem: Only include 'hw/xen/xen.h' when Xen is available


On 13/11/23 21:03, David Woodhouse wrote:

On Mon, 2023-11-13 at 16:21 +0100, Philippe Mathieu-Daudé wrote:

"hw/xen/xen.h" contains declarations for Xen hardware. There is
no point including it when Xen is not available.


... if even when Xen *is* available, AFAICT. Can you just remove the
inclusion of hw/xen/xen.h entirely? I think that still builds, at least
for x86.


Yep, also on aarch64, thanks!


  When Xen is not
available, we have enough with declarations of "sysemu/xen.h".


... and system/xen-mapcache.h


Signed-off-by: Philippe Mathieu-Daudé

Re: [PATCH-for-9.0 04/10] hw/xen: Factor xen_arch_align_ioreq_data() out of handle_ioreq()


On 13/11/23 19:16, Richard Henderson wrote:

On 11/13/23 07:21, Philippe Mathieu-Daudé wrote:

diff --git a/hw/xen/xen-hvm-common.c b/hw/xen/xen-hvm-common.c
index c028c1b541..03f9417e7e 100644
--- a/hw/xen/xen-hvm-common.c
+++ b/hw/xen/xen-hvm-common.c
@@ -426,10 +426,7 @@ static void handle_ioreq(XenIOState *state, 
ioreq_t *req)
  trace_handle_ioreq(req, req->type, req->dir, req->df, 
req->data_is_ptr,

 req->addr, req->data, req->count, req->size);
-    if (!req->data_is_ptr && (req->dir == IOREQ_WRITE) &&
-    (req->size < sizeof (target_ulong))) {
-    req->data &= ((target_ulong) 1 << (8 * req->size)) - 1;
-    }



I suspect this should never have been using target_ulong at all: 
req->data is uint64_t.


This could replace it:

-- >8 --
-if (!req->data_is_ptr && (req->dir == IOREQ_WRITE) &&
-(req->size < sizeof (target_ulong))) {
-req->data &= ((target_ulong) 1 << (8 * req->size)) - 1;
+if (!req->data_is_ptr && (req->dir == IOREQ_WRITE)) {
+req->data = extract64(req->data, 0, BITS_PER_BYTE * req->size);
 }
---

Some notes while looking at this.

Per xen/include/public/hvm/ioreq.h header:

#define IOREQ_TYPE_PIO  0 /* pio */
#define IOREQ_TYPE_COPY 1 /* mmio ops */
#define IOREQ_TYPE_PCI_CONFIG   2
#define IOREQ_TYPE_VMWARE_PORT  3
#define IOREQ_TYPE_TIMEOFFSET   7
#define IOREQ_TYPE_INVALIDATE   8 /* mapcache */

  struct ioreq {
uint64_t addr;  /* physical address */
uint64_t data;  /* data (or paddr of data) */
uint32_t count; /* for rep prefixes */
uint32_t size;  /* size in bytes */
uint32_t vp_eport;  /* evtchn for notifications to/from device 
model */

uint16_t _pad0;
uint8_t state:4;
uint8_t data_is_ptr:1;  /* if 1, data above is the guest paddr
 * of the real data to use. */
uint8_t dir:1;  /* 1=read, 0=write */
uint8_t df:1;
uint8_t _pad1:1;
uint8_t type;   /* I/O type */
  };
  typedef struct ioreq ioreq_t;

If 'data' is not a pointer, it is a u64.

- In PIO / VMWARE_PORT modes, only 32-bit are used.

- In MMIO COPY mode, memory is accessed by chunks of 64-bit

- In PCI_CONFIG mode, access is u8 or u16 or u32.

- None of TIMEOFFSET / INVALIDATE use 'req'.

- Fallback is only used in x86 for VMWARE_PORT.

--

Regards,

Phil.

Re: [PATCH 0/2] Replace anti-social QOM type names (again)

2023-11-13 Thread Markus Armbruster

Cc: the other QOM maintainers

Daniel P. Berrangé  writes:

> On Mon, Nov 13, 2023 at 02:43:42PM +0100, Markus Armbruster wrote:
>> We got rid of QOM type names containing ',' in 6.0, but some have
>> crept back in.  Replace them just like we did in 6.0.
>
> It is practical to add
>
>assert(strchr(name, ',') == NULL)
>
> to some place in QOM to stop them coming back yet again ?

This adds a naming rule to QOM.  Right now, QOM has none whatsoever,
which I've long called out as a mistake.

I'm all for correcting that mistake, but I'd go further than just
outlawing ','.

Discussed in more depth here:

>> Cover letter of 6.0's replacement:
>> https://lore.kernel.org/qemu-devel/20210304140229.575481-1-arm...@redhat.com/

Let me copy the text for convenience.

QAPI has naming rules.  docs/devel/qapi-code-gen.txt:

=== Naming rules and reserved names ===

All names must begin with a letter, and contain only ASCII letters,
digits, hyphen, and underscore.  There are two exceptions: enum values
may start with a digit, and names that are downstream extensions (see
section Downstream extensions) start with underscore.

[More on reserved names, upper vs. lower case, '-' vs. '_'...]

The generator enforces the rules.

Naming rules help in at least three ways:

1. They help with keeping names in interfaces consistent and
   predictable.

2. They make avoiding collisions with the users' names in the
   generator simpler.

3. They enable quote-less, evolvable syntax.

   For instance, keyval_parse() syntax consists of names, values, and
   special characters ',', '=', '.'

   Since names cannot contain special characters, there is no need for
   quoting[*].  Simple.

   Values are unrestricted, but only ',' is special there.  We quote
   it by doubling.

   Together, we get exactly the same quoting as in QemuOpts.  This is
   a feature.

   If we ever decice to extend key syntax, we have plenty of special
   characters to choose from.  This is also a feature.

   Both features rely on naming rules.

QOM has no naming rules whatsoever.  Actual names aren't nearly as bad
as they could be.  Still, there are plenty of "funny" names.  This may
become a problem when we

* Switch from QemuOpts to keyval_parse()

  Compared to QemuOpts, keyval_parse() restricts *keys*, but not
  *values*.

  "Funny" type names occuring as values are no worse than before:
  quoting issues, described below.

  Type names occuring in keys must be valid QAPI names.  Should be
  avoidable.

* QAPIfy (the compile-time static parts of) QOM

  QOM type names become QAPI enum values.  They must conform to QAPI
  enum naming rules.

[...]

One more thing on relaxing QAPI naming rules.  QAPI names get mapped
to (parts of) C identifiers.  These mappings are not injective.  The
basic mapping is simple: replace characters other than letters and
digits by '_'.

This means names distinct QAPI names can clash in C.  Fairly harmless
when the only "other" characters are '-' and '_'.  The more "others" we
permit, the more likely confusing clashes become.  Not a show stopper,
"merely" an issue of ergonomics.

Re: Configuring migration

2023-11-13 Thread Markus Armbruster

Cc: Paolo for QOM expertise.

Peter Xu  writes:

> On Thu, Nov 02, 2023 at 03:25:25PM +0100, Markus Armbruster wrote:

[...]

>> Migration has its own idiosyncratic configuration interface, even though
>> its configuration needs are not special at all.  This is due to a long
>> history of decisions that made sense at the time.
>> 
>> What kind of interface would we choose if we could start over now?
>> 
>> Let's have a look at what I consider the two most complex piece of
>> configuration to date, namely block backends and QOM objects.
>> 
>> In both cases, configuration is a QAPI object type: BlockdevOptions and
>> ObjectOptions.
>> 
>> The common members are the configuration common to all block backends /
>> objects.  One of them is the type of block backend ("driver" in block
>> parlance) or QOM object ("qom-type").
>> 
>> A type's variant members are the configuration specific to that type.
>> 
>> This is suitably expressive.
>> 
>> We create a state object for a given configuration object with
>> blockdev-add / object-add.
>> 
>> For block devices, we even have a way to modify a state object's
>> configuration: blockdev-reopen.  For QOM objects, there's qom-set, but I
>> don't expect that to work in the general case.  Where "not work" can
>> range from "does nothing" to "explodes".
>> 
>> Now let's try to apply this to migration.
>> 
>> As long as we can have just one migration, we need just one QAPI object
>> to configure it.
>> 
>> We could create the object with -object / object_add.  For convenience,
>> we'd probably want to create one with default configuration
>> automatically on demand.
>> 
>> We could use qom-set to change configuration.  If we're not comfortable
>> with using qom-set for production, we could do something like
>> blockdev-reopen instead.
>> 
>> Could we move towards such a design?  Turn the existing ad hoc interface
>> into compatibility sugar for it?
>
> Sounds doable to me.
>
> I'm not familiar with BlockdevOptions, it looks like something setup once
> and for all for all relevant parameters need to be set in the same request?

Yes, but you can "reopen", which replaces the entire configuration.

blockdev-add creates a new block backend device, and blockdev-reopen
reopens a set of existing ones.  Both take the same arguments for each
device.

> Migration will require each cap/parameter to be set separately anytime,
> e.g., the user can adjust downtime / bandwidth even during migration in
> progress.

"Replace entire configuration" isn't a good fit then, because users
would have to repeat the entire configuration just to tweak one thing.

> Making all caps/parameters QOM objects, or one object containing both
> attributes, sounds like a good fit.  object_property_* APIs allows setters,
> I think that's good enough for migration to trigger whatever needed (e.g.
> migration_rate_set() updates after bandwidth modifications).
>
> We can convert e.g. qmp set parameters into a loop of setting each
> property, it'll be slightly slower because we'll need to do sanity check
> for each property after each change, but that shouldn't be a hot path
> anyway so seems fine.

I figure doing initial configuration in one command is convenient.  The
obvious existing command for that is object-add.

The obvious interface for modifying configuration is a command to change
just one parameter.  The obvious existing command for that is qom-set.

Problem: qom-set is a death trap in general.  It can modify any QOM
property with a setter, and we test basically none of them.  Using it
for modifying migration configuration would signal it's okay to use
elsewhere, too.  I'm not sure we want to send that message.  Maybe we
want to do the opposite, and make it an unstable interface.

Aside: I believe the root problem is our failure to tie "can write" to
the object's state.  Just because a property can be set with object-add
doesn't mean it can be validly changed at any time during the object's
life.

Problem: when several parameters together have to satisfy constraints,
going from one valid configuration to another valid configuration may
require changing several parameters at once, or else go through invalid
intermediate configurations.

This problem is not at all specific to the migration object.

One solution is careful design to ensure that there's always a sequence
of transitions through valid configuration.  Can become complicated as
configuration evolves.  Possible even impractical or impossible.

Another solution is a command to modify multiple parameters together,
leaving alone the others (unlike blockdev-reopen, which overwrites all
of them).

> It'l still be a pity that we still cannot reduce the triplications of qapi
> docs immediately even with that.  But with that, it seems doable if we will
> obsolete QMP migrate-set-parameters after we can do QOM-set.

Yes.

Re: [PATCH v6 11/21] virtio-net: Return an error when vhost cannot enable RSS

2023-11-13 Thread Akihiko Odaki


On 2023/11/14 2:26, Yuri Benditovich wrote:



On Mon, Nov 13, 2023 at 2:44 PM Akihiko Odaki > wrote:


On 2023/11/13 20:44, Yuri Benditovich wrote:
 >
 >
 > On Sat, Nov 11, 2023 at 5:28 PM Akihiko Odaki
mailto:akihiko.od...@daynix.com>
 > >> wrote:
 >
 >     On 2023/11/03 22:14, Yuri Benditovich wrote:
 >      >
 >      >
 >      > On Fri, Nov 3, 2023 at 11:55 AM Akihiko Odaki
 >     mailto:akihiko.od...@daynix.com>
>
 >      > 
 >           >
 >      >     On 2023/11/03 18:35, Yuri Benditovich wrote:
 >      >      >
 >      >      >
 >      >      > On Thu, Nov 2, 2023 at 4:56 PM Akihiko Odaki
 >      >     mailto:akihiko.od...@daynix.com> >
 >      >>
 >      >      > 
 >     >
 >      >     
 >      wrote:
 >      >      >
 >      >      >     On 2023/11/02 19:20, Yuri Benditovich wrote:
 >      >      >      >
 >      >      >      >
 >      >      >      > On Thu, Nov 2, 2023 at 11:33 AM Michael S.
Tsirkin
 >      >      >     mailto:m...@redhat.com>
>
 >     
>>
 >      >     
>
 >     
      >      >      >  >
 >     
>>
 >      >     
>
 >     
> wrote:
 >      >      >      >
 >      >      >      >     On Thu, Nov 02, 2023 at 11:09:27AM
+0200, Yuri
 >      >     Benditovich wrote:
 >      >      >      >      > Probably we mix two different patches
in this
 >      >     discussion.
 >      >      >      >      > Focusing on the patch in the e-mail
header:
 >      >      >      >      >
 >      >      >      >      > IMO it is not acceptable to fail QEMU run
 >     for one
 >      >     feature
 >      >      >     that we
 >      >      >      >     can't make
 >      >      >      >      > active when we silently drop all other
 >     features in
 >      >     such a
 >      >      >     case.
 >      >      >      >
 >      >      >      >     If the feature is off by default then it
seems more
 >      >     reasonable
 >      >      >      >     and silent masking can be seen as a bug.
 >      >      >      >     Most virtio features are on by default
this is
 >     why it's
 >      >      >      >     reasonable to mask them.
 >      >      >      >
 >      >      >      >
 >      >      >      > If we are talking about RSS: setting it
initially
 >     off is the
 >      >      >     development
 >      >      >      > time decision.
 >      >      >      > When it will be completely stable there is
no reason to
 >      >     keep it
 >      >      >     off by
 >      >      >      > default, so this is more a question of time
and of a
 >      >     readiness of
 >      >      >     libvirt.
 >      >      >
 >      >      >     It is not ok to make "on" the default; that will
 >     enable RSS
 >      >     even when
 >      >      >     eBPF steering support is not present and can
result in
 >      >     performance
 >      >      >     degradation.
 >      >      >
 >      >      >
 >      >      > Exactly as it is today - with vhost=on the host
does not
 >     suggest RSS
 >      >      > without  eBPF.
 >      >      > I do not understand what you call

Re: [PATCH v2] test/qtest: Add API functions to capture IRQ toggling

2023-11-13 Thread Thomas Huth


On 14/11/2023 00.01, Gustavo Romero wrote:

Currently, the QTest API does not provide a function to capture when an
IRQ line is raised or lowered, although the QTest Protocol already
reports such IRQ transitions. As a consequence, it is also not possible
to capture when an IRQ line is toggled. Functions like qtest_get_irq()
only read the current state of the intercepted IRQ lines, which is
already high (or low) when the function is called if the IRQ line is
toggled. Therefore, these functions miss the IRQ line state transitions.

This commit introduces two new API functions:
qtest_get_irq_raised_counter() and qtest_get_irq_lowered_counter().
These functions allow capturing the number of times an observed IRQ line
transitioned from low to high state or from high to low state,
respectively.

When used together, these new API functions then allow checking if one
or more pulses were generated (indicating if the IRQ line was toggled).

Signed-off-by: Gustavo Romero 
---
  tests/qtest/libqtest.c | 24 
  tests/qtest/libqtest.h | 28 
  2 files changed, 52 insertions(+)


Acked-by: Thomas Huth

RE: [PATCH v1] target/i386/host-cpu: Use IOMMU addr width for passthrough devices on Intel platforms

2023-11-13 Thread Kasireddy, Vivek

Hi Laszlo,

> 
> On 11/13/23 08:32, Vivek Kasireddy wrote:
> > A recent OVMF update has resulted in MMIO regions being placed at
> > the upper end of the physical address space. As a result, when a
> > Host device is passthrough'd to the Guest via VFIO, the following
> > mapping failures occur when VFIO tries to map the MMIO regions of
> > the device:
> > VFIO_MAP_DMA failed: Invalid argument
> > vfio_dma_map(0x557b2f2736d0, 0x3800, 0x100,
> 0x7f98ac40) = -22 (Invalid argument)
> >
> > The above failures are mainly seen on some Intel platforms where
> > the physical address width is larger than the Host's IOMMU
> > address width. In these cases, VFIO fails to map the MMIO regions
> > because the IOVAs would be larger than the IOMMU aperture regions.
> >
> > Therefore, one way to solve this problem would be to ensure that
> > cpu->phys_bits = 
> > This can be done by parsing the IOMMU caps value from sysfs and
> > extracting the address width and using it to override the
> > phys_bits value as shown in this patch.
> >
> > Previous attempt at solving this issue in OVMF:
> > https://edk2.groups.io/g/devel/topic/102359124
> >
> > Cc: Gerd Hoffmann 
> > Cc: Philippe Mathieu-Daudé 
> > Cc: Alex Williamson 
> > Cc: Laszlo Ersek 
> > Cc: Dongwon Kim 
> > Signed-off-by: Vivek Kasireddy 
> > ---
> >  target/i386/host-cpu.c | 61
> +-
> >  1 file changed, 60 insertions(+), 1 deletion(-)
> >
> > diff --git a/target/i386/host-cpu.c b/target/i386/host-cpu.c
> > index 92ecb7254b..8326ec95bc 100644
> > --- a/target/i386/host-cpu.c
> > +++ b/target/i386/host-cpu.c
> > @@ -12,6 +12,8 @@
> >  #include "host-cpu.h"
> >  #include "qapi/error.h"
> >  #include "qemu/error-report.h"
> > +#include "qemu/config-file.h"
> > +#include "qemu/option.h"
> >  #include "sysemu/sysemu.h"
> >
> >  /* Note: Only safe for use on x86(-64) hosts */
> > @@ -51,11 +53,58 @@ static void host_cpu_enable_cpu_pm(X86CPU
> *cpu)
> >  env->features[FEAT_1_ECX] |= CPUID_EXT_MONITOR;
> >  }
> >
> > +static int intel_iommu_check(void *opaque, QemuOpts *opts, Error
> **errp)
> > +{
> > +g_autofree char *dev_path = NULL, *iommu_path = NULL, *caps = NULL;
> > +const char *driver = qemu_opt_get(opts, "driver");
> > +const char *device = qemu_opt_get(opts, "host");
> > +uint32_t *iommu_phys_bits = opaque;
> > +struct stat st;
> > +uint64_t iommu_caps;
> > +
> > +/*
> > + * Check if the user is passthroughing any devices via VFIO. We don't
> > + * have to limit phys_bits if there are no valid passthrough devices.
> > + */
> > +if (g_strcmp0(driver, "vfio-pci") || !device) {
> > +return 0;
> > +}
> > +
> > +dev_path = g_strdup_printf("/sys/bus/pci/devices/%s", device);
> > +if (stat(dev_path, &st) < 0) {
> > +return 0;
> > +}
> > +
> > +iommu_path = g_strdup_printf("%s/iommu/intel-iommu/cap",
> dev_path);
> > +if (stat(iommu_path, &st) < 0) {
> > +return 0;
> > +}
> > +
> > +if (g_file_get_contents(iommu_path, &caps, NULL, NULL)) {
> > +if (sscanf(caps, "%lx", &iommu_caps) != 1) {
> > +return 0;
> > +}
> > +*iommu_phys_bits = ((iommu_caps >> 16) & 0x3f) + 1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +static uint32_t host_iommu_phys_bits(void)
> > +{
> > +uint32_t iommu_phys_bits = 0;
> > +
> > +qemu_opts_foreach(qemu_find_opts("device"),
> > +  intel_iommu_check, &iommu_phys_bits, NULL);
> > +return iommu_phys_bits;
> > +}
> > +
> >  static uint32_t host_cpu_adjust_phys_bits(X86CPU *cpu)
> >  {
> >  uint32_t host_phys_bits = host_cpu_phys_bits();
> > +uint32_t iommu_phys_bits = host_iommu_phys_bits();
> >  uint32_t phys_bits = cpu->phys_bits;
> > -static bool warned;
> > +static bool warned, warned2;
> >
> >  /*
> >   * Print a warning if the user set it to a value that's not the
> > @@ -78,6 +127,16 @@ static uint32_t host_cpu_adjust_phys_bits(X86CPU
> *cpu)
> >  }
> >  }
> >
> > +if (iommu_phys_bits && phys_bits > iommu_phys_bits) {
> > +phys_bits = iommu_phys_bits;
> > +if (!warned2) {
> > +warn_report("Using physical bits (%u)"
> > +" to prevent VFIO mapping failures",
> > +iommu_phys_bits);
> > +warned2 = true;
> > +}
> > +}
> > +
> >  return phys_bits;
> >  }
> >
> 
> I only have very superficial comments here (sorry about that -- I find
> it too bad that this QEMU source file seems to have no designated
> reviewer or maintainer in QEMU, so I don't want to ignore it).
> 
> - Terminology: I think we like to call these devices "assigned", and not
> "passed through". Also, in noun form, "device assignment" and not
> "device passthrough". Sorry about being pedantic.
No problem; I'll try to start using the right terminology.

> 
> - As I (may have) mentioned in my OVMF comments, I'm

Re: [QEMU][PATCHv2 0/8] Xen: support grant mappings.

2023-11-13 Thread Juergen Gross

On 13.11.23 21:24, David Woodhouse wrote:

On Fri, 2023-10-27 at 07:27 +0200, Juergen Gross wrote:

On 26.10.23 22:56, Stefano Stabellini wrote:

On Thu, 26 Oct 2023, David Woodhouse wrote:

On Thu, 2023-10-26 at 13:36 -0700, Stefano Stabellini wrote:

This seems like a lot of code to replace that simpler option... is
there a massive performance win from doing it this way? Would we want
to use this trick for the Xen PV backends (qdisk, qnic) *too*? Might it
make sense to introduce the simple version and *then* the optimisation,
with some clear benchmarking to show the win?

This is not done for performance but for safety (as in safety
certifications, ISO 26262, etc.). This is to enable unprivileged virtio
backends running in a DomU. By unprivileged I mean a virtio backend that
is unable to map arbitrary memory (the xenforeignmemory interface is
prohibited).

The goal is to run Xen on safety-critical systems such as cars,
industrial robots and more. In this configuration there is no
traditional Dom0 in the system at all. If you would like to know more:
https://www.youtube.com/watch?v=tisljY6Bqv0&list=PLYyw7IQjL-zHtpYtMpFR3KYdRn0rcp5Xn&index=8

Yeah, I understand why we're using grant mappings instead of just
directly having access via foreignmem mappings. That wasn't what I was
confused about.

What I haven't worked out is why we're implementing this through an
automatically-populated MemoryRegion in QEMU, rather than just using
grant mapping ops like we always have.

It seems like a lot of complexity just to avoid calling
qemu_xen_gnttab_map_refs() from the virtio backend.

I think there are two questions here. One question is "Why do we need
all the new grant mapping code added to xen-mapcache.c in patch #7?
Can't we use qemu_xen_gnttab_map_refs() instead?"

The main motivation was to _avoid_ having to change all the backends.

My implementation enables _all_ qemu based virtio backends to use grant
mappings. And if a new backend is added to qemu, there will be no change
required to make it work with grants.

I'm not really convinced I buy that. This is a lot of complexity, and
don't backends need to call an appropriate mapping function to map via
an IOMMU if it's present anyway? Make then call a helper where you can
do this in one place directly instead of through a fake MemoryRegion,
and you're done, surely?

That was tested with unmodified block and net backends in qemu.

Maybe I missed something, but I think the IOMMU accesses are _not_ covering
accesses to the virtio rings from qemu. And this is something you really
want for driver domains.

Juergen

OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

[PATCH v2 15/20] migration/multifd: Add test hook to set normal page ratio.

Multifd sender thread performs zero page checking. If a page is
a zero page, only the page's metadata is sent to the receiver.
If a page is a normal page, the entire page's content is sent to
the receiver. This change adds a test hook to set the normal page
ratio. A zero page will be forced to be sent as a normal page. This
is useful for live migration performance analysis and optimization.

Signed-off-by: Hao Xiang 
---
 migration/options.c | 31 +++
 migration/options.h |  1 +
 qapi/migration.json | 18 +++---
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/migration/options.c b/migration/options.c
index 6e424b5d63..e7f1e2df24 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -79,6 +79,11 @@
 #define DEFAULT_MIGRATE_ANNOUNCE_ROUNDS5
 #define DEFAULT_MIGRATE_ANNOUNCE_STEP100
 
+/*
+ * Parameter for multifd normal page test hook.
+ */
+#define DEFAULT_MIGRATE_MULTIFD_NORMAL_PAGE_RATIO 101
+
 #define DEFINE_PROP_MIG_CAP(name, x) \
 DEFINE_PROP_BOOL(name, MigrationState, capabilities[x], false)
 
@@ -181,6 +186,9 @@ Property migration_properties[] = {
   MIG_MODE_NORMAL),
 DEFINE_PROP_STRING("multifd-dsa-accel", MigrationState,
parameters.multifd_dsa_accel),
+DEFINE_PROP_UINT8("multifd-normal-page-ratio", MigrationState,
+  parameters.multifd_normal_page_ratio,
+  DEFAULT_MIGRATE_MULTIFD_NORMAL_PAGE_RATIO),
 
 /* Migration capabilities */
 DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -860,6 +868,12 @@ int migrate_multifd_channels(void)
 return s->parameters.multifd_channels;
 }
 
+uint8_t migrate_multifd_normal_page_ratio(void)
+{
+MigrationState *s = migrate_get_current();
+return s->parameters.multifd_normal_page_ratio;
+}
+
 MultiFDCompression migrate_multifd_compression(void)
 {
 MigrationState *s = migrate_get_current();
@@ -1258,6 +1272,14 @@ bool migrate_params_check(MigrationParameters *params, 
Error **errp)
 return false;
 }
 
+if (params->has_multifd_normal_page_ratio &&
+params->multifd_normal_page_ratio > 100) {
+error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
+   "multifd_normal_page_ratio",
+   "a value between 0 and 100");
+return false;
+}
+
 return true;
 }
 
@@ -1378,6 +1400,11 @@ static void 
migrate_params_test_apply(MigrateSetParameters *params,
 assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
 dest->multifd_dsa_accel = params->multifd_dsa_accel->u.s;
 }
+
+if (params->has_multifd_normal_page_ratio) {
+dest->has_multifd_normal_page_ratio = true;
+dest->multifd_normal_page_ratio = params->multifd_normal_page_ratio;
+}
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1528,6 +1555,10 @@ static void migrate_params_apply(MigrateSetParameters 
*params, Error **errp)
 assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
 s->parameters.multifd_dsa_accel = 
g_strdup(params->multifd_dsa_accel->u.s);
 }
+
+if (params->has_multifd_normal_page_ratio) {
+s->parameters.multifd_normal_page_ratio = 
params->multifd_normal_page_ratio;
+}
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/migration/options.h b/migration/options.h
index 56100961a9..21e3e7b0cf 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -95,6 +95,7 @@ const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 const char *migrate_multifd_dsa_accel(void);
+uint8_t migrate_multifd_normal_page_ratio(void);
 
 /* parameters setters */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index a8e3b66d6f..bb876c8325 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -882,6 +882,9 @@
 # @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
 # certain memory operations. (since 8.2)
 #
+# @multifd-normal-page-ratio: Test hook setting the normal page ratio.
+# (Since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -915,7 +918,8 @@
'block-bitmap-mapping',
{ 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
'vcpu-dirty-limit',
-   'mode'] }
+   'mode',
+   'multifd-normal-page-ratio'] }
 
 ##
 # @MigrateSetParameters:
@@ -1073,6 +1077,9 @@
 # @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
 # certain memory operations. (since 8.2)
 #
+# @multifd-normal-page-ratio: Test hook setting the normal page ratio.
+# (Since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1127,7 +1134,8 @@
 'features': [ 'unstable' ] }

[PATCH v2 16/20] migration/multifd: Enable set normal page ratio test hook in multifd.

Test hook is disabled by default. To set it, a normal page ratio
between 0 and 100 are valid. If the ratio is set to 50, it means
at least 50% of all pages are sent as normal pages.

Set the option:
migrate_set_parameter multifd-normal-page-ratio 60

Signed-off-by: Hao Xiang 
---
 include/qemu/dsa.h |  7 ++-
 migration/migration-hmp-cmds.c |  7 +++
 migration/multifd.c| 33 +
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 3f8ee07004..bc7f652e0b 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -37,7 +37,10 @@ typedef struct buffer_zero_batch_task {
 enum dsa_task_type task_type;
 enum dsa_task_status status;
 bool *results;
-int batch_size;
+uint32_t batch_size;
+// Set normal page ratio test hook.
+uint32_t normal_page_index;
+uint32_t normal_page_counter;
 QSIMPLEQ_ENTRY(buffer_zero_batch_task) entry;
 } buffer_zero_batch_task;
 
@@ -45,6 +48,8 @@ typedef struct buffer_zero_batch_task {
 
 struct buffer_zero_batch_task {
 bool *results;
+uint32_t normal_page_index;
+uint32_t normal_page_counter;
 };
 
 #endif
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index d9451744dd..788ce699ac 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -356,6 +356,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "%s: %s\n",
 MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL),
 params->multifd_dsa_accel);
+monitor_printf(mon, "%s: %u\n",
+
MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_NORMAL_PAGE_RATIO),
+params->multifd_normal_page_ratio);
 
 if (params->has_block_bitmap_mapping) {
 const BitmapMigrationNodeAliasList *bmnal;
@@ -675,6 +678,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict 
*qdict)
 error_setg(&err, "The block-bitmap-mapping parameter can only be set "
"through QMP");
 break;
+case MIGRATION_PARAMETER_MULTIFD_NORMAL_PAGE_RATIO:
+p->has_multifd_normal_page_ratio = true;
+visit_type_uint8(v, param, &p->multifd_normal_page_ratio, &err);
+break;
 case MIGRATION_PARAMETER_X_VCPU_DIRTY_LIMIT_PERIOD:
 p->has_x_vcpu_dirty_limit_period = true;
 visit_type_size(v, param, &p->x_vcpu_dirty_limit_period, &err);
diff --git a/migration/multifd.c b/migration/multifd.c
index 2f635898ed..c9f9eef5b1 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -687,6 +687,37 @@ int multifd_send_sync_main(QEMUFile *f)
 return 0;
 }
 
+static void multifd_normal_page_test_hook(MultiFDSendParams *p)
+{
+/*
+ * The value is between 0 to 100. If the value is 10, it means at
+ * least 10% of the pages are normal page. A zero page can be made
+ * a normal page but not the other way around.
+ */
+uint8_t multifd_normal_page_ratio =
+migrate_multifd_normal_page_ratio();
+struct buffer_zero_batch_task *batch_task = p->batch_task;
+
+// Set normal page test hook is disabled.
+if (multifd_normal_page_ratio > 100) {
+return;
+}
+
+for (int i = 0; i < p->pages->num; i++) {
+if (batch_task->normal_page_counter < multifd_normal_page_ratio) {
+// Turn a zero page into a normal page.
+batch_task->results[i] = false;
+}
+batch_task->normal_page_index++;
+batch_task->normal_page_counter++;
+
+if (batch_task->normal_page_index >= 100) {
+batch_task->normal_page_index = 0;
+batch_task->normal_page_counter = 0;
+}
+}
+}
+
 static void set_page(MultiFDSendParams *p, bool zero_page, uint64_t offset)
 {
 RAMBlock *rb = p->pages->block;
@@ -752,6 +783,8 @@ static void multifd_zero_page_check(MultiFDSendParams *p)
 set_normal_pages(p);
 }
 
+multifd_normal_page_test_hook(p);
+
 for (int i = 0; i < p->pages->num; i++) {
 uint64_t offset = p->pages->offset[i];
 bool zero_page = p->batch_task->results[i];
-- 
2.30.2

[PATCH v2 17/20] migration/multifd: Add migration option set packet size.

The current multifd packet size is 128 * 4kb. This change adds
an option to set the packet size. Both sender and receiver needs
to set the same packet size for things to work.

Signed-off-by: Hao Xiang 
---
 migration/options.c | 34 ++
 migration/options.h |  1 +
 qapi/migration.json | 21 ++---
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/migration/options.c b/migration/options.c
index e7f1e2df24..81f1bf25d4 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -84,6 +84,12 @@
  */
 #define DEFAULT_MIGRATE_MULTIFD_NORMAL_PAGE_RATIO 101
 
+/*
+ * Parameter for multifd packet size.
+ */
+#define DEFAULT_MIGRATE_MULTIFD_PACKET_SIZE 128
+#define MAX_MIGRATE_MULTIFD_PACKET_SIZE 1024
+
 #define DEFINE_PROP_MIG_CAP(name, x) \
 DEFINE_PROP_BOOL(name, MigrationState, capabilities[x], false)
 
@@ -189,6 +195,9 @@ Property migration_properties[] = {
 DEFINE_PROP_UINT8("multifd-normal-page-ratio", MigrationState,
   parameters.multifd_normal_page_ratio,
   DEFAULT_MIGRATE_MULTIFD_NORMAL_PAGE_RATIO),
+DEFINE_PROP_SIZE("multifd-packet-size", MigrationState,
+ parameters.multifd_packet_size,
+ DEFAULT_MIGRATE_MULTIFD_PACKET_SIZE),
 
 /* Migration capabilities */
 DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -874,6 +883,13 @@ uint8_t migrate_multifd_normal_page_ratio(void)
 return s->parameters.multifd_normal_page_ratio;
 }
 
+uint64_t migrate_multifd_packet_size(void)
+{
+MigrationState *s = migrate_get_current();
+
+return s->parameters.multifd_packet_size;
+}
+
 MultiFDCompression migrate_multifd_compression(void)
 {
 MigrationState *s = migrate_get_current();
@@ -1012,6 +1028,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error 
**errp)
 params->x_checkpoint_delay = s->parameters.x_checkpoint_delay;
 params->has_block_incremental = true;
 params->block_incremental = s->parameters.block_incremental;
+params->has_multifd_packet_size = true;
+params->multifd_packet_size = s->parameters.multifd_packet_size;
 params->has_multifd_channels = true;
 params->multifd_channels = s->parameters.multifd_channels;
 params->has_multifd_compression = true;
@@ -1072,6 +1090,7 @@ void migrate_params_init(MigrationParameters *params)
 params->has_downtime_limit = true;
 params->has_x_checkpoint_delay = true;
 params->has_block_incremental = true;
+params->has_multifd_packet_size = true;
 params->has_multifd_channels = true;
 params->has_multifd_compression = true;
 params->has_multifd_zlib_level = true;
@@ -1170,6 +1189,15 @@ bool migrate_params_check(MigrationParameters *params, 
Error **errp)
 
 /* x_checkpoint_delay is now always positive */
 
+if (params->has_multifd_packet_size &&
+((params->multifd_packet_size < DEFAULT_MIGRATE_MULTIFD_PACKET_SIZE) ||
+(params->multifd_packet_size >  MAX_MIGRATE_MULTIFD_PACKET_SIZE))) 
{
+error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
+"multifd_packet_size",
+"a value between 128 and 1024");
+return false;
+}
+
 if (params->has_multifd_channels && (params->multifd_channels < 1)) {
 error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
"multifd_channels",
@@ -1351,6 +1379,9 @@ static void 
migrate_params_test_apply(MigrateSetParameters *params,
 if (params->has_block_incremental) {
 dest->block_incremental = params->block_incremental;
 }
+if (params->has_multifd_packet_size) {
+dest->multifd_packet_size = params->multifd_packet_size;
+}
 if (params->has_multifd_channels) {
 dest->multifd_channels = params->multifd_channels;
 }
@@ -1496,6 +1527,9 @@ static void migrate_params_apply(MigrateSetParameters 
*params, Error **errp)
 " use blockdev-mirror with NBD instead");
 s->parameters.block_incremental = params->block_incremental;
 }
+if (params->has_multifd_packet_size) {
+s->parameters.multifd_packet_size = params->multifd_packet_size;
+}
 if (params->has_multifd_channels) {
 s->parameters.multifd_channels = params->multifd_channels;
 }
diff --git a/migration/options.h b/migration/options.h
index 21e3e7b0cf..5816f6dac2 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -96,6 +96,7 @@ const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 const char *migrate_multifd_dsa_accel(void);
 uint8_t migrate_multifd_normal_page_ratio(void);
+uint64_t migrate_multifd_packet_size(void);
 
 /* parameters setters */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index bb876c8325..f87daddf33 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -885,6 +885,10 @@
 # @multifd-normal-page-ratio: Test hook setting the normal page ratio.
 # (Since 8.2)

[PATCH v2 02/20] multifd: Support for zero pages transmission

From: Juan Quintela 

This patch adds counters and similar. Logic will be added on the
following patch.

Signed-off-by: Juan Quintela 
---
 migration/multifd.c| 37 ++---
 migration/multifd.h| 17 -
 migration/trace-events |  8 
 3 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index ec58c58082..d28ef0028b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -267,6 +267,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 packet->normal_pages = cpu_to_be32(p->normal_num);
 packet->next_packet_size = cpu_to_be32(p->next_packet_size);
 packet->packet_num = cpu_to_be64(p->packet_num);
+packet->zero_pages = cpu_to_be32(p->zero_num);
 
 if (p->pages->block) {
 strncpy(packet->ramblock, p->pages->block->idstr, 256);
@@ -326,7 +327,15 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
*p, Error **errp)
 p->next_packet_size = be32_to_cpu(packet->next_packet_size);
 p->packet_num = be64_to_cpu(packet->packet_num);
 
-if (p->normal_num == 0) {
+p->zero_num = be32_to_cpu(packet->zero_pages);
+if (p->zero_num > packet->pages_alloc - p->normal_num) {
+error_setg(errp, "multifd: received packet "
+   "with %u zero pages and expected maximum pages are %u",
+   p->zero_num, packet->pages_alloc - p->normal_num) ;
+return -1;
+}
+
+if (p->normal_num == 0 && p->zero_num == 0) {
 return 0;
 }
 
@@ -431,6 +440,7 @@ static int multifd_send_pages(QEMUFile *f)
 p->packet_num = multifd_send_state->packet_num++;
 multifd_send_state->pages = p->pages;
 p->pages = pages;
+
 qemu_mutex_unlock(&p->mutex);
 qemu_sem_post(&p->sem);
 
@@ -552,6 +562,8 @@ void multifd_save_cleanup(void)
 p->iov = NULL;
 g_free(p->normal);
 p->normal = NULL;
+g_free(p->zero);
+p->zero = NULL;
 multifd_send_state->ops->send_cleanup(p, &local_err);
 if (local_err) {
 migrate_set_error(migrate_get_current(), local_err);
@@ -680,6 +692,7 @@ static void *multifd_send_thread(void *opaque)
 uint64_t packet_num = p->packet_num;
 uint32_t flags;
 p->normal_num = 0;
+p->zero_num = 0;
 
 if (use_zero_copy_send) {
 p->iovs_num = 0;
@@ -704,12 +717,13 @@ static void *multifd_send_thread(void *opaque)
 p->flags = 0;
 p->num_packets++;
 p->total_normal_pages += p->normal_num;
+p->total_zero_pages += p->zero_num;
 p->pages->num = 0;
 p->pages->block = NULL;
 qemu_mutex_unlock(&p->mutex);
 
-trace_multifd_send(p->id, packet_num, p->normal_num, flags,
-   p->next_packet_size);
+trace_multifd_send(p->id, packet_num, p->normal_num, p->zero_num,
+   flags, p->next_packet_size);
 
 if (use_zero_copy_send) {
 /* Send header first, without zerocopy */
@@ -732,6 +746,8 @@ static void *multifd_send_thread(void *opaque)
 
 stat64_add(&mig_stats.multifd_bytes,
p->next_packet_size + p->packet_len);
+stat64_add(&mig_stats.normal_pages, p->normal_num);
+stat64_add(&mig_stats.zero_pages, p->zero_num);
 p->next_packet_size = 0;
 qemu_mutex_lock(&p->mutex);
 p->pending_job--;
@@ -762,7 +778,8 @@ out:
 
 rcu_unregister_thread();
 migration_threads_remove(thread);
-trace_multifd_send_thread_end(p->id, p->num_packets, 
p->total_normal_pages);
+trace_multifd_send_thread_end(p->id, p->num_packets, p->total_normal_pages,
+  p->total_zero_pages);
 
 return NULL;
 }
@@ -939,6 +956,7 @@ int multifd_save_setup(Error **errp)
 p->normal = g_new0(ram_addr_t, page_count);
 p->page_size = qemu_target_page_size();
 p->page_count = page_count;
+p->zero = g_new0(ram_addr_t, page_count);
 
 if (migrate_zero_copy_send()) {
 p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
@@ -1054,6 +1072,8 @@ void multifd_load_cleanup(void)
 p->iov = NULL;
 g_free(p->normal);
 p->normal = NULL;
+g_free(p->zero);
+p->zero = NULL;
 multifd_recv_state->ops->recv_cleanup(p);
 }
 qemu_sem_destroy(&multifd_recv_state->sem_sync);
@@ -1122,10 +1142,11 @@ static void *multifd_recv_thread(void *opaque)
 flags = p->flags;
 /* recv methods don't know how to handle the SYNC flag */
 p->flags &= ~MULTIFD_FLAG_SYNC;
-trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
-   p->next_packet_size);
+trace_multifd_recv(p->id, p->packet_num, p->normal_num, p->zero_num,
+   flags, p->next

[PATCH v2 20/20] migration/multifd: Add integration tests for multifd with Intel DSA offloading.

* Add test case to start and complete multifd live migration with DSA
offloading enabled.
* Add test case to start and cancel multifd live migration with DSA
offloading enabled.

Signed-off-by: Bryan Zhang 
Signed-off-by: Hao Xiang 
---
 tests/qtest/migration-test.c | 77 +++-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 5752412b64..3ffbdd5a65 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -639,6 +639,12 @@ typedef struct {
 const char *opts_target;
 } MigrateStart;
 
+/*
+ * It requires separate steps to configure and enable DSA device.
+ * This test assumes that the configuration is done already.
+ */
+static const char* dsa_dev_path = "/dev/dsa/wq4.0";
+
 /*
  * A hook that runs after the src and dst QEMUs have been
  * created, but before the migration is started. This can
@@ -2775,7 +2781,7 @@ static void 
test_multifd_tcp_tls_x509_reject_anon_client(void)
  *
  *  And see that it works
  */
-static void test_multifd_tcp_cancel(void)
+static void test_multifd_tcp_cancel_common(bool use_dsa)
 {
 MigrateStart args = {
 .hide_stderr = true,
@@ -2796,6 +2802,10 @@ static void test_multifd_tcp_cancel(void)
 migrate_set_capability(from, "multifd", true);
 migrate_set_capability(to, "multifd", true);
 
+if (use_dsa) {
+migrate_set_parameter_str(from, "multifd-dsa-accel", dsa_dev_path);
+}
+
 /* Start incoming migration from the 1st socket */
 migrate_incoming_qmp(to, "tcp:127.0.0.1:0", "{}");
 
@@ -2852,6 +2862,48 @@ static void test_multifd_tcp_cancel(void)
 test_migrate_end(from, to2, true);
 }
 
+/*
+ * This test does:
+ *  source   target
+ *   migrate_incoming
+ * migrate
+ * migrate_cancel
+ *   launch another target
+ * migrate
+ *
+ *  And see that it works
+ */
+static void test_multifd_tcp_cancel(void)
+{
+test_multifd_tcp_cancel_common(false);
+}
+
+#ifdef CONFIG_DSA_OPT
+
+static void *test_migrate_precopy_tcp_multifd_start_dsa(QTestState *from,
+QTestState *to)
+{
+migrate_set_parameter_str(from, "multifd-dsa-accel", dsa_dev_path);
+return test_migrate_precopy_tcp_multifd_start_common(from, to, "none");
+}
+
+static void test_multifd_tcp_none_dsa(void)
+{
+MigrateCommon args = {
+.listen_uri = "defer",
+.start_hook = test_migrate_precopy_tcp_multifd_start_dsa,
+};
+
+test_precopy_common(&args);
+}
+
+static void test_multifd_tcp_cancel_dsa(void)
+{
+test_multifd_tcp_cancel_common(true);
+}
+
+#endif
+
 static void calc_dirty_rate(QTestState *who, uint64_t calc_time)
 {
 qtest_qmp_assert_success(who,
@@ -3274,6 +3326,19 @@ static bool kvm_dirty_ring_supported(void)
 #endif
 }
 
+#ifdef CONFIG_DSA_OPT
+static int test_dsa_setup(void)
+{
+int fd;
+fd = open(dsa_dev_path, O_RDWR);
+if (fd < 0) {
+return -1;
+}
+close(fd);
+return 0;
+}
+#endif
+
 int main(int argc, char **argv)
 {
 bool has_kvm, has_tcg;
@@ -3468,6 +3533,16 @@ int main(int argc, char **argv)
 }
 qtest_add_func("/migration/multifd/tcp/plain/none",
test_multifd_tcp_none);
+
+#ifdef CONFIG_DSA_OPT
+if (g_str_equal(arch, "x86_64") && test_dsa_setup() == 0) {
+qtest_add_func("/migration/multifd/tcp/plain/none/dsa",
+   test_multifd_tcp_none_dsa);
+qtest_add_func("/migration/multifd/tcp/plain/cancel/dsa",
+   test_multifd_tcp_cancel_dsa);
+}
+#endif
+
 /*
  * This test is flaky and sometimes fails in CI and otherwise:
  * don't run unless user opts in via environment variable.
-- 
2.30.2

[PATCH v2 05/20] meson: Introduce new instruction set enqcmd to the build system.

Enable instruction set enqcmd in build.

Signed-off-by: Hao Xiang 
---
 meson.build   | 2 ++
 meson_options.txt | 2 ++
 scripts/meson-buildoptions.sh | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/meson.build b/meson.build
index ec01f8b138..1292ab78a3 100644
--- a/meson.build
+++ b/meson.build
@@ -2708,6 +2708,8 @@ config_host_data.set('CONFIG_AVX512BW_OPT', 
get_option('avx512bw') \
 int main(int argc, char *argv[]) { return bar(argv[0]); }
   '''), error_message: 'AVX512BW not available').allowed())
 
+config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd'))
+
 # For both AArch64 and AArch32, detect if builtins are available.
 config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
 #include 
diff --git a/meson_options.txt b/meson_options.txt
index c9baeda639..6fe8aca181 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -121,6 +121,8 @@ option('avx512f', type: 'feature', value: 'disabled',
description: 'AVX512F optimizations')
 option('avx512bw', type: 'feature', value: 'auto',
description: 'AVX512BW optimizations')
+option('enqcmd', type: 'boolean', value: false,
+   description: 'MENQCMD optimizations')
 option('keyring', type: 'feature', value: 'auto',
description: 'Linux keyring support')
 option('libkeyutils', type: 'feature', value: 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 680fa3f581..bf139e3fb4 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -93,6 +93,7 @@ meson_options_help() {
   printf "%s\n" '  avx2AVX2 optimizations'
   printf "%s\n" '  avx512bwAVX512BW optimizations'
   printf "%s\n" '  avx512f AVX512F optimizations'
+  printf "%s\n" '  enqcmd  ENQCMD optimizations'
   printf "%s\n" '  blkio   libblkio block device driver'
   printf "%s\n" '  bochs   bochs image format support'
   printf "%s\n" '  bpf eBPF support'
@@ -240,6 +241,8 @@ _meson_option_parse() {
 --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
 --enable-avx512f) printf "%s" -Davx512f=enabled ;;
 --disable-avx512f) printf "%s" -Davx512f=disabled ;;
+--enable-enqcmd) printf "%s" -Denqcmd=true ;;
+--disable-enqcmd) printf "%s" -Denqcmd=false ;;
 --enable-gcov) printf "%s" -Db_coverage=true ;;
 --disable-gcov) printf "%s" -Db_coverage=false ;;
 --enable-lto) printf "%s" -Db_lto=true ;;
-- 
2.30.2

[PATCH v2 10/20] util/dsa: Implement zero page checking in DSA task.

Create DSA task with operation code DSA_OPCODE_COMPVAL.
Here we create two types of DSA tasks, a single DSA task and
a batch DSA task. Batch DSA task reduces task submission overhead
and hence should be the default option. However, due to the way DSA
hardware works, a DSA batch task must contain at least two individual
tasks. There are times we need to submit a single task and hence a
single DSA task submission is also required.

Signed-off-by: Hao Xiang 
Signed-off-by: Bryan Zhang 
---
 include/qemu/dsa.h |  16 +++
 util/dsa.c | 252 +
 2 files changed, 247 insertions(+), 21 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 23f55185be..b10e7b8fb7 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -49,6 +49,22 @@ struct buffer_zero_batch_task {
 
 #endif
 
+/**
+ * @brief Initializes a buffer zero batch task.
+ *
+ * @param task A pointer to the batch task to initialize.
+ * @param batch_size The number of DSA tasks in the batch.
+ */
+void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
+ int batch_size);
+
+/**
+ * @brief Performs the proper cleanup on a DSA batch task.
+ *
+ * @param task A pointer to the batch task to cleanup.
+ */
+void buffer_zero_batch_task_destroy(struct buffer_zero_batch_task *task);
+
 /**
  * @brief Initializes DSA devices.
  *
diff --git a/util/dsa.c b/util/dsa.c
index 0e68013ffb..3cc017b8a0 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -75,6 +75,7 @@ uint64_t max_retry_count;
 static struct dsa_device_group dsa_group;
 static struct dsa_completion_thread completion_thread;
 
+static void buffer_zero_dsa_completion(void *context);
 
 /**
  * @brief This function opens a DSA device's work queue and
@@ -208,7 +209,6 @@ dsa_device_group_start(struct dsa_device_group *group)
  *
  * @param group A pointer to the DSA device group.
  */
-__attribute__((unused))
 static void
 dsa_device_group_stop(struct dsa_device_group *group)
 {
@@ -244,7 +244,6 @@ dsa_device_group_cleanup(struct dsa_device_group *group)
  * @return struct dsa_device* A pointer to the next available DSA device
  * in the group.
  */
-__attribute__((unused))
 static struct dsa_device *
 dsa_device_group_get_next_device(struct dsa_device_group *group)
 {
@@ -319,7 +318,6 @@ dsa_task_enqueue(struct dsa_device_group *group,
  * @param group A pointer to the DSA device group.
  * @return buffer_zero_batch_task* The DSA task being dequeued.
  */
-__attribute__((unused))
 static struct buffer_zero_batch_task *
 dsa_task_dequeue(struct dsa_device_group *group)
 {
@@ -376,22 +374,6 @@ submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
 return 0;
 }
 
-/**
- * @brief Synchronously submits a DSA work item to the
- *device work queue.
- *
- * @param wq A pointer to the DSA worjk queue's device memory.
- * @param descriptor A pointer to the DSA work item descriptor.
- *
- * @return int Zero if successful, non-zero otherwise.
- */
-__attribute__((unused))
-static int
-submit_wi(void *wq, struct dsa_hw_desc *descriptor)
-{
-return submit_wi_int(wq, descriptor);
-}
-
 /**
  * @brief Asynchronously submits a DSA work item to the
  *device work queue.
@@ -400,7 +382,6 @@ submit_wi(void *wq, struct dsa_hw_desc *descriptor)
  *
  * @return int Zero if successful, non-zero otherwise.
  */
-__attribute__((unused))
 static int
 submit_wi_async(struct buffer_zero_batch_task *task)
 {
@@ -428,7 +409,6 @@ submit_wi_async(struct buffer_zero_batch_task *task)
  *
  * @return int Zero if successful, non-zero otherwise.
  */
-__attribute__((unused))
 static int
 submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
 {
@@ -678,6 +658,231 @@ static void dsa_completion_thread_stop(void *opaque)
 qemu_sem_destroy(&thread_context->sem_init_done);
 }
 
+/**
+ * @brief Initializes a buffer zero comparison DSA task.
+ *
+ * @param descriptor A pointer to the DSA task descriptor.
+ * @param completion A pointer to the DSA task completion record.
+ */
+static void
+buffer_zero_task_init_int(struct dsa_hw_desc *descriptor,
+  struct dsa_completion_record *completion)
+{
+descriptor->opcode = DSA_OPCODE_COMPVAL;
+descriptor->flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+descriptor->comp_pattern = (uint64_t)0;
+descriptor->completion_addr = (uint64_t)completion;
+}
+
+/**
+ * @brief Initializes a buffer zero batch task.
+ *
+ * @param task A pointer to the batch task to initialize.
+ * @param batch_size The number of DSA tasks in the batch.
+ */
+void
+buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
+int batch_size)
+{
+int descriptors_size = sizeof(*task->descriptors) * batch_size;
+memset(task, 0, sizeof(*task));
+
+task->descriptors =
+(struct dsa_hw_desc *)qemu_memalign(64, descriptors_size);
+memset(task->descriptors, 0, descriptors_size);
+task->completions = (

[PATCH v2 18/20] migration/multifd: Enable set packet size migration option.

During live migration, if the latency between sender and receiver
is high but bandwidth is high (a long and fat pipe), using a bigger
packet size can help reduce migration total time. In addition, Intel
DSA offloading performs better with a large batch task. Providing an
option to set the packet size is useful for performance tuning.

Set the option:
migrate_set_parameter multifd-packet-size 512

Signed-off-by: Hao Xiang 
---
 migration/migration-hmp-cmds.c | 7 +++
 migration/multifd-zlib.c   | 8 ++--
 migration/multifd-zstd.c   | 8 ++--
 migration/multifd.c| 4 ++--
 migration/multifd.h| 3 ---
 5 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 788ce699ac..2d0c71294c 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -338,6 +338,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "%s: %s\n",
 MigrationParameter_str(MIGRATION_PARAMETER_BLOCK_INCREMENTAL),
 params->block_incremental ? "on" : "off");
+monitor_printf(mon, "%s: %" PRIu64 "\n",
+MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_PACKET_SIZE),
+params->multifd_packet_size);
 monitor_printf(mon, "%s: %u\n",
 MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_CHANNELS),
 params->multifd_channels);
@@ -626,6 +629,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict 
*qdict)
 p->multifd_dsa_accel->type = QTYPE_QSTRING;
 visit_type_str(v, param, &p->multifd_dsa_accel->u.s, &err);
 break;
+case MIGRATION_PARAMETER_MULTIFD_PACKET_SIZE:
+p->has_multifd_packet_size = true;
+visit_type_size(v, param, &p->multifd_packet_size, &err);
+break;
 case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
 p->has_multifd_channels = true;
 visit_type_uint8(v, param, &p->multifd_channels, &err);
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 37ce48621e..453c85d725 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -49,6 +49,8 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
 struct zlib_data *z = g_new0(struct zlib_data, 1);
 z_stream *zs = &z->zs;
 const char *err_msg;
+uint64_t multifd_packet_size =
+migrate_multifd_packet_size() * qemu_target_page_size();
 
 zs->zalloc = Z_NULL;
 zs->zfree = Z_NULL;
@@ -58,7 +60,7 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
 goto err_free_z;
 }
 /* This is the maximum size of the compressed buffer */
-z->zbuff_len = compressBound(MULTIFD_PACKET_SIZE);
+z->zbuff_len = compressBound(multifd_packet_size);
 z->zbuff = g_try_malloc(z->zbuff_len);
 if (!z->zbuff) {
 err_msg = "out of memory for zbuff";
@@ -186,6 +188,8 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error 
**errp)
  */
 static int zlib_recv_setup(MultiFDRecvParams *p, Error **errp)
 {
+uint64_t multifd_packet_size =
+migrate_multifd_packet_size() * qemu_target_page_size();
 struct zlib_data *z = g_new0(struct zlib_data, 1);
 z_stream *zs = &z->zs;
 
@@ -200,7 +204,7 @@ static int zlib_recv_setup(MultiFDRecvParams *p, Error 
**errp)
 return -1;
 }
 /* To be safe, we reserve twice the size of the packet */
-z->zbuff_len = MULTIFD_PACKET_SIZE * 2;
+z->zbuff_len = multifd_packet_size * 2;
 z->zbuff = g_try_malloc(z->zbuff_len);
 if (!z->zbuff) {
 inflateEnd(zs);
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index b471daadcd..60298861d6 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -49,6 +49,8 @@ struct zstd_data {
  */
 static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
 {
+uint64_t multifd_packet_size =
+migrate_multifd_packet_size() * qemu_target_page_size();
 struct zstd_data *z = g_new0(struct zstd_data, 1);
 int res;
 
@@ -69,7 +71,7 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
 return -1;
 }
 /* This is the maximum size of the compressed buffer */
-z->zbuff_len = ZSTD_compressBound(MULTIFD_PACKET_SIZE);
+z->zbuff_len = ZSTD_compressBound(multifd_packet_size);
 z->zbuff = g_try_malloc(z->zbuff_len);
 if (!z->zbuff) {
 ZSTD_freeCStream(z->zcs);
@@ -175,6 +177,8 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error 
**errp)
  */
 static int zstd_recv_setup(MultiFDRecvParams *p, Error **errp)
 {
+uint64_t multifd_packet_size =
+migrate_multifd_packet_size() * qemu_target_page_size();
 struct zstd_data *z = g_new0(struct zstd_data, 1);
 int ret;
 
@@ -196,7 +200,7 @@ static int zstd_recv_setup(MultiFDRecvParams *p, Error 
**errp)
 }
 
 /* To be safe, we reserve twice the size of the packet */
-z->zbuff_len = MULTIFD_PACKET_

[PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic.

* DSA device open and close.
* DSA group contains multiple DSA devices.
* DSA group configure/start/stop/clean.

Signed-off-by: Hao Xiang 
Signed-off-by: Bryan Zhang 
---
 include/qemu/dsa.h |  49 +++
 util/dsa.c | 338 +
 util/meson.build   |   1 +
 3 files changed, 388 insertions(+)
 create mode 100644 include/qemu/dsa.h
 create mode 100644 util/dsa.c

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
new file mode 100644
index 00..30246b507e
--- /dev/null
+++ b/include/qemu/dsa.h
@@ -0,0 +1,49 @@
+#ifndef QEMU_DSA_H
+#define QEMU_DSA_H
+
+#include "qemu/thread.h"
+#include "qemu/queue.h"
+
+#ifdef CONFIG_DSA_OPT
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+
+#include 
+#include "x86intrin.h"
+
+#endif
+
+/**
+ * @brief Initializes DSA devices.
+ *
+ * @param dsa_parameter A list of DSA device path from migration parameter.
+ * @return int Zero if successful, otherwise non zero.
+ */
+int dsa_init(const char *dsa_parameter);
+
+/**
+ * @brief Start logic to enable using DSA.
+ */
+void dsa_start(void);
+
+/**
+ * @brief Stop logic to clean up DSA by halting the device group and cleaning 
up
+ * the completion thread.
+ */
+void dsa_stop(void);
+
+/**
+ * @brief Clean up system resources created for DSA offloading.
+ *This function is called during QEMU process teardown.
+ */
+void dsa_cleanup(void);
+
+/**
+ * @brief Check if DSA is running.
+ *
+ * @return True if DSA is running, otherwise false.
+ */
+bool dsa_is_running(void);
+
+#endif
\ No newline at end of file
diff --git a/util/dsa.c b/util/dsa.c
new file mode 100644
index 00..8edaa892ec
--- /dev/null
+++ b/util/dsa.c
@@ -0,0 +1,338 @@
+/*
+ * Use Intel Data Streaming Accelerator to offload certain background
+ * operations.
+ *
+ * Copyright (c) 2023 Hao Xiang 
+ *Bryan Zhang 
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/queue.h"
+#include "qemu/memalign.h"
+#include "qemu/lockable.h"
+#include "qemu/cutils.h"
+#include "qemu/dsa.h"
+#include "qemu/bswap.h"
+#include "qemu/error-report.h"
+#include "qemu/rcu.h"
+
+#ifdef CONFIG_DSA_OPT
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+
+#include 
+#include "x86intrin.h"
+
+#define DSA_WQ_SIZE 4096
+#define MAX_DSA_DEVICES 16
+
+typedef QSIMPLEQ_HEAD(dsa_task_queue, buffer_zero_batch_task) dsa_task_queue;
+
+struct dsa_device {
+void *work_queue;
+};
+
+struct dsa_device_group {
+struct dsa_device *dsa_devices;
+int num_dsa_devices;
+uint32_t index;
+bool running;
+QemuMutex task_queue_lock;
+QemuCond task_queue_cond;
+dsa_task_queue task_queue;
+};
+
+uint64_t max_retry_count;
+static struct dsa_device_group dsa_group;
+
+
+/**
+ * @brief This function opens a DSA device's work queue and
+ *maps the DSA device memory into the current process.
+ *
+ * @param dsa_wq_path A pointer to the DSA device work queue's file path.
+ * @return A pointer to the mapped memory.
+ */
+static void *
+map_dsa_device(const char *dsa_wq_path)
+{
+void *dsa_device;
+int fd;
+
+fd = open(dsa_wq_path, O_RDWR);
+if (fd < 0) {
+fprintf(stderr, "open %s failed with errno = %d.\n",
+dsa_wq_path, errno);
+return MAP_FAILED;
+}
+dsa_device = mmap(NULL, DSA_WQ_SIZE, PROT_WRITE,
+  MAP_SHARED | MAP_POPULATE, fd, 0);
+close(fd);
+if (dsa_device == MAP_FAILED) {
+fprintf(stderr, "mmap failed with errno = %d.\n", errno);
+return MAP_FAILED;
+}
+return dsa_device;
+}
+
+/**
+ * @brief Initializes a DSA device structure.
+ *
+ * @param instance A pointer to the DSA device.
+ * @param work_queue  A pointer to the DSA work queue.
+ */
+static void
+dsa_device_init(struct dsa_device *instance,
+void *dsa_work_queue)
+{
+instance->work_queue = dsa_work_queue;
+}
+
+/**
+

[PATCH v2 14/20] migration/multifd: Enable DSA offloading in multifd sender path.

Multifd sender path gets an array of pages queued by the migration
thread. It performs zero page checking on every page in the array.
The pages are classfied as either a zero page or a normal page. This
change uses Intel DSA to offload the zero page checking from CPU to
the DSA accelerator. The sender thread submits a batch of pages to DSA
hardware and waits for the DSA completion thread to signal for work
completion.

Signed-off-by: Hao Xiang 
---
 migration/multifd.c | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 68ab97f918..2f635898ed 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -560,6 +560,8 @@ void multifd_save_cleanup(void)
 qemu_thread_join(&p->thread);
 }
 }
+dsa_stop();
+dsa_cleanup();
 for (i = 0; i < migrate_multifd_channels(); i++) {
 MultiFDSendParams *p = &multifd_send_state->params[i];
 Error *local_err = NULL;
@@ -702,6 +704,7 @@ static void buffer_is_zero_use_cpu(MultiFDSendParams *p)
 {
 const void **buf = (const void **)p->addr;
 assert(!migrate_use_main_zero_page());
+assert(!dsa_is_running());
 
 for (int i = 0; i < p->pages->num; i++) {
 p->batch_task->results[i] = buffer_is_zero(buf[i], p->page_size);
@@ -710,15 +713,29 @@ static void buffer_is_zero_use_cpu(MultiFDSendParams *p)
 
 static void set_normal_pages(MultiFDSendParams *p)
 {
+assert(migrate_use_main_zero_page());
+
 for (int i = 0; i < p->pages->num; i++) {
 p->batch_task->results[i] = false;
 }
 }
 
+static void buffer_is_zero_use_dsa(MultiFDSendParams *p)
+{
+assert(!migrate_use_main_zero_page());
+assert(dsa_is_running());
+
+buffer_is_zero_dsa_batch_async(p->batch_task,
+   (const void **)p->addr,
+   p->pages->num,
+   p->page_size);
+}
+
 static void multifd_zero_page_check(MultiFDSendParams *p)
 {
 /* older qemu don't understand zero page on multifd channel */
 bool use_multifd_zero_page = !migrate_use_main_zero_page();
+bool use_multifd_dsa_accel = dsa_is_running();
 
 RAMBlock *rb = p->pages->block;
 
@@ -726,7 +743,9 @@ static void multifd_zero_page_check(MultiFDSendParams *p)
 p->addr[i] = (ram_addr_t)(rb->host + p->pages->offset[i]);
 }
 
-if (use_multifd_zero_page) {
+if (use_multifd_dsa_accel && use_multifd_zero_page) {
+buffer_is_zero_use_dsa(p);
+} else if (use_multifd_zero_page) {
 buffer_is_zero_use_cpu(p);
 } else {
 // No zero page checking. All pages are normal pages.
@@ -1001,11 +1020,15 @@ int multifd_save_setup(Error **errp)
 int thread_count;
 uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
 uint8_t i;
+const char *dsa_parameter = migrate_multifd_dsa_accel();
 
 if (!migrate_multifd()) {
 return 0;
 }
 
+dsa_init(dsa_parameter);
+dsa_start();
+
 thread_count = migrate_multifd_channels();
 multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
 multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
@@ -1061,6 +1084,7 @@ int multifd_save_setup(Error **errp)
 return ret;
 }
 }
+
 return 0;
 }
 
@@ -1138,6 +1162,8 @@ void multifd_load_cleanup(void)
 
 qemu_thread_join(&p->thread);
 }
+dsa_stop();
+dsa_cleanup();
 for (i = 0; i < migrate_multifd_channels(); i++) {
 MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
@@ -1272,6 +1298,7 @@ int multifd_load_setup(Error **errp)
 int thread_count;
 uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
 uint8_t i;
+const char *dsa_parameter = migrate_multifd_dsa_accel();
 
 /*
  * Return successfully if multiFD recv state is already initialised
@@ -1281,6 +1308,9 @@ int multifd_load_setup(Error **errp)
 return 0;
 }
 
+dsa_init(dsa_parameter);
+dsa_start();
+
 thread_count = migrate_multifd_channels();
 multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
 multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
@@ -1317,6 +1347,7 @@ int multifd_load_setup(Error **errp)
 return ret;
 }
 }
+
 return 0;
 }
 
-- 
2.30.2

[PATCH v2 04/20] So we use multifd to transmit zero pages.

From: Juan Quintela 

Signed-off-by: Juan Quintela 
Reviewed-by: Leonardo Bras 
---
 migration/multifd.c |  7 ---
 migration/options.c | 13 +++--
 migration/ram.c | 45 ++---
 qapi/migration.json |  1 -
 4 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 1b994790d5..1198ffde9c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
 #include "qemu/rcu.h"
+#include "qemu/cutils.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
 #include "exec/ramblock.h"
@@ -459,7 +460,6 @@ static int multifd_send_pages(QEMUFile *f)
 p->packet_num = multifd_send_state->packet_num++;
 multifd_send_state->pages = p->pages;
 p->pages = pages;
-
 qemu_mutex_unlock(&p->mutex);
 qemu_sem_post(&p->sem);
 
@@ -684,7 +684,7 @@ static void *multifd_send_thread(void *opaque)
 MigrationThread *thread = NULL;
 Error *local_err = NULL;
 /* qemu older than 8.2 don't understand zero page on multifd channel */
-bool use_zero_page = !migrate_use_main_zero_page();
+bool use_multifd_zero_page = !migrate_use_main_zero_page();
 int ret = 0;
 bool use_zero_copy_send = migrate_zero_copy_send();
 
@@ -713,6 +713,7 @@ static void *multifd_send_thread(void *opaque)
 RAMBlock *rb = p->pages->block;
 uint64_t packet_num = p->packet_num;
 uint32_t flags;
+
 p->normal_num = 0;
 p->zero_num = 0;
 
@@ -724,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
 
 for (int i = 0; i < p->pages->num; i++) {
 uint64_t offset = p->pages->offset[i];
-if (use_zero_page &&
+if (use_multifd_zero_page &&
 buffer_is_zero(rb->host + offset, p->page_size)) {
 p->zero[p->zero_num] = offset;
 p->zero_num++;
diff --git a/migration/options.c b/migration/options.c
index 00c0c4a0d6..97d121d4d7 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -195,6 +195,7 @@ Property migration_properties[] = {
 DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
 DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
 DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
+DEFINE_PROP_MIG_CAP("x-main-zero-page", 
MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
 DEFINE_PROP_MIG_CAP("x-background-snapshot",
 MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
 #ifdef CONFIG_LINUX
@@ -288,13 +289,9 @@ bool migrate_multifd(void)
 
 bool migrate_use_main_zero_page(void)
 {
-//MigrationState *s;
-
-//s = migrate_get_current();
+MigrationState *s = migrate_get_current();
 
-// We will enable this when we add the right code.
-// return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
-return true;
+return s->capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
 }
 
 bool migrate_pause_before_switchover(void)
@@ -457,6 +454,7 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
 MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
 MIGRATION_CAPABILITY_RETURN_PATH,
 MIGRATION_CAPABILITY_MULTIFD,
+MIGRATION_CAPABILITY_MAIN_ZERO_PAGE,
 MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
 MIGRATION_CAPABILITY_AUTO_CONVERGE,
 MIGRATION_CAPABILITY_RELEASE_RAM,
@@ -534,6 +532,9 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, 
Error **errp)
 error_setg(errp, "Postcopy is not yet compatible with multifd");
 return false;
 }
+if (new_caps[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE]) {
+error_setg(errp, "Postcopy is not yet compatible with main zero 
copy");
+}
 }
 
 if (new_caps[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
diff --git a/migration/ram.c b/migration/ram.c
index 8c7886ab79..f7a42feff2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2059,17 +2059,42 @@ static int ram_save_target_page_legacy(RAMState *rs, 
PageSearchStatus *pss)
 if (save_zero_page(rs, pss, offset)) {
 return 1;
 }
-
 /*
- * Do not use multifd in postcopy as one whole host page should be
- * placed.  Meanwhile postcopy requires atomic update of pages, so even
- * if host page size == guest page size the dest guest during run may
- * still see partially copied pages which is data corruption.
+ * Do not use multifd for:
+ * 1. Compression as the first page in the new block should be posted out
+ *before sending the compressed page
+ * 2. In postcopy as one whole host page should be placed
  */
-if (migrate_multifd() && !migration_in_postcopy()) {
+if (!migrate_compress() && migrate_multifd() && !migration_in_postcopy()) {
+return ram_save_multifd_page(pss->pss_channel, block, offset);
+}
+
+return ram_save_page(rs, pss);
+}
+
+/**

[PATCH v2 06/20] util/dsa: Add dependency idxd.

Idxd is the device driver for DSA (Intel Data Streaming
Accelerator). The driver is fully functioning since Linux
kernel 5.19. This change adds the driver's header file used
for userspace development.

Signed-off-by: Hao Xiang 
---
 linux-headers/linux/idxd.h | 356 +
 1 file changed, 356 insertions(+)
 create mode 100644 linux-headers/linux/idxd.h

diff --git a/linux-headers/linux/idxd.h b/linux-headers/linux/idxd.h
new file mode 100644
index 00..1d553bedbd
--- /dev/null
+++ b/linux-headers/linux/idxd.h
@@ -0,0 +1,356 @@
+/* SPDX-License-Identifier: LGPL-2.1 WITH Linux-syscall-note */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#ifndef _USR_IDXD_H_
+#define _USR_IDXD_H_
+
+#ifdef __KERNEL__
+#include 
+#else
+#include 
+#endif
+
+/* Driver command error status */
+enum idxd_scmd_stat {
+   IDXD_SCMD_DEV_ENABLED = 0x8010,
+   IDXD_SCMD_DEV_NOT_ENABLED = 0x8020,
+   IDXD_SCMD_WQ_ENABLED = 0x8021,
+   IDXD_SCMD_DEV_DMA_ERR = 0x8002,
+   IDXD_SCMD_WQ_NO_GRP = 0x8003,
+   IDXD_SCMD_WQ_NO_NAME = 0x8004,
+   IDXD_SCMD_WQ_NO_SVM = 0x8005,
+   IDXD_SCMD_WQ_NO_THRESH = 0x8006,
+   IDXD_SCMD_WQ_PORTAL_ERR = 0x8007,
+   IDXD_SCMD_WQ_RES_ALLOC_ERR = 0x8008,
+   IDXD_SCMD_PERCPU_ERR = 0x8009,
+   IDXD_SCMD_DMA_CHAN_ERR = 0x800a,
+   IDXD_SCMD_CDEV_ERR = 0x800b,
+   IDXD_SCMD_WQ_NO_SWQ_SUPPORT = 0x800c,
+   IDXD_SCMD_WQ_NONE_CONFIGURED = 0x800d,
+   IDXD_SCMD_WQ_NO_SIZE = 0x800e,
+   IDXD_SCMD_WQ_NO_PRIV = 0x800f,
+   IDXD_SCMD_WQ_IRQ_ERR = 0x8010,
+   IDXD_SCMD_WQ_USER_NO_IOMMU = 0x8011,
+};
+
+#define IDXD_SCMD_SOFTERR_MASK 0x8000
+#define IDXD_SCMD_SOFTERR_SHIFT16
+
+/* Descriptor flags */
+#define IDXD_OP_FLAG_FENCE 0x0001
+#define IDXD_OP_FLAG_BOF   0x0002
+#define IDXD_OP_FLAG_CRAV  0x0004
+#define IDXD_OP_FLAG_RCR   0x0008
+#define IDXD_OP_FLAG_RCI   0x0010
+#define IDXD_OP_FLAG_CRSTS 0x0020
+#define IDXD_OP_FLAG_CR0x0080
+#define IDXD_OP_FLAG_CC0x0100
+#define IDXD_OP_FLAG_ADDR1_TCS 0x0200
+#define IDXD_OP_FLAG_ADDR2_TCS 0x0400
+#define IDXD_OP_FLAG_ADDR3_TCS 0x0800
+#define IDXD_OP_FLAG_CR_TCS0x1000
+#define IDXD_OP_FLAG_STORD 0x2000
+#define IDXD_OP_FLAG_DRDBK 0x4000
+#define IDXD_OP_FLAG_DSTS  0x8000
+
+/* IAX */
+#define IDXD_OP_FLAG_RD_SRC2_AECS  0x01
+#define IDXD_OP_FLAG_RD_SRC2_2ND   0x02
+#define IDXD_OP_FLAG_WR_SRC2_AECS_COMP 0x04
+#define IDXD_OP_FLAG_WR_SRC2_AECS_OVFL 0x08
+#define IDXD_OP_FLAG_SRC2_STS  0x10
+#define IDXD_OP_FLAG_CRC_RFC3720   0x20
+
+/* Opcode */
+enum dsa_opcode {
+   DSA_OPCODE_NOOP = 0,
+   DSA_OPCODE_BATCH,
+   DSA_OPCODE_DRAIN,
+   DSA_OPCODE_MEMMOVE,
+   DSA_OPCODE_MEMFILL,
+   DSA_OPCODE_COMPARE,
+   DSA_OPCODE_COMPVAL,
+   DSA_OPCODE_CR_DELTA,
+   DSA_OPCODE_AP_DELTA,
+   DSA_OPCODE_DUALCAST,
+   DSA_OPCODE_CRCGEN = 0x10,
+   DSA_OPCODE_COPY_CRC,
+   DSA_OPCODE_DIF_CHECK,
+   DSA_OPCODE_DIF_INS,
+   DSA_OPCODE_DIF_STRP,
+   DSA_OPCODE_DIF_UPDT,
+   DSA_OPCODE_CFLUSH = 0x20,
+};
+
+enum iax_opcode {
+   IAX_OPCODE_NOOP = 0,
+   IAX_OPCODE_DRAIN = 2,
+   IAX_OPCODE_MEMMOVE,
+   IAX_OPCODE_DECOMPRESS = 0x42,
+   IAX_OPCODE_COMPRESS,
+   IAX_OPCODE_CRC64,
+   IAX_OPCODE_ZERO_DECOMP_32 = 0x48,
+   IAX_OPCODE_ZERO_DECOMP_16,
+   IAX_OPCODE_ZERO_COMP_32 = 0x4c,
+   IAX_OPCODE_ZERO_COMP_16,
+   IAX_OPCODE_SCAN = 0x50,
+   IAX_OPCODE_SET_MEMBER,
+   IAX_OPCODE_EXTRACT,
+   IAX_OPCODE_SELECT,
+   IAX_OPCODE_RLE_BURST,
+   IAX_OPCODE_FIND_UNIQUE,
+   IAX_OPCODE_EXPAND,
+};
+
+/* Completion record status */
+enum dsa_completion_status {
+   DSA_COMP_NONE = 0,
+   DSA_COMP_SUCCESS,
+   DSA_COMP_SUCCESS_PRED,
+   DSA_COMP_PAGE_FAULT_NOBOF,
+   DSA_COMP_PAGE_FAULT_IR,
+   DSA_COMP_BATCH_FAIL,
+   DSA_COMP_BATCH_PAGE_FAULT,
+   DSA_COMP_DR_OFFSET_NOINC,
+   DSA_COMP_DR_OFFSET_ERANGE,
+   DSA_COMP_DIF_ERR,
+   DSA_COMP_BAD_OPCODE = 0x10,
+   DSA_COMP_INVALID_FLAGS,
+   DSA_COMP_NOZERO_RESERVE,
+   DSA_COMP_XFER_ERANGE,
+   DSA_COMP_DESC_CNT_ERANGE,
+   DSA_COMP_DR_ERANGE,
+   DSA_COMP_OVERLAP_BUFFERS,
+   DSA_COMP_DCAST_ERR,
+   DSA_COMP_DESCLIST_ALIGN,
+   DSA_COMP_INT_HANDLE_INVAL,
+   DSA_COMP_CRA_XLAT,
+   DSA_COMP_CRA_ALIGN,
+   DSA_COMP_ADDR_ALIGN,
+   DSA_COMP_PRIV_BAD,
+   DSA_COMP_TRAFFIC_CLASS_CONF,
+   DSA_COMP_PFAULT_RDBA,
+   DSA_COMP_HW_ERR1,
+   DSA_COMP_HW_ERR_DRB,
+   DSA_COMP_TRANSLATION_FAIL,
+};
+
+enum iax_completion_status {
+   IAX_COMP_NONE = 0,
+   IAX_COMP_SUCCESS,
+   IAX_COMP_PAGE_FAULT_IR = 0x04,
+   IAX_COMP_ANALYTICS_ERROR = 0x0a,
+   IAX_COMP_OUTBUF_OVERFLOW

[PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model.

* Create a dedicated thread for DSA task completion.
* DSA completion thread runs a loop and poll for completed tasks.
* Start and stop DSA completion thread during DSA device start stop.

User space application can directly submit task to Intel DSA
accelerator by writing to DSA's device memory (mapped in user space).
Once a task is submitted, the device starts processing it and write
the completion status back to the task. A user space application can
poll the task's completion status to check for completion. This change
uses a dedicated thread to perform DSA task completion checking.

Signed-off-by: Hao Xiang 
---
 util/dsa.c | 243 -
 1 file changed, 242 insertions(+), 1 deletion(-)

diff --git a/util/dsa.c b/util/dsa.c
index f82282ce99..0e68013ffb 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -44,6 +44,7 @@
 
 #define DSA_WQ_SIZE 4096
 #define MAX_DSA_DEVICES 16
+#define DSA_COMPLETION_THREAD "dsa_completion"
 
 typedef QSIMPLEQ_HEAD(dsa_task_queue, buffer_zero_batch_task) dsa_task_queue;
 
@@ -61,8 +62,18 @@ struct dsa_device_group {
 dsa_task_queue task_queue;
 };
 
+struct dsa_completion_thread {
+bool stopping;
+bool running;
+QemuThread thread;
+int thread_id;
+QemuSemaphore sem_init_done;
+struct dsa_device_group *group;
+};
+
 uint64_t max_retry_count;
 static struct dsa_device_group dsa_group;
+static struct dsa_completion_thread completion_thread;
 
 
 /**
@@ -439,6 +450,234 @@ submit_batch_wi_async(struct buffer_zero_batch_task 
*batch_task)
 return dsa_task_enqueue(device_group, batch_task);
 }
 
+/**
+ * @brief Poll for the DSA work item completion.
+ *
+ * @param completion A pointer to the DSA work item completion record.
+ * @param opcode The DSA opcode.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+poll_completion(struct dsa_completion_record *completion,
+enum dsa_opcode opcode)
+{
+uint8_t status;
+uint64_t retry = 0;
+
+while (true) {
+// The DSA operation completes successfully or fails.
+status = completion->status;
+if (status == DSA_COMP_SUCCESS ||
+status == DSA_COMP_PAGE_FAULT_NOBOF ||
+status == DSA_COMP_BATCH_PAGE_FAULT ||
+status == DSA_COMP_BATCH_FAIL) {
+break;
+} else if (status != DSA_COMP_NONE) {
+/* TODO: Error handling here on unexpected failure. */
+fprintf(stderr, "DSA opcode %d failed with status = %d.\n",
+opcode, status);
+exit(1);
+}
+retry++;
+if (retry > max_retry_count) {
+fprintf(stderr, "Wait for completion retry %lu times.\n", retry);
+exit(1);
+}
+_mm_pause();
+}
+
+return 0;
+}
+
+/**
+ * @brief Complete a single DSA task in the batch task.
+ *
+ * @param task A pointer to the batch task structure.
+ */
+static void
+poll_task_completion(struct buffer_zero_batch_task *task)
+{
+assert(task->task_type == DSA_TASK);
+
+struct dsa_completion_record *completion = &task->completions[0];
+uint8_t status;
+
+poll_completion(completion, task->descriptors[0].opcode);
+
+status = completion->status;
+if (status == DSA_COMP_SUCCESS) {
+task->results[0] = (completion->result == 0);
+return;
+}
+
+assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
+}
+
+/**
+ * @brief Poll a batch task status until it completes. If DSA task doesn't
+ *complete properly, use CPU to complete the task.
+ *
+ * @param batch_task A pointer to the DSA batch task.
+ */
+static void
+poll_batch_task_completion(struct buffer_zero_batch_task *batch_task)
+{
+struct dsa_completion_record *batch_completion = 
&batch_task->batch_completion;
+struct dsa_completion_record *completion;
+uint8_t batch_status;
+uint8_t status;
+bool *results = batch_task->results;
+uint32_t count = batch_task->batch_descriptor.desc_count;
+
+poll_completion(batch_completion,
+batch_task->batch_descriptor.opcode);
+
+batch_status = batch_completion->status;
+
+if (batch_status == DSA_COMP_SUCCESS) {
+if (batch_completion->bytes_completed == count) {
+// Let's skip checking for each descriptors' completion status
+// if the batch descriptor says all succedded.
+for (int i = 0; i < count; i++) {
+assert(batch_task->completions[i].status == DSA_COMP_SUCCESS);
+results[i] = (batch_task->completions[i].result == 0);
+}
+return;
+}
+} else {
+assert(batch_status == DSA_COMP_BATCH_FAIL ||
+batch_status == DSA_COMP_BATCH_PAGE_FAULT);
+}
+
+for (int i = 0; i < count; i++) {
+
+completion = &batch_task->completions[i];
+status = completion->status;
+
+if (status == DSA_COMP_SUCCESS) {
+results[i] = (completion->re

[PATCH v2 08/20] util/dsa: Implement DSA task enqueue and dequeue.

* Use a safe thread queue for DSA task enqueue/dequeue.
* Implement DSA task submission.
* Implement DSA batch task submission.

Signed-off-by: Hao Xiang 
---
 include/qemu/dsa.h |  35 
 util/dsa.c | 196 +
 2 files changed, 231 insertions(+)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 30246b507e..23f55185be 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -12,6 +12,41 @@
 #include 
 #include "x86intrin.h"
 
+enum dsa_task_type {
+DSA_TASK = 0,
+DSA_BATCH_TASK
+};
+
+enum dsa_task_status {
+DSA_TASK_READY = 0,
+DSA_TASK_PROCESSING,
+DSA_TASK_COMPLETION
+};
+
+typedef void (*buffer_zero_dsa_completion_fn)(void *);
+
+typedef struct buffer_zero_batch_task {
+struct dsa_hw_desc batch_descriptor;
+struct dsa_hw_desc *descriptors;
+struct dsa_completion_record batch_completion __attribute__((aligned(32)));
+struct dsa_completion_record *completions;
+struct dsa_device_group *group;
+struct dsa_device *device;
+buffer_zero_dsa_completion_fn completion_callback;
+QemuSemaphore sem_task_complete;
+enum dsa_task_type task_type;
+enum dsa_task_status status;
+bool *results;
+int batch_size;
+QSIMPLEQ_ENTRY(buffer_zero_batch_task) entry;
+} buffer_zero_batch_task;
+
+#else
+
+struct buffer_zero_batch_task {
+bool *results;
+};
+
 #endif
 
 /**
diff --git a/util/dsa.c b/util/dsa.c
index 8edaa892ec..f82282ce99 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -245,6 +245,200 @@ dsa_device_group_get_next_device(struct dsa_device_group 
*group)
 return &group->dsa_devices[current];
 }
 
+/**
+ * @brief Empties out the DSA task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_empty_task_queue(struct dsa_device_group *group)
+{
+qemu_mutex_lock(&group->task_queue_lock);
+dsa_task_queue *task_queue = &group->task_queue;
+while (!QSIMPLEQ_EMPTY(task_queue)) {
+QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
+}
+qemu_mutex_unlock(&group->task_queue_lock);
+}
+
+/**
+ * @brief Adds a task to the DSA task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ * @param context A pointer to the DSA task to enqueue.
+ *
+ * @return int Zero if successful, otherwise a proper error code.
+ */
+static int
+dsa_task_enqueue(struct dsa_device_group *group,
+ struct buffer_zero_batch_task *task)
+{
+dsa_task_queue *task_queue = &group->task_queue;
+QemuMutex *task_queue_lock = &group->task_queue_lock;
+QemuCond *task_queue_cond = &group->task_queue_cond;
+
+bool notify = false;
+
+qemu_mutex_lock(task_queue_lock);
+
+if (!group->running) {
+fprintf(stderr, "DSA: Tried to queue task to stopped device queue\n");
+qemu_mutex_unlock(task_queue_lock);
+return -1;
+}
+
+// The queue is empty. This enqueue operation is a 0->1 transition.
+if (QSIMPLEQ_EMPTY(task_queue))
+notify = true;
+
+QSIMPLEQ_INSERT_TAIL(task_queue, task, entry);
+
+// We need to notify the waiter for 0->1 transitions.
+if (notify)
+qemu_cond_signal(task_queue_cond);
+
+qemu_mutex_unlock(task_queue_lock);
+
+return 0;
+}
+
+/**
+ * @brief Takes a DSA task out of the task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ * @return buffer_zero_batch_task* The DSA task being dequeued.
+ */
+__attribute__((unused))
+static struct buffer_zero_batch_task *
+dsa_task_dequeue(struct dsa_device_group *group)
+{
+struct buffer_zero_batch_task *task = NULL;
+dsa_task_queue *task_queue = &group->task_queue;
+QemuMutex *task_queue_lock = &group->task_queue_lock;
+QemuCond *task_queue_cond = &group->task_queue_cond;
+
+qemu_mutex_lock(task_queue_lock);
+
+while (true) {
+if (!group->running)
+goto exit;
+task = QSIMPLEQ_FIRST(task_queue);
+if (task != NULL) {
+break;
+}
+qemu_cond_wait(task_queue_cond, task_queue_lock);
+}
+
+QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
+
+exit:
+qemu_mutex_unlock(task_queue_lock);
+return task;
+}
+
+/**
+ * @brief Submits a DSA work item to the device work queue.
+ *
+ * @param wq A pointer to the DSA work queue's device memory.
+ * @param descriptor A pointer to the DSA work item descriptor.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
+{
+uint64_t retry = 0;
+
+_mm_sfence();
+
+while (true) {
+if (_enqcmd(wq, descriptor) == 0) {
+break;
+}
+retry++;
+if (retry > max_retry_count) {
+fprintf(stderr, "Submit work retry %lu times.\n", retry);
+exit(1);
+}
+}
+
+return 0;
+}
+
+/**
+ * @brief Synchronously submits a DSA work item to the
+ *device work queue.
+ *
+ * @param wq A pointer to the DSA worjk queue's device memo

[PATCH v2 19/20] util/dsa: Add unit test coverage for Intel DSA task submission and completion.

* Test DSA start and stop path.
* Test DSA configure and cleanup path.
* Test DSA task submission and completion path.

Signed-off-by: Bryan Zhang 
Signed-off-by: Hao Xiang 
---
 tests/unit/meson.build |   6 +
 tests/unit/test-dsa.c  | 466 +
 2 files changed, 472 insertions(+)
 create mode 100644 tests/unit/test-dsa.c

diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index a05d471090..72e22063dc 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -54,6 +54,12 @@ tests = {
   'test-virtio-dmabuf': [meson.project_source_root() / 
'hw/display/virtio-dmabuf.c'],
 }
 
+if config_host_data.get('CONFIG_DSA_OPT')
+  tests += {
+'test-dsa': [],
+  }
+endif
+
 if have_system or have_tools
   tests += {
 'test-qmp-event': [testqapi],
diff --git a/tests/unit/test-dsa.c b/tests/unit/test-dsa.c
new file mode 100644
index 00..d2f23c3dba
--- /dev/null
+++ b/tests/unit/test-dsa.c
@@ -0,0 +1,466 @@
+/*
+ * Test DSA functions.
+ *
+ * Copyright (c) 2023 Hao Xiang 
+ * Copyright (c) 2023 Bryan Zhang 
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see .
+ */
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+
+#include "qemu/cutils.h"
+#include "qemu/memalign.h"
+#include "qemu/dsa.h"
+
+// TODO Make these not-hardcoded.
+static const char *path1 = "/dev/dsa/wq4.0";
+static const char *path2 = "/dev/dsa/wq4.0 /dev/dsa/wq4.1";
+static const int num_devices = 2;
+
+static struct buffer_zero_batch_task batch_task __attribute__((aligned(64)));
+
+// TODO Communicate that DSA must be configured to support this batch size.
+// TODO Alternatively, poke the DSA device to figure out batch size.
+static int batch_size = 128;
+static int page_size = 4096;
+
+// A helper for running a single task and checking for correctness.
+static void do_single_task(void)
+{
+buffer_zero_batch_task_init(&batch_task, batch_size);
+char buf[page_size];
+char* ptr = buf;
+
+buffer_is_zero_dsa_batch_async(&batch_task,
+   (const void**) &ptr,
+   1,
+   page_size);
+g_assert(batch_task.results[0] == buffer_is_zero(buf, page_size));
+}
+
+static void test_single_zero(void)
+{
+g_assert(!dsa_init(path1));
+dsa_start();
+
+buffer_zero_batch_task_init(&batch_task, batch_size);
+
+char buf[page_size];
+char* ptr = buf;
+
+memset(buf, 0x0, page_size);
+buffer_is_zero_dsa_batch_async(&batch_task,
+   (const void**) &ptr,
+   1, page_size);
+g_assert(batch_task.results[0]);
+
+dsa_cleanup();
+}
+
+static void test_single_zero_async(void)
+{
+test_single_zero();
+}
+
+static void test_single_nonzero(void)
+{
+g_assert(!dsa_init(path1));
+dsa_start();
+
+buffer_zero_batch_task_init(&batch_task, batch_size);
+
+char buf[page_size];
+char* ptr = buf;
+
+memset(buf, 0x1, page_size);
+buffer_is_zero_dsa_batch_async(&batch_task,
+   (const void**) &ptr,
+   1, page_size);
+g_assert(!batch_task.results[0]);
+
+dsa_cleanup();
+}
+
+static void test_single_nonzero_async(void)
+{
+test_single_nonzero();
+}
+
+// count == 0 should return quickly without calling into DSA.
+static void test_zero_count_async(void)
+{
+char buf[page_size];
+buffer_is_zero_dsa_batch_async(&batch_task,
+ (const void **) &buf,
+ 0,
+ page_size);
+}
+
+static void test_null_task_async(void)
+{
+if (g_test_subprocess()) {
+g_assert(!dsa_init(path1));
+
+char buf[page_size * batch_size];
+char *addrs[batch_size];
+for (int i = 0; i < batch_size; i++) {
+addrs[i] = buf + (page_size * i);
+}
+
+buffer_is_zero_dsa_batch_async(NULL, (const void**) addrs, batch_size,
+ page_size);
+} else {
+g_test_trap_subprocess(NULL, 0, 0);
+g_test_trap_assert_failed();
+}
+}
+
+static void test_oversized_batch(void)
+{
+g_assert(!dsa_init(path1));
+dsa_start();
+
+buffer_zero_batch_task_init(&batch_task, batch_size);
+
+int oversized_batch_size = batch_size + 1;
+char

[PATCH v2 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion.

* Add a DSA task completion callback.
* DSA completion thread will call the tasks's completion callback
on every task/batch task completion.
* DSA submission path to wait for completion.
* Implement CPU fallback if DSA is not able to complete the task.

Signed-off-by: Hao Xiang 
Signed-off-by: Bryan Zhang 
---
 include/qemu/dsa.h |  14 +
 util/dsa.c | 153 -
 2 files changed, 164 insertions(+), 3 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index b10e7b8fb7..3f8ee07004 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -65,6 +65,20 @@ void buffer_zero_batch_task_init(struct 
buffer_zero_batch_task *task,
  */
 void buffer_zero_batch_task_destroy(struct buffer_zero_batch_task *task);
 
+/**
+ * @brief Performs buffer zero comparison on a DSA batch task asynchronously.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The buffer length.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+int
+buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
+   const void **buf, size_t count, size_t len);
+
 /**
  * @brief Initializes DSA devices.
  *
diff --git a/util/dsa.c b/util/dsa.c
index 3cc017b8a0..06c6fbf2ca 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -470,6 +470,41 @@ poll_completion(struct dsa_completion_record *completion,
 return 0;
 }
 
+/**
+ * @brief Use CPU to complete a single zero page checking task.
+ *
+ * @param task A pointer to the task.
+ */
+static void
+task_cpu_fallback(struct buffer_zero_batch_task *task)
+{
+assert(task->task_type == DSA_TASK);
+
+struct dsa_completion_record *completion = &task->completions[0];
+const uint8_t *buf;
+size_t len;
+
+if (completion->status == DSA_COMP_SUCCESS) {
+return;
+}
+
+/*
+ * DSA was able to partially complete the operation. Check the
+ * result. If we already know this is not a zero page, we can
+ * return now.
+ */
+if (completion->bytes_completed != 0 && completion->result != 0) {
+task->results[0] = false;
+return;
+}
+
+/* Let's fallback to use CPU to complete it. */
+buf = (const uint8_t *)task->descriptors[0].src_addr;
+len = task->descriptors[0].xfer_size;
+task->results[0] = buffer_is_zero(buf + completion->bytes_completed,
+  len - completion->bytes_completed);
+}
+
 /**
  * @brief Complete a single DSA task in the batch task.
  *
@@ -548,6 +583,62 @@ poll_batch_task_completion(struct buffer_zero_batch_task 
*batch_task)
 }
 }
 
+/**
+ * @brief Use CPU to complete the zero page checking batch task.
+ *
+ * @param batch_task A pointer to the batch task.
+ */
+static void
+batch_task_cpu_fallback(struct buffer_zero_batch_task *batch_task)
+{
+assert(batch_task->task_type == DSA_BATCH_TASK);
+
+struct dsa_completion_record *batch_completion =
+&batch_task->batch_completion;
+struct dsa_completion_record *completion;
+uint8_t status;
+const uint8_t *buf;
+size_t len;
+bool *results = batch_task->results;
+uint32_t count = batch_task->batch_descriptor.desc_count;
+
+// DSA is able to complete the entire batch task.
+if (batch_completion->status == DSA_COMP_SUCCESS) {
+assert(count == batch_completion->bytes_completed);
+return;
+}
+
+/*
+ * DSA encounters some error and is not able to complete
+ * the entire batch task. Use CPU fallback.
+ */
+for (int i = 0; i < count; i++) {
+completion = &batch_task->completions[i];
+status = completion->status;
+if (status == DSA_COMP_SUCCESS) {
+continue;
+}
+assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
+
+/*
+ * DSA was able to partially complete the operation. Check the
+ * result. If we already know this is not a zero page, we can
+ * return now.
+ */
+if (completion->bytes_completed != 0 && completion->result != 0) {
+results[i] = false;
+continue;
+}
+
+/* Let's fallback to use CPU to complete it. */
+buf = (uint8_t *)batch_task->descriptors[i].src_addr;
+len = batch_task->descriptors[i].xfer_size;
+results[i] =
+buffer_is_zero(buf + completion->bytes_completed,
+   len - completion->bytes_completed);
+}
+}
+
 /**
  * @brief Handles an asynchronous DSA batch task completion.
  *
@@ -825,7 +916,6 @@ buffer_zero_batch_task_set(struct buffer_zero_batch_task 
*batch_task,
  *
  * @return int Zero if successful, otherwise an appropriate error code.
  */
-__attribute__((unused))
 static int
 buffer_zero_dsa_async(struct buffer_zero_batch_task *task,
   const void *buf, size_t len)
@@ -844,7 +934,6 @@ buffer_zero_dsa

[PATCH v2 13/20] migration/multifd: Prepare to introduce DSA acceleration on the multifd path.

1. Refactor multifd_send_thread function.
2. Implement buffer_is_zero_use_cpu to handle CPU based zero page
checking.
3. Introduce the batch task structure in MultiFDSendParams.

Signed-off-by: Hao Xiang 
---
 migration/multifd.c | 82 -
 migration/multifd.h |  3 ++
 2 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 1198ffde9c..68ab97f918 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -14,6 +14,8 @@
 #include "qemu/cutils.h"
 #include "qemu/rcu.h"
 #include "qemu/cutils.h"
+#include "qemu/dsa.h"
+#include "qemu/memalign.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
 #include "exec/ramblock.h"
@@ -574,6 +576,11 @@ void multifd_save_cleanup(void)
 p->name = NULL;
 multifd_pages_clear(p->pages);
 p->pages = NULL;
+g_free(p->addr);
+p->addr = NULL;
+buffer_zero_batch_task_destroy(p->batch_task);
+qemu_vfree(p->batch_task);
+p->batch_task = NULL;
 p->packet_len = 0;
 g_free(p->packet);
 p->packet = NULL;
@@ -678,13 +685,66 @@ int multifd_send_sync_main(QEMUFile *f)
 return 0;
 }
 
+static void set_page(MultiFDSendParams *p, bool zero_page, uint64_t offset)
+{
+RAMBlock *rb = p->pages->block;
+if (zero_page) {
+p->zero[p->zero_num] = offset;
+p->zero_num++;
+ram_release_page(rb->idstr, offset);
+} else {
+p->normal[p->normal_num] = offset;
+p->normal_num++;
+}
+}
+
+static void buffer_is_zero_use_cpu(MultiFDSendParams *p)
+{
+const void **buf = (const void **)p->addr;
+assert(!migrate_use_main_zero_page());
+
+for (int i = 0; i < p->pages->num; i++) {
+p->batch_task->results[i] = buffer_is_zero(buf[i], p->page_size);
+}
+}
+
+static void set_normal_pages(MultiFDSendParams *p)
+{
+for (int i = 0; i < p->pages->num; i++) {
+p->batch_task->results[i] = false;
+}
+}
+
+static void multifd_zero_page_check(MultiFDSendParams *p)
+{
+/* older qemu don't understand zero page on multifd channel */
+bool use_multifd_zero_page = !migrate_use_main_zero_page();
+
+RAMBlock *rb = p->pages->block;
+
+for (int i = 0; i < p->pages->num; i++) {
+p->addr[i] = (ram_addr_t)(rb->host + p->pages->offset[i]);
+}
+
+if (use_multifd_zero_page) {
+buffer_is_zero_use_cpu(p);
+} else {
+// No zero page checking. All pages are normal pages.
+set_normal_pages(p);
+}
+
+for (int i = 0; i < p->pages->num; i++) {
+uint64_t offset = p->pages->offset[i];
+bool zero_page = p->batch_task->results[i];
+set_page(p, zero_page, offset);
+}
+}
+
 static void *multifd_send_thread(void *opaque)
 {
 MultiFDSendParams *p = opaque;
 MigrationThread *thread = NULL;
 Error *local_err = NULL;
-/* qemu older than 8.2 don't understand zero page on multifd channel */
-bool use_multifd_zero_page = !migrate_use_main_zero_page();
 int ret = 0;
 bool use_zero_copy_send = migrate_zero_copy_send();
 
@@ -710,7 +770,6 @@ static void *multifd_send_thread(void *opaque)
 qemu_mutex_lock(&p->mutex);
 
 if (p->pending_job) {
-RAMBlock *rb = p->pages->block;
 uint64_t packet_num = p->packet_num;
 uint32_t flags;
 
@@ -723,18 +782,7 @@ static void *multifd_send_thread(void *opaque)
 p->iovs_num = 1;
 }
 
-for (int i = 0; i < p->pages->num; i++) {
-uint64_t offset = p->pages->offset[i];
-if (use_multifd_zero_page &&
-buffer_is_zero(rb->host + offset, p->page_size)) {
-p->zero[p->zero_num] = offset;
-p->zero_num++;
-ram_release_page(rb->idstr, offset);
-} else {
-p->normal[p->normal_num] = offset;
-p->normal_num++;
-}
-}
+multifd_zero_page_check(p);
 
 if (p->normal_num) {
 ret = multifd_send_state->ops->send_prepare(p, &local_err);
@@ -976,6 +1024,10 @@ int multifd_save_setup(Error **errp)
 p->pending_job = 0;
 p->id = i;
 p->pages = multifd_pages_init(page_count);
+p->addr = g_new0(ram_addr_t, page_count);
+p->batch_task =
+(struct buffer_zero_batch_task *)qemu_memalign(64, 
sizeof(*p->batch_task));
+buffer_zero_batch_task_init(p->batch_task, page_count);
 p->packet_len = sizeof(MultiFDPacket_t)
   + sizeof(uint64_t) * page_count;
 p->packet = g_malloc0(p->packet_len);
diff --git a/migration/multifd.h b/migration/multifd.h
index 13762900d4..62f31b03c0 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -119,6 +119,9 @@ typedef struct {
  * pending_job != 0 -> multifd_channel can use it.
  */
 MultiFDPa

[PATCH v2 01/20] multifd: Add capability to enable/disable zero_page

From: Juan Quintela 

We have to enable it by default until we introduce the new code.

Signed-off-by: Juan Quintela 
---
 migration/options.c | 13 +
 migration/options.h |  1 +
 qapi/migration.json |  8 +++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/migration/options.c b/migration/options.c
index 8d8ec73ad9..00c0c4a0d6 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -204,6 +204,8 @@ Property migration_properties[] = {
 DEFINE_PROP_MIG_CAP("x-switchover-ack",
 MIGRATION_CAPABILITY_SWITCHOVER_ACK),
 DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
+DEFINE_PROP_MIG_CAP("main-zero-page",
+MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
 DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -284,6 +286,17 @@ bool migrate_multifd(void)
 return s->capabilities[MIGRATION_CAPABILITY_MULTIFD];
 }
 
+bool migrate_use_main_zero_page(void)
+{
+//MigrationState *s;
+
+//s = migrate_get_current();
+
+// We will enable this when we add the right code.
+// return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
+return true;
+}
+
 bool migrate_pause_before_switchover(void)
 {
 MigrationState *s = migrate_get_current();
diff --git a/migration/options.h b/migration/options.h
index 246c160aee..c901eb57c6 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -88,6 +88,7 @@ int migrate_multifd_channels(void);
 MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
 int migrate_multifd_zstd_level(void);
+bool migrate_use_main_zero_page(void);
 uint8_t migrate_throttle_trigger_threshold(void);
 const char *migrate_tls_authz(void);
 const char *migrate_tls_creds(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 975761eebd..09e4393591 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -531,6 +531,12 @@
 # and can result in more stable read performance.  Requires KVM
 # with accelerator property "dirty-ring-size" set.  (Since 8.1)
 #
+#
+# @main-zero-page: If enabled, the detection of zero pages will be
+#  done on the main thread.  Otherwise it is done on
+#  the multifd threads.
+#  (since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block is deprecated.  Use blockdev-mirror with
@@ -555,7 +561,7 @@
{ 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
'validate-uuid', 'background-snapshot',
'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
-   'dirty-limit'] }
+   'dirty-limit', 'main-zero-page'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.30.2

[PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading.

Intel DSA offloading is an optional feature that turns on if
proper hardware and software stack is available. To turn on
DSA offloading in multifd live migration:

multifd-dsa-accel="[dsa_dev_path1] ] [dsa_dev_path2] ... [dsa_dev_pathX]"

This feature is turned off by default.

Signed-off-by: Hao Xiang 
---
 migration/migration-hmp-cmds.c |  8 
 migration/options.c| 28 
 migration/options.h|  1 +
 qapi/migration.json| 17 ++---
 scripts/meson-buildoptions.sh  |  6 +++---
 5 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 86ae832176..d9451744dd 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -353,6 +353,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
 monitor_printf(mon, "%s: '%s'\n",
 MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
 params->tls_authz);
+monitor_printf(mon, "%s: %s\n",
+MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL),
+params->multifd_dsa_accel);
 
 if (params->has_block_bitmap_mapping) {
 const BitmapMigrationNodeAliasList *bmnal;
@@ -615,6 +618,11 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict 
*qdict)
 p->has_block_incremental = true;
 visit_type_bool(v, param, &p->block_incremental, &err);
 break;
+case MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL:
+p->multifd_dsa_accel = g_new0(StrOrNull, 1);
+p->multifd_dsa_accel->type = QTYPE_QSTRING;
+visit_type_str(v, param, &p->multifd_dsa_accel->u.s, &err);
+break;
 case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
 p->has_multifd_channels = true;
 visit_type_uint8(v, param, &p->multifd_channels, &err);
diff --git a/migration/options.c b/migration/options.c
index 97d121d4d7..6e424b5d63 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -179,6 +179,8 @@ Property migration_properties[] = {
 DEFINE_PROP_MIG_MODE("mode", MigrationState,
   parameters.mode,
   MIG_MODE_NORMAL),
+DEFINE_PROP_STRING("multifd-dsa-accel", MigrationState,
+   parameters.multifd_dsa_accel),
 
 /* Migration capabilities */
 DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -901,6 +903,13 @@ const char *migrate_tls_creds(void)
 return s->parameters.tls_creds;
 }
 
+const char *migrate_multifd_dsa_accel(void)
+{
+MigrationState *s = migrate_get_current();
+
+return s->parameters.multifd_dsa_accel;
+}
+
 const char *migrate_tls_hostname(void)
 {
 MigrationState *s = migrate_get_current();
@@ -1025,6 +1034,7 @@ MigrationParameters *qmp_query_migrate_parameters(Error 
**errp)
 params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit;
 params->has_mode = true;
 params->mode = s->parameters.mode;
+params->multifd_dsa_accel = s->parameters.multifd_dsa_accel;
 
 return params;
 }
@@ -1033,6 +1043,7 @@ void migrate_params_init(MigrationParameters *params)
 {
 params->tls_hostname = g_strdup("");
 params->tls_creds = g_strdup("");
+params->multifd_dsa_accel = g_strdup("");
 
 /* Set has_* up only for parameter checks */
 params->has_compress_level = true;
@@ -1362,6 +1373,11 @@ static void 
migrate_params_test_apply(MigrateSetParameters *params,
 if (params->has_mode) {
 dest->mode = params->mode;
 }
+
+if (params->multifd_dsa_accel) {
+assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
+dest->multifd_dsa_accel = params->multifd_dsa_accel->u.s;
+}
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1506,6 +1522,12 @@ static void migrate_params_apply(MigrateSetParameters 
*params, Error **errp)
 if (params->has_mode) {
 s->parameters.mode = params->mode;
 }
+
+if (params->multifd_dsa_accel) {
+g_free(s->parameters.multifd_dsa_accel);
+assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
+s->parameters.multifd_dsa_accel = 
g_strdup(params->multifd_dsa_accel->u.s);
+}
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
@@ -1531,6 +1553,12 @@ void qmp_migrate_set_parameters(MigrateSetParameters 
*params, Error **errp)
 params->tls_authz->type = QTYPE_QSTRING;
 params->tls_authz->u.s = strdup("");
 }
+if (params->multifd_dsa_accel
+&& params->multifd_dsa_accel->type == QTYPE_QNULL) {
+qobject_unref(params->multifd_dsa_accel->u.n);
+params->multifd_dsa_accel->type = QTYPE_QSTRING;
+params->multifd_dsa_accel->u.s = strdup("");
+}
 
 migrate_params_test_apply(params, &tmp);
 
diff --git a/migration/options.h b/migration/options.h
index c901eb57c6..56100961a9 100644
--- a/migration/options.h
+++ b/migration/op

[PATCH v2 03/20] multifd: Zero pages transmission

From: Juan Quintela 

This implements the zero page dection and handling.

Signed-off-by: Juan Quintela 
---
 migration/multifd.c | 41 +++--
 migration/multifd.h |  5 +
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index d28ef0028b..1b994790d5 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -11,6 +11,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -279,6 +280,12 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 
 packet->offset[i] = cpu_to_be64(temp);
 }
+for (i = 0; i < p->zero_num; i++) {
+/* there are architectures where ram_addr_t is 32 bit */
+uint64_t temp = p->zero[i];
+
+packet->offset[p->normal_num + i] = cpu_to_be64(temp);
+}
 }
 
 static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
@@ -361,6 +368,18 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams 
*p, Error **errp)
 p->normal[i] = offset;
 }
 
+for (i = 0; i < p->zero_num; i++) {
+uint64_t offset = be64_to_cpu(packet->offset[p->normal_num + i]);
+
+if (offset > (p->block->used_length - p->page_size)) {
+error_setg(errp, "multifd: offset too long %" PRIu64
+   " (max " RAM_ADDR_FMT ")",
+   offset, p->block->used_length);
+return -1;
+}
+p->zero[i] = offset;
+}
+
 return 0;
 }
 
@@ -664,6 +683,8 @@ static void *multifd_send_thread(void *opaque)
 MultiFDSendParams *p = opaque;
 MigrationThread *thread = NULL;
 Error *local_err = NULL;
+/* qemu older than 8.2 don't understand zero page on multifd channel */
+bool use_zero_page = !migrate_use_main_zero_page();
 int ret = 0;
 bool use_zero_copy_send = migrate_zero_copy_send();
 
@@ -689,6 +710,7 @@ static void *multifd_send_thread(void *opaque)
 qemu_mutex_lock(&p->mutex);
 
 if (p->pending_job) {
+RAMBlock *rb = p->pages->block;
 uint64_t packet_num = p->packet_num;
 uint32_t flags;
 p->normal_num = 0;
@@ -701,8 +723,16 @@ static void *multifd_send_thread(void *opaque)
 }
 
 for (int i = 0; i < p->pages->num; i++) {
-p->normal[p->normal_num] = p->pages->offset[i];
-p->normal_num++;
+uint64_t offset = p->pages->offset[i];
+if (use_zero_page &&
+buffer_is_zero(rb->host + offset, p->page_size)) {
+p->zero[p->zero_num] = offset;
+p->zero_num++;
+ram_release_page(rb->idstr, offset);
+} else {
+p->normal[p->normal_num] = offset;
+p->normal_num++;
+}
 }
 
 if (p->normal_num) {
@@ -1156,6 +1186,13 @@ static void *multifd_recv_thread(void *opaque)
 }
 }
 
+for (int i = 0; i < p->zero_num; i++) {
+void *page = p->host + p->zero[i];
+if (!buffer_is_zero(page, p->page_size)) {
+memset(page, 0, p->page_size);
+}
+}
+
 if (flags & MULTIFD_FLAG_SYNC) {
 qemu_sem_post(&multifd_recv_state->sem_sync);
 qemu_sem_wait(&p->sem_sync);
diff --git a/migration/multifd.h b/migration/multifd.h
index d587b0e19c..13762900d4 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -53,6 +53,11 @@ typedef struct {
 uint32_t unused32[1];/* Reserved for future use */
 uint64_t unused64[3];/* Reserved for future use */
 char ramblock[256];
+/*
+ * This array contains the pointers to:
+ *  - normal pages (initial normal_pages entries)
+ *  - zero pages (following zero_pages entries)
+ */
 uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
-- 
2.30.2

[PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration.

v2
* Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
* Leave Juan's changes in their original form instead of squashing them.
* Add a new commit to refactor the multifd_send_thread function to prepare for
introducing the DSA offload functionality.
* Use page count to configure multifd-packet-size option.
* Don't use the FLAKY flag in DSA tests.
* Test if DSA integration test is setup correctly and skip the test if
* not.
* Fixed broken link in the previous patch cover.

* Background:

I posted an RFC about DSA offloading in QEMU:
https://patchew.org/QEMU/20230529182001.2232069-1-hao.xi...@bytedance.com/

This patchset implements the DSA offloading on zero page checking in
multifd live migration code path.

* Overview:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
Xeon server, aka Sapphire Rapids.
https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This patchset implements a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
two benefits from this change:
1. Reduces CPU usage in multifd live migration workflow across all use
cases.
2. Reduces migration total time in some use cases.

* Design:

These are the logical steps to perform DSA offloading:
1. Configure DSA accelerators and create user space openable DSA work
queues via the idxd driver.
2. Map DSA's work queue into a user space address space.
3. Fill an in-memory task descriptor to describe the memory operation.
4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
the work queue.
5. Pull the task descriptor's completion status field until the task
completes.
6. Check return status.

The memory operation is now totally done by the accelerator hardware but
the new workflow introduces overheads. The overhead is the extra cost CPU
prepares and submits the task descriptors and the extra cost CPU pulls for
completion. The design is around minimizing these two overheads.

1. In order to reduce the overhead on task preparation and submission,
we use batch descriptors. A batch descriptor will contain N individual
zero page checking tasks where the default N is 128 (default packet size
/ page size) and we can increase N by setting the packet size via a new
migration option.
2. The multifd sender threads prepares and submits batch tasks to DSA
hardware and it waits on a synchronization object for task completion.
Whenever a DSA task is submitted, the task structure is added to a
thread safe queue. It's safe to have multiple multifd sender threads to
submit tasks concurrently.
3. Multiple DSA hardware devices can be used. During multifd initialization,
every sender thread will be assigned a DSA device to work with. We
use a round-robin scheme to evenly distribute the work across all used
DSA devices.
4. Use a dedicated thread dsa_completion to perform busy pulling for all
DSA task completions. The thread keeps dequeuing DSA tasks from the
thread safe queue. The thread blocks when there is no outstanding DSA
task. When pulling for completion of a DSA task, the thread uses CPU
instruction _mm_pause between the iterations of a busy loop to save some
CPU power as well as optimizing core resources for the other hypercore.
5. DSA accelerator can encounter errors. The most popular error is a
page fault. We have tested using devices to handle page faults but
performance is bad. Right now, if DSA hits a page fault, we fallback to
use CPU to complete the rest of the work. The CPU fallback is done in
the multifd sender thread.
6. Added a new migration option multifd-dsa-accel to set the DSA device
path. If set, the multifd workflow will leverage the DSA devices for
offloading.
7. Added a new migration option multifd-normal-page-ratio to make
multifd live migration easier to test. Setting a normal page ratio will
make live migration recognize a zero page as a normal page and send
the entire payload over the network. If we want to send a large network
payload and analyze throughput, this option is useful.
8. Added a new migration option multifd-packet-size. This can increase
the number of pages being zero page checked and sent over the network.
The extra synchronization between the sender threads and the dsa
completion thread is an overhead. Using a large packet size can reduce
that overhead.

* Performance:

We use two Intel 4th generation Xeon servers for testing.

Architecture:x86_64
CPU(s): 192
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s):2
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8457C
Stepping:8
CPU MHz: 2538.624
CPU max MHz: 3800.

Re: [PATCH v2 0/2] net: Update MemReentrancyGuard for NIC

On Thu, Sep 21, 2023 at 3:16 PM Akihiko Odaki  wrote:
>
> On 2023/06/01 12:18, Akihiko Odaki wrote:
> > Recently MemReentrancyGuard was added to DeviceState to record that the
> > device is engaging in I/O. The network device backend needs to update it
> > when delivering a packet to a device.
> >
> > This implementation follows what bottom half does, but it does not add
> > a tracepoint for the case that the network device backend started
> > delivering a packet to a device which is already engaging in I/O. This
> > is because such reentrancy frequently happens for
> > qemu_flush_queued_packets() and is insignificant.
> >
> > This series consists of two patches. The first patch makes a bulk change to
> > add a new parameter to qemu_new_nic() and does not contain behavioral 
> > changes.
> > The second patch actually implements MemReentrancyGuard update.
> >
> > V1 -> V2: Added the 'Fixes: CVE-2023-3019' tag
> >
> > Akihiko Odaki (2):
> >net: Provide MemReentrancyGuard * to qemu_new_nic()
> >net: Update MemReentrancyGuard for NIC
> >
> >   include/net/net.h |  2 ++
> >   hw/net/allwinner-sun8i-emac.c |  3 ++-
> >   hw/net/allwinner_emac.c   |  3 ++-
> >   hw/net/cadence_gem.c  |  3 ++-
> >   hw/net/dp8393x.c  |  3 ++-
> >   hw/net/e1000.c|  3 ++-
> >   hw/net/e1000e.c   |  2 +-
> >   hw/net/eepro100.c |  4 +++-
> >   hw/net/etraxfs_eth.c  |  3 ++-
> >   hw/net/fsl_etsec/etsec.c  |  3 ++-
> >   hw/net/ftgmac100.c|  3 ++-
> >   hw/net/i82596.c   |  2 +-
> >   hw/net/igb.c  |  2 +-
> >   hw/net/imx_fec.c  |  2 +-
> >   hw/net/lan9118.c  |  3 ++-
> >   hw/net/mcf_fec.c  |  3 ++-
> >   hw/net/mipsnet.c  |  3 ++-
> >   hw/net/msf2-emac.c|  3 ++-
> >   hw/net/mv88w8618_eth.c|  3 ++-
> >   hw/net/ne2000-isa.c   |  3 ++-
> >   hw/net/ne2000-pci.c   |  3 ++-
> >   hw/net/npcm7xx_emc.c  |  3 ++-
> >   hw/net/opencores_eth.c|  3 ++-
> >   hw/net/pcnet.c|  3 ++-
> >   hw/net/rocker/rocker_fp.c |  4 ++--
> >   hw/net/rtl8139.c  |  3 ++-
> >   hw/net/smc91c111.c|  3 ++-
> >   hw/net/spapr_llan.c   |  3 ++-
> >   hw/net/stellaris_enet.c   |  3 ++-
> >   hw/net/sungem.c   |  2 +-
> >   hw/net/sunhme.c   |  3 ++-
> >   hw/net/tulip.c|  3 ++-
> >   hw/net/virtio-net.c   |  6 --
> >   hw/net/vmxnet3.c  |  2 +-
> >   hw/net/xen_nic.c  |  4 ++--
> >   hw/net/xgmac.c|  3 ++-
> >   hw/net/xilinx_axienet.c   |  3 ++-
> >   hw/net/xilinx_ethlite.c   |  3 ++-
> >   hw/usb/dev-network.c  |  3 ++-
> >   net/net.c | 15 +++
> >   40 files changed, 90 insertions(+), 41 deletions(-)
> >
>
> Hi Jason,
>
> Can you review this series?

For some reason it falls through the cracks.

I've queued this for rc1.

Thanks

>
> Regards,
> Akihiko Odaki
>

RE: [PATCH v5 04/11] hw/net: Add NPCMXXX GMAC device

2023-11-13 Thread kft...@nuvoton.com



-Original Message-
From: Nabih Estefan 
Sent: Saturday, October 28, 2023 1:55 AM
To: peter.mayd...@linaro.org
Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; CS20 KFTing 
; wuhao...@google.com; jasonw...@redhat.com; IS20 Avi 
Fishman ; nabiheste...@google.com; CS20 KWLiu 
; IS20 Tomer Maimon ; IN20 Hila 
Miranda-Kuzi 
Subject: [PATCH v5 04/11] hw/net: Add NPCMXXX GMAC device

CAUTION - External Email: Do not click links or open attachments unless you 
acknowledge the sender and content.


From: Hao Wu 

This patch implements the basic registers of GMAC device and sets registers for 
networking functionalities.

Tested:
The following message shows up with the change:
Broadcom BCM54612E stmmac-0:00: attached PHY driver [Broadcom BCM54612E] 
(mii_bus:phy_addr=stmmac-0:00, irq=POLL) stmmaceth f0802000.eth eth0: Link is 
Up - 1Gbps/Full - flow control rx/tx

Change-Id: I9d9ad2533eb6b9bdc6979e6d823aea3a4dd052fa
Signed-off-by: Hao Wu 
Signed-off-by: Nabih Estefan Diaz 
---
 hw/net/meson.build |   2 +-
 hw/net/npcm_gmac.c | 395 +
 hw/net/trace-events|  11 ++
 include/hw/net/npcm_gmac.h | 170 
 4 files changed, 577 insertions(+), 1 deletion(-)  create mode 100644 
hw/net/npcm_gmac.c  create mode 100644 include/hw/net/npcm_gmac.h

diff --git a/hw/net/meson.build b/hw/net/meson.build index 
2632634df3..8389a134d5 100644
--- a/hw/net/meson.build
+++ b/hw/net/meson.build
@@ -38,7 +38,7 @@ system_ss.add(when: 'CONFIG_I82596_COMMON', if_true: 
files('i82596.c'))
 system_ss.add(when: 'CONFIG_SUNHME', if_true: files('sunhme.c'))
 system_ss.add(when: 'CONFIG_FTGMAC100', if_true: files('ftgmac100.c'))
 system_ss.add(when: 'CONFIG_SUNGEM', if_true: files('sungem.c'))
-system_ss.add(when: 'CONFIG_NPCM7XX', if_true: files('npcm7xx_emc.c'))
+system_ss.add(when: 'CONFIG_NPCM7XX', if_true: files('npcm7xx_emc.c',
+'npcm_gmac.c'))

 system_ss.add(when: 'CONFIG_ETRAXFS', if_true: files('etraxfs_eth.c'))
 system_ss.add(when: 'CONFIG_COLDFIRE', if_true: files('mcf_fec.c')) diff --git 
a/hw/net/npcm_gmac.c b/hw/net/npcm_gmac.c new file mode 100644 index 
00..5ce632858d
--- /dev/null
+++ b/hw/net/npcm_gmac.c
@@ -0,0 +1,395 @@
+/*
+ * Nuvoton NPCM7xx/8xx GMAC Module
+ *
+ * Copyright 2022 Google LLC
+ *
+ * This program is free software; you can redistribute it and/or modify
+it
+ * under the terms of the GNU General Public License as published by
+the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ *
+ * Unsupported/unimplemented features:
+ * - MII is not implemented, MII_ADDR.BUSY and MII_DATA always return
+zero
+ * - Precision timestamp (PTP) is not implemented.
+ */
+
+#include "qemu/osdep.h"
+
+#include "hw/registerfields.h"
+#include "hw/net/mii.h"
+#include "hw/net/npcm_gmac.h"
+#include "migration/vmstate.h"
+#include "qemu/log.h"
+#include "qemu/units.h"
+#include "sysemu/dma.h"
+#include "trace.h"
+
+REG32(NPCM_DMA_BUS_MODE, 0x1000)
+REG32(NPCM_DMA_XMT_POLL_DEMAND, 0x1004) REG32(NPCM_DMA_RCV_POLL_DEMAND,
+0x1008) REG32(NPCM_DMA_RCV_BASE_ADDR, 0x100c)
+REG32(NPCM_DMA_TX_BASE_ADDR, 0x1010) REG32(NPCM_DMA_STATUS, 0x1014)
+REG32(NPCM_DMA_CONTROL, 0x1018) REG32(NPCM_DMA_INTR_ENA, 0x101c)
+REG32(NPCM_DMA_MISSED_FRAME_CTR, 0x1020) REG32(NPCM_DMA_HOST_TX_DESC,
+0x1048) REG32(NPCM_DMA_HOST_RX_DESC, 0x104c)
+REG32(NPCM_DMA_CUR_TX_BUF_ADDR, 0x1050) REG32(NPCM_DMA_CUR_RX_BUF_ADDR,
+0x1054) REG32(NPCM_DMA_HW_FEATURE, 0x1058)
+
+REG32(NPCM_GMAC_MAC_CONFIG, 0x0)
+REG32(NPCM_GMAC_FRAME_FILTER, 0x4)
+REG32(NPCM_GMAC_HASH_HIGH, 0x8)
+REG32(NPCM_GMAC_HASH_LOW, 0xc)
+REG32(NPCM_GMAC_MII_ADDR, 0x10)
+REG32(NPCM_GMAC_MII_DATA, 0x14)
+REG32(NPCM_GMAC_FLOW_CTRL, 0x18)
+REG32(NPCM_GMAC_VLAN_FLAG, 0x1c)
+REG32(NPCM_GMAC_VERSION, 0x20)
+REG32(NPCM_GMAC_WAKEUP_FILTER, 0x28)
+REG32(NPCM_GMAC_PMT, 0x2c)
+REG32(NPCM_GMAC_LPI_CTRL, 0x30)
+REG32(NPCM_GMAC_TIMER_CTRL, 0x34)
+REG32(NPCM_GMAC_INT_STATUS, 0x38)
+REG32(NPCM_GMAC_INT_MASK, 0x3c)
+REG32(NPCM_GMAC_MAC0_ADDR_HI, 0x40)
+REG32(NPCM_GMAC_MAC0_ADDR_LO, 0x44)
+REG32(NPCM_GMAC_MAC1_ADDR_HI, 0x48)
+REG32(NPCM_GMAC_MAC1_ADDR_LO, 0x4c)
+REG32(NPCM_GMAC_MAC2_ADDR_HI, 0x50)
+REG32(NPCM_GMAC_MAC2_ADDR_LO, 0x54)
+REG32(NPCM_GMAC_MAC3_ADDR_HI, 0x58)
+REG32(NPCM_GMAC_MAC3_ADDR_LO, 0x5c)
+REG32(NPCM_GMAC_RGMII_STATUS, 0xd8)
+REG32(NPCM_GMAC_WATCHDOG, 0xdc)
+REG32(NPCM_GMAC_PTP_TCR, 0x700)
+REG32(NPCM_GMAC_PTP_SSIR, 0x704)
+REG32(NPCM_GMAC_PTP_STSR, 0x708)
+REG32(NPCM_GMAC_PTP_STNSR, 0x70c)
+REG32(NPCM_GMAC_PTP_STSUR, 0x710)
+REG32(NPCM_GMAC_PTP_STNSUR, 0x714)
+REG32(NPCM_GMAC_PTP_TAR, 0x718)
+REG32(NPCM_GMAC_PTP_TTSR, 0x71c)
+
+/* Register Fields */
+#define NPCM_GMAC_MII_ADDR_BUSY BIT(0)
+#define NPCM_G

[PULL 2/2] igb: Add Function Level Reset to PF and VF

From: Cédric Le Goater 

The Intel 82576EB GbE Controller say that the Physical and Virtual
Functions support Function Level Reset. Add the capability to the PF
device model using device property "x-pcie-flr-init" which is "on" by
default and "off" for machines <= 8.1 to preserve compatibility.

The FLR capability of the VF model is defined according to the FLR
property of the PF, this to avoid adding an extra compatibility
property.

Cc: Sriram Yagnaraman 
Fixes: 3a977deebe6b ("Intrdocue igb device emulation")
Reviewed-by: Akihiko Odaki 
Tested-by: Akihiko Odaki 
Signed-off-by: Cédric Le Goater 
Signed-off-by: Jason Wang 
---
 hw/core/machine.c | 3 ++-
 hw/net/igb.c  | 9 +
 hw/net/igbvf.c| 9 +
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 50edaab..0c17398 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -35,7 +35,8 @@
 GlobalProperty hw_compat_8_1[] = {
 { TYPE_PCI_BRIDGE, "x-pci-express-writeable-slt-bug", "true" },
 { "ramfb", "x-migrate", "off" },
-{ "vfio-pci-nohotplug", "x-ramfb-migrate", "off" }
+{ "vfio-pci-nohotplug", "x-ramfb-migrate", "off" },
+{ "igb", "x-pcie-flr-init", "off" },
 };
 const size_t hw_compat_8_1_len = G_N_ELEMENTS(hw_compat_8_1);
 
diff --git a/hw/net/igb.c b/hw/net/igb.c
index e70a66e..dfb722b 100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -78,6 +78,7 @@ struct IGBState {
 uint32_t ioaddr;
 
 IGBCore core;
+bool has_flr;
 };
 
 #define IGB_CAP_SRIOV_OFFSET(0x160)
@@ -101,6 +102,9 @@ static void igb_write_config(PCIDevice *dev, uint32_t addr,
 
 trace_igb_write_config(addr, val, len);
 pci_default_write_config(dev, addr, val, len);
+if (s->has_flr) {
+pcie_cap_flr_write_config(dev, addr, val, len);
+}
 
 if (range_covers_byte(addr, len, PCI_COMMAND) &&
 (dev->config[PCI_COMMAND] & PCI_COMMAND_MASTER)) {
@@ -433,6 +437,10 @@ static void igb_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 }
 
 /* PCIe extended capabilities (in order) */
+if (s->has_flr) {
+pcie_cap_flr_init(pci_dev);
+}
+
 if (pcie_aer_init(pci_dev, 1, 0x100, 0x40, errp) < 0) {
 hw_error("Failed to initialize AER capability");
 }
@@ -588,6 +596,7 @@ static const VMStateDescription igb_vmstate = {
 
 static Property igb_properties[] = {
 DEFINE_NIC_PROPERTIES(IGBState, conf),
+DEFINE_PROP_BOOL("x-pcie-flr-init", IGBState, has_flr, true),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/net/igbvf.c b/hw/net/igbvf.c
index 07343fa..94a4e88 100644
--- a/hw/net/igbvf.c
+++ b/hw/net/igbvf.c
@@ -204,6 +204,10 @@ static void igbvf_write_config(PCIDevice *dev, uint32_t 
addr, uint32_t val,
 {
 trace_igbvf_write_config(addr, val, len);
 pci_default_write_config(dev, addr, val, len);
+if (object_property_get_bool(OBJECT(pcie_sriov_get_pf(dev)),
+ "x-pcie-flr-init", &error_abort)) {
+pcie_cap_flr_write_config(dev, addr, val, len);
+}
 }
 
 static uint64_t igbvf_mmio_read(void *opaque, hwaddr addr, unsigned size)
@@ -266,6 +270,11 @@ static void igbvf_pci_realize(PCIDevice *dev, Error **errp)
 hw_error("Failed to initialize PCIe capability");
 }
 
+if (object_property_get_bool(OBJECT(pcie_sriov_get_pf(dev)),
+ "x-pcie-flr-init", &error_abort)) {
+pcie_cap_flr_init(dev);
+}
+
 if (pcie_aer_init(dev, 1, 0x100, 0x40, errp) < 0) {
 hw_error("Failed to initialize AER capability");
 }
-- 
2.7.4

[PULL 0/2] Net patches

The following changes since commit 69680740eafa1838527c90155a7432d51b8ff203:

  Merge tag 'qdev-array-prop' of https://repo.or.cz/qemu/kevin into staging 
(2023-11-11 11:23:25 +0800)

are available in the git repository at:

  https://github.com/jasowang/qemu.git tags/net-pull-request

for you to fetch changes up to d90014fc337ab77f37285b1a30fd4f545056be0a:

  igb: Add Function Level Reset to PF and VF (2023-11-13 15:33:37 +0800)




Cédric Le Goater (2):
  igb: Add a VF reset handler
  igb: Add Function Level Reset to PF and VF

 hw/core/machine.c   |  3 ++-
 hw/net/igb.c| 15 +++
 hw/net/igb_common.h |  1 +
 hw/net/igb_core.c   |  6 --
 hw/net/igb_core.h   |  3 +++
 hw/net/igbvf.c  | 19 +++
 hw/net/trace-events |  1 +
 7 files changed, 45 insertions(+), 3 deletions(-)

[PULL 1/2] igb: Add a VF reset handler

From: Cédric Le Goater 

Export the igb_vf_reset() helper routine from the PF model to let the
IGBVF model implement its own device reset.

Cc: Akihiko Odaki 
Suggested-by: Sriram Yagnaraman 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Cédric Le Goater 
Signed-off-by: Jason Wang 
---
 hw/net/igb.c|  6 ++
 hw/net/igb_common.h |  1 +
 hw/net/igb_core.c   |  6 --
 hw/net/igb_core.h   |  3 +++
 hw/net/igbvf.c  | 10 ++
 hw/net/trace-events |  1 +
 6 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/hw/net/igb.c b/hw/net/igb.c
index 8ff832a..e70a66e 100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -122,6 +122,12 @@ igb_mmio_write(void *opaque, hwaddr addr, uint64_t val, 
unsigned size)
 igb_core_write(&s->core, addr, val, size);
 }
 
+void igb_vf_reset(void *opaque, uint16_t vfn)
+{
+IGBState *s = opaque;
+igb_core_vf_reset(&s->core, vfn);
+}
+
 static bool
 igb_io_get_reg_index(IGBState *s, uint32_t *idx)
 {
diff --git a/hw/net/igb_common.h b/hw/net/igb_common.h
index 5c261ba..b316a5b 100644
--- a/hw/net/igb_common.h
+++ b/hw/net/igb_common.h
@@ -152,5 +152,6 @@ enum {
 
 uint64_t igb_mmio_read(void *opaque, hwaddr addr, unsigned size);
 void igb_mmio_write(void *opaque, hwaddr addr, uint64_t val, unsigned size);
+void igb_vf_reset(void *opaque, uint16_t vfn);
 
 #endif
diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index f6a5e23..2a7a11a 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -2477,11 +2477,13 @@ static void igb_set_vfmailbox(IGBCore *core, int index, 
uint32_t val)
 }
 }
 
-static void igb_vf_reset(IGBCore *core, uint16_t vfn)
+void igb_core_vf_reset(IGBCore *core, uint16_t vfn)
 {
 uint16_t qn0 = vfn;
 uint16_t qn1 = vfn + IGB_NUM_VM_POOLS;
 
+trace_igb_core_vf_reset(vfn);
+
 /* disable Rx and Tx for the VF*/
 core->mac[RXDCTL0 + (qn0 * 16)] &= ~E1000_RXDCTL_QUEUE_ENABLE;
 core->mac[RXDCTL0 + (qn1 * 16)] &= ~E1000_RXDCTL_QUEUE_ENABLE;
@@ -2560,7 +2562,7 @@ static void igb_set_vtctrl(IGBCore *core, int index, 
uint32_t val)
 
 if (val & E1000_CTRL_RST) {
 vfn = (index - PVTCTRL0) / 0x40;
-igb_vf_reset(core, vfn);
+igb_core_vf_reset(core, vfn);
 }
 }
 
diff --git a/hw/net/igb_core.h b/hw/net/igb_core.h
index 9cbbfd5..bf8c46f 100644
--- a/hw/net/igb_core.h
+++ b/hw/net/igb_core.h
@@ -130,6 +130,9 @@ igb_core_set_link_status(IGBCore *core);
 void
 igb_core_pci_uninit(IGBCore *core);
 
+void
+igb_core_vf_reset(IGBCore *core, uint16_t vfn);
+
 bool
 igb_can_receive(IGBCore *core);
 
diff --git a/hw/net/igbvf.c b/hw/net/igbvf.c
index d55e1e8..07343fa 100644
--- a/hw/net/igbvf.c
+++ b/hw/net/igbvf.c
@@ -273,6 +273,13 @@ static void igbvf_pci_realize(PCIDevice *dev, Error **errp)
 pcie_ari_init(dev, 0x150);
 }
 
+static void igbvf_qdev_reset_hold(Object *obj)
+{
+PCIDevice *vf = PCI_DEVICE(obj);
+
+igb_vf_reset(pcie_sriov_get_pf(vf), pcie_sriov_vf_number(vf));
+}
+
 static void igbvf_pci_uninit(PCIDevice *dev)
 {
 IgbVfState *s = IGBVF(dev);
@@ -287,6 +294,7 @@ static void igbvf_class_init(ObjectClass *class, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(class);
 PCIDeviceClass *c = PCI_DEVICE_CLASS(class);
+ResettableClass *rc = RESETTABLE_CLASS(class);
 
 c->realize = igbvf_pci_realize;
 c->exit = igbvf_pci_uninit;
@@ -295,6 +303,8 @@ static void igbvf_class_init(ObjectClass *class, void *data)
 c->revision = 1;
 c->class_id = PCI_CLASS_NETWORK_ETHERNET;
 
+rc->phases.hold = igbvf_qdev_reset_hold;
+
 dc->desc = "Intel 82576 Virtual Function";
 dc->user_creatable = false;
 
diff --git a/hw/net/trace-events b/hw/net/trace-events
index 3097742..387e32e 100644
--- a/hw/net/trace-events
+++ b/hw/net/trace-events
@@ -274,6 +274,7 @@ igb_core_mdic_read(uint32_t addr, uint32_t data) "MDIC 
READ: PHY[%u] = 0x%x"
 igb_core_mdic_read_unhandled(uint32_t addr) "MDIC READ: PHY[%u] UNHANDLED"
 igb_core_mdic_write(uint32_t addr, uint32_t data) "MDIC WRITE: PHY[%u] = 0x%x"
 igb_core_mdic_write_unhandled(uint32_t addr) "MDIC WRITE: PHY[%u] UNHANDLED"
+igb_core_vf_reset(uint16_t vfn) "VF%d"
 
 igb_link_set_ext_params(bool asd_check, bool speed_select_bypass, bool pfrstd) 
"Set extended link params: ASD check: %d, Speed select bypass: %d, PF reset 
done: %d"
 
-- 
2.7.4

RE: [PATCH v5 10/20] vfio/pci: Make vfio cdev pre-openable by passing a file handle

2023-11-13 Thread Duan, Zhenzhong



>-Original Message-
>From: Cédric Le Goater 
>Sent: Monday, November 13, 2023 7:08 PM
>Subject: Re: [PATCH v5 10/20] vfio/pci: Make vfio cdev pre-openable by passing 
>a
>file handle
>
>On 11/13/23 04:00, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Cédric Le Goater 
>>> Sent: Friday, November 10, 2023 6:53 PM
>>> Subject: Re: [PATCH v5 10/20] vfio/pci: Make vfio cdev pre-openable by
>passing a
>>> file handle
>>>
>>> On 11/9/23 12:45, Zhenzhong Duan wrote:
 This gives management tools like libvirt a chance to open the vfio
 cdev with privilege and pass FD to qemu. This way qemu never needs
 to have privilege to open a VFIO or iommu cdev node.

 Together with the earlier support of pre-opening /dev/iommu device,
 now we have full support of passing a vfio device to unprivileged
 qemu by management tool. This mode is no more considered for the
 legacy backend. So let's remove the "TODO" comment.

 Add a helper function vfio_device_get_name() to check fd and get
 device name, it will also be used by other vfio devices.

 There is no easy way to check if a device is mdev with FD passing,
 so fail the x-balloon-allowed check unconditionally in this case.

 There is also no easy way to get BDF as name with FD passing, so
 we fake a name by VFIO_FD[fd].

 Signed-off-by: Zhenzhong Duan 
 ---
include/hw/vfio/vfio-common.h |  1 +
hw/vfio/helpers.c | 34 +
hw/vfio/iommufd.c | 12 +++
hw/vfio/pci.c | 40 ---
4 files changed, 71 insertions(+), 16 deletions(-)

 diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
 index 3dac5c167e..960a14e8d8 100644
 --- a/include/hw/vfio/vfio-common.h
 +++ b/include/hw/vfio/vfio-common.h
 @@ -238,6 +238,7 @@ struct vfio_info_cap_header *
vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
struct vfio_info_cap_header *
vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id);
 +int vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
#endif

bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp);
 diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
 index 168847e7c5..d80aa58719 100644
 --- a/hw/vfio/helpers.c
 +++ b/hw/vfio/helpers.c
 @@ -20,6 +20,7 @@
 */

#include "qemu/osdep.h"
 +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
#include 

#include "hw/vfio/vfio-common.h"
 @@ -609,3 +610,36 @@ bool vfio_has_region_cap(VFIODevice *vbasedev,
>int
>>> region, uint16_t cap_type)

return ret;
}
 +
 +int vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
 +{
 +struct stat st;
 +
 +if (vbasedev->fd < 0) {
 +if (stat(vbasedev->sysfsdev, &st) < 0) {
 +error_setg_errno(errp, errno, "no such host device");
 +error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev);
 +return -errno;
 +}
 +/* User may specify a name, e.g: VFIO platform device */
 +if (!vbasedev->name) {
 +vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
 +}
 +}
 +#ifdef CONFIG_IOMMUFD
 +else {
 +if (!vbasedev->iommufd) {
>>>
>>>
>>> Can we handle with this case without CONFIG_IOMMUFD, simply by
>>> testing vbasedev->iommufd ?
>>
>> Sure, will do.
>>
>>>
 +error_setg(errp, "Use FD passing only with iommufd backend");
 +return -EINVAL;
 +}
 +/*
 + * Give a name with fd so any function printing out vbasedev->name
 + * will not break.
 + */
 +if (!vbasedev->name) {
 +vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
 +}
 +}
 +#endif
 +return 0;
 +}
 diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
 index 44dc6848bf..fd30477275 100644
 --- a/hw/vfio/iommufd.c
 +++ b/hw/vfio/iommufd.c
 @@ -326,11 +326,15 @@ static int iommufd_attach_device(const char
>*name,
>>> VFIODevice *vbasedev,
uint32_t ioas_id;
Error *err = NULL;

 -devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
 -if (devfd < 0) {
 -return devfd;
 +if (vbasedev->fd < 0) {
 +devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
 +if (devfd < 0) {
 +return devfd;
 +}
 +vbasedev->fd = devfd;
 +} else {
 +devfd = vbasedev->fd;
}
 -vbasedev->fd = devfd;

ret = iommufd_connect_and_bind(vbasedev, errp);
if (ret) {
 diff --git a/hw

RE: [PATCH v5 03/20] vfio/iommufd: Implement the iommufd backend

2023-11-13 Thread Duan, Zhenzhong



>-Original Message-
>From: Cédric Le Goater 
>Sent: Monday, November 13, 2023 7:05 PM
>Subject: Re: [PATCH v5 03/20] vfio/iommufd: Implement the iommufd backend
>
>On 11/10/23 11:18, Duan, Zhenzhong wrote:
>>
>>
>>> -Original Message-
>>> From: Cédric Le Goater 
>>> Sent: Friday, November 10, 2023 5:34 PM
>>> Subject: Re: [PATCH v5 03/20] vfio/iommufd: Implement the iommufd
>backend
>>>
>>> On 11/9/23 12:45, Zhenzhong Duan wrote:
 From: Yi Liu 

 Add the iommufd backend. The IOMMUFD container class is implemented
 based on the new /dev/iommu user API. This backend obviously depends
 on CONFIG_IOMMUFD.

 So far, the iommufd backend doesn't support dirty page sync yet due
 to missing support in the host kernel.

 Co-authored-by: Eric Auger 
 Signed-off-by: Yi Liu 
 Signed-off-by: Zhenzhong Duan 
 ---
 v5: Switch to IOAS attach/detach and hide hwpt

include/hw/vfio/vfio-common.h |  11 +
hw/vfio/common.c  |  20 +-
hw/vfio/iommufd.c | 429 ++
hw/vfio/meson.build   |   3 +
hw/vfio/trace-events  |  10 +
5 files changed, 469 insertions(+), 4 deletions(-)
create mode 100644 hw/vfio/iommufd.c

 diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
 index 24ecc0e7ee..3dac5c167e 100644
 --- a/include/hw/vfio/vfio-common.h
 +++ b/include/hw/vfio/vfio-common.h
 @@ -89,6 +89,14 @@ typedef struct VFIOHostDMAWindow {
QLIST_ENTRY(VFIOHostDMAWindow) hostwin_next;
} VFIOHostDMAWindow;

 +typedef struct IOMMUFDBackend IOMMUFDBackend;
 +
 +typedef struct VFIOIOMMUFDContainer {
 +VFIOContainerBase bcontainer;
 +IOMMUFDBackend *be;
 +uint32_t ioas_id;
 +} VFIOIOMMUFDContainer;
 +
typedef struct VFIODeviceOps VFIODeviceOps;

typedef struct VFIODevice {
 @@ -116,6 +124,8 @@ typedef struct VFIODevice {
OnOffAuto pre_copy_dirty_page_tracking;
bool dirty_pages_supported;
bool dirty_tracking;
 +int devid;
 +IOMMUFDBackend *iommufd;
} VFIODevice;

struct VFIODeviceOps {
 @@ -201,6 +211,7 @@ typedef QLIST_HEAD(VFIODeviceList, VFIODevice)
>>> VFIODeviceList;
extern VFIOGroupList vfio_group_list;
extern VFIODeviceList vfio_device_list;
extern const VFIOIOMMUOps vfio_legacy_ops;
 +extern const VFIOIOMMUOps vfio_iommufd_ops;
extern const MemoryListener vfio_memory_listener;
extern int vfio_kvm_device_fd;

 diff --git a/hw/vfio/common.c b/hw/vfio/common.c
 index 572ae7c934..3b7e11158f 100644
 --- a/hw/vfio/common.c
 +++ b/hw/vfio/common.c
 @@ -19,6 +19,7 @@
 */

#include "qemu/osdep.h"
 +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
#include 
#ifdef CONFIG_KVM
#include 
 @@ -1462,10 +1463,13 @@ VFIOAddressSpace
>>> *vfio_get_address_space(AddressSpace *as)

void vfio_put_address_space(VFIOAddressSpace *space)
{
 -if (QLIST_EMPTY(&space->containers)) {
 -QLIST_REMOVE(space, list);
 -g_free(space);
 +if (!QLIST_EMPTY(&space->containers)) {
 +return;
>>>
>>> I think this change deserves to be in a separate patch, even if simple.
>>> Is there some relation with iommufd ? This is not clear.
>>
>> OK, will do. It's unrelated to iommufd, just avoid unnecessary check below.
>>
>>>
}
 +
 +QLIST_REMOVE(space, list);
 +g_free(space);
 +
if (QLIST_EMPTY(&vfio_address_spaces)) {
qemu_unregister_reset(vfio_reset_handler, NULL);
}
 @@ -1498,8 +1502,16 @@ retry:
int vfio_attach_device(char *name, VFIODevice *vbasedev,
   AddressSpace *as, Error **errp)
{
 -const VFIOIOMMUOps *ops = &vfio_legacy_ops;
 +const VFIOIOMMUOps *ops;

 +#ifdef CONFIG_IOMMUFD
 +if (vbasedev->iommufd) {
 +ops = &vfio_iommufd_ops;
 +} else
 +#endif
 +{
 +ops = &vfio_legacy_ops;
 +}
>>>
>>> Simply adding :
>>>
>>>   +#ifdef CONFIG_IOMMUFD
>>>   +if (vbasedev->iommufd) {
>>>   +ops = &vfio_iommufd_ops;
>>>   +}
>>>   +#endif
>>>
>>> would have the same effect with less change.
>>
>> Indeed, will do.
>>
>>>
>>> That said, it would also be nice to find a way to avoid the use of
>>> CONFIG_IOMMUFD in hw/vfio/common.c. May be with a helper returning
>>> 'const VFIOIOMMUOps *'. This is minor. Still, I find some redundancy
>>> with vfio_container_init() and I don't a good alternative yet :)
>>
>> Sure, will do, guess you mean a helper function in hw/vfio/helpers.c with
>> CONFIG_IOMMUFD check?
>
>Yes. That was the idea. I took a look and the benefits are minimal.
>I am not s

[PATCH v5] target/riscv: update checks on writing pmpcfg for Smepmp to version 1.0

2023-11-13 Thread Alvin Chang via

Current checks on writing pmpcfg for Smepmp follows Smepmp version
0.9.1. However, Smepmp specification has already been ratified, and
there are some differences between version 0.9.1 and 1.0. In this
commit we update the checks of writing pmpcfg to follow Smepmp version
1.0.

When mseccfg.MML is set, the constraints to modify PMP rules are:
1. Locked rules cannot be removed or modified until a PMP reset, unless
   mseccfg.RLB is set.
2. From Smepmp specification version 1.0, chapter 2 section 4b:
   Adding a rule with executable privileges that either is M-mode-only
   or a locked Shared-Region is not possible and such pmpcfg writes are
   ignored, leaving pmpcfg unchanged.

The commit transfers the value of pmpcfg into the index of the Smepmp
truth table, and checks the rules by aforementioned specification
changes.

Signed-off-by: Alvin Chang 
---
Changes from v4: Rebase on master.

Changes from v3: Modify "epmp_operation" to "smepmp_operation".

Changes from v2: Adopt switch case ranges and numerical order.

Changes from v1: Convert ePMP over to Smepmp.

 target/riscv/pmp.c | 40 
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/target/riscv/pmp.c b/target/riscv/pmp.c
index 162e88a90a..4069514069 100644
--- a/target/riscv/pmp.c
+++ b/target/riscv/pmp.c
@@ -102,16 +102,40 @@ static bool pmp_write_cfg(CPURISCVState *env, uint32_t 
pmp_index, uint8_t val)
 locked = false;
 }
 
-/* mseccfg.MML is set */
-if (MSECCFG_MML_ISSET(env)) {
-/* not adding execute bit */
-if ((val & PMP_LOCK) != 0 && (val & PMP_EXEC) != PMP_EXEC) {
+/*
+ * mseccfg.MML is set. Locked rules cannot be removed or modified
+ * until a PMP reset. Besides, from Smepmp specification version 
1.0
+ * , chapter 2 section 4b says:
+ * Adding a rule with executable privileges that either is
+ * M-mode-only or a locked Shared-Region is not possible and such
+ * pmpcfg writes are ignored, leaving pmpcfg unchanged.
+ */
+if (MSECCFG_MML_ISSET(env) && !pmp_is_locked(env, pmp_index)) {
+/*
+ * Convert the PMP permissions to match the truth table in the
+ * Smepmp spec.
+ */
+const uint8_t smepmp_operation =
+((val & PMP_LOCK) >> 4) | ((val & PMP_READ) << 2) |
+(val & PMP_WRITE) | ((val & PMP_EXEC) >> 2);
+
+switch (smepmp_operation) {
+case 0 ... 8:
 locked = false;
-}
-/* shared region and not adding X bit */
-if ((val & PMP_LOCK) != PMP_LOCK &&
-(val & 0x7) != (PMP_WRITE | PMP_EXEC)) {
+break;
+case 9 ... 11:
+break;
+case 12:
+locked = false;
+break;
+case 13:
+break;
+case 14:
+case 15:
 locked = false;
+break;
+default:
+g_assert_not_reached();
 }
 }
 } else {
-- 
2.34.1

[PATCH v5 06/14] tpm-sysbus: add plug handler for TPM on SysBus

TPM needs to know its own base address in order to generate its DSDT
device entry.

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 include/sysemu/tpm.h |  4 
 hw/tpm/tpm-sysbus.c  | 47 
 hw/tpm/meson.build   |  1 +
 3 files changed, 52 insertions(+)
 create mode 100644 hw/tpm/tpm-sysbus.c

diff --git a/include/sysemu/tpm.h b/include/sysemu/tpm.h
index 1ee568b3b6..ffd300e607 100644
--- a/include/sysemu/tpm.h
+++ b/include/sysemu/tpm.h
@@ -12,6 +12,8 @@
 #ifndef QEMU_TPM_H
 #define QEMU_TPM_H
 
+#include "qemu/osdep.h"
+#include "exec/hwaddr.h"
 #include "qapi/qapi-types-tpm.h"
 #include "qom/object.h"
 
@@ -78,6 +80,8 @@ static inline TPMVersion tpm_get_version(TPMIf *ti)
 return TPM_IF_GET_CLASS(ti)->get_version(ti);
 }
 
+void tpm_sysbus_plug(TPMIf *tpmif, Object *pbus, hwaddr pbus_base);
+
 #else /* CONFIG_TPM */
 
 #define tpm_init()  (0)
diff --git a/hw/tpm/tpm-sysbus.c b/hw/tpm/tpm-sysbus.c
new file mode 100644
index 00..732ce34c73
--- /dev/null
+++ b/hw/tpm/tpm-sysbus.c
@@ -0,0 +1,47 @@
+/*
+ * tpm-sysbus.c - Support functions for SysBus TPM devices
+ *
+ * Copyright (c) 2023 QEMU contributors
+ *
+ * Authors:
+ *   Joelle van Dyne 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "sysemu/tpm.h"
+#include "hw/platform-bus.h"
+#include "hw/sysbus.h"
+#include "qapi/error.h"
+
+/**
+ * Called from a machine's pre_plug handler to set the device's physical addr.
+ */
+void tpm_sysbus_plug(TPMIf *tpmif, Object *pbus, hwaddr pbus_base)
+{
+PlatformBusDevice *pbusdev = PLATFORM_BUS_DEVICE(pbus);
+SysBusDevice *sbdev = SYS_BUS_DEVICE(tpmif);
+MemoryRegion *sbdev_mr;
+hwaddr tpm_base;
+uint64_t tpm_size;
+
+/* exit early if TPM is not a sysbus device */
+if (!object_dynamic_cast(OBJECT(tpmif), TYPE_SYS_BUS_DEVICE)) {
+return;
+}
+
+assert(object_dynamic_cast(pbus, TYPE_PLATFORM_BUS_DEVICE));
+
+tpm_base = platform_bus_get_mmio_addr(pbusdev, sbdev, 0);
+assert(tpm_base != -1);
+
+tpm_base += pbus_base;
+
+sbdev_mr = sysbus_mmio_get_region(sbdev, 0);
+tpm_size = memory_region_size(sbdev_mr);
+
+object_property_set_uint(OBJECT(sbdev), "x-baseaddr",
+ tpm_base, &error_abort);
+object_property_set_uint(OBJECT(sbdev), "x-size",
+ tpm_size, &error_abort);
+}
diff --git a/hw/tpm/meson.build b/hw/tpm/meson.build
index cb8204d5bc..3060ac05e8 100644
--- a/hw/tpm/meson.build
+++ b/hw/tpm/meson.build
@@ -1,6 +1,7 @@
 system_ss.add(when: 'CONFIG_TPM_TIS', if_true: files('tpm_tis_common.c'))
 system_ss.add(when: 'CONFIG_TPM_TIS_ISA', if_true: files('tpm_tis_isa.c'))
 system_ss.add(when: 'CONFIG_TPM_TIS_SYSBUS', if_true: 
files('tpm_tis_sysbus.c'))
+system_ss.add(when: 'CONFIG_TPM_TIS_SYSBUS', if_true: files('tpm-sysbus.c'))
 system_ss.add(when: 'CONFIG_TPM_TIS_I2C', if_true: files('tpm_tis_i2c.c'))
 system_ss.add(when: 'CONFIG_TPM_CRB', if_true: files('tpm_crb.c'))
 system_ss.add(when: 'CONFIG_TPM_CRB', if_true: files('tpm_crb_common.c'))
-- 
2.41.0

[PATCH] linux-headers: Synchronize linux headers from linux v6.7.0-rc1

2023-11-13 Thread Tianrui Zhao

Use the scripts/update-linux-headers.sh to synchronize linux
headers from linux v6.7.0-rc1. We mainly want to add the
loongarch linux headers and then add the loongarch kvm support
based on it.

Signed-off-by: Tianrui Zhao 
---
 include/standard-headers/drm/drm_fourcc.h |   2 +
 include/standard-headers/linux/pci_regs.h |  24 ++-
 include/standard-headers/linux/vhost_types.h  |   7 +
 .../standard-headers/linux/virtio_config.h|   5 +
 linux-headers/asm-arm64/kvm.h |  32 
 linux-headers/asm-generic/unistd.h|  14 +-
 linux-headers/asm-loongarch/bitsperlong.h |   1 +
 linux-headers/asm-loongarch/kvm.h | 108 +++
 linux-headers/asm-loongarch/mman.h|   1 +
 linux-headers/asm-loongarch/unistd.h  |   5 +
 linux-headers/asm-mips/unistd_n32.h   |   4 +
 linux-headers/asm-mips/unistd_n64.h   |   4 +
 linux-headers/asm-mips/unistd_o32.h   |   4 +
 linux-headers/asm-powerpc/unistd_32.h |   4 +
 linux-headers/asm-powerpc/unistd_64.h |   4 +
 linux-headers/asm-riscv/kvm.h |  12 ++
 linux-headers/asm-s390/unistd_32.h|   4 +
 linux-headers/asm-s390/unistd_64.h|   4 +
 linux-headers/asm-x86/unistd_32.h |   4 +
 linux-headers/asm-x86/unistd_64.h |   3 +
 linux-headers/asm-x86/unistd_x32.h|   3 +
 linux-headers/linux/iommufd.h | 180 +-
 linux-headers/linux/kvm.h |  11 ++
 linux-headers/linux/psp-sev.h |   1 +
 linux-headers/linux/stddef.h  |   7 +
 linux-headers/linux/userfaultfd.h |   9 +-
 linux-headers/linux/vfio.h|  47 +++--
 linux-headers/linux/vhost.h   |   8 +
 28 files changed, 486 insertions(+), 26 deletions(-)
 create mode 100644 linux-headers/asm-loongarch/bitsperlong.h
 create mode 100644 linux-headers/asm-loongarch/kvm.h
 create mode 100644 linux-headers/asm-loongarch/mman.h
 create mode 100644 linux-headers/asm-loongarch/unistd.h

diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 72279f4d25..3afb70160f 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -322,6 +322,8 @@ extern "C" {
  * index 1 = Cr:Cb plane, [39:0] Cr1:Cb1:Cr0:Cb0 little endian
  */
 #define DRM_FORMAT_NV15fourcc_code('N', 'V', '1', '5') /* 2x2 
subsampled Cr:Cb plane */
+#define DRM_FORMAT_NV20fourcc_code('N', 'V', '2', '0') /* 2x1 
subsampled Cr:Cb plane */
+#define DRM_FORMAT_NV30fourcc_code('N', 'V', '3', '0') /* 
non-subsampled Cr:Cb plane */
 
 /*
  * 2 plane YCbCr MSB aligned
diff --git a/include/standard-headers/linux/pci_regs.h 
b/include/standard-headers/linux/pci_regs.h
index e5f558d964..a39193213f 100644
--- a/include/standard-headers/linux/pci_regs.h
+++ b/include/standard-headers/linux/pci_regs.h
@@ -80,6 +80,7 @@
 #define  PCI_HEADER_TYPE_NORMAL0
 #define  PCI_HEADER_TYPE_BRIDGE1
 #define  PCI_HEADER_TYPE_CARDBUS   2
+#define  PCI_HEADER_TYPE_MFD   0x80/* Multi-Function Device 
(possible) */
 
 #define PCI_BIST   0x0f/* 8 bits */
 #define  PCI_BIST_CODE_MASK0x0f/* Return result */
@@ -637,6 +638,7 @@
 #define PCI_EXP_RTCAP  0x1e/* Root Capabilities */
 #define  PCI_EXP_RTCAP_CRSVIS  0x0001  /* CRS Software Visibility capability */
 #define PCI_EXP_RTSTA  0x20/* Root Status */
+#define  PCI_EXP_RTSTA_PME_RQ_ID 0x /* PME Requester ID */
 #define  PCI_EXP_RTSTA_PME 0x0001 /* PME status */
 #define  PCI_EXP_RTSTA_PENDING 0x0002 /* PME pending */
 /*
@@ -930,12 +932,13 @@
 
 /* Process Address Space ID */
 #define PCI_PASID_CAP  0x04/* PASID feature register */
-#define  PCI_PASID_CAP_EXEC0x02/* Exec permissions Supported */
-#define  PCI_PASID_CAP_PRIV0x04/* Privilege Mode Supported */
+#define  PCI_PASID_CAP_EXEC0x0002  /* Exec permissions Supported */
+#define  PCI_PASID_CAP_PRIV0x0004  /* Privilege Mode Supported */
+#define  PCI_PASID_CAP_WIDTH   0x1f00
 #define PCI_PASID_CTRL 0x06/* PASID control register */
-#define  PCI_PASID_CTRL_ENABLE 0x01/* Enable bit */
-#define  PCI_PASID_CTRL_EXEC   0x02/* Exec permissions Enable */
-#define  PCI_PASID_CTRL_PRIV   0x04/* Privilege Mode Enable */
+#define  PCI_PASID_CTRL_ENABLE 0x0001  /* Enable bit */
+#define  PCI_PASID_CTRL_EXEC   0x0002  /* Exec permissions Enable */
+#define  PCI_PASID_CTRL_PRIV   0x0004  /* Privilege Mode Enable */
 #define PCI_EXT_CAP_PASID_SIZEOF   8
 
 /* Single Root I/O Virtualization */
@@ -975,6 +978,8 @@
 #define  PCI_LTR_VALUE_MASK0x03ff
 #define  PCI_LTR_SCALE_MASK0x1c00
 #define  PCI_LTR_SCALE_SHIFT   10
+#define  PCI_LTR_NOSNOOP_VALUE 0x03ff /* Max No-Snoop Latency Value */

[PATCH v5 11/14] tpm_crb_sysbus: introduce TPM CRB SysBus device

This SysBus variant of the CRB interface supports dynamically locating
the MMIO interface so that Virt machines can use it. This interface
is currently the only one supported by QEMU that works on Windows 11
ARM64 as 'tpm-tis-device' does not work with current Windows drivers.
We largely follow that device as a template.

To try out this device with Windows 11 before OVMF is updated, you
will need to modify `sysbus-fdt.c` and change the added line from:

```c
TYPE_BINDING(TYPE_TPM_CRB_SYSBUS, no_fdt_node),
```

to

```c
TYPE_BINDING(TYPE_TPM_CRB_SYSBUS, add_tpm_tis_fdt_node),
```

This change was not included because it can confuse Linux (although
from testing, it seems like Linux is able to properly ignore the
device from the TPM TIS driver and recognize it from the ACPI device
in the TPM CRB driver). A proper fix would require OVMF to recognize
the ACPI device and not depend on the FDT node for recognizing TPM.

The command line to try out this device with SWTPM is:

```
$ qemu-system-aarch64 \
-chardev socket,id=chrtpm0,path=tpm.sock \
-tpmdev emulator,id=tpm0,chardev=chrtpm0 \
-device tpm-crb-device,tpmdev=tpm0
```

along with SWTPM:

```
$ swtpm \
--ctrl type=unixio,path=tpm.sock,terminate \
--tpmstate backend-uri=file://tpm.data \
--tpm2
```

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 docs/specs/tpm.rst  |   1 +
 include/sysemu/tpm.h|   3 +
 hw/acpi/aml-build.c |   7 +-
 hw/arm/virt.c   |   1 +
 hw/core/sysbus-fdt.c|   1 +
 hw/loongarch/virt.c |   1 +
 hw/riscv/virt.c |   1 +
 hw/tpm/tpm_crb_sysbus.c | 162 
 hw/arm/Kconfig  |   1 +
 hw/loongarch/Kconfig|   1 +
 hw/riscv/Kconfig|   1 +
 hw/tpm/Kconfig  |   5 ++
 hw/tpm/meson.build  |   3 +
 13 files changed, 187 insertions(+), 1 deletion(-)
 create mode 100644 hw/tpm/tpm_crb_sysbus.c

diff --git a/docs/specs/tpm.rst b/docs/specs/tpm.rst
index 2bc29c9804..95aeb49220 100644
--- a/docs/specs/tpm.rst
+++ b/docs/specs/tpm.rst
@@ -46,6 +46,7 @@ operating system.
 QEMU files related to TPM CRB interface:
  - ``hw/tpm/tpm_crb.c``
  - ``hw/tpm/tpm_crb_common.c``
+ - ``hw/tpm/tpm_crb_sysbus.c``
 
 SPAPR interface
 ---
diff --git a/include/sysemu/tpm.h b/include/sysemu/tpm.h
index ffd300e607..bab30fa546 100644
--- a/include/sysemu/tpm.h
+++ b/include/sysemu/tpm.h
@@ -49,6 +49,7 @@ struct TPMIfClass {
 #define TYPE_TPM_TIS_ISA"tpm-tis"
 #define TYPE_TPM_TIS_SYSBUS "tpm-tis-device"
 #define TYPE_TPM_CRB"tpm-crb"
+#define TYPE_TPM_CRB_SYSBUS "tpm-crb-device"
 #define TYPE_TPM_SPAPR  "tpm-spapr"
 #define TYPE_TPM_TIS_I2C"tpm-tis-i2c"
 
@@ -58,6 +59,8 @@ struct TPMIfClass {
 object_dynamic_cast(OBJECT(chr), TYPE_TPM_TIS_SYSBUS)
 #define TPM_IS_CRB(chr) \
 object_dynamic_cast(OBJECT(chr), TYPE_TPM_CRB)
+#define TPM_IS_CRB_SYSBUS(chr)  \
+object_dynamic_cast(OBJECT(chr), TYPE_TPM_CRB_SYSBUS)
 #define TPM_IS_SPAPR(chr)   \
 object_dynamic_cast(OBJECT(chr), TYPE_TPM_SPAPR)
 #define TPM_IS_TIS_I2C(chr)  \
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index af66bde0f5..acc654382e 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -31,6 +31,7 @@
 #include "hw/pci/pci_bus.h"
 #include "hw/pci/pci_bridge.h"
 #include "qemu/cutils.h"
+#include "qom/object.h"
 
 static GArray *build_alloc_array(void)
 {
@@ -2218,7 +2219,7 @@ void build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog,
 {
 uint8_t start_method_params[12] = {};
 unsigned log_addr_offset;
-uint64_t control_area_start_address;
+uint64_t baseaddr, control_area_start_address;
 TPMIf *tpmif = tpm_find();
 uint32_t start_method;
 AcpiTable table = { .sig = "TPM2", .rev = 4,
@@ -2236,6 +2237,10 @@ void build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog,
 } else if (TPM_IS_CRB(tpmif)) {
 control_area_start_address = TPM_CRB_ADDR_CTRL;
 start_method = TPM2_START_METHOD_CRB;
+} else if (TPM_IS_CRB_SYSBUS(tpmif)) {
+baseaddr = object_property_get_uint(OBJECT(tpmif), "x-baseaddr", NULL);
+control_area_start_address = baseaddr + A_CRB_CTRL_REQ;
+start_method = TPM2_START_METHOD_CRB;
 } else {
 g_assert_not_reached();
 }
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 36e2506420..e6152a2f51 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2955,6 +2955,7 @@ static void virt_machine_class_init(ObjectClass *oc, void 
*data)
 machine_class_allow_dynamic_sysbus_dev(mc, TYPE_VFIO_PLATFORM);
 #ifdef CONFIG_TPM
 machine_class_allow_dynamic_sysbus_dev(mc, TYPE_TPM_TIS_SYSBUS);
+machine_class_allow_dynamic_sysbus_dev(mc, TYPE_TPM_CRB_SYSBUS);
 #endif
 mc->block_default_type = IF_VIRTIO;
 mc->no_cdrom = 1;
diff --git a/hw/core/sysbu

[PATCH v5 04/14] tpm_crb: use a single read-as-mem/write-as-mmio mapping

On Apple Silicon, when Windows performs a LDP on the CRB MMIO space,
the exception is not decoded by hardware and we cannot trap the MMIO
read. This led to the idea from @agraf to use the same mapping type as
ROM devices: namely that reads should be seen as memory type and
writes should trap as MMIO.

Once that was done, the second memory mapping of the command buffer
region was redundent and was removed.

A note about the removal of the read trap for `CRB_LOC_STATE`:
The only usage was to return the most up-to-date value for
`tpmEstablished`. However, `tpmEstablished` is only cleared when a
TPM2_HashStart operation is called which only exists for locality 4.
We do not handle locality 4. Indeed, the comment for the write handler
of `CRB_LOC_CTRL` makes the same argument for why it is not calling
the backend to reset the `tpmEstablished` bit (to 1).
As this bit is unused, we do not need to worry about updating it for
reads.

In order to maintain migration compatibility with older versions of
QEMU, we store a copy of the register data and command data which is
used only during save/restore.

Signed-off-by: Joelle van Dyne 
---
 hw/tpm/tpm_crb.h|   5 +-
 hw/tpm/tpm_crb.c|  30 -
 hw/tpm/tpm_crb_common.c | 145 +++-
 3 files changed, 114 insertions(+), 66 deletions(-)

diff --git a/hw/tpm/tpm_crb.h b/hw/tpm/tpm_crb.h
index da3a0cf256..36863e1664 100644
--- a/hw/tpm/tpm_crb.h
+++ b/hw/tpm/tpm_crb.h
@@ -26,9 +26,7 @@
 typedef struct TPMCRBState {
 TPMBackend *tpmbe;
 TPMBackendCmd cmd;
-uint32_t regs[TPM_CRB_R_MAX];
 MemoryRegion mmio;
-MemoryRegion cmdmem;
 
 size_t be_buffer_size;
 
@@ -72,5 +70,8 @@ enum TPMVersion tpm_crb_get_version(TPMCRBState *s);
 int tpm_crb_pre_save(TPMCRBState *s);
 void tpm_crb_reset(TPMCRBState *s, uint64_t baseaddr);
 void tpm_crb_init_memory(Object *obj, TPMCRBState *s, Error **errp);
+void tpm_crb_mem_save(TPMCRBState *s, uint32_t *saved_regs, void 
*saved_cmdmem);
+void tpm_crb_mem_load(TPMCRBState *s, const uint32_t *saved_regs,
+  const void *saved_cmdmem);
 
 #endif /* TPM_TPM_CRB_H */
diff --git a/hw/tpm/tpm_crb.c b/hw/tpm/tpm_crb.c
index 598c3e0161..99c64dd72a 100644
--- a/hw/tpm/tpm_crb.c
+++ b/hw/tpm/tpm_crb.c
@@ -37,6 +37,10 @@ struct CRBState {
 DeviceState parent_obj;
 
 TPMCRBState state;
+
+/* These states are only for migration */
+uint32_t saved_regs[TPM_CRB_R_MAX];
+MemoryRegion saved_cmdmem;
 };
 typedef struct CRBState CRBState;
 
@@ -57,18 +61,36 @@ static enum TPMVersion tpm_crb_none_get_version(TPMIf *ti)
 return tpm_crb_get_version(&s->state);
 }
 
+/**
+ * For migrating to an older version of QEMU
+ */
 static int tpm_crb_none_pre_save(void *opaque)
 {
 CRBState *s = opaque;
+void *saved_cmdmem = memory_region_get_ram_ptr(&s->saved_cmdmem);
 
+tpm_crb_mem_save(&s->state, s->saved_regs, saved_cmdmem);
 return tpm_crb_pre_save(&s->state);
 }
 
+/**
+ * For migrating from an older version of QEMU
+ */
+static int tpm_crb_none_post_load(void *opaque, int version_id)
+{
+CRBState *s = opaque;
+void *saved_cmdmem = memory_region_get_ram_ptr(&s->saved_cmdmem);
+
+tpm_crb_mem_load(&s->state, s->saved_regs, saved_cmdmem);
+return 0;
+}
+
 static const VMStateDescription vmstate_tpm_crb_none = {
 .name = "tpm-crb",
 .pre_save = tpm_crb_none_pre_save,
+.post_load = tpm_crb_none_post_load,
 .fields = (VMStateField[]) {
-VMSTATE_UINT32_ARRAY(state.regs, CRBState, TPM_CRB_R_MAX),
+VMSTATE_UINT32_ARRAY(saved_regs, CRBState, TPM_CRB_R_MAX),
 VMSTATE_END_OF_LIST(),
 }
 };
@@ -101,10 +123,12 @@ static void tpm_crb_none_realize(DeviceState *dev, Error 
**errp)
 
 tpm_crb_init_memory(OBJECT(s), &s->state, errp);
 
+/* only used for migration */
+memory_region_init_ram(&s->saved_cmdmem, OBJECT(s),
+"tpm-crb-cmd", CRB_CTRL_CMD_SIZE, errp);
+
 memory_region_add_subregion(get_system_memory(),
 TPM_CRB_ADDR_BASE, &s->state.mmio);
-memory_region_add_subregion(get_system_memory(),
-TPM_CRB_ADDR_BASE + sizeof(s->state.regs), &s->state.cmdmem);
 
 if (s->state.ppi_enabled) {
 memory_region_add_subregion(get_system_memory(),
diff --git a/hw/tpm/tpm_crb_common.c b/hw/tpm/tpm_crb_common.c
index bee0b71fee..f96a8cf299 100644
--- a/hw/tpm/tpm_crb_common.c
+++ b/hw/tpm/tpm_crb_common.c
@@ -31,31 +31,12 @@
 #include "qom/object.h"
 #include "tpm_crb.h"
 
-static uint64_t tpm_crb_mmio_read(void *opaque, hwaddr addr,
-  unsigned size)
+static uint8_t tpm_crb_get_active_locty(TPMCRBState *s, uint32_t *regs)
 {
-TPMCRBState *s = opaque;
-void *regs = (void *)&s->regs + (addr & ~3);
-unsigned offset = addr & 3;
-uint32_t val = *(uint32_t *)regs >> (8 * offset);
-
-switch (addr) {
-case A_CRB_LOC_STATE:
-val |= !tpm_backend_get_tpm_established_flag(s->tpmbe);
-break;
-}

[PATCH v5 02/14] tpm_crb: CTRL_RSP_ADDR is 64-bits wide

The register is actually 64-bits but in order to make this more clear
than the specification, we define two 32-bit registers:
CTRL_RSP_LADDR and CTRL_RSP_HADDR to match the CTRL_CMD_* naming. This
deviates from the specs but is way more clear.

Previously, the only CRB device uses a fixed system address so this
was not an issue. However, once we support SysBus CRB device, the
address can be anywhere in 64-bit space.

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 include/hw/acpi/tpm.h  | 3 ++-
 hw/tpm/tpm_crb_common.c| 3 ++-
 tests/qtest/tpm-crb-test.c | 2 +-
 tests/qtest/tpm-util.c | 2 +-
 4 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/hw/acpi/tpm.h b/include/hw/acpi/tpm.h
index 579c45f5ba..f60bfe2789 100644
--- a/include/hw/acpi/tpm.h
+++ b/include/hw/acpi/tpm.h
@@ -174,7 +174,8 @@ REG32(CRB_CTRL_CMD_SIZE, 0x58)
 REG32(CRB_CTRL_CMD_LADDR, 0x5C)
 REG32(CRB_CTRL_CMD_HADDR, 0x60)
 REG32(CRB_CTRL_RSP_SIZE, 0x64)
-REG32(CRB_CTRL_RSP_ADDR, 0x68)
+REG32(CRB_CTRL_RSP_LADDR, 0x68)
+REG32(CRB_CTRL_RSP_HADDR, 0x6C)
 REG32(CRB_DATA_BUFFER, 0x80)
 
 #define TPM_CRB_ADDR_BASE   0xFED4
diff --git a/hw/tpm/tpm_crb_common.c b/hw/tpm/tpm_crb_common.c
index fa463f295f..01b35808f6 100644
--- a/hw/tpm/tpm_crb_common.c
+++ b/hw/tpm/tpm_crb_common.c
@@ -197,7 +197,8 @@ void tpm_crb_reset(TPMCRBState *s, uint64_t baseaddr)
 s->regs[R_CRB_CTRL_CMD_LADDR] = (uint32_t)baseaddr;
 s->regs[R_CRB_CTRL_CMD_HADDR] = (uint32_t)(baseaddr >> 32);
 s->regs[R_CRB_CTRL_RSP_SIZE] = CRB_CTRL_CMD_SIZE;
-s->regs[R_CRB_CTRL_RSP_ADDR] = (uint32_t)baseaddr;
+s->regs[R_CRB_CTRL_RSP_LADDR] = (uint32_t)baseaddr;
+s->regs[R_CRB_CTRL_RSP_HADDR] = (uint32_t)(baseaddr >> 32);
 
 s->be_buffer_size = MIN(tpm_backend_get_buffer_size(s->tpmbe),
 CRB_CTRL_CMD_SIZE);
diff --git a/tests/qtest/tpm-crb-test.c b/tests/qtest/tpm-crb-test.c
index 396ae3f91c..9d30fe8293 100644
--- a/tests/qtest/tpm-crb-test.c
+++ b/tests/qtest/tpm-crb-test.c
@@ -28,7 +28,7 @@ static void tpm_crb_test(const void *data)
 uint32_t csize = readl(TPM_CRB_ADDR_BASE + A_CRB_CTRL_CMD_SIZE);
 uint64_t caddr = readq(TPM_CRB_ADDR_BASE + A_CRB_CTRL_CMD_LADDR);
 uint32_t rsize = readl(TPM_CRB_ADDR_BASE + A_CRB_CTRL_RSP_SIZE);
-uint64_t raddr = readq(TPM_CRB_ADDR_BASE + A_CRB_CTRL_RSP_ADDR);
+uint64_t raddr = readq(TPM_CRB_ADDR_BASE + A_CRB_CTRL_RSP_LADDR);
 uint8_t locstate = readb(TPM_CRB_ADDR_BASE + A_CRB_LOC_STATE);
 uint32_t locctrl = readl(TPM_CRB_ADDR_BASE + A_CRB_LOC_CTRL);
 uint32_t locsts = readl(TPM_CRB_ADDR_BASE + A_CRB_LOC_STS);
diff --git a/tests/qtest/tpm-util.c b/tests/qtest/tpm-util.c
index 1c0319e6e7..dd02057fc0 100644
--- a/tests/qtest/tpm-util.c
+++ b/tests/qtest/tpm-util.c
@@ -25,7 +25,7 @@ void tpm_util_crb_transfer(QTestState *s,
unsigned char *rsp, size_t rsp_size)
 {
 uint64_t caddr = qtest_readq(s, TPM_CRB_ADDR_BASE + A_CRB_CTRL_CMD_LADDR);
-uint64_t raddr = qtest_readq(s, TPM_CRB_ADDR_BASE + A_CRB_CTRL_RSP_ADDR);
+uint64_t raddr = qtest_readq(s, TPM_CRB_ADDR_BASE + A_CRB_CTRL_RSP_LADDR);
 
 qtest_writeb(s, TPM_CRB_ADDR_BASE + A_CRB_LOC_CTRL, 1);
 
-- 
2.41.0

[PATCH v5 08/14] hw/loongarch/virt: connect TPM to platform bus

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 hw/loongarch/virt.c  | 7 +++
 hw/loongarch/Kconfig | 1 +
 2 files changed, 8 insertions(+)

diff --git a/hw/loongarch/virt.c b/hw/loongarch/virt.c
index 4b7dc67a2d..feed0f8bbf 100644
--- a/hw/loongarch/virt.c
+++ b/hw/loongarch/virt.c
@@ -1004,6 +1004,13 @@ static void 
loongarch_machine_device_plug_cb(HotplugHandler *hotplug_dev,
 } else if (memhp_type_supported(dev)) {
 virt_mem_plug(hotplug_dev, dev, errp);
 }
+
+#ifdef CONFIG_TPM
+if (object_dynamic_cast(OBJECT(dev), TYPE_TPM_IF)) {
+tpm_sysbus_plug(TPM_IF(dev), OBJECT(lams->platform_bus_dev),
+VIRT_PLATFORM_BUS_BASEADDRESS);
+}
+#endif
 }
 
 static HotplugHandler *virt_machine_get_hotplug_handler(MachineState *machine,
diff --git a/hw/loongarch/Kconfig b/hw/loongarch/Kconfig
index 5727efed6d..25da190ffc 100644
--- a/hw/loongarch/Kconfig
+++ b/hw/loongarch/Kconfig
@@ -5,6 +5,7 @@ config LOONGARCH_VIRT
 imply VIRTIO_VGA
 imply PCI_DEVICES
 imply NVDIMM
+imply TPM_TIS_SYSBUS
 select SERIAL
 select VIRTIO_PCI
 select PLATFORM_BUS
-- 
2.41.0

[PATCH v5 03/14] tpm_ppi: refactor memory space initialization

Instead of calling `memory_region_add_subregion` directly, we defer to
the caller to do it. This allows us to re-use the code for a SysBus
device.

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 hw/tpm/tpm_ppi.h| 10 +++---
 hw/tpm/tpm_crb.c|  4 ++--
 hw/tpm/tpm_crb_common.c |  3 +++
 hw/tpm/tpm_ppi.c|  5 +
 hw/tpm/tpm_tis_isa.c|  5 +++--
 5 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/hw/tpm/tpm_ppi.h b/hw/tpm/tpm_ppi.h
index bf5d4a300f..30863c6438 100644
--- a/hw/tpm/tpm_ppi.h
+++ b/hw/tpm/tpm_ppi.h
@@ -20,17 +20,13 @@ typedef struct TPMPPI {
 } TPMPPI;
 
 /**
- * tpm_ppi_init:
+ * tpm_ppi_init_memory:
  * @tpmppi: a TPMPPI
- * @m: the address-space / MemoryRegion to use
- * @addr: the address of the PPI region
  * @obj: the owner object
  *
- * Register the TPM PPI memory region at @addr on the given address
- * space for the object @obj.
+ * Creates the TPM PPI memory region.
  **/
-void tpm_ppi_init(TPMPPI *tpmppi, MemoryRegion *m,
-  hwaddr addr, Object *obj);
+void tpm_ppi_init_memory(TPMPPI *tpmppi, Object *obj);
 
 /**
  * tpm_ppi_reset:
diff --git a/hw/tpm/tpm_crb.c b/hw/tpm/tpm_crb.c
index 3ef4977fb5..598c3e0161 100644
--- a/hw/tpm/tpm_crb.c
+++ b/hw/tpm/tpm_crb.c
@@ -107,8 +107,8 @@ static void tpm_crb_none_realize(DeviceState *dev, Error 
**errp)
 TPM_CRB_ADDR_BASE + sizeof(s->state.regs), &s->state.cmdmem);
 
 if (s->state.ppi_enabled) {
-tpm_ppi_init(&s->state.ppi, get_system_memory(),
- TPM_PPI_ADDR_BASE, OBJECT(s));
+memory_region_add_subregion(get_system_memory(),
+TPM_PPI_ADDR_BASE, &s->state.ppi.ram);
 }
 
 if (xen_enabled()) {
diff --git a/hw/tpm/tpm_crb_common.c b/hw/tpm/tpm_crb_common.c
index 01b35808f6..bee0b71fee 100644
--- a/hw/tpm/tpm_crb_common.c
+++ b/hw/tpm/tpm_crb_common.c
@@ -214,4 +214,7 @@ void tpm_crb_init_memory(Object *obj, TPMCRBState *s, Error 
**errp)
 "tpm-crb-mmio", sizeof(s->regs));
 memory_region_init_ram(&s->cmdmem, obj,
 "tpm-crb-cmd", CRB_CTRL_CMD_SIZE, errp);
+if (s->ppi_enabled) {
+tpm_ppi_init_memory(&s->ppi, obj);
+}
 }
diff --git a/hw/tpm/tpm_ppi.c b/hw/tpm/tpm_ppi.c
index 7f74e26ec6..40cab59afa 100644
--- a/hw/tpm/tpm_ppi.c
+++ b/hw/tpm/tpm_ppi.c
@@ -44,14 +44,11 @@ void tpm_ppi_reset(TPMPPI *tpmppi)
 }
 }
 
-void tpm_ppi_init(TPMPPI *tpmppi, MemoryRegion *m,
-  hwaddr addr, Object *obj)
+void tpm_ppi_init_memory(TPMPPI *tpmppi, Object *obj)
 {
 tpmppi->buf = qemu_memalign(qemu_real_host_page_size(),
 HOST_PAGE_ALIGN(TPM_PPI_ADDR_SIZE));
 memory_region_init_ram_device_ptr(&tpmppi->ram, obj, "tpm-ppi",
   TPM_PPI_ADDR_SIZE, tpmppi->buf);
 vmstate_register_ram(&tpmppi->ram, DEVICE(obj));
-
-memory_region_add_subregion(m, addr, &tpmppi->ram);
 }
diff --git a/hw/tpm/tpm_tis_isa.c b/hw/tpm/tpm_tis_isa.c
index 0367401586..d596f38c0f 100644
--- a/hw/tpm/tpm_tis_isa.c
+++ b/hw/tpm/tpm_tis_isa.c
@@ -134,8 +134,9 @@ static void tpm_tis_isa_realizefn(DeviceState *dev, Error 
**errp)
 TPM_TIS_ADDR_BASE, &s->mmio);
 
 if (s->ppi_enabled) {
-tpm_ppi_init(&s->ppi, isa_address_space(ISA_DEVICE(dev)),
- TPM_PPI_ADDR_BASE, OBJECT(dev));
+tpm_ppi_init_memory(&s->ppi, OBJECT(dev));
+memory_region_add_subregion(isa_address_space(ISA_DEVICE(dev)),
+TPM_PPI_ADDR_BASE, &s->ppi.ram);
 }
 }
 
-- 
2.41.0

[PATCH v5 12/14] tests: acpi: implement TPM CRB tests for ARM virt

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 tests/qtest/bios-tables-test.c | 43 --
 1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index 71af5cf69f..bb4ebf00c1 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -1447,6 +1447,28 @@ static void test_acpi_piix4_tcg_numamem(void)
 
 uint64_t tpm_tis_base_addr;
 
+static test_data tcg_tpm_test_data(const char *machine)
+{
+if (g_strcmp0(machine, "virt") == 0) {
+test_data data = {
+.machine = "virt",
+.tcg_only = true,
+.uefi_fl1 = "pc-bios/edk2-aarch64-code.fd",
+.uefi_fl2 = "pc-bios/edk2-arm-vars.fd",
+.cd =
+   
"tests/data/uefi-boot-images/bios-tables-test.aarch64.iso.qcow2",
+.ram_start = 0x4000ULL,
+.scan_len = 128ULL * 1024 * 1024,
+};
+return data;
+} else {
+test_data data = {
+.machine = machine,
+};
+return data;
+}
+}
+
 static void test_acpi_tcg_tpm(const char *machine, const char *tpm_if,
   uint64_t base, enum TPMVersion tpm_version)
 {
@@ -1454,7 +1476,7 @@ static void test_acpi_tcg_tpm(const char *machine, const 
char *tpm_if,
   machine, tpm_if);
 char *tmp_path = g_dir_make_tmp(tmp_dir_name, NULL);
 TPMTestState test;
-test_data data = {};
+test_data data = tcg_tpm_test_data(machine);
 GThread *thread;
 const char *suffix = tpm_version == TPM_VERSION_2_0 ? "tpm2" : "tpm12";
 char *args, *variant = g_strdup_printf(".%s.%s", tpm_if, suffix);
@@ -1474,13 +1496,14 @@ static void test_acpi_tcg_tpm(const char *machine, 
const char *tpm_if,
 thread = g_thread_new(NULL, tpm_emu_ctrl_thread, &test);
 tpm_emu_test_wait_cond(&test);
 
-data.machine = machine;
 data.variant = variant;
 
 args = g_strdup_printf(
+" %s"
 " -chardev socket,id=chr,path=%s"
 " -tpmdev emulator,id=dev,chardev=chr"
 " -device tpm-%s,tpmdev=dev",
+g_strcmp0(machine, "virt") == 0 ? "-cpu cortex-a57" : "",
 test.addr->u.q_unix.path, tpm_if);
 
 test_acpi_one(args, &data);
@@ -1506,6 +1529,16 @@ static void test_acpi_q35_tcg_tpm12_tis(void)
 test_acpi_tcg_tpm("q35", "tis", 0xFED4, TPM_VERSION_1_2);
 }
 
+static void test_acpi_q35_tcg_tpm2_crb(void)
+{
+test_acpi_tcg_tpm("q35", "crb", 0xFED4, TPM_VERSION_2_0);
+}
+
+static void test_acpi_virt_tcg_tpm2_crb(void)
+{
+test_acpi_tcg_tpm("virt", "crb-device", 0xFED4, TPM_VERSION_2_0);
+}
+
 static void test_acpi_tcg_dimm_pxm(const char *machine)
 {
 test_data data = {};
@@ -2212,6 +2245,9 @@ int main(int argc, char *argv[])
 qtest_add_func("acpi/q35/tpm12-tis",
test_acpi_q35_tcg_tpm12_tis);
 }
+if (tpm_model_is_available("-machine q35", "tpm-crb")) {
+qtest_add_func("acpi/q35/tpm2-crb", 
test_acpi_q35_tcg_tpm2_crb);
+}
 qtest_add_func("acpi/q35/bridge", test_acpi_q35_tcg_bridge);
 qtest_add_func("acpi/q35/no-acpi-hotplug",
test_acpi_q35_tcg_no_acpi_hotplug);
@@ -2301,6 +2337,9 @@ int main(int argc, char *argv[])
 qtest_add_func("acpi/virt/viot", test_acpi_virt_viot);
 }
 }
+if (tpm_model_is_available("-machine virt", "tpm-crb")) {
+qtest_add_func("acpi/virt/tpm2-crb", test_acpi_virt_tcg_tpm2_crb);
+}
 }
 ret = g_test_run();
 boot_sector_cleanup(disk);
-- 
2.41.0

[PATCH v5 01/14] tpm_crb: refactor common code

In preparation for the SysBus variant, we move common code styled
after the TPM TIS devices.

To maintain compatibility, we do not rename the existing tpm-crb
device.

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 docs/specs/tpm.rst  |   1 +
 hw/tpm/tpm_crb.h|  76 +++
 hw/tpm/tpm_crb.c| 270 ++--
 hw/tpm/tpm_crb_common.c | 216 
 hw/tpm/meson.build  |   1 +
 hw/tpm/trace-events |   2 +-
 6 files changed, 331 insertions(+), 235 deletions(-)
 create mode 100644 hw/tpm/tpm_crb.h
 create mode 100644 hw/tpm/tpm_crb_common.c

diff --git a/docs/specs/tpm.rst b/docs/specs/tpm.rst
index efe124a148..2bc29c9804 100644
--- a/docs/specs/tpm.rst
+++ b/docs/specs/tpm.rst
@@ -45,6 +45,7 @@ operating system.
 
 QEMU files related to TPM CRB interface:
  - ``hw/tpm/tpm_crb.c``
+ - ``hw/tpm/tpm_crb_common.c``
 
 SPAPR interface
 ---
diff --git a/hw/tpm/tpm_crb.h b/hw/tpm/tpm_crb.h
new file mode 100644
index 00..da3a0cf256
--- /dev/null
+++ b/hw/tpm/tpm_crb.h
@@ -0,0 +1,76 @@
+/*
+ * tpm_crb.h - QEMU's TPM CRB interface emulator
+ *
+ * Copyright (c) 2018 Red Hat, Inc.
+ *
+ * Authors:
+ *   Marc-André Lureau 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ * tpm_crb is a device for TPM 2.0 Command Response Buffer (CRB) Interface
+ * as defined in TCG PC Client Platform TPM Profile (PTP) Specification
+ * Family “2.0” Level 00 Revision 01.03 v22
+ */
+#ifndef TPM_TPM_CRB_H
+#define TPM_TPM_CRB_H
+
+#include "exec/memory.h"
+#include "hw/acpi/tpm.h"
+#include "sysemu/tpm_backend.h"
+#include "tpm_ppi.h"
+
+#define CRB_CTRL_CMD_SIZE (TPM_CRB_ADDR_SIZE - A_CRB_DATA_BUFFER)
+
+typedef struct TPMCRBState {
+TPMBackend *tpmbe;
+TPMBackendCmd cmd;
+uint32_t regs[TPM_CRB_R_MAX];
+MemoryRegion mmio;
+MemoryRegion cmdmem;
+
+size_t be_buffer_size;
+
+bool ppi_enabled;
+TPMPPI ppi;
+} TPMCRBState;
+
+#define CRB_INTF_TYPE_CRB_ACTIVE 0b1
+#define CRB_INTF_VERSION_CRB 0b1
+#define CRB_INTF_CAP_LOCALITY_0_ONLY 0b0
+#define CRB_INTF_CAP_IDLE_FAST 0b0
+#define CRB_INTF_CAP_XFER_SIZE_64 0b11
+#define CRB_INTF_CAP_FIFO_NOT_SUPPORTED 0b0
+#define CRB_INTF_CAP_CRB_SUPPORTED 0b1
+#define CRB_INTF_IF_SELECTOR_CRB 0b1
+
+enum crb_loc_ctrl {
+CRB_LOC_CTRL_REQUEST_ACCESS = BIT(0),
+CRB_LOC_CTRL_RELINQUISH = BIT(1),
+CRB_LOC_CTRL_SEIZE = BIT(2),
+CRB_LOC_CTRL_RESET_ESTABLISHMENT_BIT = BIT(3),
+};
+
+enum crb_ctrl_req {
+CRB_CTRL_REQ_CMD_READY = BIT(0),
+CRB_CTRL_REQ_GO_IDLE = BIT(1),
+};
+
+enum crb_start {
+CRB_START_INVOKE = BIT(0),
+};
+
+enum crb_cancel {
+CRB_CANCEL_INVOKE = BIT(0),
+};
+
+#define TPM_CRB_NO_LOCALITY 0xff
+
+void tpm_crb_request_completed(TPMCRBState *s, int ret);
+enum TPMVersion tpm_crb_get_version(TPMCRBState *s);
+int tpm_crb_pre_save(TPMCRBState *s);
+void tpm_crb_reset(TPMCRBState *s, uint64_t baseaddr);
+void tpm_crb_init_memory(Object *obj, TPMCRBState *s, Error **errp);
+
+#endif /* TPM_TPM_CRB_H */
diff --git a/hw/tpm/tpm_crb.c b/hw/tpm/tpm_crb.c
index ea930da545..3ef4977fb5 100644
--- a/hw/tpm/tpm_crb.c
+++ b/hw/tpm/tpm_crb.c
@@ -31,257 +31,62 @@
 #include "tpm_ppi.h"
 #include "trace.h"
 #include "qom/object.h"
+#include "tpm_crb.h"
 
 struct CRBState {
 DeviceState parent_obj;
 
-TPMBackend *tpmbe;
-TPMBackendCmd cmd;
-uint32_t regs[TPM_CRB_R_MAX];
-MemoryRegion mmio;
-MemoryRegion cmdmem;
-
-size_t be_buffer_size;
-
-bool ppi_enabled;
-TPMPPI ppi;
+TPMCRBState state;
 };
 typedef struct CRBState CRBState;
 
 DECLARE_INSTANCE_CHECKER(CRBState, CRB,
  TYPE_TPM_CRB)
 
-#define CRB_INTF_TYPE_CRB_ACTIVE 0b1
-#define CRB_INTF_VERSION_CRB 0b1
-#define CRB_INTF_CAP_LOCALITY_0_ONLY 0b0
-#define CRB_INTF_CAP_IDLE_FAST 0b0
-#define CRB_INTF_CAP_XFER_SIZE_64 0b11
-#define CRB_INTF_CAP_FIFO_NOT_SUPPORTED 0b0
-#define CRB_INTF_CAP_CRB_SUPPORTED 0b1
-#define CRB_INTF_IF_SELECTOR_CRB 0b1
-
-#define CRB_CTRL_CMD_SIZE (TPM_CRB_ADDR_SIZE - A_CRB_DATA_BUFFER)
-
-enum crb_loc_ctrl {
-CRB_LOC_CTRL_REQUEST_ACCESS = BIT(0),
-CRB_LOC_CTRL_RELINQUISH = BIT(1),
-CRB_LOC_CTRL_SEIZE = BIT(2),
-CRB_LOC_CTRL_RESET_ESTABLISHMENT_BIT = BIT(3),
-};
-
-enum crb_ctrl_req {
-CRB_CTRL_REQ_CMD_READY = BIT(0),
-CRB_CTRL_REQ_GO_IDLE = BIT(1),
-};
-
-enum crb_start {
-CRB_START_INVOKE = BIT(0),
-};
-
-enum crb_cancel {
-CRB_CANCEL_INVOKE = BIT(0),
-};
-
-#define TPM_CRB_NO_LOCALITY 0xff
-
-static uint64_t tpm_crb_mmio_read(void *opaque, hwaddr addr,
-  unsigned size)
-{
-CRBState *s = CRB(opaque);
-void *regs = (void *)&s->regs + (addr & ~3);
-unsigned offset = addr & 3;
-uint32_t val = *(uint32_t *)regs >> (8 * offset);
-
-switch (addr) {
-case A_CRB_LOC_STATE:
-val |= !tpm_backend_get_tpm_established_flag(

[PATCH v5 05/14] tpm_crb: move ACPI table building to device interface

This logic is similar to TPM TIS ISA device. Since TPM CRB can only
support TPM 2.0 backends, we check for this in realize.

Signed-off-by: Joelle van Dyne 
---
 hw/tpm/tpm_crb.h|  2 ++
 hw/i386/acpi-build.c| 16 +---
 hw/tpm/tpm_crb.c| 16 
 hw/tpm/tpm_crb_common.c | 19 +++
 4 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/hw/tpm/tpm_crb.h b/hw/tpm/tpm_crb.h
index 36863e1664..e6a86e3fd1 100644
--- a/hw/tpm/tpm_crb.h
+++ b/hw/tpm/tpm_crb.h
@@ -73,5 +73,7 @@ void tpm_crb_init_memory(Object *obj, TPMCRBState *s, Error 
**errp);
 void tpm_crb_mem_save(TPMCRBState *s, uint32_t *saved_regs, void 
*saved_cmdmem);
 void tpm_crb_mem_load(TPMCRBState *s, const uint32_t *saved_regs,
   const void *saved_cmdmem);
+void tpm_crb_build_aml(TPMIf *ti, Aml *scope, uint32_t baseaddr, uint32_t size,
+   bool build_ppi);
 
 #endif /* TPM_TPM_CRB_H */
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 80db183b78..7491cee2af 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1792,21 +1792,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
 
 #ifdef CONFIG_TPM
 if (TPM_IS_CRB(tpm)) {
-dev = aml_device("TPM");
-aml_append(dev, aml_name_decl("_HID", aml_string("MSFT0101")));
-aml_append(dev, aml_name_decl("_STR",
-  aml_string("TPM 2.0 Device")));
-crs = aml_resource_template();
-aml_append(crs, aml_memory32_fixed(TPM_CRB_ADDR_BASE,
-   TPM_CRB_ADDR_SIZE, AML_READ_WRITE));
-aml_append(dev, aml_name_decl("_CRS", crs));
-
-aml_append(dev, aml_name_decl("_STA", aml_int(0xf)));
-aml_append(dev, aml_name_decl("_UID", aml_int(1)));
-
-tpm_build_ppi_acpi(tpm, dev);
-
-aml_append(sb_scope, dev);
+call_dev_aml_func(DEVICE(tpm), scope);
 }
 #endif
 
diff --git a/hw/tpm/tpm_crb.c b/hw/tpm/tpm_crb.c
index 99c64dd72a..8d57295b15 100644
--- a/hw/tpm/tpm_crb.c
+++ b/hw/tpm/tpm_crb.c
@@ -19,6 +19,8 @@
 #include "qemu/module.h"
 #include "qapi/error.h"
 #include "exec/address-spaces.h"
+#include "hw/acpi/acpi_aml_interface.h"
+#include "hw/acpi/tpm.h"
 #include "hw/qdev-properties.h"
 #include "hw/pci/pci_ids.h"
 #include "hw/acpi/tpm.h"
@@ -121,6 +123,11 @@ static void tpm_crb_none_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
+if (tpm_crb_none_get_version(TPM_IF(s)) != TPM_VERSION_2_0) {
+error_setg(errp, "TPM CRB only supports TPM 2.0 backends");
+return;
+}
+
 tpm_crb_init_memory(OBJECT(s), &s->state, errp);
 
 /* only used for migration */
@@ -142,10 +149,17 @@ static void tpm_crb_none_realize(DeviceState *dev, Error 
**errp)
 }
 }
 
+static void build_tpm_crb_none_aml(AcpiDevAmlIf *adev, Aml *scope)
+{
+tpm_crb_build_aml(TPM_IF(adev), scope, TPM_CRB_ADDR_BASE, 
TPM_CRB_ADDR_SIZE,
+  true);
+}
+
 static void tpm_crb_none_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
 TPMIfClass *tc = TPM_IF_CLASS(klass);
+AcpiDevAmlIfClass *adevc = ACPI_DEV_AML_IF_CLASS(klass);
 
 dc->realize = tpm_crb_none_realize;
 device_class_set_props(dc, tpm_crb_none_properties);
@@ -154,6 +168,7 @@ static void tpm_crb_none_class_init(ObjectClass *klass, 
void *data)
 tc->model = TPM_MODEL_TPM_CRB;
 tc->get_version = tpm_crb_none_get_version;
 tc->request_completed = tpm_crb_none_request_completed;
+adevc->build_dev_aml = build_tpm_crb_none_aml;
 
 set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 }
@@ -166,6 +181,7 @@ static const TypeInfo tpm_crb_none_info = {
 .class_init  = tpm_crb_none_class_init,
 .interfaces = (InterfaceInfo[]) {
 { TYPE_TPM_IF },
+{ TYPE_ACPI_DEV_AML_IF },
 { }
 }
 };
diff --git a/hw/tpm/tpm_crb_common.c b/hw/tpm/tpm_crb_common.c
index f96a8cf299..09ca55eece 100644
--- a/hw/tpm/tpm_crb_common.c
+++ b/hw/tpm/tpm_crb_common.c
@@ -241,3 +241,22 @@ void tpm_crb_mem_load(TPMCRBState *s, const uint32_t 
*saved_regs,
 memcpy(regs, saved_regs, A_CRB_DATA_BUFFER);
 memcpy(®s[R_CRB_DATA_BUFFER], saved_cmdmem, CRB_CTRL_CMD_SIZE);
 }
+
+void tpm_crb_build_aml(TPMIf *ti, Aml *scope, uint32_t baseaddr, uint32_t size,
+   bool build_ppi)
+{
+Aml *dev, *crs;
+
+dev = aml_device("TPM");
+aml_append(dev, aml_name_decl("_HID", aml_string("MSFT0101")));
+aml_append(dev, aml_name_decl("_STR", aml_string("TPM 2.0 Device")));
+aml_append(dev, aml_name_decl("_UID", aml_int(1)));
+aml_append(dev, aml_name_decl("_STA", aml_int(0xF)));
+crs = aml_resource_template();
+aml_append(crs, aml_memory32_fixed(baseaddr, size, AML_READ_WRITE));
+aml_append(dev, aml_name_decl("_CRS", crs));
+if (build_ppi) {
+tpm_build_ppi_acpi(ti, dev);
+}
+aml_append(scope, dev);
+}
-- 
2.41.0

[PATCH v5 10/14] tests: acpi: prepare for TPM CRB tests

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 tests/qtest/bios-tables-test-allowed-diff.h | 4 
 tests/data/acpi/q35/DSDT.crb.tpm2   | 0
 tests/data/acpi/q35/TPM2.crb.tpm2   | 0
 tests/data/acpi/virt/DSDT.crb-device.tpm2   | 0
 tests/data/acpi/virt/TPM2.crb-device.tpm2   | 0
 5 files changed, 4 insertions(+)
 create mode 100644 tests/data/acpi/q35/DSDT.crb.tpm2
 create mode 100644 tests/data/acpi/q35/TPM2.crb.tpm2
 create mode 100644 tests/data/acpi/virt/DSDT.crb-device.tpm2
 create mode 100644 tests/data/acpi/virt/TPM2.crb-device.tpm2

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
b/tests/qtest/bios-tables-test-allowed-diff.h
index dfb8523c8b..c2d1924c2f 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1 +1,5 @@
 /* List of comma-separated changed AML files to ignore */
+"tests/data/acpi/q35/DSDT.crb.tpm2",
+"tests/data/acpi/q35/TPM2.crb.tpm2",
+"tests/data/acpi/virt/DSDT.crb.tpm2",
+"tests/data/acpi/virt/TPM2.crb.tpm2",
diff --git a/tests/data/acpi/q35/DSDT.crb.tpm2 
b/tests/data/acpi/q35/DSDT.crb.tpm2
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/q35/TPM2.crb.tpm2 
b/tests/data/acpi/q35/TPM2.crb.tpm2
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/virt/DSDT.crb-device.tpm2 
b/tests/data/acpi/virt/DSDT.crb-device.tpm2
new file mode 100644
index 00..e69de29bb2
diff --git a/tests/data/acpi/virt/TPM2.crb-device.tpm2 
b/tests/data/acpi/virt/TPM2.crb-device.tpm2
new file mode 100644
index 00..e69de29bb2
-- 
2.41.0

[PATCH v5 13/14] tests: acpi: updated expected blobs for TPM CRB

Signed-off-by: Joelle van Dyne 
Tested-by: Stefan Berger 
---
 tests/qtest/bios-tables-test-allowed-diff.h |   4 
 tests/data/acpi/q35/DSDT.crb.tpm2   | Bin 0 -> 8355 bytes
 tests/data/acpi/q35/TPM2.crb.tpm2   | Bin 0 -> 76 bytes
 tests/data/acpi/virt/DSDT.crb-device.tpm2   | Bin 0 -> 5276 bytes
 tests/data/acpi/virt/TPM2.crb-device.tpm2   | Bin 0 -> 76 bytes
 5 files changed, 4 deletions(-)

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
b/tests/qtest/bios-tables-test-allowed-diff.h
index c2d1924c2f..dfb8523c8b 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1,5 +1 @@
 /* List of comma-separated changed AML files to ignore */
-"tests/data/acpi/q35/DSDT.crb.tpm2",
-"tests/data/acpi/q35/TPM2.crb.tpm2",
-"tests/data/acpi/virt/DSDT.crb.tpm2",
-"tests/data/acpi/virt/TPM2.crb.tpm2",
diff --git a/tests/data/acpi/q35/DSDT.crb.tpm2 
b/tests/data/acpi/q35/DSDT.crb.tpm2
index 
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..fb89ae0ac6d4346e33156e9e4d3718698a0a1a8e
 100644
GIT binary patch
literal 8355
zcmb7JOKcm*8J^`sS}m8-l3H7S#U`8>eWXYzJ1?|oP;!?qOOz=tWv8G4E+waxod8)R
zF%Ty(AS*!P_|$}T&?B8HKyU4-*Ba=hz4_JvJ@wE_u0;`_qJIC(jyR
z|M_Nj=UdMBf#3Umi815V>Lsrkl&WuZyb?YJV~mdJ*J)+0vi^==Z48WDDr5BTC~XclT0V1EX9t%8FLUoL=J{8a$7|Wqc45(S`t5&S`0mW9UwnDx
z{mR3i|KnHp-m)?PoX4+;-wP3ag&&31>2U0PF}iNtCOSX2JYM`_#7~Phht5PHwLGvz
z6Qx?-d&^zP*8HHIAHOoX!J`}kkI_Fg!iy?@uC#YS-
zrTUv~WpJ%1@T%q7MVzRvwYx^{k)ToFRo6D!rB2I#qtrL5tKJH8&vm@o#Z>=UiuU)T
zZ9+u1jO&bY^nXCjd(3^l0?srP<%;MljIp6xo$K)N^k)t=pcpkO!Uqsv2HV7CxINlr
zqfHxQv(Ii1jp6O#EyJ39I>6&|+UO?|M4RF=n4OkaXRbZKuMuri#S%~yOl!FkU<(jlNIwB^beO&;Npl_0M3hZoCl~3
ziHZCio8nAh*I0HosF6cTDsyZD_r=#g~be#xQodr#2LDN~#bs|)C
z7B!tkO=nTpiBQ$KsOenPbS~;T5vn>}O{c5rbakBwRh>sPokujCM|7PCRh>sQokumD
zM|GVDRh_UgF=z2vX-U($r0Ybe>O7|DJf`V9rt3tg>O9Vwm3SsR&Y9JCvO6xA-qVnn
zevCb#F;8gB6FL*2$~>ttPioAQIuoJFJf$&DY0OhP6QRm{Mq@ssF`vFwO=K`)$^IEADxgmc6rq#`0~J_lpbC@>R6w5?C_?l`8mPd=5!FOi>6pZnBSr>_
z5Iy2p7^uL;QLK?O$v_2EhN?~&s7TA1Fi-_b28vL+A{i(`sS^e&u$&14RiI>`0?KQW
zfg+STVW0xbnJ`cVN(L&RoJj_XQ0jz%3M^;BKouw%sDN@N87M+|Jz<~%%b74x1xf}g
zpqxnticso=feI{V!ax-$8K{7ACK)I~sS^e&u$&14RiI>`0?L_Wpa`W-7^uKf}LawZulLa7r5DzKah
z16818paRO7WS|J8P8g`bawZH^fs%m=C})y^B9uB|paRR8Fi-_b1}dPONd}5g>V$y`
zEN8+%6(|{~fN~}oC_Az;Y%GRDqI#3Mglifg+ST
zVW0xbnJ`cVN(L&RoJj_XQ0jz%3M^;BKouw%sDN@N87M-j69y`}VUmF=OcK^?eeK12m6?d_<
zj{pDTxsR-!ZMJ94?O8eZrPjLForCRm%Y}I>_t^}a<4Xy**ga~qviNRAA8lI;jE<0~
zTkh|!&cf#_awW!I5bG}{N(Y6b*5YULY%UFlVwi&&W>a>HxeJ4!S7Ce9g-&<9;uZ#e
zD`2c%0eSC#5jUcHL`snx6Q^y=0A
zZkx1=wHT~I#oDdZA>|dr0>%4qDQNDga`FdQwkt{!Ri1H1ke1n&7
zB+54qDBp<7H6%l<>Y}^13d0xaZ+z{XZRzJ
zA9}9ibjioqD(LC(zA%wav`tMn@mv=5ba;uFNGIB+rki-q7WH&^vzSOH+NP$Pcy3%h
z9bPvk(uuaI=_a0oBYHZ#gG{6oZBx@t+}B6-ba*kDNGIB+rkl7=m-KXaTbW2F+Vpfz
z+Z4B9w`6y(g5bLpfY&`$@Xvls$wAsJ@o85ys!qRAYy?VG&^FA-|#xCEAc~m7dzp*X5f9Ho3R9MOD)Yc5IwH6p&
zw|&{b%6pl<>IO@DUfaj&evy!AFQ~1S0QW1s5|*u7Yb`Tk)QE@g!d1R8fDVaH#%h+!
z)D5w%l64DSul~!_*cxrKPdrGy?lxzzZBUu(KYR7Xj4G4_(7J!J8O0*n2^l3%kc7xu
zzIL4Kd4LSlTdQ3uruHMY6&csQ@{6NuM#Qc~
zMi{Z-SF84KMxk+k3r%6Pl`P2xCmV55#!L5;t+*^(UytWTLu(&pzK*7yA3rxSa&+CJ
zt-I96A-g$5uO7TQet81M?+jeNEh`;O3=B?!cXNxj-D(-J??wqX*%n=LXxr*9PZu|l
z3;nsdIenPhbKa$(XCE-k)9;pv{209G`joMWtW>gwo+j-P81RGuqX`
zeQoa1-Hj)pUFj8amdUViL9fH^JT?}4ITFLRuitP_;^DzGFsPN!v-pXp2Z`<}r+q{`
zS$sc;Pa+p{)}Qa@SqgI$KKt}#G>pggW7{y%ZrIp?VeC7cer!L9^VmLO>_2>SkDsTv
z>HU3ro2E~SY1@7#cDDW~&agjdXC7id@OyFQ;p_LF$5vsSO|+3Z+7^RQ?L#r9pIt8l
zonp_G?>ts8-HEA;+LbvBlXS0Q<;1kf=djXDX~w`FWPkT!rqk?n`COPtfpHcQS7~x{+9&8L_IGnZxjZlj6~7BLKMu;Ti2zs3VDRu@*=N|@
z#KC!aaDfh>+zska!DnbbZ*|vGR%qFdm*GYFcYgX}n#vH8&KmSD2mi>{tMuj3mv1td
z*?NtR>-5#2ucq1GeQBlYqcdVlJPl7IO|f|$vyL>3kcG^^?RJe_!|&M?zpBr*FKs+w
zE#Rd_VVPF;ENvb4ch9eOddo6*2IB<>!{00g>sa}Q@j?27v}vB*;hE2Sm)cJ_S)iwL
z9;Y9tnR(XXoO9it_oO#D)FJ2PsUsFK!#v9j>drz?uf*e?Vi-zlsKyOxG&nXrlX!Wk
HVR!t0z{JvO

literal 0
HcmV?d1

diff --git a/tests/data/acpi/q35/TPM2.crb.tpm2 
b/tests/data/acpi/q35/TPM2.crb.tpm2
index 
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..9d89c5becc0fc4558a0556daf46e152181f2eb51
 100644
GIT binary patch
literal 76
zcmWFu@HO&bU|?Wja`Jcf2v%^42yj*a0!E-1hz+7az=7e)KM>6hB2WNG#ec9c0JCKY
A0RR91

literal 0
HcmV?d1

diff --git a/tests/data/acpi/virt/DSDT.crb-device.tpm2 
b/tests/data/acpi/virt/DSDT.crb-device.tpm2
index 
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..1b3a8ef4cb3ab093402f4bdb2a349b3d0d015ad5
 100644
GIT binary patch
literal 5276
zcmZvg%WoT16o>D`lh__VVmr?JVHZlpGaB1Xla{u`9y^IkoEVSWAf=KkRYjC+Dp6G`
z6;jBeh3;r1RxE-P3H}TuR_xfZV9kbqfF0&{=guVOsAr^%=gi#m&Hcv5@$qf?&Hj%?
zrAB^f?0Q>%x$$Y&D`T^iQu8F~^Y@MZ&m7
z8Dg384@p$&Q-tv$Wyp1!mgX@-7}qI7uG5Ufm?MlElp)t?R$?p=#!bqQ>vUXVED^>>
zlp)t?PGXb^W1TYOI?YRrHwdFn8FHN#B*sO;Xi$b+rxOz65@C3RQI+eoC^6n9j3#Bs
zbvh|A-X)9{Wyp0pB{AM7j19_=>vURTEEC2iWyp0pBQadU*rE)%PG==Xl`z_rA=l}g
z#JEftwWPqgrkH1nd8W80Lh6}jo@wTZmXCTOq@Ee(nPHw8?un3kW|?P}d7?$6o(QSuIP)B5
zp5xpTA@#)18B^my73Y{IT1x7Pkb35sXP$ZHxhF#ESzw+8=7|=RdLpEr6U=jhc}{Ro
zgwzv16p3e%d7@>do(QSuB=ekPo|D`YA@!VMo>R;dEiUy$NIj>S=QQ)2=AH4mY8}Xq@J_PbC!9|a!-WRbB=k=F;BG6)Dt1~oM)c%%yXW5BBY)R%yWTxqUENZ
z2&v~H^IT+}i`)|-^;}|}OUx53

[PATCH v5 14/14] tests: add TPM-CRB sysbus tests for aarch64

- Factor out common test code from tpm-crb-test.c -> tpm-tests.c
- Store device addr in `tpm_device_base_addr` (unify with TIS tests)
- Add new tests for aarch64

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 tests/qtest/tpm-tests.h |   2 +
 tests/qtest/tpm-util.h  |   4 +-
 tests/qtest/bios-tables-test.c  |   4 +-
 tests/qtest/tpm-crb-device-swtpm-test.c |  72 ++
 tests/qtest/tpm-crb-device-test.c   |  71 ++
 tests/qtest/tpm-crb-swtpm-test.c|   2 +
 tests/qtest/tpm-crb-test.c  | 121 +---
 tests/qtest/tpm-tests.c | 121 
 tests/qtest/tpm-tis-device-swtpm-test.c |   2 +-
 tests/qtest/tpm-tis-device-test.c   |   2 +-
 tests/qtest/tpm-tis-i2c-test.c  |   3 +
 tests/qtest/tpm-tis-swtpm-test.c|   2 +-
 tests/qtest/tpm-tis-test.c  |   2 +-
 tests/qtest/tpm-util.c  |  16 ++--
 tests/qtest/meson.build |   4 +
 15 files changed, 295 insertions(+), 133 deletions(-)
 create mode 100644 tests/qtest/tpm-crb-device-swtpm-test.c
 create mode 100644 tests/qtest/tpm-crb-device-test.c

diff --git a/tests/qtest/tpm-tests.h b/tests/qtest/tpm-tests.h
index 07ba60d26e..c1bfb2f914 100644
--- a/tests/qtest/tpm-tests.h
+++ b/tests/qtest/tpm-tests.h
@@ -24,4 +24,6 @@ void tpm_test_swtpm_migration_test(const char *src_tpm_path,
const char *ifmodel,
const char *machine_options);
 
+void tpm_test_crb(const void *data);
+
 #endif /* TESTS_TPM_TESTS_H */
diff --git a/tests/qtest/tpm-util.h b/tests/qtest/tpm-util.h
index 0cb28dd6e5..c99380684e 100644
--- a/tests/qtest/tpm-util.h
+++ b/tests/qtest/tpm-util.h
@@ -15,10 +15,10 @@
 
 #include "io/channel-socket.h"
 
-extern uint64_t tpm_tis_base_addr;
+extern uint64_t tpm_device_base_addr;
 
 #define TIS_REG(LOCTY, REG) \
-(tpm_tis_base_addr + ((LOCTY) << 12) + REG)
+(tpm_device_base_addr + ((LOCTY) << 12) + REG)
 
 typedef void (tx_func)(QTestState *s,
const unsigned char *req, size_t req_size,
diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index bb4ebf00c1..01e0a4aa00 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -1445,7 +1445,7 @@ static void test_acpi_piix4_tcg_numamem(void)
 free_test_data(&data);
 }
 
-uint64_t tpm_tis_base_addr;
+uint64_t tpm_device_base_addr;
 
 static test_data tcg_tpm_test_data(const char *machine)
 {
@@ -1481,7 +1481,7 @@ static void test_acpi_tcg_tpm(const char *machine, const 
char *tpm_if,
 const char *suffix = tpm_version == TPM_VERSION_2_0 ? "tpm2" : "tpm12";
 char *args, *variant = g_strdup_printf(".%s.%s", tpm_if, suffix);
 
-tpm_tis_base_addr = base;
+tpm_device_base_addr = base;
 
 module_call_init(MODULE_INIT_QOM);
 
diff --git a/tests/qtest/tpm-crb-device-swtpm-test.c 
b/tests/qtest/tpm-crb-device-swtpm-test.c
new file mode 100644
index 00..332add5ca6
--- /dev/null
+++ b/tests/qtest/tpm-crb-device-swtpm-test.c
@@ -0,0 +1,72 @@
+/*
+ * QTest testcase for TPM CRB talking to external swtpm and swtpm migration
+ *
+ * Copyright (c) 2018 IBM Corporation
+ *  with parts borrowed from migration-test.c that is:
+ * Copyright (c) 2016-2018 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *   Stefan Berger 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "libqtest.h"
+#include "qemu/module.h"
+#include "tpm-tests.h"
+#include "hw/acpi/tpm.h"
+
+uint64_t tpm_device_base_addr = 0xc00;
+#define MACHINE_OPTIONS "-machine virt,gic-version=max -accel tcg"
+
+typedef struct TestState {
+char *src_tpm_path;
+char *dst_tpm_path;
+char *uri;
+} TestState;
+
+static void tpm_crb_swtpm_test(const void *data)
+{
+const TestState *ts = data;
+
+tpm_test_swtpm_test(ts->src_tpm_path, tpm_util_crb_transfer,
+"tpm-crb-device", MACHINE_OPTIONS);
+}
+
+static void tpm_crb_swtpm_migration_test(const void *data)
+{
+const TestState *ts = data;
+
+tpm_test_swtpm_migration_test(ts->src_tpm_path, ts->dst_tpm_path, ts->uri,
+  tpm_util_crb_transfer, "tpm-crb-device",
+  MACHINE_OPTIONS);
+}
+
+int main(int argc, char **argv)
+{
+int ret;
+TestState ts = { 0 };
+
+ts.src_tpm_path = g_dir_make_tmp("qemu-tpm-crb-swtpm-test.XX", NULL);
+ts.dst_tpm_path = g_dir_make_tmp("qemu-tpm-crb-swtpm-test.XX", NULL);
+ts.uri = g_strdup_printf("unix:%s/migsocket", ts.src_tpm_path);
+
+module_call_init(MODULE_INIT_QOM);
+g_test_init(&argc, &argv, NULL);
+
+qtest_add_data_func("/tpm/crb-swtpm/test", &ts, tpm_crb_swtpm_test);
+qtest_add_data_func("/tpm/crb-swtpm-migration/test", &ts,

[PATCH v5 09/14] tpm_tis_sysbus: move DSDT AML generation to device

This reduces redundant code in different machine types with ACPI table
generation. Additionally, this will allow us to support different TPM
interfaces with the same AML logic. Finally, this matches up with the
TPM TIS ISA implementation.

Ideally, we would be able to call `qbus_build_aml` and avoid any TPM
specific code in the ACPI table generation. However, currently we
still have to call `build_tpm2` anyways and it does not look like
most other ACPI devices support the `ACPI_DEV_AML_IF` interface.

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 hw/arm/virt-acpi-build.c  | 38 ++
 hw/loongarch/acpi-build.c | 38 ++
 hw/tpm/tpm_tis_sysbus.c   | 37 +
 3 files changed, 41 insertions(+), 72 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 8bc35a483c..499d30eb5d 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -35,6 +35,7 @@
 #include "target/arm/cpu.h"
 #include "hw/acpi/acpi-defs.h"
 #include "hw/acpi/acpi.h"
+#include "hw/acpi/acpi_aml_interface.h"
 #include "hw/nvram/fw_cfg.h"
 #include "hw/acpi/bios-linker-loader.h"
 #include "hw/acpi/aml-build.h"
@@ -208,41 +209,6 @@ static void acpi_dsdt_add_gpio(Aml *scope, const 
MemMapEntry *gpio_memmap,
 aml_append(scope, dev);
 }
 
-#ifdef CONFIG_TPM
-static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState *vms)
-{
-PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
-hwaddr pbus_base = vms->memmap[VIRT_PLATFORM_BUS].base;
-SysBusDevice *sbdev = SYS_BUS_DEVICE(tpm_find());
-MemoryRegion *sbdev_mr;
-hwaddr tpm_base;
-
-if (!sbdev) {
-return;
-}
-
-tpm_base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
-assert(tpm_base != -1);
-
-tpm_base += pbus_base;
-
-sbdev_mr = sysbus_mmio_get_region(sbdev, 0);
-
-Aml *dev = aml_device("TPM0");
-aml_append(dev, aml_name_decl("_HID", aml_string("MSFT0101")));
-aml_append(dev, aml_name_decl("_STR", aml_string("TPM 2.0 Device")));
-aml_append(dev, aml_name_decl("_UID", aml_int(0)));
-
-Aml *crs = aml_resource_template();
-aml_append(crs,
-   aml_memory32_fixed(tpm_base,
-  (uint32_t)memory_region_size(sbdev_mr),
-  AML_READ_WRITE));
-aml_append(dev, aml_name_decl("_CRS", crs));
-aml_append(scope, dev);
-}
-#endif
-
 #define ID_MAPPING_ENTRY_SIZE 20
 #define SMMU_V3_ENTRY_SIZE 68
 #define ROOT_COMPLEX_ENTRY_SIZE 36
@@ -891,7 +857,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 
 acpi_dsdt_add_power_button(scope);
 #ifdef CONFIG_TPM
-acpi_dsdt_add_tpm(scope, vms);
+call_dev_aml_func(DEVICE(tpm_find()), scope);
 #endif
 
 aml_append(dsdt, scope);
diff --git a/hw/loongarch/acpi-build.c b/hw/loongarch/acpi-build.c
index ae292fc543..1969bfc8f9 100644
--- a/hw/loongarch/acpi-build.c
+++ b/hw/loongarch/acpi-build.c
@@ -14,6 +14,7 @@
 #include "target/loongarch/cpu.h"
 #include "hw/acpi/acpi-defs.h"
 #include "hw/acpi/acpi.h"
+#include "hw/acpi/acpi_aml_interface.h"
 #include "hw/nvram/fw_cfg.h"
 #include "hw/acpi/bios-linker-loader.h"
 #include "migration/vmstate.h"
@@ -328,41 +329,6 @@ static void build_flash_aml(Aml *scope, 
LoongArchMachineState *lams)
 aml_append(scope, dev);
 }
 
-#ifdef CONFIG_TPM
-static void acpi_dsdt_add_tpm(Aml *scope, LoongArchMachineState *vms)
-{
-PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
-hwaddr pbus_base = VIRT_PLATFORM_BUS_BASEADDRESS;
-SysBusDevice *sbdev = SYS_BUS_DEVICE(tpm_find());
-MemoryRegion *sbdev_mr;
-hwaddr tpm_base;
-
-if (!sbdev) {
-return;
-}
-
-tpm_base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
-assert(tpm_base != -1);
-
-tpm_base += pbus_base;
-
-sbdev_mr = sysbus_mmio_get_region(sbdev, 0);
-
-Aml *dev = aml_device("TPM0");
-aml_append(dev, aml_name_decl("_HID", aml_string("MSFT0101")));
-aml_append(dev, aml_name_decl("_STR", aml_string("TPM 2.0 Device")));
-aml_append(dev, aml_name_decl("_UID", aml_int(0)));
-
-Aml *crs = aml_resource_template();
-aml_append(crs,
-   aml_memory32_fixed(tpm_base,
-  (uint32_t)memory_region_size(sbdev_mr),
-  AML_READ_WRITE));
-aml_append(dev, aml_name_decl("_CRS", crs));
-aml_append(scope, dev);
-}
-#endif
-
 /* build DSDT */
 static void
 build_dsdt(GArray *table_data, BIOSLinker *linker, MachineState *machine)
@@ -379,7 +345,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, 
MachineState *machine)
 build_la_ged_aml(dsdt, machine);
 build_flash_aml(dsdt, lams);
 #ifdef CONFIG_TPM
-acpi_dsdt_add_tpm(dsdt, lams);
+call_dev_aml_func(DEVICE(tpm_find()), dsdt);
 #endif
 /* System State Package */
 scope = aml_scope("\\");
diff --g

[PATCH v5 00/14] tpm: introduce TPM CRB SysBus device

The impetus for this patch set is to get TPM 2.0 working on Windows 11 ARM64.
Windows' tpm.sys does not seem to work on a TPM TIS device (as verified with
VMWare's implementation). However, the current TPM CRB device uses a fixed
system bus address that is reserved for RAM in ARM64 Virt machines.

In the process of adding the TPM CRB SysBus device, we also went ahead and
cleaned up some of the existing TPM hardware code and fixed some bugs. We used
the TPM TIS devices as a template for the TPM CRB devices and refactored out
common code. We moved the ACPI DSDT generation to the device in order to handle
dynamic base address requirements as well as reduce redundent code in different
machine ACPI generation. We also changed the tpm_crb device to use the ISA bus
instead of depending on the default system bus as the device only was built for
the PC configuration.

Another change is that the TPM CRB registers are now mapped in the same way that
the pflash ROM devices are mapped. It is a memory region whose writes are
trapped as MMIO accesses. This was needed because Apple Silicon does not decode
LDP (AARCH64 load pair of registers) caused page faults. @agraf suggested that
we do this to avoid having to do AARCH64 decoding in the HVF backend's fault
handler.

Unfortunately, it seems like the LDP fault still happens on HVF but the issue
seems to be in the HVF backend which needs to be fixed in a separate patch.

One last thing that's needed to get Windows 11 to recognize the TPM 2.0 device
is for the OVMF firmware to setup the TPM device. Currently, OVMF for ARM64 Virt
only recognizes the TPM TIS device through a FDT entry. A workaround is to
falsely identify the TPM CRB device as a TPM TIS device in the FDT node but this
causes issues for Linux. A proper fix would involve adding an ACPI device driver
in OVMF.

This has been tested on ARM64 with `tpm-crb-device` and on x86_64 with
`tpm-crb`. Additional testing should be performed on other architectures (RISCV
and Loongarch for example) as well as migration cases.

v5:
- Fixed a typo in "tpm_crb: use a single read-as-mem/write-as-mmio mapping"
- Fixed ACPI tables not being created for pc CRB device

v4:
- Fixed broken test blobs

v3:
- Support backwards and forwards migration of existing tpm-crb device
- Dropped patch which moved tpm-crb to ISA bus due to migration concerns
- Unified `tpm_sysbus_plug` handler for ARM and Loongarch
- Added ACPI table tests for tpm-crb-device
- Refactored TPM CRB tests to run on tpm-crb-device for ARM Virt

v2:
- Fixed an issue where VMstate restore from an older version failed due to name
  collision of the memory block.
- In the ACPI table generation for CRB devices, the check for TPM 2.0 backend is
  moved to the device realize as CRB does not support TPM 1.0. It will error in
  that case.
- Dropped the patch to fix crash when PPI is enabled on TIS SysBus device since
  a separate patch submitted by Stefan Berger disables such an option.
- Fixed an issue where we default tpmEstablished=0 when it should be 1.
- In TPM CRB SysBus's ACPI entry, we accidently changed _UID from 0 to 1. This
  shouldn't be an issue but we changed it back just in case.
- Added a patch to migrate saved VMstate from an older version with the regs
  saved separately instead of as a RAM block.

Joelle van Dyne (14):
  tpm_crb: refactor common code
  tpm_crb: CTRL_RSP_ADDR is 64-bits wide
  tpm_ppi: refactor memory space initialization
  tpm_crb: use a single read-as-mem/write-as-mmio mapping
  tpm_crb: move ACPI table building to device interface
  tpm-sysbus: add plug handler for TPM on SysBus
  hw/arm/virt: connect TPM to platform bus
  hw/loongarch/virt: connect TPM to platform bus
  tpm_tis_sysbus: move DSDT AML generation to device
  tests: acpi: prepare for TPM CRB tests
  tpm_crb_sysbus: introduce TPM CRB SysBus device
  tests: acpi: implement TPM CRB tests for ARM virt
  tests: acpi: updated expected blobs for TPM CRB
  tests: add TPM-CRB sysbus tests for aarch64

 docs/specs/tpm.rst|   2 +
 hw/tpm/tpm_crb.h  |  79 ++
 hw/tpm/tpm_ppi.h  |  10 +-
 include/hw/acpi/tpm.h |   3 +-
 include/sysemu/tpm.h  |   7 +
 tests/qtest/tpm-tests.h   |   2 +
 tests/qtest/tpm-util.h|   4 +-
 hw/acpi/aml-build.c   |   7 +-
 hw/arm/virt-acpi-build.c  |  38 +--
 hw/arm/virt.c |   8 +
 hw/core/sysbus-fdt.c  |   1 +
 hw/i386/acpi-build.c  |  16 +-
 hw/loongarch/acpi-build.c |  38 +--
 hw/loongarch/virt.c   |   8 +
 hw/riscv/virt.c   |   1 +
 hw/tpm/tpm-sysbus.c   |  47 
 hw/tpm/tpm_crb.c  | 302 ++
 hw/tpm/tpm_crb_common.c   | 262 +++
 hw/tpm/tpm_crb_sysbus.c

[PATCH v5 07/14] hw/arm/virt: connect TPM to platform bus

Signed-off-by: Joelle van Dyne 
Reviewed-by: Stefan Berger 
---
 hw/arm/virt.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 85e3c5ba9d..36e2506420 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2811,6 +2811,13 @@ static void virt_machine_device_plug_cb(HotplugHandler 
*hotplug_dev,
 vms->virtio_iommu_bdf = pci_get_bdf(pdev);
 create_virtio_iommu_dt_bindings(vms);
 }
+
+#ifdef CONFIG_TPM
+if (object_dynamic_cast(OBJECT(dev), TYPE_TPM_IF)) {
+tpm_sysbus_plug(TPM_IF(dev), OBJECT(vms->platform_bus_dev),
+vms->memmap[VIRT_PLATFORM_BUS].base);
+}
+#endif
 }
 
 static void virt_dimm_unplug_request(HotplugHandler *hotplug_dev,
-- 
2.41.0

Re: [RFC PATCH 1/2] migration: Report error in incoming migration

2023-11-13 Thread Fabiano Rosas

Peter Xu  writes:

> On Fri, Nov 10, 2023 at 07:58:00AM -0300, Fabiano Rosas wrote:
>> Peter Xu  writes:
>> 
>> > On Thu, Nov 09, 2023 at 01:58:55PM -0300, Fabiano Rosas wrote:
>> >> We're not currently reporting the errors set with migrate_set_error()
>> >> when incoming migration fails.
>> >> 
>> >> Signed-off-by: Fabiano Rosas 
>> >> ---
>> >>  migration/migration.c | 7 +++
>> >>  1 file changed, 7 insertions(+)
>> >> 
>> >> diff --git a/migration/migration.c b/migration/migration.c
>> >> index 28a34c9068..cca32c553c 100644
>> >> --- a/migration/migration.c
>> >> +++ b/migration/migration.c
>> >> @@ -698,6 +698,13 @@ process_incoming_migration_co(void *opaque)
>> >>  }
>> >>  
>> >>  if (ret < 0) {
>> >> +MigrationState *s = migrate_get_current();
>> >> +
>> >> +if (migrate_has_error(s)) {
>> >> +WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
>> >> +error_report_err(s->error);
>> >> +}
>> >> +}
>> >
>> > What's the major benefit of dumping this explicitly?
>> 
>> This is incoming migration, so there's no centralized error reporting
>> aside from the useless "load of migration failed: -5". If the code has
>> not called error_report we just never see the error message.
>> 
>> > And this is not relevant to the multifd problem, correct?
>> 
>> Yes, I'm being sneaky.
>
> Trying to sneak one patch into a 2 patch series is prone to be exposed and
> lose the effect. :-)
>
> I remember we had the verbose error before. Was that lost since some
> commit?  In all cases, feel free to post that separately if you think we
> should get it back.
>
> The multifd fixes do not look like a regression either for this release. If
> so, both of them may be better next release's material?

People have complained about it on IRC and I hit it twice in a week. I
would call it a regression. However, we _do_ have an indication that it
might have been there all along since someone already tried to fix a
very similar issue, maybe even the same one. So I'm fine with punting to
the next release.

Re: [RFC PATCH 2/2] migration/multifd: Move semaphore release into main thread

2023-11-13 Thread Fabiano Rosas

Peter Xu  writes:

> On Fri, Nov 10, 2023 at 09:05:41AM -0300, Fabiano Rosas wrote:
>
> [...]
>
>> > Then assuming we have a clear model with all these threads issue fixed (no
>> > matter whether we'd shrink 2N threads into N threads), then what we need to
>> > do, IMHO, is making sure to join() all of them before destroying anything
>> > (say, per-channel MultiFDSendParams).  Then when we destroy everything
>> > safely, either mutex/sem/etc..  Because no one will race us anymore.
>> 
>> This doesn't address the race. There's a data dependency between the
>> multifd channels and the migration thread around the channels_ready
>> semaphore. So we cannot join the migration thread because it could be
>> stuck waiting for the semaphore, which means we cannot join+cleanup the
>> channel thread because the semaphore is still being used.
>
> I think this is the major part of confusion, on why this can happen.
>
> The problem is afaik multifd_save_cleanup() is only called by
> migrate_fd_cleanup(), which is further only called in:
>
>   1) migrate_fd_cleanup_bh()
>   2) migrate_fd_connect()
>
> For 1): it's only run when migration comletes/fails/etc. (in all cases,
> right before it quits..) and then kicks off migrate_fd_cleanup_schedule().
> So migration thread shouldn't be stuck, afaiu, or it won't be able to kick
> that BH.
>
> For 2): it's called by the main thread, where migration thread should have
> not yet been created.
>
> With that, I don't see how migrate_fd_cleanup() would need to worry about
> migration thread
>
> Did I miss something?

There are two points:

1) multifd_new_send_channel_async() doesn't set an Error. Even if
multifd_channel_connect() fails, we'll still continue with
migrate_fd_connect(). I don't see any code that looks at the migration
error (s->error).

2) the TLS handshake thread of one of the channels could simply not get
any chance to run until something else fails and we reach
multifd_save_cleanup() from the BH path.

This second point in particular is why I don't think simply joining the
TLS thread will avoid the race. There's nothing linking the multifd
channels together, we could have 7 of them operational and a 8th one
still going through the TLS handshake.

That said, I'm not sure about the exact path we take to reach the bug
situation. It's very hard to reproduce so I'm relying entirely on code
inspection.

Re: [PATCH v4 04/14] tpm_crb: use a single read-as-mem/write-as-mmio mapping

On Wed, Nov 1, 2023 at 2:25 PM Stefan Berger  wrote:
>
>
>
> On 10/31/23 00:00, Joelle van Dyne wrote:
> > On Apple Silicon, when Windows performs a LDP on the CRB MMIO space,
> > the exception is not decoded by hardware and we cannot trap the MMIO
> > read. This led to the idea from @agraf to use the same mapping type as
> > ROM devices: namely that reads should be seen as memory type and
> > writes should trap as MMIO.
> >
> > Once that was done, the second memory mapping of the command buffer
> > region was redundent and was removed.
> >
> > A note about the removal of the read trap for `CRB_LOC_STATE`:
> > The only usage was to return the most up-to-date value for
> > `tpmEstablished`. However, `tpmEstablished` is only cleared when a
> > TPM2_HashStart operation is called which only exists for locality 4.
> > We do not handle locality 4. Indeed, the comment for the write handler
> > of `CRB_LOC_CTRL` makes the same argument for why it is not calling
> > the backend to reset the `tpmEstablished` bit (to 1).
> > As this bit is unused, we do not need to worry about updating it for
> > reads.
> >
> > In order to maintain migration compatibility with older versions of
> > QEMU, we store a copy of the register data and command data which is
> > used only during save/restore.
> >
> > Signed-off-by: Joelle van Dyne 
> > ---
>
> > diff --git a/hw/tpm/tpm_crb_common.c b/hw/tpm/tpm_crb_common.c
> > index bee0b71fee..605e8576e9 100644
> > --- a/hw/tpm/tpm_crb_common.c
> > +++ b/hw/tpm/tpm_crb_common.c
> > @@ -31,31 +31,12 @@
> >   #include "qom/object.h"
> >   #include "tpm_crb.h"
> >
> > -static uint64_t tpm_crb_mmio_read(void *opaque, hwaddr addr,
> > -  unsigned size)
> > +static uint8_t tpm_crb_get_active_locty(TPMCRBState *s, uint32_t *regs)
> >   {
> > -TPMCRBState *s = opaque;
> > -void *regs = (void *)&s->regs + (addr & ~3);
> > -unsigned offset = addr & 3;
> > -uint32_t val = *(uint32_t *)regs >> (8 * offset);
> > -
> > -switch (addr) {
> > -case A_CRB_LOC_STATE:
> > -val |= !tpm_backend_get_tpm_established_flag(s->tpmbe);
> > -break;
> > -}
> > -
> > -trace_tpm_crb_mmio_read(addr, size, val);
> > -
> > -return val;
> > -}
> > -
> > -static uint8_t tpm_crb_get_active_locty(TPMCRBState *s)
> > -{
> > -if (!ARRAY_FIELD_EX32(s->regs, CRB_LOC_STATE, locAssigned)) {
> > +if (!ARRAY_FIELD_EX32(regs, CRB_LOC_STATE, locAssigned)) {
> >   return TPM_CRB_NO_LOCALITY;
> >   }
> > -return ARRAY_FIELD_EX32(s->regs, CRB_LOC_STATE, activeLocality);
> > +return ARRAY_FIELD_EX32(regs, CRB_LOC_STATE, activeLocality);
> >   }
> >
> >   static void tpm_crb_mmio_write(void *opaque, hwaddr addr,
> > @@ -63,35 +44,47 @@ static void tpm_crb_mmio_write(void *opaque, hwaddr 
> > addr,
> >   {
> >   TPMCRBState *s = opaque;
> >   uint8_t locty =  addr >> 12;
> > +uint32_t *regs;
> > +void *mem;
> >
> >   trace_tpm_crb_mmio_write(addr, size, val);
> > +regs = memory_region_get_ram_ptr(&s->mmio);
> > +mem = ®s[R_CRB_DATA_BUFFER];
> > +assert(regs);
> > +
> > +if (addr >= A_CRB_DATA_BUFFER) {
>
>
> Can you write here /* receive TPM command bytes */ ?
Will do.

>
>
> > +assert(addr + size <= TPM_CRB_ADDR_SIZE);
> > +assert(size <= sizeof(val));
> > +memcpy(mem + addr - A_CRB_DATA_BUFFER, &val, size);
>
> > +memory_region_set_dirty(&s->mmio, addr, size);
> > +return;
> > +}
> >
> >   switch (addr) {
> >   case A_CRB_CTRL_REQ:
> >   switch (val) {
> >   case CRB_CTRL_REQ_CMD_READY:
> > -ARRAY_FIELD_DP32(s->regs, CRB_CTRL_STS,
> > +ARRAY_FIELD_DP32(regs, CRB_CTRL_STS,
> >tpmIdle, 0);
> >   break;
> >   case CRB_CTRL_REQ_GO_IDLE:
> > -ARRAY_FIELD_DP32(s->regs, CRB_CTRL_STS,
> > +ARRAY_FIELD_DP32(regs, CRB_CTRL_STS,
> >tpmIdle, 1);
> >   break;
> >   }
> >   break;
> >   case A_CRB_CTRL_CANCEL:
> >   if (val == CRB_CANCEL_INVOKE &&
> > -s->regs[R_CRB_CTRL_START] & CRB_START_INVOKE) {
> > +regs[R_CRB_CTRL_START] & CRB_START_INVOKE) {
> >   tpm_backend_cancel_cmd(s->tpmbe);
> >   }
> >   break;
> >   case A_CRB_CTRL_START:
> >   if (val == CRB_START_INVOKE &&
> > -!(s->regs[R_CRB_CTRL_START] & CRB_START_INVOKE) &&
> > -tpm_crb_get_active_locty(s) == locty) {
> > -void *mem = memory_region_get_ram_ptr(&s->cmdmem);
> > +!(regs[R_CRB_CTRL_START] & CRB_START_INVOKE) &&
> > +tpm_crb_get_active_locty(s, regs) == locty) {
> >
> > -s->regs[R_CRB_CTRL_START] |= CRB_START_INVOKE;
> > +regs[R_CRB_CTRL_START] |= CRB_START_INVOKE;
> >   s->cmd = (TPMBackendCmd) {
> >   .in = mem,
> >

RE: [PATCH v5 03/11] hw/misc: Add qtest for NPCM7xx PCI Mailbox

2023-11-13 Thread kft...@nuvoton.com



-Original Message-
From: Nabih Estefan 
Sent: Saturday, October 28, 2023 1:55 AM
To: peter.mayd...@linaro.org
Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; CS20 KFTing 
; wuhao...@google.com; jasonw...@redhat.com; IS20 Avi 
Fishman ; nabiheste...@google.com; CS20 KWLiu 
; IS20 Tomer Maimon ; IN20 Hila 
Miranda-Kuzi 
Subject: [PATCH v5 03/11] hw/misc: Add qtest for NPCM7xx PCI Mailbox

CAUTION - External Email: Do not click links or open attachments unless you 
acknowledge the sender and content.


From: Hao Wu 

This patches adds a qtest for NPCM7XX PCI Mailbox module.
It sends read and write requests to the module, and verifies that the module 
contains the correct data after the requests.

Change-Id: Id7a4b3cbea564383b94d507552dfd16f6b5127d1
Signed-off-by: Hao Wu 
Signed-off-by: Nabih Estefan 
---
 tests/qtest/meson.build |   1 +
 tests/qtest/npcm7xx_pci_mbox-test.c | 238 
 2 files changed, 239 insertions(+)
 create mode 100644 tests/qtest/npcm7xx_pci_mbox-test.c

diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build index 
d6022ebd64..daec219a32 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -183,6 +183,7 @@ qtests_sparc64 = \
 qtests_npcm7xx = \
   ['npcm7xx_adc-test',
'npcm7xx_gpio-test',
+   'npcm7xx_pci_mbox-test',
'npcm7xx_pwm-test',
'npcm7xx_rng-test',
'npcm7xx_sdhci-test',
diff --git a/tests/qtest/npcm7xx_pci_mbox-test.c 
b/tests/qtest/npcm7xx_pci_mbox-test.c
new file mode 100644
index 00..24eec18e3c
--- /dev/null
+++ b/tests/qtest/npcm7xx_pci_mbox-test.c
@@ -0,0 +1,238 @@
+/*
+ * QTests for Nuvoton NPCM7xx PCI Mailbox Modules.
+ *
+ * Copyright 2021 Google LLC
+ *
+ * This program is free software; you can redistribute it and/or modify
+it
+ * under the terms of the GNU General Public License as published by
+the
+ * Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ * for more details.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/bitops.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qnum.h"
+#include "libqtest-single.h"
+
+#define PCI_MBOX_BA 0xf0848000
+#define PCI_MBOX_IRQ8
+
+/* register offset */
+#define PCI_MBOX_STAT   0x00
+#define PCI_MBOX_CTL0x04
+#define PCI_MBOX_CMD0x08
+
+#define CODE_OK 0x00
+#define CODE_INVALID_OP 0xa0
+#define CODE_INVALID_SIZE   0xa1
+#define CODE_ERROR  0xff
+
+#define OP_READ 0x01
+#define OP_WRITE0x02
+#define OP_INVALID  0x41
+
+
+static int sock;
+static int fd;
+
+/*
+ * Create a local TCP socket with any port, then save off the port we got.
+ */
+static in_port_t open_socket(void)
+{
+struct sockaddr_in myaddr;
+socklen_t addrlen;
+
+myaddr.sin_family = AF_INET;
+myaddr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+myaddr.sin_port = 0;
+sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+g_assert(sock != -1);
+g_assert(bind(sock, (struct sockaddr *) &myaddr, sizeof(myaddr)) != -1);
+addrlen = sizeof(myaddr);
+g_assert(getsockname(sock, (struct sockaddr *) &myaddr , &addrlen) != -1);
+g_assert(listen(sock, 1) != -1);
+return ntohs(myaddr.sin_port);
+}
+
+static void setup_fd(void)
+{
+fd_set readfds;
+
+FD_ZERO(&readfds);
+FD_SET(sock, &readfds);
+g_assert(select(sock + 1, &readfds, NULL, NULL, NULL) == 1);
+
+fd = accept(sock, NULL, 0);
+g_assert(fd >= 0);
+}
+
+static uint8_t read_response(uint8_t *buf, size_t len) {
+uint8_t code;
+ssize_t ret = read(fd, &code, 1);
+
+if (ret == -1) {
+return CODE_ERROR;
+}
+if (code != CODE_OK) {
+return code;
+}
+g_test_message("response code: %x", code);
+if (len > 0) {
+ret = read(fd, buf, len);
+if (ret < len) {
+return CODE_ERROR;
+}
+}
+return CODE_OK;
+}
+
+static void receive_data(uint64_t offset, uint8_t *buf, size_t len) {
+uint8_t op = OP_READ;
+uint8_t code;
+ssize_t rv;
+
+while (len > 0) {
+uint8_t size;
+
+if (len >= 8) {
+size = 8;
+} else if (len >= 4) {
+size = 4;
+} else if (len >= 2) {
+size = 2;
+} else {
+size = 1;
+}
+
+g_test_message("receiving %u bytes", size);
+/* Write op */
+rv = write(fd, &op, 1);
+g_assert_cmpint(rv, ==, 1);
+/* Write offset */
+rv = write(fd, (uint8_t *)&offset, sizeof(uint64_t));
+g_assert_cmpint(rv, ==, sizeof(uint64_t));
+/* Write size */
+g_assert_cmpint(write(fd, &size, 1), ==, 1);
+
+/* Read data and Expect response */
+code =

[PATCH 1/2] vhost: Add worker backend callouts

2023-11-13 Thread Mike Christie

This adds the vhost backend callouts for the worker ioctls added in the
6.4 linux kernel commit:

c1ecd8e95007 ("vhost: allow userspace to create workers")

Signed-off-by: Mike Christie 
---
 hw/virtio/vhost-backend.c | 28 
 include/hw/virtio/vhost-backend.h | 14 ++
 2 files changed, 42 insertions(+)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 17f3fc6a0823..833804dd40f2 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -158,6 +158,30 @@ static int vhost_kernel_set_vring_busyloop_timeout(struct 
vhost_dev *dev,
 return vhost_kernel_call(dev, VHOST_SET_VRING_BUSYLOOP_TIMEOUT, s);
 }
 
+static int vhost_kernel_new_worker(struct vhost_dev *dev,
+   struct vhost_worker_state *worker)
+{
+return vhost_kernel_call(dev, VHOST_NEW_WORKER, worker);
+}
+
+static int vhost_kernel_free_worker(struct vhost_dev *dev,
+struct vhost_worker_state *worker)
+{
+return vhost_kernel_call(dev, VHOST_FREE_WORKER, worker);
+}
+
+static int vhost_kernel_attach_vring_worker(struct vhost_dev *dev,
+struct vhost_vring_worker *worker)
+{
+return vhost_kernel_call(dev, VHOST_ATTACH_VRING_WORKER, worker);
+}
+
+static int vhost_kernel_get_vring_worker(struct vhost_dev *dev,
+ struct vhost_vring_worker *worker)
+{
+return vhost_kernel_call(dev, VHOST_GET_VRING_WORKER, worker);
+}
+
 static int vhost_kernel_set_features(struct vhost_dev *dev,
  uint64_t features)
 {
@@ -313,6 +337,10 @@ const VhostOps kernel_ops = {
 .vhost_set_vring_err = vhost_kernel_set_vring_err,
 .vhost_set_vring_busyloop_timeout =
 vhost_kernel_set_vring_busyloop_timeout,
+.vhost_get_vring_worker = vhost_kernel_get_vring_worker,
+.vhost_attach_vring_worker = vhost_kernel_attach_vring_worker,
+.vhost_new_worker = vhost_kernel_new_worker,
+.vhost_free_worker = vhost_kernel_free_worker,
 .vhost_set_features = vhost_kernel_set_features,
 .vhost_get_features = vhost_kernel_get_features,
 .vhost_set_backend_cap = vhost_kernel_set_backend_cap,
diff --git a/include/hw/virtio/vhost-backend.h 
b/include/hw/virtio/vhost-backend.h
index 96ccc18cd33b..9f16d0884e8f 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -33,6 +33,8 @@ struct vhost_memory;
 struct vhost_vring_file;
 struct vhost_vring_state;
 struct vhost_vring_addr;
+struct vhost_vring_worker;
+struct vhost_worker_state;
 struct vhost_scsi_target;
 struct vhost_iotlb_msg;
 struct vhost_virtqueue;
@@ -73,6 +75,14 @@ typedef int (*vhost_set_vring_err_op)(struct vhost_dev *dev,
   struct vhost_vring_file *file);
 typedef int (*vhost_set_vring_busyloop_timeout_op)(struct vhost_dev *dev,
struct vhost_vring_state 
*r);
+typedef int (*vhost_attach_vring_worker_op)(struct vhost_dev *dev,
+struct vhost_vring_worker *worker);
+typedef int (*vhost_get_vring_worker_op)(struct vhost_dev *dev,
+ struct vhost_vring_worker *worker);
+typedef int (*vhost_new_worker_op)(struct vhost_dev *dev,
+   struct vhost_worker_state *worker);
+typedef int (*vhost_free_worker_op)(struct vhost_dev *dev,
+struct vhost_worker_state *worker);
 typedef int (*vhost_set_features_op)(struct vhost_dev *dev,
  uint64_t features);
 typedef int (*vhost_get_features_op)(struct vhost_dev *dev,
@@ -151,6 +161,10 @@ typedef struct VhostOps {
 vhost_set_vring_call_op vhost_set_vring_call;
 vhost_set_vring_err_op vhost_set_vring_err;
 vhost_set_vring_busyloop_timeout_op vhost_set_vring_busyloop_timeout;
+vhost_new_worker_op vhost_new_worker;
+vhost_free_worker_op vhost_free_worker;
+vhost_get_vring_worker_op vhost_get_vring_worker;
+vhost_attach_vring_worker_op vhost_attach_vring_worker;
 vhost_set_features_op vhost_set_features;
 vhost_get_features_op vhost_get_features;
 vhost_set_backend_cap_op vhost_set_backend_cap;
-- 
2.34.1

[PATCH 0/2] vhost-scsi: Support worker ioctls

2023-11-13 Thread Mike Christie

The following patches allow users to configure the vhost worker threads
for vhost-scsi. With vhost-net we get a worker thread per rx/tx virtqueue
pair, but for vhost-scsi we get one worker for all workqueues. This
becomes a bottlneck after 2 queues are used.

In the upstream linux kernel commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/vhost/vhost.c?id=c1ecd8e9500797748ae4f79657971955d452d69d

we enabled the vhost layer to be able to create a worker thread and
attach it to a virtqueue.

This patchset adds support to vhost-scsi to use these ioctls so we are
no longer limited to the single worker.

[PATCH 2/2] vhost-scsi: Add support for a worker thread per virtqueue

2023-11-13 Thread Mike Christie

This adds support for vhost-scsi to be able to create a worker thread
per virtqueue. Right now for vhost-net we get a worker thread per
tx/rx virtqueue pair which scales nicely as we add more virtqueues and
CPUs, but for scsi we get the single worker thread that's shared by all
virtqueues. When trying to send IO to more than 2 virtqueues the single
thread becomes a bottlneck.

This patch adds a new setting, virtqueue_workers, which can be set to:

1: Existing behavior whre we get the single thread.
-1: Create a worker per IO virtqueue.

Signed-off-by: Mike Christie 
---
 hw/scsi/vhost-scsi.c| 68 +
 include/hw/virtio/virtio-scsi.h |  1 +
 2 files changed, 69 insertions(+)

diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 3126df9e1d9d..5cf669b6563b 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -31,6 +31,9 @@
 #include "qemu/cutils.h"
 #include "sysemu/sysemu.h"
 
+#define VHOST_SCSI_WORKER_PER_VQ-1
+#define VHOST_SCSI_WORKER_DEF1
+
 /* Features supported by host kernel. */
 static const int kernel_feature_bits[] = {
 VIRTIO_F_NOTIFY_ON_EMPTY,
@@ -165,6 +168,62 @@ static const VMStateDescription vmstate_virtio_vhost_scsi 
= {
 .pre_save = vhost_scsi_pre_save,
 };
 
+static int vhost_scsi_set_workers(VHostSCSICommon *vsc, int workers_cnt)
+{
+struct vhost_dev *dev = &vsc->dev;
+struct vhost_vring_worker vq_worker;
+struct vhost_worker_state worker;
+int i, ret;
+
+/* Use default worker */
+if (workers_cnt == VHOST_SCSI_WORKER_DEF ||
+dev->nvqs == VHOST_SCSI_VQ_NUM_FIXED + 1) {
+return 0;
+}
+
+if (workers_cnt != VHOST_SCSI_WORKER_PER_VQ) {
+return -EINVAL;
+}
+
+/*
+ * ctl/evt share the first worker since it will be rare for them
+ * to send cmds while IO is running.
+ */
+for (i = VHOST_SCSI_VQ_NUM_FIXED + 1; i < dev->nvqs; i++) {
+memset(&worker, 0, sizeof(worker));
+
+ret = dev->vhost_ops->vhost_new_worker(dev, &worker);
+if (ret == -ENOTTY) {
+/*
+ * worker ioctls are not implemented so just ignore and
+ * and continue device setup.
+ */
+ret = 0;
+break;
+} else if (ret) {
+break;
+}
+
+memset(&vq_worker, 0, sizeof(vq_worker));
+vq_worker.worker_id = worker.worker_id;
+vq_worker.index = i;
+
+ret = dev->vhost_ops->vhost_attach_vring_worker(dev, &vq_worker);
+if (ret == -ENOTTY) {
+/*
+ * It's a bug for the kernel to have supported the worker creation
+ * ioctl but not attach.
+ */
+dev->vhost_ops->vhost_free_worker(dev, &worker);
+break;
+} else if (ret) {
+break;
+}
+}
+
+return ret;
+}
+
 static void vhost_scsi_realize(DeviceState *dev, Error **errp)
 {
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev);
@@ -232,6 +291,13 @@ static void vhost_scsi_realize(DeviceState *dev, Error 
**errp)
 goto free_vqs;
 }
 
+ret = vhost_scsi_set_workers(vsc, vs->conf.virtqueue_workers);
+if (ret < 0) {
+error_setg(errp, "vhost-scsi: vhost worker setup failed: %s",
+   strerror(-ret));
+goto free_vqs;
+}
+
 /* At present, channel and lun both are 0 for bootable vhost-scsi disk */
 vsc->channel = 0;
 vsc->lun = 0;
@@ -297,6 +363,8 @@ static Property vhost_scsi_properties[] = {
  VIRTIO_SCSI_F_T10_PI,
  false),
 DEFINE_PROP_BOOL("migratable", VHostSCSICommon, migratable, false),
+DEFINE_PROP_INT32("virtqueue_workers", VirtIOSCSICommon,
+  conf.virtqueue_workers, VHOST_SCSI_WORKER_DEF),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 779568ab5d28..f70624ece564 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -51,6 +51,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig;
 struct VirtIOSCSIConf {
 uint32_t num_queues;
 uint32_t virtqueue_size;
+int virtqueue_workers;
 bool seg_max_adjust;
 uint32_t max_sectors;
 uint32_t cmd_per_lun;
-- 
2.34.1

Add new qmp json is not setting command errp

2023-11-13 Thread bambooza

I am attempting to add a new json file for qmp.  My issue is that errp is
null so when I call error_setg qemu crashes.

Specifically

QEMU: F 00:00:1699907518.054094 2802908 logging.cc:57] assert.h
assertion failed at qemu/util/error.c:59 in void error_setv(Error **, const
char *, int, const char *, ErrorClass, const char *, struct __va_list_tag
*, const char *): *errp == NULL
I am not sure what I'm missing.  The generated files for command, event,
types, visit and view are created.

The function gets called and the passed in value is correctly populated
just errp is null.

Re: [PATCH v2 3/3] hw/ide/via: implement legacy/native mode switching

2023-11-13 Thread BALATON Zoltan


On Mon, 13 Nov 2023, Mark Cave-Ayland wrote:

On 07/11/2023 10:43, Kevin Wolf wrote:

Am 06.11.2023 um 17:13 hat BALATON Zoltan geschrieben:

On Mon, 6 Nov 2023, Kevin Wolf wrote:

Am 25.10.2023 um 00:40 hat Mark Cave-Ayland geschrieben:
Allow the VIA IDE controller to switch between both legacy and native 
modes by
calling pci_ide_update_mode() to reconfigure the device whenever 
PCI_CLASS_PROG

is updated.

This patch moves the initial setting of PCI_CLASS_PROG from 
via_ide_realize() to
via_ide_reset(), and removes the direct setting of PCI_INTERRUPT_PIN 
during PCI
bus reset since this is now managed by pci_ide_update_mode(). This 
ensures that
the device configuration is always consistent with respect to the 
currently

selected mode.

Signed-off-by: Mark Cave-Ayland 
Tested-by: BALATON Zoltan 
Tested-by: Bernhard Beschow 


As I already noted in patch 1, the interrupt handling seems to be wrong
here, it continues to use the ISA IRQ in via_ide_set_irq() even after
switching to native mode.


That's a peculiarity of this via-ide device. It always uses 14/15 legacy
interrupts even in native mode and guests expect that so using native
interrupts would break pegasos2 guests. This was discussed and tested
extensively before.


This definitely needs a comment to explain the situation then because
this is in violation of the spec. If real hardware behaves like this,
it's what we should do, of course, but it's certainly unexpected and we
should explicitly document it to avoid breaking it later when someone
touches the code who doesn't know about this peculiarity.


It's a little bit more complicated than this: in native mode it is possible 
to route the IRQs for each individual channel to a small select number of 
IRQs by configuring special registers on the VIA.


That's documented for VT82c686B but VT8231 doc says other values are 
reserved and only IRQ 14/15 is valid. So even if it worked nothing uses it 
and we don't have to be concerned about it and just using these hard coded 
14/15 is enough. It probably does not worth trying to emulate chip 
functions that no guest uses, especially when we're not sure how the real 
chip works as we can't test on real machine.


Regards,
BALATON Zoltan

The complication here is that it isn't immediately obvious how the QEMU PCI 
routing code can do this - I did post about this at 
https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg10552.html asking 
the best way to resolve this, but haven't had any replies yet.


Fortunately it seems that all the guests tested so far stick with the IRQ 
14/15 defaults which is why this happens to work, so short-term this is a 
lower priority when looking at consolidating the switching logic.



ATB,

Mark.

Re: [PATCH v2 1/3] ide/pci.c: introduce pci_ide_update_mode() function

2023-11-13 Thread BALATON Zoltan


On Mon, 13 Nov 2023, Mark Cave-Ayland wrote:

On 07/11/2023 11:11, Kevin Wolf wrote:

Am 06.11.2023 um 23:41 hat Mark Cave-Ayland geschrieben:

On 06/11/2023 14:12, Kevin Wolf wrote:

Hi Kevin,

Thanks for taking the time to review this. I'll reply inline below.


Am 25.10.2023 um 00:40 hat Mark Cave-Ayland geschrieben:

This function reads the value of the PCI_CLASS_PROG register for PCI IDE
controllers and configures the PCI BARs and/or IDE ioports accordingly.

In the case where we switch to legacy mode, the PCI BARs are set to 
return zero
(as suggested in the "PCI IDE Controller" specification), the legacy IDE 
ioports
are enabled, and the PCI interrupt pin cleared to indicate legacy IRQ 
routing.


Conversely when we switch to native mode, the legacy IDE ioports are 
disabled
and the PCI interrupt pin set to indicate native IRQ routing. The 
contents of
the PCI BARs are unspecified, but this is not an issue since if a PCI 
IDE
controller has been switched to native mode then its BARs will need to 
be

programmed.

Signed-off-by: Mark Cave-Ayland 
Tested-by: BALATON Zoltan 
Tested-by: Bernhard Beschow 
---
   hw/ide/pci.c | 90 


   include/hw/ide/pci.h |  1 +
   2 files changed, 91 insertions(+)

diff --git a/hw/ide/pci.c b/hw/ide/pci.c
index a25b352537..5be643b460 100644
--- a/hw/ide/pci.c
+++ b/hw/ide/pci.c
@@ -104,6 +104,96 @@ const MemoryRegionOps pci_ide_data_le_ops = {
   .endianness = DEVICE_LITTLE_ENDIAN,
   };
+static const MemoryRegionPortio ide_portio_list[] = {
+{ 0, 8, 1, .read = ide_ioport_read, .write = ide_ioport_write },
+{ 0, 1, 2, .read = ide_data_readw, .write = ide_data_writew },
+{ 0, 1, 4, .read = ide_data_readl, .write = ide_data_writel },
+PORTIO_END_OF_LIST(),
+};
+
+static const MemoryRegionPortio ide_portio2_list[] = {
+{ 0, 1, 1, .read = ide_status_read, .write = ide_ctrl_write },
+PORTIO_END_OF_LIST(),
+};


This is duplicated from hw/ide/ioport.c. I think it would be better to
use the arrays already defined there, ideally by calling ioport.c
functions to setup and release the I/O ports.


The tricky part here is that hw/ide/ioport.c is defined for CONFIG_ISA, 
and

so if we did that then all PCI IDE controllers would become dependent upon
ISA too, regardless of whether they implement compatibility mode or not.
What do you think is the best solution here? Perhaps moving
ide_init_ioport() to a more ISA-specific place? I know that both myself 
and

Phil have considered whether ide_init_ioport() should be replaced by
something else further down the line.


Hm, yes, I didn't think about this.

Splitting ioport.c is one option, but even the port lists are really
made for ISA, so the whole file is really ISA related.

On the other hand, pci_ide_update_mode() isn't really a pure PCI
function, it's at the intersection of PCI and ISA. Can we just #ifdef it
out if ISA isn't built? Devices that don't support compatibility mode
should never try to call pci_ide_update_mode().


In terms of the QEMU modelling, the PCI IDE controllers are modelled as a 
PCIDevice rather than an ISADevice and that's why ide_init_ioport() doesn't 
really make sense in PCI IDE controllers. Currently its only PCIDevice user 
is hw/ide/piix.c and that passes ISADevice as NULL, because there is no 
underlying ISADevice.


The only ISADevice user is in hw/ide/isa.c so I think a better solution here 
would be to inline ide_init_ioport() into isa_ide_realizefn() and then add a 
separate function for PCI IDE controllers which is what I've attempted to do 
here.


How about moving ide_portio_list[] and ide_portio_list2[] to hw/ide/core.c 
instead? The definitions in include/hw/ide/internal.h already have a 
dependency on PortioList so there should be no issue, and it allows them to 
be shared between both PCI and ISA devices.


That's where these came from in commit 83d14054f9555 and the reason was to 
get rid of the ISA dependency for machines that don't need it. Would it be 
possible to make a function that only registers the portio stuff (e.g. 
ide_register_ports) that's not depenent on either ISADevice nor 
PCIIDEState but takes an IDEBus? That could be used by both ide-isa and 
PCI devices. This would just do the portio stuff for a single bus and you 
could call it twice from your pci_ide_update_mode function then rhe 
portio_list arrays can remain static to core.c. That seems better than 
duplicating code and exporting these arrays.



+void pci_ide_update_mode(PCIIDEState *s)
+{
+PCIDevice *d = PCI_DEVICE(s);
+uint8_t mode = d->config[PCI_CLASS_PROG];
+
+switch (mode & 0xf) {
+case 0xa:
+/* Both channels legacy mode */


Why is it ok to handle only the case where both channels are set to the
same mode? The spec describes mixed-mode setups, too, and doesn't seem
to allow ignoring a mode change if it's only for one of the channels.


Certainly that can be done: only both channels were implemented initiall

Re: [PATCH v4 02/33] hw/cpu: Call object_class_is_abstract() once in cpu_class_by_name()

2023-11-13 Thread Gavin Shan




On 11/7/23 00:40, Igor Mammedov wrote:

On Thu,  2 Nov 2023 10:24:29 +1000
Gavin Shan  wrote:


From: Philippe Mathieu-Daudé 

Let CPUClass::class_by_name() handlers to return abstract classes,
and filter them once in the public cpu_class_by_name() method.

Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Gavin Shan 
---
  hw/core/cpu-common.c   | 8 +++-
  include/hw/core/cpu.h  | 7 ---
  target/alpha/cpu.c | 2 +-
  target/arm/cpu.c   | 3 +--
  target/avr/cpu.c   | 3 +--
  target/cris/cpu.c  | 3 +--
  target/hexagon/cpu.c   | 3 +--
  target/loongarch/cpu.c | 3 +--
  target/m68k/cpu.c  | 3 +--
  target/openrisc/cpu.c  | 3 +--
  target/riscv/cpu.c | 3 +--
  target/rx/cpu.c| 5 +
  target/sh4/cpu.c   | 3 ---
  target/tricore/cpu.c   | 3 +--
  target/xtensa/cpu.c| 3 +--
  15 files changed, 23 insertions(+), 32 deletions(-)

diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index bab8942c30..bca0323e9f 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -150,9 +150,15 @@ static bool cpu_common_has_work(CPUState *cs)
  ObjectClass *cpu_class_by_name(const char *typename, const char *cpu_model)
  {
  CPUClass *cc = CPU_CLASS(object_class_by_name(typename));
+ObjectClass *oc;
  
  assert(cpu_model && cc->class_by_name);

-return cc->class_by_name(cpu_model);
+oc = cc->class_by_name(cpu_model);
+if (oc && !object_class_is_abstract(oc)) {
+return oc;
+}
+
+return NULL;
  }
  
  static void cpu_common_parse_features(const char *typename, char *features,

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 18593db5b2..ee85aafdf5 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -102,7 +102,7 @@ struct SysemuCPUOps;
  /**
   * CPUClass:
   * @class_by_name: Callback to map -cpu command line model name to an
- * instantiatable CPU type.
+ * instantiatable CPU type.
   * @parse_features: Callback to parse command line arguments.
   * @reset_dump_flags: #CPUDumpFlags to use for reset logging.
   * @has_work: Callback for checking if there is work to do.
@@ -772,9 +772,10 @@ void cpu_reset(CPUState *cpu);
   * @typename: The CPU base type.
   * @cpu_model: The model string without any parameters.
   *
- * Looks up a CPU #ObjectClass matching name @cpu_model.
+ * Looks up a concrete CPU #ObjectClass matching name @cpu_model.
   *
- * Returns: A #CPUClass or %NULL if not matching class is found.
+ * Returns: A concrete #CPUClass or %NULL if no matching class is found
+ *  or if the matching class is abstract.
   */
  ObjectClass *cpu_class_by_name(const char *typename, const char *cpu_model);
  
diff --git a/target/alpha/cpu.c b/target/alpha/cpu.c

index c7ae4d6a41..9436859c7b 100644
--- a/target/alpha/cpu.c
+++ b/target/alpha/cpu.c
@@ -126,7 +126,7 @@ static ObjectClass *alpha_cpu_class_by_name(const char 
*cpu_model)
  int i;
  
  oc = object_class_by_name(cpu_model);

-if (oc != NULL && object_class_dynamic_cast(oc, TYPE_ALPHA_CPU) != NULL &&

I'd split 'oc != NULL &&' into a separate patch



Agree. It's a good idea, but this patch has been merged as:

3a9d0d7b64 hw/cpu: Call object_class_is_abstract() once in cpu_class_by_name()


+if (object_class_dynamic_cast(oc, TYPE_ALPHA_CPU) &&
  !object_class_is_abstract(oc)) {


stray abstract check leftover??



Nope, It's intentional. We will fall back to @alpha_cpu_aliases if the
CPU class corresponding to @cpu_model is abstract.


  return oc;
  }
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 954328d72a..8c622d6b59 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -2399,8 +2399,7 @@ static ObjectClass *arm_cpu_class_by_name(const char 
*cpu_model)
  oc = object_class_by_name(typename);
  g_strfreev(cpuname);
  g_free(typename);
-if (!oc || !object_class_dynamic_cast(oc, TYPE_ARM_CPU) ||
-object_class_is_abstract(oc)) {
+if (!object_class_dynamic_cast(oc, TYPE_ARM_CPU)) {
  return NULL;
  }
  return oc;
diff --git a/target/avr/cpu.c b/target/avr/cpu.c
index 14d8b9d1f0..113d522f75 100644
--- a/target/avr/cpu.c
+++ b/target/avr/cpu.c
@@ -157,8 +157,7 @@ static ObjectClass *avr_cpu_class_by_name(const char 
*cpu_model)
  ObjectClass *oc;
  
  oc = object_class_by_name(cpu_model);

-if (object_class_dynamic_cast(oc, TYPE_AVR_CPU) == NULL ||
-object_class_is_abstract(oc)) {
+if (!object_class_dynamic_cast(oc, TYPE_AVR_CPU)) {
  oc = NULL;
  }
  return oc;
diff --git a/target/cris/cpu.c b/target/cris/cpu.c
index be4a44c218..1cb431cd46 100644
--- a/target/cris/cpu.c
+++ b/target/cris/cpu.c
@@ -95,8 +95,7 @@ static ObjectClass *cris_cpu_class_by_name(const char 
*cpu_model)
  typename = g_strdup_printf(CRIS_CPU_TYPE_NAME("%s"), cpu_model);
  oc = object_class_by_name(typename);
  g_free(typename);
-if (oc != NULL && (!object_class_dynamic_cast(oc, TYPE_CRIS_CPU) ||
-

Re: [PATCH v4 01/33] target/alpha: Tidy up alpha_cpu_class_by_name()

2023-11-13 Thread Gavin Shan


On 11/7/23 00:22, Igor Mammedov wrote:

On Thu,  2 Nov 2023 10:24:28 +1000
Gavin Shan  wrote:


From: Philippe Mathieu-Daudé 

For target/alpha, the default CPU model name is "ev67". The default
CPU model is used when no matching CPU model is found. The conditions
to fall back to the default CPU model can be combined so that the code
looks a bit simplified.


default cpu should be specified by board not by target internals.



Yes, MachineClass::default_cpu_type used to specify the default CPU type.
I will improve the changelog in next revision to avoid the confusion.


Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Gavin Shan 
---
  target/alpha/cpu.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/target/alpha/cpu.c b/target/alpha/cpu.c
index 51b7d8d1bf..c7ae4d6a41 100644
--- a/target/alpha/cpu.c
+++ b/target/alpha/cpu.c
@@ -142,13 +142,10 @@ static ObjectClass *alpha_cpu_class_by_name(const char 
*cpu_model)
  typename = g_strdup_printf(ALPHA_CPU_TYPE_NAME("%s"), cpu_model);
  oc = object_class_by_name(typename);
  g_free(typename);
-if (oc != NULL && object_class_is_abstract(oc)) {
-oc = NULL;
-}
  
  /* TODO: remove match everything nonsense */


Let's do ^ instead of just shifting code around.
It will break users that specify junk as input, but it's clear
users error so garbage in => error out.



Ok. The whole chunk of code to fall back to 'ev67' will be dropped
in next revision.




-/* Default to ev67; no reason not to emulate insns by default. */
-if (!oc) {
+if (!oc || object_class_is_abstract(oc)) {
+/* Default to ev67, no reason not to emulate insns by default */
  oc = object_class_by_name(ALPHA_CPU_TYPE_NAME("ev67"));
  }
  


Thanks,
Gavin

Re: [PATCH v2 13/17] hw/cxl: Add support for device sanitation

2023-11-13 Thread Hyeonggon Yoo

On Tue, Oct 24, 2023 at 1:14 AM Jonathan Cameron
 wrote:
>
> From: Davidlohr Bueso 
>
> Make use of the background operations through the sanitize command, per CXL
> 3.0 specs. Traditionally run times can be rather long, depending on the
> size of the media.
>
> Estimate times based on:
>  https://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf
>
> Signed-off-by: Davidlohr Bueso 
> Signed-off-by: Jonathan Cameron 
> ---
>  include/hw/cxl/cxl_device.h |  17 +
>  hw/cxl/cxl-mailbox-utils.c  | 140 
>  hw/mem/cxl_type3.c  |  10 +++
>  3 files changed, 167 insertions(+)
>
> diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
> index 2a813c..70aca9024c 100644
> --- a/include/hw/cxl/cxl_device.h
> +++ b/include/hw/cxl/cxl_device.h
> @@ -343,6 +343,23 @@ REG64(CXL_MEM_DEV_STS, 0)
>  FIELD(CXL_MEM_DEV_STS, MBOX_READY, 4, 1)
>  FIELD(CXL_MEM_DEV_STS, RESET_NEEDED, 5, 3)
>
> +static inline void __toggle_media(CXLDeviceState *cxl_dstate, int val)
> +{
> +uint64_t dev_status_reg;
> +
> +dev_status_reg = FIELD_DP64(0, CXL_MEM_DEV_STS, MEDIA_STATUS, val);
> +cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS] = dev_status_reg;
> +}
> +#define cxl_dev_disable_media(cxlds)\
> +do { __toggle_media((cxlds), 0x3); } while (0)
> +#define cxl_dev_enable_media(cxlds) \
> +do { __toggle_media((cxlds), 0x1); } while (0)

Before this patch, it is assumed that "Media Status" and "Mailbox
Interface Ready" were always 1,
thus mdev_reg_read() always returns 1 for both of them regardless of
register values.

I think changes like below are needed as now the assumption is broken?
Please note that it's only build-tested :)

Thanks,
Hyeonggon

diff --git a/hw/cxl/cxl-device-utils.c b/hw/cxl/cxl-device-utils.c
index 61a3c4dc2e..b6ada2fd6a 100644
--- a/hw/cxl/cxl-device-utils.c
+++ b/hw/cxl/cxl-device-utils.c
@@ -229,12 +229,9 @@ static void mailbox_reg_write(void *opaque,
hwaddr offset, uint64_t value,

 static uint64_t mdev_reg_read(void *opaque, hwaddr offset, unsigned size)
 {
-uint64_t retval = 0;
-
-retval = FIELD_DP64(retval, CXL_MEM_DEV_STS, MEDIA_STATUS, 1);
-retval = FIELD_DP64(retval, CXL_MEM_DEV_STS, MBOX_READY, 1);
+CXLDeviceState *cxl_dstate = opaque;

-return retval;
+return cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS];
 }

 static void ro_reg_write(void *opaque, hwaddr offset, uint64_t value,
@@ -371,7 +368,13 @@ static void
mailbox_reg_init_common(CXLDeviceState *cxl_dstate)
 cxl_dstate->mbox_msi_n = msi_n;
 }

-static void memdev_reg_init_common(CXLDeviceState *cxl_dstate) { }
+static void memdev_reg_init_common(CXLDeviceState *cxl_dstate) {
+uint64_t memdev_status_reg;
+
+memdev_status_reg = FIELD_DP64(0, CXL_MEM_DEV_STS, MEDIA_STATUS, 1);
+memdev_status_reg = FIELD_DP64(memdev_status_reg,
CXL_MEM_DEV_STS, MBOX_READY, 1);
+cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS] = memdev_status_reg;
+}

 void cxl_device_register_init_t3(CXLType3Dev *ct3d)
 {
diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
index 61b7f897f7..61f8f83ddf 100644
--- a/include/hw/cxl/cxl_device.h
+++ b/include/hw/cxl/cxl_device.h
@@ -351,9 +351,9 @@ REG64(CXL_MEM_DEV_STS, 0)

 static inline void __toggle_media(CXLDeviceState *cxl_dstate, int val)
 {
-uint64_t dev_status_reg;
+uint64_t dev_status_reg = cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS];

-dev_status_reg = FIELD_DP64(0, CXL_MEM_DEV_STS, MEDIA_STATUS, val);
+dev_status_reg = FIELD_DP64(dev_status_reg, CXL_MEM_DEV_STS,
MEDIA_STATUS, val);
 cxl_dstate->mbox_reg_state64[R_CXL_MEM_DEV_STS] = dev_status_reg;
 }
 #define cxl_dev_disable_media(cxlds)\

[PATCH v2] test/qtest: Add API functions to capture IRQ toggling

2023-11-13 Thread Gustavo Romero

Currently, the QTest API does not provide a function to capture when an
IRQ line is raised or lowered, although the QTest Protocol already
reports such IRQ transitions. As a consequence, it is also not possible
to capture when an IRQ line is toggled. Functions like qtest_get_irq()
only read the current state of the intercepted IRQ lines, which is
already high (or low) when the function is called if the IRQ line is
toggled. Therefore, these functions miss the IRQ line state transitions.

This commit introduces two new API functions:
qtest_get_irq_raised_counter() and qtest_get_irq_lowered_counter().
These functions allow capturing the number of times an observed IRQ line
transitioned from low to high state or from high to low state,
respectively.

When used together, these new API functions then allow checking if one
or more pulses were generated (indicating if the IRQ line was toggled).

Signed-off-by: Gustavo Romero 
---
 tests/qtest/libqtest.c | 24 
 tests/qtest/libqtest.h | 28 
 2 files changed, 52 insertions(+)

diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
index f33a210861..6ada4cae6e 100644
--- a/tests/qtest/libqtest.c
+++ b/tests/qtest/libqtest.c
@@ -82,6 +82,8 @@ struct QTestState
 int expected_status;
 bool big_endian;
 bool irq_level[MAX_IRQ];
+uint64_t irq_raised_counter[MAX_IRQ];
+uint64_t irq_lowered_counter[MAX_IRQ];
 GString *rx;
 QTestTransportOps ops;
 GList *pending_events;
@@ -498,6 +500,8 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
 s->rx = g_string_new("");
 for (i = 0; i < MAX_IRQ; i++) {
 s->irq_level[i] = false;
+s->irq_raised_counter[i] = 0;
+s->irq_lowered_counter[i] = 0;
 }
 
 /*
@@ -689,8 +693,10 @@ redo:
 g_assert_cmpint(irq, <, MAX_IRQ);
 
 if (strcmp(words[1], "raise") == 0) {
+s->irq_raised_counter[irq]++;
 s->irq_level[irq] = true;
 } else {
+s->irq_lowered_counter[irq]++;
 s->irq_level[irq] = false;
 }
 
@@ -980,6 +986,22 @@ bool qtest_get_irq(QTestState *s, int num)
 return s->irq_level[num];
 }
 
+uint64_t qtest_get_irq_raised_counter(QTestState *s, int num)
+{
+/* dummy operation in order to make sure irq is up to date */
+qtest_inb(s, 0);
+
+return s->irq_raised_counter[num];
+}
+
+uint64_t qtest_get_irq_lowered_counter(QTestState *s, int num)
+{
+/* dummy operation in order to make sure irq is up to date */
+qtest_inb(s, 0);
+
+return s->irq_lowered_counter[num];
+}
+
 void qtest_module_load(QTestState *s, const char *prefix, const char *libname)
 {
 qtest_sendf(s, "module_load %s %s\n", prefix, libname);
@@ -1799,6 +1821,8 @@ QTestState *qtest_inproc_init(QTestState **s, bool log, 
const char* arch,
 qts->wstatus = 0;
 for (int i = 0; i < MAX_IRQ; i++) {
 qts->irq_level[i] = false;
+qts->irq_raised_counter[i] = 0;
+qts->irq_lowered_counter[i] = 0;
 }
 
 qtest_client_set_rx_handler(qts, qtest_client_inproc_recv_line);
diff --git a/tests/qtest/libqtest.h b/tests/qtest/libqtest.h
index 6e3d3525bf..a2a16914dc 100644
--- a/tests/qtest/libqtest.h
+++ b/tests/qtest/libqtest.h
@@ -364,6 +364,34 @@ void qtest_module_load(QTestState *s, const char *prefix, 
const char *libname);
  */
 bool qtest_get_irq(QTestState *s, int num);
 
+/**
+ * qtest_get_irq_raised_counter:
+ * @s: #QTestState instance to operate on.
+ * @num: Interrupt to observe.
+ *
+ * This function can be used in conjunction with the
+ * qtest_get_irq_lowered_counter() to check if one or more pulses where
+ * generated on the observed interrupt.
+ *
+ * Returns: The number of times IRQ @num was raised, i.e., transitioned from
+ * a low state (false) to a high state (true).
+ */
+uint64_t qtest_get_irq_raised_counter(QTestState *s, int num);
+
+/**
+ * qtest_get_irq_lowered_counter:
+ * @s: #QTestState instance to operate on.
+ * @num: Interrupt to observe.
+ *
+ * This function can be used in conjunction with the
+ * qtest_get_irq_raised_counter() to check if one or more pulses where
+ * generated on the observed interrupt.
+ *
+ * Returns: The number of times IRQ @num was lowered, i.e., transitioned from
+ * a high state (true) to a low state (false).
+ */
+uint64_t qtest_get_irq_lowered_counter(QTestState *s, int num);
+
 /**
  * qtest_irq_intercept_in:
  * @s: #QTestState instance to operate on.
-- 
2.34.1

Re: [RFC PATCH v2 1/4] migration/multifd: Stop setting p->ioc before connecting

2023-11-13 Thread Peter Xu

On Fri, Nov 10, 2023 at 05:02:38PM -0300, Fabiano Rosas wrote:
> This is being shadowed but the assignments at
> multifd_channel_connect() and multifd_tls_channel_connect() .
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH] spelling: hw/audio/virtio-snd.c: initalize


On 13/11/23 22:20, Michael Tokarev wrote:

Fixes: eb9ad377bb94 "virtio-sound: handle control messages and streams"
Signed-off-by: Michael Tokarev 
---
  hw/audio/virtio-snd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v2 2/2] s390x/pci: only limit DMA aperture if vfio DMA limit reported

2023-11-13 Thread Matthew Rosato

On 11/13/23 4:24 PM, Michael Tokarev wrote:
> 10.11.2023 20:51, Matthew Rosato wrote:
>> If the host kernel lacks vfio DMA limit reporting, do not attempt
>> to shrink the guest DMA aperture.
>>
>> Fixes: df202e3ff3 ("s390x/pci: shrink DMA aperture to be bound by vfio DMA 
>> limit")
>> Signed-off-by: Matthew Rosato 
> 
> Is this stable-8.1 material?
> 
> Thanks,
> 
> /mjt
> 

Yes, I believe it is (sorry, should have added CC stable)

If you have a host kernel that doesn't report the vfio DMA limit the resulting 
PCI device will be rendered unusable in the s390x guest due this bug.

Thanks,
Matt

[PATCH for-9.0 2/6] target/riscv/tcg: do not use "!generic" CPU checks

Our current logic in get/setters of MISA and multi-letter extensions
works because we have only 2 CPU types, generic and vendor, and by using
"!generic" we're implying that we're talking about vendor CPUs. When adding
a third CPU type this logic will break so let's handle it beforehand.

In set_misa_ext_cfg() and set_multi_ext_cfg(), check for "vendor" cpu instead
of "not generic". The "generic CPU" checks remaining are from
riscv_cpu_add_misa_properties() and cpu_add_multi_ext_prop() before
applying default values for the extensions.

This leaves us with:

- vendor CPUs will not allow extension enablement, all other CPUs will;

- generic CPUs will inherit default values for extensions, all others
  won't.

And now we can add a new, third CPU type, that will allow extensions to
be enabled and will not inherit defaults, without changing the existing
logic.

Signed-off-by: Daniel Henrique Barboza 
Reviewed-by: Andrew Jones 
Reviewed-by: Alistair Francis 
---
 target/riscv/tcg/tcg-cpu.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/target/riscv/tcg/tcg-cpu.c b/target/riscv/tcg/tcg-cpu.c
index 08adad304d..304211169e 100644
--- a/target/riscv/tcg/tcg-cpu.c
+++ b/target/riscv/tcg/tcg-cpu.c
@@ -654,6 +654,11 @@ static bool riscv_cpu_is_generic(Object *cpu_obj)
 return object_dynamic_cast(cpu_obj, TYPE_RISCV_DYNAMIC_CPU) != NULL;
 }
 
+static bool riscv_cpu_is_vendor(Object *cpu_obj)
+{
+return object_dynamic_cast(cpu_obj, TYPE_RISCV_VENDOR_CPU) != NULL;
+}
+
 /*
  * We'll get here via the following path:
  *
@@ -722,7 +727,7 @@ static void cpu_set_misa_ext_cfg(Object *obj, Visitor *v, 
const char *name,
 target_ulong misa_bit = misa_ext_cfg->misa_bit;
 RISCVCPU *cpu = RISCV_CPU(obj);
 CPURISCVState *env = &cpu->env;
-bool generic_cpu = riscv_cpu_is_generic(obj);
+bool vendor_cpu = riscv_cpu_is_vendor(obj);
 bool prev_val, value;
 
 if (!visit_type_bool(v, name, &value, errp)) {
@@ -736,7 +741,7 @@ static void cpu_set_misa_ext_cfg(Object *obj, Visitor *v, 
const char *name,
 }
 
 if (value) {
-if (!generic_cpu) {
+if (vendor_cpu) {
 g_autofree char *cpuname = riscv_cpu_get_name(cpu);
 error_setg(errp, "'%s' CPU does not allow enabling extensions",
cpuname);
@@ -841,7 +846,7 @@ static void cpu_set_multi_ext_cfg(Object *obj, Visitor *v, 
const char *name,
 {
 const RISCVCPUMultiExtConfig *multi_ext_cfg = opaque;
 RISCVCPU *cpu = RISCV_CPU(obj);
-bool generic_cpu = riscv_cpu_is_generic(obj);
+bool vendor_cpu = riscv_cpu_is_vendor(obj);
 bool prev_val, value;
 
 if (!visit_type_bool(v, name, &value, errp)) {
@@ -865,7 +870,7 @@ static void cpu_set_multi_ext_cfg(Object *obj, Visitor *v, 
const char *name,
 return;
 }
 
-if (value && !generic_cpu) {
+if (value && vendor_cpu) {
 g_autofree char *cpuname = riscv_cpu_get_name(cpu);
 error_setg(errp, "'%s' CPU does not allow enabling extensions",
cpuname);
-- 
2.41.0

[PATCH for-9.0 3/6] target/riscv/tcg: update priv_ver on user_set extensions

We'll add a new bare CPU type that won't have any default priv_ver. This
means that the CPU will default to priv_ver = 0, i.e. 1.10.0.

At the same we'll allow these CPUs to enable extensions at will, but
then, if the extension has a priv_ver newer than 1.10, we'll end up
disabling it. Users will then need to manually set priv_ver to something
other than 1.10 to enable the extensions they want, which is not ideal.

Change the setter() of extensions to allow user enabled extensions to
bump the priv_ver of the CPU. This will make it convenient for users to
enable extensions for CPUs that doesn't set a default priv_ver.

This change does not affect any existing CPU: vendor CPUs does not allow
extensions to be enabled, and generic CPUs are already set to priv_ver
LATEST.

Signed-off-by: Daniel Henrique Barboza 
Reviewed-by: Andrew Jones 
---
 target/riscv/tcg/tcg-cpu.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/target/riscv/tcg/tcg-cpu.c b/target/riscv/tcg/tcg-cpu.c
index 304211169e..c63b2adb5b 100644
--- a/target/riscv/tcg/tcg-cpu.c
+++ b/target/riscv/tcg/tcg-cpu.c
@@ -114,6 +114,26 @@ static int cpu_cfg_ext_get_min_version(uint32_t ext_offset)
 g_assert_not_reached();
 }
 
+static void cpu_validate_multi_ext_priv_ver(CPURISCVState *env,
+uint32_t ext_offset)
+{
+int ext_priv_ver;
+
+if (env->priv_ver == PRIV_VERSION_LATEST) {
+return;
+}
+
+ext_priv_ver = cpu_cfg_ext_get_min_version(ext_offset);
+
+if (env->priv_ver < ext_priv_ver) {
+/*
+ * Note: the 'priv_spec' command line option, if present,
+ * will take precedence over this priv_ver bump.
+ */
+env->priv_ver = ext_priv_ver;
+}
+}
+
 static void cpu_cfg_ext_auto_update(RISCVCPU *cpu, uint32_t ext_offset,
 bool value)
 {
@@ -748,6 +768,14 @@ static void cpu_set_misa_ext_cfg(Object *obj, Visitor *v, 
const char *name,
 return;
 }
 
+if (misa_bit == RVH && env->priv_ver < PRIV_VERSION_1_12_0) {
+/*
+ * Note: the 'priv_spec' command line option, if present,
+ * will take precedence over this priv_ver bump.
+ */
+env->priv_ver = PRIV_VERSION_1_12_0;
+}
+
 env->misa_ext |= misa_bit;
 env->misa_ext_mask |= misa_bit;
 } else {
@@ -877,6 +905,10 @@ static void cpu_set_multi_ext_cfg(Object *obj, Visitor *v, 
const char *name,
 return;
 }
 
+if (value) {
+cpu_validate_multi_ext_priv_ver(&cpu->env, multi_ext_cfg->offset);
+}
+
 isa_ext_update_enabled(cpu, multi_ext_cfg->offset, value);
 }
 
-- 
2.41.0

[PATCH for-9.0 5/6] target/riscv: add rv32i CPU

Add a bare bones 32 bit CPU, like we already did with rv64i, to ease the
pain of users trying to build a CPU from scratch and having to disable
the defaults we have with the regular rv32 CPU.

See:

https://lore.kernel.org/qemu-riscv/258be47f-97be-4308-bed5-dc34ef7ff954@Spark/

For a use case where the existence of rv32i would make things simpler.

Signed-off-by: Daniel Henrique Barboza 
---
 target/riscv/cpu-qom.h |  1 +
 target/riscv/cpu.c | 23 +++
 2 files changed, 24 insertions(+)

diff --git a/target/riscv/cpu-qom.h b/target/riscv/cpu-qom.h
index 4d1aa54311..f345c17e69 100644
--- a/target/riscv/cpu-qom.h
+++ b/target/riscv/cpu-qom.h
@@ -34,6 +34,7 @@
 #define TYPE_RISCV_CPU_BASE32   RISCV_CPU_TYPE_NAME("rv32")
 #define TYPE_RISCV_CPU_BASE64   RISCV_CPU_TYPE_NAME("rv64")
 #define TYPE_RISCV_CPU_BASE128  RISCV_CPU_TYPE_NAME("x-rv128")
+#define TYPE_RISCV_CPU_RV32IRISCV_CPU_TYPE_NAME("rv32i")
 #define TYPE_RISCV_CPU_RV64IRISCV_CPU_TYPE_NAME("rv64i")
 #define TYPE_RISCV_CPU_IBEX RISCV_CPU_TYPE_NAME("lowrisc-ibex")
 #define TYPE_RISCV_CPU_SHAKTI_C RISCV_CPU_TYPE_NAME("shakti-c")
diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index a52bf1e33c..55cf114b61 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -666,6 +666,28 @@ static void rv32_imafcu_nommu_cpu_init(Object *obj)
 cpu->cfg.ext_zicsr = true;
 cpu->cfg.pmp = true;
 }
+
+static void rv32i_bare_cpu_init(Object *obj)
+{
+CPURISCVState *env = &RISCV_CPU(obj)->env;
+riscv_cpu_set_misa(env, MXL_RV32, RVI);
+
+/* Remove the defaults from the parent class */
+RISCV_CPU(obj)->cfg.ext_zicntr = false;
+RISCV_CPU(obj)->cfg.ext_zihpm = false;
+
+/* Set to QEMU's first supported priv version */
+env->priv_ver = PRIV_VERSION_1_10_0;
+
+/*
+ * Support all available satp_mode settings. The default
+ * value will be set to MBARE if the user doesn't set
+ * satp_mode manually (see set_satp_mode_default()).
+ */
+#ifndef CONFIG_USER_ONLY
+set_satp_mode_max_supported(RISCV_CPU(obj), VM_1_10_SV32);
+#endif
+}
 #endif
 
 static ObjectClass *riscv_cpu_class_by_name(const char *cpu_model)
@@ -1860,6 +1882,7 @@ static const TypeInfo riscv_cpu_type_infos[] = {
 DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_E31,  rv32_sifive_e_cpu_init),
 DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_E34,  rv32_imafcu_nommu_cpu_init),
 DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_U34,  rv32_sifive_u_cpu_init),
+DEFINE_BARE_CPU(TYPE_RISCV_CPU_RV32I, rv32i_bare_cpu_init),
 #elif defined(TARGET_RISCV64)
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_BASE64,   rv64_base_cpu_init),
 DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_E51,  rv64_sifive_e_cpu_init),
-- 
2.41.0

[PATCH for-9.0 6/6] target/riscv: add rv32e/rv64e CPUs

In our internals we'll never allow RVI and RVE to be enabled at the same
time, and we require either RVI or RVE to be enabled to proceed with
machine boot. And all CPUs we have enables RVI by default.

This means that if one wants to create an embedded CPU he'll need to
disable RVI first then enable RVE, e.g.:

-cpu rv64i,i=false,e=true

Let's add two RVE CPUs to ease the burder when working with embedded
CPUs in QEMU.

Signed-off-by: Daniel Henrique Barboza 
---
 target/riscv/cpu-qom.h |  2 ++
 target/riscv/cpu.c | 46 ++
 2 files changed, 48 insertions(+)

diff --git a/target/riscv/cpu-qom.h b/target/riscv/cpu-qom.h
index f345c17e69..34d1034cfc 100644
--- a/target/riscv/cpu-qom.h
+++ b/target/riscv/cpu-qom.h
@@ -36,6 +36,8 @@
 #define TYPE_RISCV_CPU_BASE128  RISCV_CPU_TYPE_NAME("x-rv128")
 #define TYPE_RISCV_CPU_RV32IRISCV_CPU_TYPE_NAME("rv32i")
 #define TYPE_RISCV_CPU_RV64IRISCV_CPU_TYPE_NAME("rv64i")
+#define TYPE_RISCV_CPU_RV32ERISCV_CPU_TYPE_NAME("rv32e")
+#define TYPE_RISCV_CPU_RV64ERISCV_CPU_TYPE_NAME("rv64e")
 #define TYPE_RISCV_CPU_IBEX RISCV_CPU_TYPE_NAME("lowrisc-ibex")
 #define TYPE_RISCV_CPU_SHAKTI_C RISCV_CPU_TYPE_NAME("shakti-c")
 #define TYPE_RISCV_CPU_SIFIVE_E31   RISCV_CPU_TYPE_NAME("sifive-e31")
diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 55cf114b61..7d5ff7a0aa 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -585,6 +585,28 @@ static void rv64i_bare_cpu_init(Object *obj)
 set_satp_mode_max_supported(RISCV_CPU(obj), VM_1_10_SV64);
 #endif
 }
+
+static void rv64e_bare_cpu_init(Object *obj)
+{
+CPURISCVState *env = &RISCV_CPU(obj)->env;
+riscv_cpu_set_misa(env, MXL_RV64, RVE);
+
+/* Remove the defaults from the parent class */
+RISCV_CPU(obj)->cfg.ext_zicntr = false;
+RISCV_CPU(obj)->cfg.ext_zihpm = false;
+
+/* Set to QEMU's first supported priv version */
+env->priv_ver = PRIV_VERSION_1_10_0;
+
+/*
+ * Support all available satp_mode settings. The default
+ * value will be set to MBARE if the user doesn't set
+ * satp_mode manually (see set_satp_mode_default()).
+ */
+#ifndef CONFIG_USER_ONLY
+set_satp_mode_max_supported(RISCV_CPU(obj), VM_1_10_SV64);
+#endif
+}
 #else
 static void rv32_base_cpu_init(Object *obj)
 {
@@ -688,6 +710,28 @@ static void rv32i_bare_cpu_init(Object *obj)
 set_satp_mode_max_supported(RISCV_CPU(obj), VM_1_10_SV32);
 #endif
 }
+
+static void rv32e_bare_cpu_init(Object *obj)
+{
+CPURISCVState *env = &RISCV_CPU(obj)->env;
+riscv_cpu_set_misa(env, MXL_RV32, RVE);
+
+/* Remove the defaults from the parent class */
+RISCV_CPU(obj)->cfg.ext_zicntr = false;
+RISCV_CPU(obj)->cfg.ext_zihpm = false;
+
+/* Set to QEMU's first supported priv version */
+env->priv_ver = PRIV_VERSION_1_10_0;
+
+/*
+ * Support all available satp_mode settings. The default
+ * value will be set to MBARE if the user doesn't set
+ * satp_mode manually (see set_satp_mode_default()).
+ */
+#ifndef CONFIG_USER_ONLY
+set_satp_mode_max_supported(RISCV_CPU(obj), VM_1_10_SV32);
+#endif
+}
 #endif
 
 static ObjectClass *riscv_cpu_class_by_name(const char *cpu_model)
@@ -1883,6 +1927,7 @@ static const TypeInfo riscv_cpu_type_infos[] = {
 DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_E34,  rv32_imafcu_nommu_cpu_init),
 DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_U34,  rv32_sifive_u_cpu_init),
 DEFINE_BARE_CPU(TYPE_RISCV_CPU_RV32I, rv32i_bare_cpu_init),
+DEFINE_BARE_CPU(TYPE_RISCV_CPU_RV32E, rv32e_bare_cpu_init),
 #elif defined(TARGET_RISCV64)
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_BASE64,   rv64_base_cpu_init),
 DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_E51,  rv64_sifive_e_cpu_init),
@@ -1892,6 +1937,7 @@ static const TypeInfo riscv_cpu_type_infos[] = {
 DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_VEYRON_V1,   rv64_veyron_v1_cpu_init),
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_BASE128,  rv128_base_cpu_init),
 DEFINE_BARE_CPU(TYPE_RISCV_CPU_RV64I, rv64i_bare_cpu_init),
+DEFINE_BARE_CPU(TYPE_RISCV_CPU_RV64E, rv64e_bare_cpu_init),
 #endif
 };
 
-- 
2.41.0

[PATCH for-9.0 4/6] target/riscv: add rv64i CPU

We don't have any form of a 'bare bones' CPU. rv64, our default CPUs,
comes with a lot of defaults. This is fine for most regular uses but
it's not suitable when more control of what is actually loaded in the
CPU is required.

A bare-bones CPU would be annoying to deal with if not by profile
support, a way to load a multitude of extensions with a single flag.
Profile support is going to be implemented shortly, so let's add a CPU
for it.

The new 'rv64i' CPU will have only RVI loaded. It is inspired in the
profile specification that dictates, for RVA22U64 [1]:

"RVA22U64 Mandatory Base
 RV64I is the mandatory base ISA for RVA22U64"

And so it seems that RV64I is the mandatory base ISA for all profiles
listed in [1], making it an ideal CPU to use with profile support.

rv64i is a CPU of type TYPE_RISCV_BARE_CPU. It has a mix of features
from pre-existent CPUs:

- it allows extensions to be enabled, like generic CPUs;
- it will not inherit extension defaults, like vendor CPUs.

This is the minimum extension set to boot OpenSBI and buildroot using
rv64i:

./build/qemu-system-riscv64 -nographic -M virt \
-cpu rv64i,sv39=true,g=true,c=true,s=true,u=true

Our minimal riscv,isa in this case will be:

 # cat /proc/device-tree/cpus/cpu@0/riscv,isa
rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd#

[1] https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc

Signed-off-by: Daniel Henrique Barboza 
Reviewed-by: Andrew Jones 
---
 target/riscv/cpu-qom.h |  2 ++
 target/riscv/cpu.c | 46 ++
 2 files changed, 48 insertions(+)

diff --git a/target/riscv/cpu-qom.h b/target/riscv/cpu-qom.h
index ca7dd509e3..4d1aa54311 100644
--- a/target/riscv/cpu-qom.h
+++ b/target/riscv/cpu-qom.h
@@ -24,6 +24,7 @@
 #define TYPE_RISCV_CPU "riscv-cpu"
 #define TYPE_RISCV_DYNAMIC_CPU "riscv-dynamic-cpu"
 #define TYPE_RISCV_VENDOR_CPU "riscv-vendor-cpu"
+#define TYPE_RISCV_BARE_CPU "riscv-bare-cpu"
 
 #define RISCV_CPU_TYPE_SUFFIX "-" TYPE_RISCV_CPU
 #define RISCV_CPU_TYPE_NAME(name) (name RISCV_CPU_TYPE_SUFFIX)
@@ -33,6 +34,7 @@
 #define TYPE_RISCV_CPU_BASE32   RISCV_CPU_TYPE_NAME("rv32")
 #define TYPE_RISCV_CPU_BASE64   RISCV_CPU_TYPE_NAME("rv64")
 #define TYPE_RISCV_CPU_BASE128  RISCV_CPU_TYPE_NAME("x-rv128")
+#define TYPE_RISCV_CPU_RV64IRISCV_CPU_TYPE_NAME("rv64i")
 #define TYPE_RISCV_CPU_IBEX RISCV_CPU_TYPE_NAME("lowrisc-ibex")
 #define TYPE_RISCV_CPU_SHAKTI_C RISCV_CPU_TYPE_NAME("shakti-c")
 #define TYPE_RISCV_CPU_SIFIVE_E31   RISCV_CPU_TYPE_NAME("sifive-e31")
diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 220113408e..a52bf1e33c 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -370,6 +370,17 @@ static void set_satp_mode_max_supported(RISCVCPU *cpu,
 /* Set the satp mode to the max supported */
 static void set_satp_mode_default_map(RISCVCPU *cpu)
 {
+/*
+ * Bare CPUs do not default to the max available.
+ * Users must set a valid satp_mode in the command
+ * line.
+ */
+if (object_dynamic_cast(OBJECT(cpu), TYPE_RISCV_BARE_CPU) != NULL) {
+warn_report("No satp mode set. Defaulting to 'bare'");
+cpu->cfg.satp_mode.map = (1 << VM_1_10_MBARE);
+return;
+}
+
 cpu->cfg.satp_mode.map = cpu->cfg.satp_mode.supported;
 }
 #endif
@@ -552,6 +563,28 @@ static void rv128_base_cpu_init(Object *obj)
 set_satp_mode_max_supported(RISCV_CPU(obj), VM_1_10_SV57);
 #endif
 }
+
+static void rv64i_bare_cpu_init(Object *obj)
+{
+CPURISCVState *env = &RISCV_CPU(obj)->env;
+riscv_cpu_set_misa(env, MXL_RV64, RVI);
+
+/* Remove the defaults from the parent class */
+RISCV_CPU(obj)->cfg.ext_zicntr = false;
+RISCV_CPU(obj)->cfg.ext_zihpm = false;
+
+/* Set to QEMU's first supported priv version */
+env->priv_ver = PRIV_VERSION_1_10_0;
+
+/*
+ * Support all available satp_mode settings. The default
+ * value will be set to MBARE if the user doesn't set
+ * satp_mode manually (see set_satp_mode_default()).
+ */
+#ifndef CONFIG_USER_ONLY
+set_satp_mode_max_supported(RISCV_CPU(obj), VM_1_10_SV64);
+#endif
+}
 #else
 static void rv32_base_cpu_init(Object *obj)
 {
@@ -1785,6 +1818,13 @@ void riscv_cpu_list(void)
 .instance_init = initfn  \
 }
 
+#define DEFINE_BARE_CPU(type_name, initfn) \
+{  \
+.name = type_name, \
+.parent = TYPE_RISCV_BARE_CPU, \
+.instance_init = initfn\
+}
+
 static const TypeInfo riscv_cpu_type_infos[] = {
 {
 .name = TYPE_RISCV_CPU,
@@ -1807,6 +1847,11 @@ static const TypeInfo riscv_cpu_type_infos[] = {
 .parent = TYPE_RISCV_CPU,
 .abstract = true,
 },
+{
+.name = TYPE_RISCV_BARE_CPU,
+.parent = TYPE_RISCV_CPU,
+.abstract = true,
+},
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_ANY,  riscv_any_cpu_init),
 DEFINE_DYNAMI

[PATCH for-9.0 0/6] riscv: rv32i,rv32e,rv64i and rv64e CPUs

Hi,

This series adds canonical/bare bones RISC-V CPUs in QEMU. The idea is
to allow users to create a CPU from scratch without having to deal/disable
existing defaults. 

A bare-bones CPU will avoid scenarios like the one described here:

https://lore.kernel.org/qemu-riscv/258be47f-97be-4308-bed5-dc34ef7ff954@Spark/

Where one has to disable a bunch of defaults from the rv32 CPU to be
able to use the desired configuration. After this series, the case from
the link above:

-cpu rv32,g=false,f=false,v=false,d=false,e=false,h=false,(... desired
setup)

Will be expressed as:

-cpu rv32i,(... desired setup)

Note that the idea isn't new. The rv64i CPU was already presented in the
rva22u64 profile series [1]. That series didn't make it for 8.2, so I'm
picking patches 1-4 (already reviewed and acked) and re-posting for this
work. In case this series is accepted first I'll rebase and re-send the
profile series.

I'm also adding RVE CPUs, rv32e and rv64e. The reason is that we can't
enable I and E at the same time, and all default CPUs has I by default,
so we would need to do something like 'r32i,i=false,e=true' to have a
base RVE 32 bit CPU.

[1] 
https://lore.kernel.org/qemu-riscv/20231103134629.561732-1-dbarb...@ventanamicro.com/

Daniel Henrique Barboza (6):
  target/riscv: create TYPE_RISCV_VENDOR_CPU
  target/riscv/tcg: do not use "!generic" CPU checks
  target/riscv/tcg: update priv_ver on user_set extensions
  target/riscv: add rv64i CPU
  target/riscv: add rv32i CPU
  target/riscv: add rv32e/rv64e CPUs

 target/riscv/cpu-qom.h |   6 ++
 target/riscv/cpu.c | 145 ++---
 target/riscv/tcg/tcg-cpu.c |  45 +++-
 3 files changed, 183 insertions(+), 13 deletions(-)

-- 
2.41.0

[PATCH for-9.0 1/6] target/riscv: create TYPE_RISCV_VENDOR_CPU

We want to add a new CPU type for bare CPUs that will inherit specific
traits of the 2 existing types:

- it will allow for extensions to be enabled/disabled, like generic
  CPUs;

- it will NOT inherit defaults, like vendor CPUs.

We can make this conditions met by adding an explicit type for the
existing vendor CPUs and change the existing logic to not imply that
"not generic" means vendor CPUs.

Let's add the "vendor" CPU type first.

Signed-off-by: Daniel Henrique Barboza 
Reviewed-by: Andrew Jones 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu-qom.h |  1 +
 target/riscv/cpu.c | 30 +-
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/target/riscv/cpu-qom.h b/target/riscv/cpu-qom.h
index 91b3361dec..ca7dd509e3 100644
--- a/target/riscv/cpu-qom.h
+++ b/target/riscv/cpu-qom.h
@@ -23,6 +23,7 @@
 
 #define TYPE_RISCV_CPU "riscv-cpu"
 #define TYPE_RISCV_DYNAMIC_CPU "riscv-dynamic-cpu"
+#define TYPE_RISCV_VENDOR_CPU "riscv-vendor-cpu"
 
 #define RISCV_CPU_TYPE_SUFFIX "-" TYPE_RISCV_CPU
 #define RISCV_CPU_TYPE_NAME(name) (name RISCV_CPU_TYPE_SUFFIX)
diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 83c7c0cf07..220113408e 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -1778,6 +1778,13 @@ void riscv_cpu_list(void)
 .instance_init = initfn   \
 }
 
+#define DEFINE_VENDOR_CPU(type_name, initfn) \
+{\
+.name = type_name,   \
+.parent = TYPE_RISCV_VENDOR_CPU, \
+.instance_init = initfn  \
+}
+
 static const TypeInfo riscv_cpu_type_infos[] = {
 {
 .name = TYPE_RISCV_CPU,
@@ -1795,21 +1802,26 @@ static const TypeInfo riscv_cpu_type_infos[] = {
 .parent = TYPE_RISCV_CPU,
 .abstract = true,
 },
+{
+.name = TYPE_RISCV_VENDOR_CPU,
+.parent = TYPE_RISCV_CPU,
+.abstract = true,
+},
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_ANY,  riscv_any_cpu_init),
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_MAX,  riscv_max_cpu_init),
 #if defined(TARGET_RISCV32)
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_BASE32,   rv32_base_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_IBEX, rv32_ibex_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_SIFIVE_E31,   rv32_sifive_e_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_SIFIVE_E34,   rv32_imafcu_nommu_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_SIFIVE_U34,   rv32_sifive_u_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_IBEX,rv32_ibex_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_E31,  rv32_sifive_e_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_E34,  rv32_imafcu_nommu_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_U34,  rv32_sifive_u_cpu_init),
 #elif defined(TARGET_RISCV64)
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_BASE64,   rv64_base_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_SIFIVE_E51,   rv64_sifive_e_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_SIFIVE_U54,   rv64_sifive_u_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_SHAKTI_C, rv64_sifive_u_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_THEAD_C906,   rv64_thead_c906_cpu_init),
-DEFINE_CPU(TYPE_RISCV_CPU_VEYRON_V1,rv64_veyron_v1_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_E51,  rv64_sifive_e_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SIFIVE_U54,  rv64_sifive_u_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_SHAKTI_C,rv64_sifive_u_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_THEAD_C906,  rv64_thead_c906_cpu_init),
+DEFINE_VENDOR_CPU(TYPE_RISCV_CPU_VEYRON_V1,   rv64_veyron_v1_cpu_init),
 DEFINE_DYNAMIC_CPU(TYPE_RISCV_CPU_BASE128,  rv128_base_cpu_init),
 #endif
 };
-- 
2.41.0

Re: [PATCH] spelling: hw/audio/virtio-snd.c: initalize

2023-11-13 Thread Stefan Weil via


Am 13.11.23 um 22:20 schrieb Michael Tokarev:

Fixes: eb9ad377bb94 "virtio-sound: handle control messages and streams"
Signed-off-by: Michael Tokarev 
---
  hw/audio/virtio-snd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/audio/virtio-snd.c b/hw/audio/virtio-snd.c
index a18a9949a7..2fe966e311 100644
--- a/hw/audio/virtio-snd.c
+++ b/hw/audio/virtio-snd.c
@@ -1126,7 +1126,7 @@ static void virtio_snd_realize(DeviceState *dev, Error 
**errp)
  status = virtio_snd_set_pcm_params(vsnd, i, &default_params);
  if (status != cpu_to_le32(VIRTIO_SND_S_OK)) {
  error_setg(errp,
-   "Can't initalize stream params, device responded with 
%s.",
+   "Can't initialize stream params, device responded with 
%s.",
 print_code(status));
  return;
  }


Reviewed-by: Stefan Weil 

Thanks,
Stefan

Re: [PATCH v2 1/3] ide/pci.c: introduce pci_ide_update_mode() function

2023-11-13 Thread Mark Cave-Ayland


On 07/11/2023 11:11, Kevin Wolf wrote:


Am 06.11.2023 um 23:41 hat Mark Cave-Ayland geschrieben:

On 06/11/2023 14:12, Kevin Wolf wrote:

Hi Kevin,

Thanks for taking the time to review this. I'll reply inline below.


Am 25.10.2023 um 00:40 hat Mark Cave-Ayland geschrieben:

This function reads the value of the PCI_CLASS_PROG register for PCI IDE
controllers and configures the PCI BARs and/or IDE ioports accordingly.

In the case where we switch to legacy mode, the PCI BARs are set to return zero
(as suggested in the "PCI IDE Controller" specification), the legacy IDE ioports
are enabled, and the PCI interrupt pin cleared to indicate legacy IRQ routing.

Conversely when we switch to native mode, the legacy IDE ioports are disabled
and the PCI interrupt pin set to indicate native IRQ routing. The contents of
the PCI BARs are unspecified, but this is not an issue since if a PCI IDE
controller has been switched to native mode then its BARs will need to be
programmed.

Signed-off-by: Mark Cave-Ayland 
Tested-by: BALATON Zoltan 
Tested-by: Bernhard Beschow 
---
   hw/ide/pci.c | 90 
   include/hw/ide/pci.h |  1 +
   2 files changed, 91 insertions(+)

diff --git a/hw/ide/pci.c b/hw/ide/pci.c
index a25b352537..5be643b460 100644
--- a/hw/ide/pci.c
+++ b/hw/ide/pci.c
@@ -104,6 +104,96 @@ const MemoryRegionOps pci_ide_data_le_ops = {
   .endianness = DEVICE_LITTLE_ENDIAN,
   };
+static const MemoryRegionPortio ide_portio_list[] = {
+{ 0, 8, 1, .read = ide_ioport_read, .write = ide_ioport_write },
+{ 0, 1, 2, .read = ide_data_readw, .write = ide_data_writew },
+{ 0, 1, 4, .read = ide_data_readl, .write = ide_data_writel },
+PORTIO_END_OF_LIST(),
+};
+
+static const MemoryRegionPortio ide_portio2_list[] = {
+{ 0, 1, 1, .read = ide_status_read, .write = ide_ctrl_write },
+PORTIO_END_OF_LIST(),
+};


This is duplicated from hw/ide/ioport.c. I think it would be better to
use the arrays already defined there, ideally by calling ioport.c
functions to setup and release the I/O ports.


The tricky part here is that hw/ide/ioport.c is defined for CONFIG_ISA, and
so if we did that then all PCI IDE controllers would become dependent upon
ISA too, regardless of whether they implement compatibility mode or not.
What do you think is the best solution here? Perhaps moving
ide_init_ioport() to a more ISA-specific place? I know that both myself and
Phil have considered whether ide_init_ioport() should be replaced by
something else further down the line.


Hm, yes, I didn't think about this.

Splitting ioport.c is one option, but even the port lists are really
made for ISA, so the whole file is really ISA related.

On the other hand, pci_ide_update_mode() isn't really a pure PCI
function, it's at the intersection of PCI and ISA. Can we just #ifdef it
out if ISA isn't built? Devices that don't support compatibility mode
should never try to call pci_ide_update_mode().


In terms of the QEMU modelling, the PCI IDE controllers are modelled as a PCIDevice 
rather than an ISADevice and that's why ide_init_ioport() doesn't really make sense 
in PCI IDE controllers. Currently its only PCIDevice user is hw/ide/piix.c and that 
passes ISADevice as NULL, because there is no underlying ISADevice.


The only ISADevice user is in hw/ide/isa.c so I think a better solution here would be 
to inline ide_init_ioport() into isa_ide_realizefn() and then add a separate function 
for PCI IDE controllers which is what I've attempted to do here.


How about moving ide_portio_list[] and ide_portio_list2[] to hw/ide/core.c instead? 
The definitions in include/hw/ide/internal.h already have a dependency on PortioList 
so there should be no issue, and it allows them to be shared between both PCI and ISA 
devices.



+void pci_ide_update_mode(PCIIDEState *s)
+{
+PCIDevice *d = PCI_DEVICE(s);
+uint8_t mode = d->config[PCI_CLASS_PROG];
+
+switch (mode & 0xf) {
+case 0xa:
+/* Both channels legacy mode */


Why is it ok to handle only the case where both channels are set to the
same mode? The spec describes mixed-mode setups, too, and doesn't seem
to allow ignoring a mode change if it's only for one of the channels.


Certainly that can be done: only both channels were implemented initially
because that was the test case immediately available using the VIA. I can
have a look at implementing both channels separately in v2.


I don't think it would make the code more complicated, so it feels like
implementing it right away would be nice.

On the other hand, if you want to see this in 8.2, I'm happy to merge
this part as it is and then we can improve it on top.


I think this helps Zoltan boot AmigaOS on the new AmigaOne machine, and I am 
certainly planning more work in this area during the 9.0 cycle.



+
+/* Zero BARs */
+pci_set_long(d->config + PCI_BASE_ADDRESS_0, 0x0);
+pci_set_long(d->config + PCI_BASE_ADDRESS_1, 0x0);
+pci_set

Re: [PATCH v2 2/2] s390x/pci: only limit DMA aperture if vfio DMA limit reported

2023-11-13 Thread Michael Tokarev


10.11.2023 20:51, Matthew Rosato wrote:

If the host kernel lacks vfio DMA limit reporting, do not attempt
to shrink the guest DMA aperture.

Fixes: df202e3ff3 ("s390x/pci: shrink DMA aperture to be bound by vfio DMA 
limit")
Signed-off-by: Matthew Rosato 


Is this stable-8.1 material?

Thanks,

/mjt


---
  hw/s390x/s390-pci-vfio.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index e28573b593..7dbbc76823 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -136,7 +136,7 @@ static void s390_pci_read_base(S390PCIBusDevice *pbdev,
   * to the guest based upon the vfio DMA limit.
   */
  vfio_size = pbdev->iommu->max_dma_limit << TARGET_PAGE_BITS;
-if (vfio_size < (cap->end_dma - cap->start_dma + 1)) {
+if (vfio_size > 0 && vfio_size < cap->end_dma - cap->start_dma + 1) {
  pbdev->zpci_fn.edma = cap->start_dma + vfio_size - 1;
  }
  }

[PATCH] spelling: hw/audio/virtio-snd.c: initalize

2023-11-13 Thread Michael Tokarev

Fixes: eb9ad377bb94 "virtio-sound: handle control messages and streams"
Signed-off-by: Michael Tokarev 
---
 hw/audio/virtio-snd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/audio/virtio-snd.c b/hw/audio/virtio-snd.c
index a18a9949a7..2fe966e311 100644
--- a/hw/audio/virtio-snd.c
+++ b/hw/audio/virtio-snd.c
@@ -1126,7 +1126,7 @@ static void virtio_snd_realize(DeviceState *dev, Error 
**errp)
 status = virtio_snd_set_pcm_params(vsnd, i, &default_params);
 if (status != cpu_to_le32(VIRTIO_SND_S_OK)) {
 error_setg(errp,
-   "Can't initalize stream params, device responded with 
%s.",
+   "Can't initialize stream params, device responded with 
%s.",
print_code(status));
 return;
 }
-- 
2.39.2

[PATCH] spelling: qapi/migration.json: transfering

2023-11-13 Thread Michael Tokarev

Fixes: 074dbce5fcce "migration: New migrate and migrate-incoming argument 
'channels'"
Signed-off-by: Michael Tokarev 
---
 qapi/migration.json | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 975761eebd..eb2f883513 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1658,7 +1658,7 @@
 #
 # Migration stream channel parameters.
 #
-# @channel-type: Channel type for transfering packet information.
+# @channel-type: Channel type for transferring packet information.
 #
 # @addr: Migration endpoint configuration on destination interface.
 #
-- 
2.39.2

Re: Instruction virtual address in TCG Plugins

2023-11-13 Thread Alex Bennée

Mikhail Tyutin  writes:

> Greetings,
>
> What is the right way to get virtual address of either translation block or 
> instruction inside of TCG plugin? Does
> plugin API allow that or it needs some extension?
>
> So far I use qemu_plugin_tb_vaddr() inside of my block translation callback 
> to get block virtual address and then
> pass it as 'userdata' argument into qemu_plugin_register_vcpu_tb_exec_cb(). I 
> use it later during code execution.
> It works well for user-mode emulation, but sometimes leads to
> incorrect addresses in system-mode emulation.

You can use qemu_plugin_insn_vaddr and qemu_plugin_insn_haddr. But your
right something under one vaddr and be executed under another with
overlapping mappings. The haddr should be stable though I think.

> I suspect it is because of memory mappings by guest OS that changes virtual 
> addresses for that block.
>
> I also looked at gen_empty_udata_cb() function and considered to extend 
> plugin API to pass a program counter
> value as additional callback argument. I thought it would always give me 
> valid virtual address of an instruction.
> Unfortunately, I didn't find a way to get value of that register in 
> architecture agnostic way (it is 'pc' member in
> CPUArchState structure).

When we merge the register api you should be able to do that. Although
during testing I realised that PC acted funny compared to everything
else because we don't actually update the shadow register every
instruction.

>
> ---
> Mikhail

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [PATCH v2 3/3] hw/ide/via: implement legacy/native mode switching

2023-11-13 Thread Mark Cave-Ayland


On 07/11/2023 10:43, Kevin Wolf wrote:


Am 06.11.2023 um 17:13 hat BALATON Zoltan geschrieben:

On Mon, 6 Nov 2023, Kevin Wolf wrote:

Am 25.10.2023 um 00:40 hat Mark Cave-Ayland geschrieben:

Allow the VIA IDE controller to switch between both legacy and native modes by
calling pci_ide_update_mode() to reconfigure the device whenever PCI_CLASS_PROG
is updated.

This patch moves the initial setting of PCI_CLASS_PROG from via_ide_realize() to
via_ide_reset(), and removes the direct setting of PCI_INTERRUPT_PIN during PCI
bus reset since this is now managed by pci_ide_update_mode(). This ensures that
the device configuration is always consistent with respect to the currently
selected mode.

Signed-off-by: Mark Cave-Ayland 
Tested-by: BALATON Zoltan 
Tested-by: Bernhard Beschow 


As I already noted in patch 1, the interrupt handling seems to be wrong
here, it continues to use the ISA IRQ in via_ide_set_irq() even after
switching to native mode.


That's a peculiarity of this via-ide device. It always uses 14/15 legacy
interrupts even in native mode and guests expect that so using native
interrupts would break pegasos2 guests. This was discussed and tested
extensively before.


This definitely needs a comment to explain the situation then because
this is in violation of the spec. If real hardware behaves like this,
it's what we should do, of course, but it's certainly unexpected and we
should explicitly document it to avoid breaking it later when someone
touches the code who doesn't know about this peculiarity.


It's a little bit more complicated than this: in native mode it is possible to route 
the IRQs for each individual channel to a small select number of IRQs by configuring 
special registers on the VIA.


The complication here is that it isn't immediately obvious how the QEMU PCI routing 
code can do this - I did post about this at 
https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg10552.html asking the best 
way to resolve this, but haven't had any replies yet.


Fortunately it seems that all the guests tested so far stick with the IRQ 14/15 
defaults which is why this happens to work, so short-term this is a lower priority 
when looking at consolidating the switching logic.



ATB,

Mark.

Re: [PATCH] MAINTAINERS: update virtio-fs mailing list address

2023-11-13 Thread Stefan Hajnoczi

Applied.

Stefan

Re: [QEMU][PATCHv2 0/8] Xen: support grant mappings.

On Fri, 2023-10-27 at 07:27 +0200, Juergen Gross wrote:
> On 26.10.23 22:56, Stefano Stabellini wrote:
> > On Thu, 26 Oct 2023, David Woodhouse wrote:
> > > On Thu, 2023-10-26 at 13:36 -0700, Stefano Stabellini wrote:
> > > > 
> > > > > This seems like a lot of code to replace that simpler option... is
> > > > > there a massive performance win from doing it this way? Would we want
> > > > > to use this trick for the Xen PV backends (qdisk, qnic) *too*? Might 
> > > > > it
> > > > > make sense to introduce the simple version and *then* the 
> > > > > optimisation,
> > > > > with some clear benchmarking to show the win?
> > > > 
> > > > This is not done for performance but for safety (as in safety
> > > > certifications, ISO 26262, etc.). This is to enable unprivileged virtio
> > > > backends running in a DomU. By unprivileged I mean a virtio backend that
> > > > is unable to map arbitrary memory (the xenforeignmemory interface is
> > > > prohibited).
> > > > 
> > > > The goal is to run Xen on safety-critical systems such as cars,
> > > > industrial robots and more. In this configuration there is no
> > > > traditional Dom0 in the system at all. If you  would like to know more:
> > > > https://www.youtube.com/watch?v=tisljY6Bqv0&list=PLYyw7IQjL-zHtpYtMpFR3KYdRn0rcp5Xn&index=8
> > > 
> > > Yeah, I understand why we're using grant mappings instead of just
> > > directly having access via foreignmem mappings. That wasn't what I was
> > > confused about.
> > > 
> > > What I haven't worked out is why we're implementing this through an
> > > automatically-populated MemoryRegion in QEMU, rather than just using
> > > grant mapping ops like we always have.
> > > 
> > > It seems like a lot of complexity just to avoid calling
> > > qemu_xen_gnttab_map_refs() from the virtio backend.
> > 
> > I think there are two questions here. One question is "Why do we need
> > all the new grant mapping code added to xen-mapcache.c in patch #7?
> > Can't we use qemu_xen_gnttab_map_refs() instead?"
> 
> The main motivation was to _avoid_ having to change all the backends.
> 
> My implementation enables _all_ qemu based virtio backends to use grant
> mappings. And if a new backend is added to qemu, there will be no change
> required to make it work with grants.

I'm not really convinced I buy that. This is a lot of complexity, and
don't backends need to call an appropriate mapping function to map via
an IOMMU if it's present anyway? Make then call a helper where you can
do this in one place directly instead of through a fake MemoryRegion,
and you're done, surely? 


smime.p7s
Description: S/MIME cryptographic signature

[PATCH] migration: fix coverity migrate_mode finding

2023-11-13 Thread Steve Sistare

Coverity diagnoses a possible out-of-range array index here ...

static GSList *migration_blockers[MIG_MODE__MAX];

fill_source_migration_info() {
GSList *cur_blocker = migration_blockers[migrate_mode()];

... because it does not know that MIG_MODE__MAX will never be returned as
a migration mode.  To fix, assert so in migrate_mode().

Fixes: fa3673e497a1 ("migration: per-mode blockers")

Reported-by: Peter Maydell 
Suggested-by: Peter Maydell 
Signed-off-by: Steve Sistare 
---
 migration/options.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/migration/options.c b/migration/options.c
index 8d8ec73..3e3e0b9 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -833,8 +833,10 @@ uint64_t migrate_max_postcopy_bandwidth(void)
 MigMode migrate_mode(void)
 {
 MigrationState *s = migrate_get_current();
+MigMode mode = s->parameters.mode;
 
-return s->parameters.mode;
+assert(mode >= 0 && mode < MIG_MODE__MAX);
+return mode;
 }
 
 int migrate_multifd_channels(void)
-- 
1.8.3.1

[PATCH V7 7/8] gdbstub: Add helper function to unregister GDB register space

Add common function to help unregister the GDB register space. This shall be
done in context to the CPU unrealization.

Signed-off-by: Salil Mehta 
Tested-by: Vishnu Pajjuri 
Reviewed-by: Gavin Shan 
Tested-by: Xianglai Li 
Tested-by: Miguel Luis 
Reviewed-by: Shaoqin Huang 
---
 gdbstub/gdbstub.c  | 12 
 include/exec/gdbstub.h |  5 +
 2 files changed, 17 insertions(+)

diff --git a/gdbstub/gdbstub.c b/gdbstub/gdbstub.c
index b1532118d1..7bd6d45857 100644
--- a/gdbstub/gdbstub.c
+++ b/gdbstub/gdbstub.c
@@ -498,6 +498,18 @@ void gdb_register_coprocessor(CPUState *cpu,
 }
 }
 
+void gdb_unregister_coprocessor_all(CPUState *cpu)
+{
+/*
+ * Safe to nuke everything. GDBRegisterState::xml is static const char so
+ * it won't be freed
+ */
+g_array_free(cpu->gdb_regs, true);
+
+cpu->gdb_regs = NULL;
+cpu->gdb_num_g_regs = 0;
+}
+
 static void gdb_process_breakpoint_remove_all(GDBProcess *p)
 {
 CPUState *cpu = gdb_get_first_cpu_in_process(p);
diff --git a/include/exec/gdbstub.h b/include/exec/gdbstub.h
index 1a01c35f8e..3744257ed3 100644
--- a/include/exec/gdbstub.h
+++ b/include/exec/gdbstub.h
@@ -32,6 +32,11 @@ typedef int (*gdb_set_reg_cb)(CPUArchState *env, uint8_t 
*buf, int reg);
 void gdb_register_coprocessor(CPUState *cpu,
   gdb_get_reg_cb get_reg, gdb_set_reg_cb set_reg,
   int num_regs, const char *xml, int g_pos);
+/**
+ * gdb_unregister_coprocessor_all() - unregisters supplemental set of registers
+ * @cpu - the CPU associated with registers
+ */
+void gdb_unregister_coprocessor_all(CPUState *cpu);
 
 /**
  * gdbserver_start: start the gdb server
-- 
2.34.1

[PATCH V7 8/8] docs/specs/acpi_hw_reduced_hotplug: Add the CPU Hotplug Event Bit

GED interface is used by many hotplug events like memory hotplug, NVDIMM hotplug
and non-hotplug events like system power down event. Each of these can be
selected using a bit in the 32 bit GED IO interface. A bit has been reserved for
the CPU hotplug event.

Signed-off-by: Salil Mehta 
---
 docs/specs/acpi_hw_reduced_hotplug.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/specs/acpi_hw_reduced_hotplug.rst 
b/docs/specs/acpi_hw_reduced_hotplug.rst
index 0bd3f9399f..3acd6fcd8b 100644
--- a/docs/specs/acpi_hw_reduced_hotplug.rst
+++ b/docs/specs/acpi_hw_reduced_hotplug.rst
@@ -64,7 +64,8 @@ GED IO interface (4 byte access)
0: Memory hotplug event
1: System power down event
2: NVDIMM hotplug event
-3-31: Reserved
+   3: CPU hotplug event
+4-31: Reserved
 
 **write_access:**
 
-- 
2.34.1

[PATCH V7 6/8] physmem: Add helper function to destroy CPU AddressSpace

Virtual CPU Hot-unplug leads to unrealization of a CPU object. This also
involves destruction of the CPU AddressSpace. Add common function to help
destroy the CPU AddressSpace.

Signed-off-by: Salil Mehta 
Tested-by: Vishnu Pajjuri 
Reviewed-by: Gavin Shan 
Tested-by: Xianglai Li 
Tested-by: Miguel Luis 
Reviewed-by: Shaoqin Huang 
---
 include/exec/cpu-common.h |  8 
 include/hw/core/cpu.h |  1 +
 system/physmem.c  | 29 +
 3 files changed, 38 insertions(+)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 605b160a7e..a930e49e02 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -127,6 +127,14 @@ size_t qemu_ram_pagesize_largest(void);
  */
 void cpu_address_space_init(CPUState *cpu, int asidx,
 const char *prefix, MemoryRegion *mr);
+/**
+ * cpu_address_space_destroy:
+ * @cpu: CPU for which address space needs to be destroyed
+ * @asidx: integer index of this address space
+ *
+ * Note that with KVM only one address space is supported.
+ */
+void cpu_address_space_destroy(CPUState *cpu, int asidx);
 
 void cpu_physical_memory_rw(hwaddr addr, void *buf,
 hwaddr len, bool is_write);
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 3968369554..708b6b48de 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -496,6 +496,7 @@ struct CPUState {
 QSIMPLEQ_HEAD(, qemu_work_item) work_list;
 
 CPUAddressSpace *cpu_ases;
+int cpu_ases_count;
 int num_ases;
 AddressSpace *as;
 MemoryRegion *memory;
diff --git a/system/physmem.c b/system/physmem.c
index edc3ed8ab9..a16f1d4056 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -761,6 +761,7 @@ void cpu_address_space_init(CPUState *cpu, int asidx,
 
 if (!cpu->cpu_ases) {
 cpu->cpu_ases = g_new0(CPUAddressSpace, cpu->num_ases);
+cpu->cpu_ases_count = cpu->num_ases;
 }
 
 newas = &cpu->cpu_ases[asidx];
@@ -774,6 +775,34 @@ void cpu_address_space_init(CPUState *cpu, int asidx,
 }
 }
 
+void cpu_address_space_destroy(CPUState *cpu, int asidx)
+{
+CPUAddressSpace *cpuas;
+
+assert(cpu->cpu_ases);
+assert(asidx >= 0 && asidx < cpu->num_ases);
+/* KVM cannot currently support multiple address spaces. */
+assert(asidx == 0 || !kvm_enabled());
+
+cpuas = &cpu->cpu_ases[asidx];
+if (tcg_enabled()) {
+memory_listener_unregister(&cpuas->tcg_as_listener);
+}
+
+address_space_destroy(cpuas->as);
+g_free_rcu(cpuas->as, rcu);
+
+if (asidx == 0) {
+/* reset the convenience alias for address space 0 */
+cpu->as = NULL;
+}
+
+if (--cpu->cpu_ases_count == 0) {
+g_free(cpu->cpu_ases);
+cpu->cpu_ases = NULL;
+}
+}
+
 AddressSpace *cpu_get_address_space(CPUState *cpu, int asidx)
 {
 /* Return the AddressSpace corresponding to the specified index */
-- 
2.34.1

[PATCH V7 5/8] hw/acpi: Update CPUs AML with cpu-(ctrl)dev change

CPUs Control device(\\_SB.PCI0) register interface for the x86 arch is IO port
based and existing CPUs AML code assumes _CRS objects would evaluate to a system
resource which describes IO Port address. But on ARM arch CPUs control
device(\\_SB.PRES) register interface is memory-mapped hence _CRS object should
evaluate to system resource which describes memory-mapped base address. Update
build CPUs AML function to accept both IO/MEMORY region spaces and accordingly
update the _CRS object.

On x86, CPU Hotplug uses Generic ACPI GPE Block Bit 2 (GPE.2) event handler to
notify OSPM about any CPU hot(un)plug events. Latest CPU Hotplug is based on
ACPI Generic Event Device framework and uses ACPI GED device for the same. Not
all architectures support GPE based CPU Hotplug event handler. Hence, make AML
for GPE.2 event handler conditional.

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Signed-off-by: Salil Mehta 
Reviewed-by: Gavin Shan 
Tested-by: Vishnu Pajjuri 
Reviewed-by: Jonathan Cameron 
Tested-by: Xianglai Li 
Tested-by: Miguel Luis 
Reviewed-by: Shaoqin Huang 
---
 hw/acpi/cpu.c | 23 ---
 hw/i386/acpi-build.c  |  3 ++-
 include/hw/acpi/cpu.h |  5 +++--
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
index de1f9295dc..5b0eaad1c5 100644
--- a/hw/acpi/cpu.c
+++ b/hw/acpi/cpu.c
@@ -339,9 +339,10 @@ const VMStateDescription vmstate_cpu_hotplug = {
 #define CPU_FW_EJECT_EVENT "CEJF"
 
 void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
-build_madt_cpu_fn build_madt_cpu, hwaddr io_base,
+build_madt_cpu_fn build_madt_cpu, hwaddr base_addr,
 const char *res_root,
-const char *event_handler_method)
+const char *event_handler_method,
+AmlRegionSpace rs)
 {
 Aml *ifctx;
 Aml *field;
@@ -366,13 +367,19 @@ void build_cpus_aml(Aml *table, MachineState *machine, 
CPUHotplugFeatures opts,
 aml_append(cpu_ctrl_dev, aml_mutex(CPU_LOCK, 0));
 
 crs = aml_resource_template();
-aml_append(crs, aml_io(AML_DECODE16, io_base, io_base, 1,
+if (rs == AML_SYSTEM_IO) {
+aml_append(crs, aml_io(AML_DECODE16, base_addr, base_addr, 1,
ACPI_CPU_HOTPLUG_REG_LEN));
+} else {
+aml_append(crs, aml_memory32_fixed(base_addr,
+   ACPI_CPU_HOTPLUG_REG_LEN, AML_READ_WRITE));
+}
+
 aml_append(cpu_ctrl_dev, aml_name_decl("_CRS", crs));
 
 /* declare CPU hotplug MMIO region with related access fields */
 aml_append(cpu_ctrl_dev,
-aml_operation_region("PRST", AML_SYSTEM_IO, aml_int(io_base),
+aml_operation_region("PRST", rs, aml_int(base_addr),
  ACPI_CPU_HOTPLUG_REG_LEN));
 
 field = aml_field("PRST", AML_BYTE_ACC, AML_NOLOCK,
@@ -696,9 +703,11 @@ void build_cpus_aml(Aml *table, MachineState *machine, 
CPUHotplugFeatures opts,
 aml_append(sb_scope, cpus_dev);
 aml_append(table, sb_scope);
 
-method = aml_method(event_handler_method, 0, AML_NOTSERIALIZED);
-aml_append(method, aml_call0("\\_SB.CPUS." CPU_SCAN_METHOD));
-aml_append(table, method);
+if (event_handler_method) {
+method = aml_method(event_handler_method, 0, AML_NOTSERIALIZED);
+aml_append(method, aml_call0("\\_SB.CPUS." CPU_SCAN_METHOD));
+aml_append(table, method);
+}
 
 g_free(cphp_res_path);
 }
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 3f2b27cf75..f9f31f9db5 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1550,7 +1550,8 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
 .fw_unplugs_cpu = pm->smi_on_cpu_unplug,
 };
 build_cpus_aml(dsdt, machine, opts, pc_madt_cpu_entry,
-   pm->cpu_hp_io_base, "\\_SB.PCI0", "\\_GPE._E02");
+   pm->cpu_hp_io_base, "\\_SB.PCI0", "\\_GPE._E02",
+   AML_SYSTEM_IO);
 }
 
 if (pcms->memhp_io_base && nr_mem) {
diff --git a/include/hw/acpi/cpu.h b/include/hw/acpi/cpu.h
index bc901660fb..b521a4e0de 100644
--- a/include/hw/acpi/cpu.h
+++ b/include/hw/acpi/cpu.h
@@ -60,9 +60,10 @@ typedef void (*build_madt_cpu_fn)(int uid, const 
CPUArchIdList *apic_ids,
   GArray *entry, bool force_enabled);
 
 void build_cpus_aml(Aml *table, MachineState *machine, CPUHotplugFeatures opts,
-build_madt_cpu_fn build_madt_cpu, hwaddr io_base,
+build_madt_cpu_fn build_madt_cpu, hwaddr base_addr,
 const char *res_root,
-const char *event_handler_method);
+const char *event_handler_method,
+AmlRegionSpace rs);
 
 void acpi_cpu_ospm_status(CPUHotplugState *cpu_st, ACPIOSTInfoList ***list);
 
-

[PATCH V7 4/8] hw/acpi: Update GED _EVT method AML with CPU scan

OSPM evaluates _EVT method to map the event. The CPU hotplug event eventually
results in start of the CPU scan. Scan figures out the CPU and the kind of
event(plug/unplug) and notifies it back to the guest. Update the GED AML _EVT
method with the call to \\_SB.CPUS.CSCN

Also, macro CPU_SCAN_METHOD might be referred in other places like during GED
intialization so it makes sense to have its definition placed in some common
header file like cpu_hotplug.h. But doing this can cause compilation break
because of the conflicting macro definitions present in cpu.c and cpu_hotplug.c
and because both these files get compiled due to historic reasons of x86 world
i.e. decision to use legacy(GPE.2)/modern(GED) CPU hotplug interface happens
during runtime [1]. To mitigate above, for now, declare a new common macro
ACPI_CPU_SCAN_METHOD for CPU scan method instead.
(This needs a separate discussion later on for clean-up)

Reference:
[1] 
https://lore.kernel.org/qemu-devel/1463496205-251412-24-git-send-email-imamm...@redhat.com/

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Signed-off-by: Salil Mehta 
Reviewed-by: Jonathan Cameron 
Reviewed-by: Gavin Shan 
Tested-by: Vishnu Pajjuri 
Tested-by: Xianglai Li 
Tested-by: Miguel Luis 
Reviewed-by: Shaoqin Huang 
---
 hw/acpi/cpu.c  | 2 +-
 hw/acpi/generic_event_device.c | 4 
 include/hw/acpi/cpu_hotplug.h  | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
index 4b24a25003..de1f9295dc 100644
--- a/hw/acpi/cpu.c
+++ b/hw/acpi/cpu.c
@@ -323,7 +323,7 @@ const VMStateDescription vmstate_cpu_hotplug = {
 #define CPUHP_RES_DEVICE  "PRES"
 #define CPU_LOCK  "CPLK"
 #define CPU_STS_METHOD"CSTA"
-#define CPU_SCAN_METHOD   "CSCN"
+#define CPU_SCAN_METHOD   ACPI_CPU_SCAN_METHOD
 #define CPU_NOTIFY_METHOD "CTFY"
 #define CPU_EJECT_METHOD  "CEJ0"
 #define CPU_OST_METHOD"COST"
diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index 57b0c2815b..f547b96d74 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -109,6 +109,10 @@ void build_ged_aml(Aml *table, const char *name, 
HotplugHandler *hotplug_dev,
 aml_append(if_ctx, aml_call0(MEMORY_DEVICES_CONTAINER "."
  MEMORY_SLOT_SCAN_METHOD));
 break;
+case ACPI_GED_CPU_HOTPLUG_EVT:
+aml_append(if_ctx, aml_call0(ACPI_CPU_CONTAINER "."
+ ACPI_CPU_SCAN_METHOD));
+break;
 case ACPI_GED_PWR_DOWN_EVT:
 aml_append(if_ctx,
aml_notify(aml_name(ACPI_POWER_BUTTON_DEVICE),
diff --git a/include/hw/acpi/cpu_hotplug.h b/include/hw/acpi/cpu_hotplug.h
index 48b291e45e..ef631750b4 100644
--- a/include/hw/acpi/cpu_hotplug.h
+++ b/include/hw/acpi/cpu_hotplug.h
@@ -20,6 +20,8 @@
 #include "hw/acpi/cpu.h"
 
 #define ACPI_CPU_HOTPLUG_REG_LEN 12
+#define ACPI_CPU_SCAN_METHOD "CSCN"
+#define ACPI_CPU_CONTAINER "\\_SB.CPUS"
 
 typedef struct AcpiCpuHotplug {
 Object *device;
-- 
2.34.1

[PATCH V7 3/8] hw/acpi: Update ACPI GED framework to support vCPU Hotplug

ACPI GED (as described in the ACPI 6.4 spec) uses an interrupt listed in the
_CRS object of GED to intimate OSPM about an event. Later then demultiplexes the
notified event by evaluating ACPI _EVT method to know the type of event. Use
ACPI GED to also notify the guest kernel about any CPU hot(un)plug events.

ACPI CPU hotplug related initialization should only happen if ACPI_CPU_HOTPLUG
support has been enabled for particular architecture. Add cpu_hotplug_hw_init()
stub to avoid compilation break.

Co-developed-by: Keqian Zhu 
Signed-off-by: Keqian Zhu 
Signed-off-by: Salil Mehta 
Reviewed-by: Jonathan Cameron 
Reviewed-by: Gavin Shan 
Reviewed-by: David Hildenbrand 
Reviewed-by: Shaoqin Huang 
Tested-by: Vishnu Pajjuri 
Tested-by: Xianglai Li 
Tested-by: Miguel Luis 
---
 hw/acpi/acpi-cpu-hotplug-stub.c|  6 ++
 hw/acpi/generic_event_device.c | 17 +
 include/hw/acpi/generic_event_device.h |  4 
 3 files changed, 27 insertions(+)

diff --git a/hw/acpi/acpi-cpu-hotplug-stub.c b/hw/acpi/acpi-cpu-hotplug-stub.c
index 3fc4b14c26..c6c61bb9cd 100644
--- a/hw/acpi/acpi-cpu-hotplug-stub.c
+++ b/hw/acpi/acpi-cpu-hotplug-stub.c
@@ -19,6 +19,12 @@ void legacy_acpi_cpu_hotplug_init(MemoryRegion *parent, 
Object *owner,
 return;
 }
 
+void cpu_hotplug_hw_init(MemoryRegion *as, Object *owner,
+ CPUHotplugState *state, hwaddr base_addr)
+{
+return;
+}
+
 void acpi_cpu_ospm_status(CPUHotplugState *cpu_st, ACPIOSTInfoList ***list)
 {
 return;
diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
index a3d31631fe..57b0c2815b 100644
--- a/hw/acpi/generic_event_device.c
+++ b/hw/acpi/generic_event_device.c
@@ -12,6 +12,7 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "hw/acpi/acpi.h"
+#include "hw/acpi/cpu.h"
 #include "hw/acpi/generic_event_device.h"
 #include "hw/irq.h"
 #include "hw/mem/pc-dimm.h"
@@ -25,6 +26,7 @@ static const uint32_t ged_supported_events[] = {
 ACPI_GED_MEM_HOTPLUG_EVT,
 ACPI_GED_PWR_DOWN_EVT,
 ACPI_GED_NVDIMM_HOTPLUG_EVT,
+ACPI_GED_CPU_HOTPLUG_EVT,
 };
 
 /*
@@ -234,6 +236,8 @@ static void acpi_ged_device_plug_cb(HotplugHandler 
*hotplug_dev,
 } else {
 acpi_memory_plug_cb(hotplug_dev, &s->memhp_state, dev, errp);
 }
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+acpi_cpu_plug_cb(hotplug_dev, &s->cpuhp_state, dev, errp);
 } else {
 error_setg(errp, "virt: device plug request for unsupported device"
" type: %s", object_get_typename(OBJECT(dev)));
@@ -248,6 +252,8 @@ static void acpi_ged_unplug_request_cb(HotplugHandler 
*hotplug_dev,
 if ((object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) &&
!(object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM {
 acpi_memory_unplug_request_cb(hotplug_dev, &s->memhp_state, dev, errp);
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+acpi_cpu_unplug_request_cb(hotplug_dev, &s->cpuhp_state, dev, errp);
 } else {
 error_setg(errp, "acpi: device unplug request for unsupported device"
" type: %s", object_get_typename(OBJECT(dev)));
@@ -261,6 +267,8 @@ static void acpi_ged_unplug_cb(HotplugHandler *hotplug_dev,
 
 if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
 acpi_memory_unplug_cb(&s->memhp_state, dev, errp);
+} else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
+acpi_cpu_unplug_cb(&s->cpuhp_state, dev, errp);
 } else {
 error_setg(errp, "acpi: device unplug for unsupported device"
" type: %s", object_get_typename(OBJECT(dev)));
@@ -272,6 +280,7 @@ static void acpi_ged_ospm_status(AcpiDeviceIf *adev, 
ACPIOSTInfoList ***list)
 AcpiGedState *s = ACPI_GED(adev);
 
 acpi_memory_ospm_status(&s->memhp_state, list);
+acpi_cpu_ospm_status(&s->cpuhp_state, list);
 }
 
 static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
@@ -286,6 +295,8 @@ static void acpi_ged_send_event(AcpiDeviceIf *adev, 
AcpiEventStatusBits ev)
 sel = ACPI_GED_PWR_DOWN_EVT;
 } else if (ev & ACPI_NVDIMM_HOTPLUG_STATUS) {
 sel = ACPI_GED_NVDIMM_HOTPLUG_EVT;
+} else if (ev & ACPI_CPU_HOTPLUG_STATUS) {
+sel = ACPI_GED_CPU_HOTPLUG_EVT;
 } else {
 /* Unknown event. Return without generating interrupt. */
 warn_report("GED: Unsupported event %d. No irq injected", ev);
@@ -400,6 +411,12 @@ static void acpi_ged_initfn(Object *obj)
 memory_region_init_io(&ged_st->regs, obj, &ged_regs_ops, ged_st,
   TYPE_ACPI_GED "-regs", ACPI_GED_REG_COUNT);
 sysbus_init_mmio(sbd, &ged_st->regs);
+
+memory_region_init(&s->container_cpuhp, OBJECT(dev), "cpuhp container",
+   ACPI_CPU_HOTPLUG_REG_LEN);
+sysbus_init_mmio(SYS_BUS_DEVICE(dev), &s->container_cpuhp);
+cpu_hotplug_hw_init(&s->container_cpuhp, OBJECT(dev),
+

[PATCH V7 2/8] hw/acpi: Move CPU ctrl-dev MMIO region len macro to common header file

CPU ctrl-dev MMIO region length could be used in ACPI GED and various other
architecture specific places. Move ACPI_CPU_HOTPLUG_REG_LEN macro to more
appropriate common header file.

Signed-off-by: Salil Mehta 
Reviewed-by: Alex Bennée 
Reviewed-by: Jonathan Cameron 
Reviewed-by: Gavin Shan 
Reviewed-by: David Hildenbrand 
Reviewed-by: Shaoqin Huang 
Tested-by: Vishnu Pajjuri 
Tested-by: Xianglai Li 
Tested-by: Miguel Luis 
---
 hw/acpi/cpu.c | 2 +-
 include/hw/acpi/cpu_hotplug.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/cpu.c b/hw/acpi/cpu.c
index 011d2c6c2d..4b24a25003 100644
--- a/hw/acpi/cpu.c
+++ b/hw/acpi/cpu.c
@@ -1,13 +1,13 @@
 #include "qemu/osdep.h"
 #include "migration/vmstate.h"
 #include "hw/acpi/cpu.h"
+#include "hw/acpi/cpu_hotplug.h"
 #include "hw/core/cpu.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-acpi.h"
 #include "trace.h"
 #include "sysemu/numa.h"
 
-#define ACPI_CPU_HOTPLUG_REG_LEN 12
 #define ACPI_CPU_SELECTOR_OFFSET_WR 0
 #define ACPI_CPU_FLAGS_OFFSET_RW 4
 #define ACPI_CPU_CMD_OFFSET_WR 5
diff --git a/include/hw/acpi/cpu_hotplug.h b/include/hw/acpi/cpu_hotplug.h
index 3b932a..48b291e45e 100644
--- a/include/hw/acpi/cpu_hotplug.h
+++ b/include/hw/acpi/cpu_hotplug.h
@@ -19,6 +19,8 @@
 #include "hw/hotplug.h"
 #include "hw/acpi/cpu.h"
 
+#define ACPI_CPU_HOTPLUG_REG_LEN 12
+
 typedef struct AcpiCpuHotplug {
 Object *device;
 MemoryRegion io;
-- 
2.34.1

Re: [PATCH-for-9.0 01/10] sysemu/xen: Forbid using Xen headers in user emulation

On Mon, 2023-11-13 at 16:21 +0100, Philippe Mathieu-Daudé wrote:
> Xen is a system specific accelerator, it makes no sense
> to include its headers in user emulation.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: David Woodhouse 




smime.p7s
Description: S/MIME cryptographic signature

[PATCH V7 1/8] accel/kvm: Extract common KVM vCPU {creation, parking} code

KVM vCPU creation is done once during the vCPU realization when Qemu vCPU thread
is spawned. This is common to all the architectures as of now.

Hot-unplug of vCPU results in destruction of the vCPU object in QOM but the
corresponding KVM vCPU object in the Host KVM is not destroyed as KVM doesn't
support vCPU removal. Therefore, its representative KVM vCPU object/context in
Qemu is parked.

Refactor architecture common logic so that some APIs could be reused by vCPU
Hotplug code of some architectures likes ARM, Loongson etc. Update new/old APIs
with trace events instead of DPRINTF. No functional change is intended here.

Signed-off-by: Salil Mehta 
Reviewed-by: Gavin Shan 
Tested-by: Vishnu Pajjuri 
Reviewed-by: Jonathan Cameron 
Tested-by: Xianglai Li 
Tested-by: Miguel Luis 
Reviewed-by: Shaoqin Huang 
---
 accel/kvm/kvm-all.c| 64 --
 accel/kvm/trace-events |  4 +++
 include/sysemu/kvm.h   | 16 +++
 3 files changed, 69 insertions(+), 15 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 72e1d1141c..bfa7816aaa 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -137,6 +137,7 @@ static QemuMutex kml_slots_lock;
 #define kvm_slots_unlock()  qemu_mutex_unlock(&kml_slots_lock)
 
 static void kvm_slot_init_dirty_bitmap(KVMSlot *mem);
+static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id);
 
 static inline void kvm_resample_fd_remove(int gsi)
 {
@@ -320,14 +321,53 @@ err:
 return ret;
 }
 
+void kvm_park_vcpu(CPUState *cpu)
+{
+struct KVMParkedVcpu *vcpu;
+
+trace_kvm_park_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
+
+vcpu = g_malloc0(sizeof(*vcpu));
+vcpu->vcpu_id = kvm_arch_vcpu_id(cpu);
+vcpu->kvm_fd = cpu->kvm_fd;
+QLIST_INSERT_HEAD(&kvm_state->kvm_parked_vcpus, vcpu, node);
+}
+
+int kvm_create_vcpu(CPUState *cpu)
+{
+unsigned long vcpu_id = kvm_arch_vcpu_id(cpu);
+KVMState *s = kvm_state;
+int kvm_fd;
+
+trace_kvm_create_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
+
+/* check if the KVM vCPU already exist but is parked */
+kvm_fd = kvm_get_vcpu(s, vcpu_id);
+if (kvm_fd < 0) {
+/* vCPU not parked: create a new KVM vCPU */
+kvm_fd = kvm_vm_ioctl(s, KVM_CREATE_VCPU, vcpu_id);
+if (kvm_fd < 0) {
+error_report("KVM_CREATE_VCPU IOCTL failed for vCPU %lu", vcpu_id);
+return kvm_fd;
+}
+}
+
+cpu->kvm_fd = kvm_fd;
+cpu->kvm_state = s;
+cpu->vcpu_dirty = true;
+cpu->dirty_pages = 0;
+cpu->throttle_us_per_full = 0;
+
+return 0;
+}
+
 static int do_kvm_destroy_vcpu(CPUState *cpu)
 {
 KVMState *s = kvm_state;
 long mmap_size;
-struct KVMParkedVcpu *vcpu = NULL;
 int ret = 0;
 
-DPRINTF("kvm_destroy_vcpu\n");
+trace_kvm_destroy_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
 
 ret = kvm_arch_destroy_vcpu(cpu);
 if (ret < 0) {
@@ -353,10 +393,7 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
 }
 }
 
-vcpu = g_malloc0(sizeof(*vcpu));
-vcpu->vcpu_id = kvm_arch_vcpu_id(cpu);
-vcpu->kvm_fd = cpu->kvm_fd;
-QLIST_INSERT_HEAD(&kvm_state->kvm_parked_vcpus, vcpu, node);
+kvm_park_vcpu(cpu);
 err:
 return ret;
 }
@@ -377,6 +414,8 @@ static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id)
 if (cpu->vcpu_id == vcpu_id) {
 int kvm_fd;
 
+trace_kvm_get_vcpu(vcpu_id);
+
 QLIST_REMOVE(cpu, node);
 kvm_fd = cpu->kvm_fd;
 g_free(cpu);
@@ -384,7 +423,7 @@ static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id)
 }
 }
 
-return kvm_vm_ioctl(s, KVM_CREATE_VCPU, (void *)vcpu_id);
+return -ENOENT;
 }
 
 int kvm_init_vcpu(CPUState *cpu, Error **errp)
@@ -395,19 +434,14 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
 
 trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
 
-ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu));
+ret = kvm_create_vcpu(cpu);
 if (ret < 0) {
-error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed 
(%lu)",
+error_setg_errno(errp, -ret,
+ "kvm_init_vcpu: kvm_create_vcpu failed (%lu)",
  kvm_arch_vcpu_id(cpu));
 goto err;
 }
 
-cpu->kvm_fd = ret;
-cpu->kvm_state = s;
-cpu->vcpu_dirty = true;
-cpu->dirty_pages = 0;
-cpu->throttle_us_per_full = 0;
-
 mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
 if (mmap_size < 0) {
 ret = mmap_size;
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 399aaeb0ec..cdd0c95c09 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -9,6 +9,10 @@ kvm_device_ioctl(int fd, int type, void *arg) "dev fd %d, type 
0x%x, arg %p"
 kvm_failed_reg_get(uint64_t id, const char *msg) "Warning: Unable to retrieve 
ONEREG %" PRIu64 " from KVM: %s"
 kvm_failed_reg_set(uint64_t id, const char *msg) "Warning: Unable to set 
ONEREG %" PRIu64 " to

[PATCH V7 0/8] Add architecture agnostic code to support vCPU Hotplug

Virtual CPU hotplug support is being added across various architectures[1][3].
This series adds various code bits common across all architectures:

1. vCPU creation and Parking code refactor [Patch 1]
2. Update ACPI GED framework to support vCPU Hotplug [Patch 2,3]
3. ACPI CPUs AML code change [Patch 4,5]
4. Helper functions to support unrealization of CPU objects [Patch 6,7]
5. Docs [Patch 8]


Repository:

[*] https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2.common.v7


Revision History:

Patch-set  V6 -> V7
1. Addressed Alex Bennée's comments
   - Updated the docs
2. Addressed Igor Mammedov's comments
   - Merged patches [Patch V6 3/9] & [Patch V6 7/9] with [Patch V6 4/9]
   - Updated commit-log of [Patch V6 1/9] and [Patch V6 5/9] 
3. Added Shaoqin Huang's Reviewed-by tags for whole series.
Link: 
https://lore.kernel.org/qemu-devel/20231013105129.25648-1-salil.me...@huawei.com/

Patch-set  V5 -> V6
1. Addressed Gavin Shan's comments
   - Fixed the assert() ranges of address spaces
   - Rebased the patch-set to latest changes in the qemu.git
   - Added Reviewed-by tags for patches {8,9}
2. Addressed Jonathan Cameron's comments
   - Updated commit-log for [Patch V5 1/9] with mention of trace events
   - Added Reviewed-by tags for patches {1,5}
3. Added Tested-by tags from Xianglai Li
4. Fixed checkpatch.pl error "Qemu -> QEMU" in [Patch V5 1/9] 
Link: 
https://lore.kernel.org/qemu-devel/20231011194355.15628-1-salil.me...@huawei.com/

Patch-set  V4 -> V5
1. Addressed Gavin Shan's comments
   - Fixed the trace events print string for kvm_{create,get,park,destroy}_vcpu
   - Added Reviewed-by tag for patch {1}
2. Added Shaoqin Huang's Reviewed-by tags for Patches {2,3}
3. Added Tested-by Tag from Vishnu Pajjuri to the patch-set
4. Dropped the ARM specific [Patch V4 10/10]
Link: 
https://lore.kernel.org/qemu-devel/20231009203601.17584-1-salil.me...@huawei.com/

Patch-set  V3 -> V4
1. Addressed David Hilderbrand's comments
   - Fixed the wrong doc comment of kvm_park_vcpu API prototype
   - Added Reviewed-by tags for patches {2,4}
Link: 
https://lore.kernel.org/qemu-devel/20231009112812.10612-1-salil.me...@huawei.com/

Patch-set  V2 -> V3
1. Addressed Jonathan Cameron's comments
   - Fixed 'vcpu-id' type wrongly changed from 'unsigned long' to 'integer'
   - Removed unnecessary use of variable 'vcpu_id' in kvm_park_vcpu
   - Updated [Patch V2 3/10] commit-log with details of ACPI_CPU_SCAN_METHOD 
macro
   - Updated [Patch V2 5/10] commit-log with details of conditional event 
handler method
   - Added Reviewed-by tags for patches {2,3,4,6,7}
2. Addressed Gavin Shan's comments
   - Remove unnecessary use of variable 'vcpu_id' in kvm_par_vcpu
   - Fixed return value in kvm_get_vcpu from -1 to -ENOENT
   - Reset the value of 'gdb_num_g_regs' in gdb_unregister_coprocessor_all
   - Fixed the kvm_{create,park}_vcpu prototypes docs
   - Added Reviewed-by tags for patches {2,3,4,5,6,7,9,10}
3. Addressed one earlier missed comment by Alex Bennée in RFC V1
   - Added traces instead of DPRINTF in the newly added and some existing 
functions
Link: 
https://lore.kernel.org/qemu-devel/20230930001933.2660-1-salil.me...@huawei.com/

Patch-set V1 -> V2
1. Addressed Alex Bennée's comments
   - Refactored the kvm_create_vcpu logic to get rid of goto
   - Added the docs for kvm_{create,park}_vcpu prototypes
   - Splitted the gdbstub and AddressSpace destruction change into separate 
patches
   - Added Reviewed-by tags for patches {2,10}
Link: 
https://lore.kernel.org/qemu-devel/20230929124304.13672-1-salil.me...@huawei.com/

References:

[1] 
https://lore.kernel.org/qemu-devel/20230926100436.28284-1-salil.me...@huawei.com/
[2] https://lore.kernel.org/all/20230913163823.7880-1-james.mo...@arm.com/
[3] 
https://lore.kernel.org/qemu-devel/cover.1695697701.git.lixiang...@loongson.cn/



Salil Mehta (8):
  accel/kvm: Extract common KVM vCPU {creation,parking} code
  hw/acpi: Move CPU ctrl-dev MMIO region len macro to common header file
  hw/acpi: Update ACPI GED framework to support vCPU Hotplug
  hw/acpi: Update GED _EVT method AML with CPU scan
  hw/acpi: Update CPUs AML with cpu-(ctrl)dev change
  physmem: Add helper function to destroy CPU AddressSpace
  gdbstub: Add helper function to unregister GDB register space
  docs/specs/acpi_hw_reduced_hotplug: Add the CPU Hotplug Event Bit

 accel/kvm/kvm-all.c| 64 --
 accel/kvm/trace-events |  4 ++
 docs/specs/acpi_hw_reduced_hotplug.rst |  3 +-
 gdbstub/gdbstub.c  | 12 +
 hw/acpi/acpi-cpu-hotplug-stub.c|  6 +++
 hw/acpi/cpu.c  | 27 +++
 hw/acpi/generic_event_device.c | 21 +
 hw/i386/acpi-build.c   |  3 +-
 include/exec/cpu-common.h  |  8 
 include/exec/gdbstub.h |  5 ++
 include/hw/acpi/cpu.h  |  5 +-
 include/hw/acpi/cpu_hotplug.h  |  4 ++
 include/hw/acpi/ge

Re: [PATCH-for-9.0 10/10] hw/xen: Have most of Xen files become target-agnostic

On Mon, 2023-11-13 at 16:21 +0100, Philippe Mathieu-Daudé wrote:
> Previous commits re-organized the target-specific bits
> from Xen files. We can now build the common files once
> instead of per-target.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: David Woodhouse 



smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH-for-9.0 09/10] hw/xen: Extract 'xen_igd.h' from 'xen_pt.h'