date:20200520

On 14/05/2020 14.37, Janosch Frank wrote:
> Why should we do conversion of a ebcdic value if we have a handy table
> where we coul look up the ascii value instead?

s/coul/could/

> Signed-off-by: Janosch Frank 
> Reviewed-by: David Hildenbrand 
> ---
>  pc-bios/s390-ccw/bootmap.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/pc-bios/s390-ccw/bootmap.c b/pc-bios/s390-ccw/bootmap.c
> index d13b7cbd15..97205674e5 100644
> --- a/pc-bios/s390-ccw/bootmap.c
> +++ b/pc-bios/s390-ccw/bootmap.c
> @@ -328,9 +328,7 @@ static void print_eckd_ldl_msg(ECKD_IPL_mode_t mode)
>  msg[0] = '2';
>  break;
>  default:
> -msg[0] = vlbl->LDL_version;
> -msg[0] &= 0x0f; /* convert EBCDIC   */
> -msg[0] |= 0x30; /* to ASCII (digit) */
> +msg[0] = ebc2asc[vlbl->LDL_version];
>  msg[1] = '?';
>  break;
>  }
> 

Reviewed-by: Thomas Huth

Re: [PATCH v2 6/9] pc-bios: s390x: Move panic() into header and add infinite loop

On 14/05/2020 14.37, Janosch Frank wrote:
> panic() was defined for the ccw and net bios, i.e. twice, so it's
> cleaner to rather put it into the header.
> 
> Also let's add an infinite loop into the assembly of disabled_wait() so
> the caller doesn't need to take care of it.
> 
> Signed-off-by: Janosch Frank 
> Reviewed-by: Pierre Morel 
> Reviewed-by: David Hildenbrand 
> ---
>  pc-bios/s390-ccw/main.c | 7 ---
>  pc-bios/s390-ccw/netmain.c  | 8 
>  pc-bios/s390-ccw/s390-ccw.h | 9 +++--
>  pc-bios/s390-ccw/start.S| 5 +++--
>  4 files changed, 10 insertions(+), 19 deletions(-)

Reviewed-by: Thomas Huth

Re: [PATCH v2 5/9] pc-bios: s390x: Use PSW masks where possible

On 14/05/2020 14.37, Janosch Frank wrote:
> Let's move some of the PSW mask defines into s390-arch.h and use them
> in jump2ipl.c
> 
> Signed-off-by: Janosch Frank 
> Reviewed-by: David Hildenbrand 
> ---
>  pc-bios/s390-ccw/jump2ipl.c  | 10 --
>  pc-bios/s390-ccw/s390-arch.h |  2 ++
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/pc-bios/s390-ccw/jump2ipl.c b/pc-bios/s390-ccw/jump2ipl.c
> index 4eba2510b0..767012bf0c 100644
> --- a/pc-bios/s390-ccw/jump2ipl.c
> +++ b/pc-bios/s390-ccw/jump2ipl.c
> @@ -8,12 +8,10 @@
>  
>  #include "libc.h"
>  #include "s390-ccw.h"
> +#include "s390-arch.h"
>  
>  #define KERN_IMAGE_START 0x01UL
> -#define PSW_MASK_64 0x0001ULL
> -#define PSW_MASK_32 0x8000ULL
> -#define PSW_MASK_SHORTPSW 0x0008ULL
> -#define RESET_PSW_MASK (PSW_MASK_SHORTPSW | PSW_MASK_32 | PSW_MASK_64)
> +#define RESET_PSW_MASK (PSW_MASK_SHORTPSW | PSW_MASK_64)
>  
>  typedef struct ResetInfo {
>  uint64_t ipl_psw;
> @@ -54,7 +52,7 @@ void jump_to_IPL_code(uint64_t address)
>  
>  current->ipl_psw = (uint64_t) _to_IPL_2;
>  current->ipl_psw |= RESET_PSW_MASK;
> -current->ipl_continue = address & 0x7fff;
> +current->ipl_continue = address & PSW_MASK_SHORT_ADDR;
>  
>  debug_print_int("set IPL addr to", current->ipl_continue);
>  
> @@ -86,7 +84,7 @@ void jump_to_low_kernel(void)
>  
>  /* Trying to get PSW at zero address */
>  if (*((uint64_t *)0) & RESET_PSW_MASK) {
> -jump_to_IPL_code((*((uint64_t *)0)) & 0x7fff);
> +jump_to_IPL_code((*((uint64_t *)0)) & PSW_MASK_SHORT_ADDR);
>  }
>  
>  /* No other option left, so use the Linux kernel start address */
> diff --git a/pc-bios/s390-ccw/s390-arch.h b/pc-bios/s390-ccw/s390-arch.h
> index 73852029d4..6da44d4436 100644
> --- a/pc-bios/s390-ccw/s390-arch.h
> +++ b/pc-bios/s390-ccw/s390-arch.h
> @@ -26,9 +26,11 @@ _Static_assert(sizeof(struct PSWLegacy) == 8, "PSWLegacy 
> size incorrect");
>  
>  /* s390 psw bit masks */
>  #define PSW_MASK_IOINT  0x0200ULL
> +#define PSW_MASK_SHORTPSW   0x0008ULL
>  #define PSW_MASK_WAIT   0x0002ULL
>  #define PSW_MASK_EAMODE 0x0001ULL
>  #define PSW_MASK_BAMODE 0x8000ULL
> +#define PSW_MASK_SHORT_ADDR 0x7fffULL

Please also mention that new define in the patch description.

 Thomas

Re: [PATCH v2 4/9] pc-bios: s390x: Rename and use PSW_MASK_ZMODE constant

On 21/05/2020 07.44, Thomas Huth wrote:
> On 14/05/2020 14.37, Janosch Frank wrote:
>> ZMODE has a lot of ambiguity with the ESAME architecture mode, but is
>> actually 64 bit addressing.
>>
>> Signed-off-by: Janosch Frank 
>> Reviewed-by: Pierre Morel 
>> Reviewed-by: David Hildenbrand 
>> ---
>>  pc-bios/s390-ccw/dasd-ipl.c  | 3 +--
>>  pc-bios/s390-ccw/s390-arch.h | 2 +-
>>  2 files changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/pc-bios/s390-ccw/dasd-ipl.c b/pc-bios/s390-ccw/dasd-ipl.c
>> index 0fc879bb8e..b932531e6f 100644
>> --- a/pc-bios/s390-ccw/dasd-ipl.c
>> +++ b/pc-bios/s390-ccw/dasd-ipl.c
>> @@ -229,7 +229,6 @@ void dasd_ipl(SubChannelId schid, uint16_t cutype)
>>  run_ipl2(schid, cutype, ipl2_addr);
>>  
>>  /* Transfer control to the guest operating system */
>> -pswl->mask |= PSW_MASK_EAMODE;   /* Force z-mode */
>> -pswl->addr |= PSW_MASK_BAMODE;   /* ...  */
>> +pswl->mask |= PSW_MASK_64;   /* Force 64 bit addressing */
> 
> This is not only a rename (as announced in the subject), but also a
> change in behavior since you now do not change pswl->addr anymore. So
> this is even a bug fix? Could you please mention this in the patch
> description, too?

Ah, wait, pswl is of type PSWLegacy, and ->mask and ->addr are of type
uint32_t here! So it seems wrong to use a 64-bit value for mask here,
doesn't it?

 Thomas

Re: [PATCH v2 4/9] pc-bios: s390x: Rename and use PSW_MASK_ZMODE constant

On 14/05/2020 14.37, Janosch Frank wrote:
> ZMODE has a lot of ambiguity with the ESAME architecture mode, but is
> actually 64 bit addressing.
> 
> Signed-off-by: Janosch Frank 
> Reviewed-by: Pierre Morel 
> Reviewed-by: David Hildenbrand 
> ---
>  pc-bios/s390-ccw/dasd-ipl.c  | 3 +--
>  pc-bios/s390-ccw/s390-arch.h | 2 +-
>  2 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/pc-bios/s390-ccw/dasd-ipl.c b/pc-bios/s390-ccw/dasd-ipl.c
> index 0fc879bb8e..b932531e6f 100644
> --- a/pc-bios/s390-ccw/dasd-ipl.c
> +++ b/pc-bios/s390-ccw/dasd-ipl.c
> @@ -229,7 +229,6 @@ void dasd_ipl(SubChannelId schid, uint16_t cutype)
>  run_ipl2(schid, cutype, ipl2_addr);
>  
>  /* Transfer control to the guest operating system */
> -pswl->mask |= PSW_MASK_EAMODE;   /* Force z-mode */
> -pswl->addr |= PSW_MASK_BAMODE;   /* ...  */
> +pswl->mask |= PSW_MASK_64;   /* Force 64 bit addressing */

This is not only a rename (as announced in the subject), but also a
change in behavior since you now do not change pswl->addr anymore. So
this is even a bug fix? Could you please mention this in the patch
description, too?

 Thanks,
  Thomas

Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices

2020-05-20 Thread Yan Zhao

On Wed, May 20, 2020 at 10:46:12AM -0600, Alex Williamson wrote:
> On Wed, 20 May 2020 19:10:07 +0530
> Kirti Wankhede  wrote:
> 
> > On 5/20/2020 8:25 AM, Yan Zhao wrote:
> > > On Tue, May 19, 2020 at 10:58:04AM -0600, Alex Williamson wrote:  
> > >> Hi folks,
> > >>
> > >> My impression is that we're getting pretty close to a workable
> > >> implementation here with v22 plus respins of patches 5, 6, and 8.  We
> > >> also have a matching QEMU series and a proposal for a new i40e
> > >> consumer, as well as I assume GVT-g updates happening internally at
> > >> Intel.  I expect all of the latter needs further review and discussion,
> > >> but we should be at the point where we can validate these proposed
> > >> kernel interfaces.  Therefore I'd like to make a call for reviews so
> > >> that we can get this wrapped up for the v5.8 merge window.  I know
> > >> Connie has some outstanding documentation comments and I'd like to make
> > >> sure everyone has an opportunity to check that their comments have been
> > >> addressed and we don't discover any new blocking issues.  Please send
> > >> your Acked-by/Reviewed-by/Tested-by tags if you're satisfied with this
> > >> interface and implementation.  Thanks!
> > >>  
> > > hi Alex and Kirti,
> > > after porting to qemu v22 and kernel v22, it is found out that
> > > it can not even pass basic live migration test with error like
> > > 
> > > "Failed to get dirty bitmap for iova: 0xca000 size: 0x3000 err: 22"
> > >   
> > 
> > Thanks for testing Yan.
> > I think last moment change in below cause this failure
> > 
> > https://lore.kernel.org/kvm/1589871178-8282-1-git-send-email-kwankh...@nvidia.com/
> > 
> >  >  if (dma->iova > iova + size)
> >  >  break;  
> > 
> > Surprisingly with my basic testing with 2G sys mem QEMU didn't raise 
> > abort on g_free, but I do hit this with large sys mem.
> > With above change, that function iterated through next vfio_dma as well. 
> > Check should be as below:
> > 
> > -   if (dma->iova > iova + size)
> > +   if (dma->iova > iova + size -1)
> 
> 
> Or just:
> 
>   if (dma->iova >= iova + size)
> 
> Thanks,
> Alex
> 
> 
> >  break;
> > 
> > Another fix is in QEMU.
> > https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg04751.html
> > 
> >  > > +range->bitmap.size = ROUND_UP(pages, 64) / 8;  
> >  >
> >  > ROUND_UP(npages/8, sizeof(u64))?
> >  >  
> > 
> > If npages < 8, npages/8 is 0 and ROUND_UP(0, 8) returns 0.
> > 
> > Changing it as below
> > 
> > -range->bitmap.size = ROUND_UP(pages / 8, sizeof(uint64_t));
> > +range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * 
> > BITS_PER_BYTE) /
> > + BITS_PER_BYTE;
> > 
> > I'm updating patches with these fixes and Cornelia's suggestion soon.
> > 
> > Due to short of time I may not be able to address all the concerns 
> > raised on previous versions of QEMU, I'm trying make QEMU side code 
> > available for testing for others with latest kernel changes. Don't 
> > worry, I will revisit comments on QEMU patches. Right now first priority 
> > is to test kernel UAPI and prepare kernel patches for 5.8
> > 
>
hi Kirti
by updating kernel/qemu to v23, still met below two types of errors.
just basic migration test.
(the guest VM size is 2G for all reported bugs).

"Failed to get dirty bitmap for iova: 0xfe011000 size: 0x3fb0 err: 22"

or 

"qemu-system-x86_64-lm: vfio_load_state: Error allocating buffer
qemu-system-x86_64-lm: error while loading state section id 49(vfio)
qemu-system-x86_64-lm: load of migration failed: Cannot allocate memory"


Thanks
Yan

Re: [PATCH v2 1/2] spapr: Add associativity reference point count to machine info

On Thu, May 21, 2020 at 01:34:37AM +0200, Greg Kurz wrote:
> On Mon, 18 May 2020 16:44:17 -0500
> Reza Arbab  wrote:
> 
> > Make the number of NUMA associativity reference points a
> > machine-specific value, using the currently assumed default (two
> > reference points). This preps the next patch to conditionally change it.
> > 
> > Signed-off-by: Reza Arbab 
> > ---
> >  hw/ppc/spapr.c | 6 +-
> >  include/hw/ppc/spapr.h | 1 +
> >  2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index c18eab0a2305..88b4a1f17716 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -889,10 +889,12 @@ static int spapr_dt_rng(void *fdt)
> >  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >  {
> >  MachineState *ms = MACHINE(spapr);
> > +SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(ms);
> >  int rtas;
> >  GString *hypertas = g_string_sized_new(256);
> >  GString *qemu_hypertas = g_string_sized_new(256);
> >  uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> > +uint32_t nr_refpoints;
> >  uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
> >  memory_region_size((spapr)->device_memory->mr);
> >  uint32_t lrdr_capacity[] = {
> > @@ -944,8 +946,9 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, 
> > void *fdt)
> >   qemu_hypertas->str, qemu_hypertas->len));
> >  g_string_free(qemu_hypertas, TRUE);
> >  
> > +nr_refpoints = MIN(smc->nr_assoc_refpoints, ARRAY_SIZE(refpoints));
> 
> Having the machine requesting more reference points than available
> would clearly be a bug. I'd rather add an assert() than silently
> clipping to the size of refpoints[].

Actually, I think this "num reference points" thing is a false
abstraction.  It's selecting a number of entries from a list of
reference points that's fixed.  The number of things we could do
simply by changing the machine property and not the array is pretty
small.

I think it would be simpler to just have a boolean in the machine
class.

> >  _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> > - refpoints, sizeof(refpoints)));
> > + refpoints, nr_refpoints * sizeof(uint32_t)));
> >  
> 
> Size can be expressed without yet another explicit reference to the
> uint32_t type:
> 
> nr_refpoints * sizeof(refpoints[0])
> 
> >  _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
> >   maxdomains, sizeof(maxdomains)));
> > @@ -4541,6 +4544,7 @@ static void spapr_machine_class_init(ObjectClass *oc, 
> > void *data)
> >  smc->linux_pci_probe = true;
> >  smc->smp_threads_vsmt = true;
> >  smc->nr_xirqs = SPAPR_NR_XIRQS;
> > +smc->nr_assoc_refpoints = 2;
> 
> When adding a new setting for the default machine type, we usually
> take care of older machine types at the same time, ie. folding this
> patch into the next one. Both patches are simple enough that it should
> be okay and this would avoid this line to be touched again.
> 
> >  xfc->match_nvt = spapr_match_nvt;
> >  }
> >  
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index e579eaf28c05..abaf9a92adc0 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -129,6 +129,7 @@ struct SpaprMachineClass {
> >  bool linux_pci_probe;
> >  bool smp_threads_vsmt; /* set VSMT to smp_threads by default */
> >  hwaddr rma_limit;  /* clamp the RMA to this size */
> > +uint32_t nr_assoc_refpoints;
> >  
> >  void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
> >uint64_t *buid, hwaddr *pio, 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs

On Thu, May 21, 2020 at 01:36:16AM +0200, Greg Kurz wrote:
> On Mon, 18 May 2020 16:44:18 -0500
> Reza Arbab  wrote:
> 
> > NUMA nodes corresponding to GPU memory currently have the same
> > affinity/distance as normal memory nodes. Add a third NUMA associativity
> > reference point enabling us to give GPU nodes more distance.
> > 
> > This is guest visible information, which shouldn't change under a
> > running guest across migration between different qemu versions, so make
> > the change effective only in new (pseries > 5.0) machine types.
> > 
> > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> > 
> > node distances:
> > node   0   1   2   3   4   5
> >   0:  10  40  40  40  40  40
> >   1:  40  10  40  40  40  40
> >   2:  40  40  10  40  40  40
> >   3:  40  40  40  10  40  40
> >   4:  40  40  40  40  10  40
> >   5:  40  40  40  40  40  10
> > 
> > After:
> > 
> > node distances:
> > node   0   1   2   3   4   5
> >   0:  10  40  80  80  80  80
> >   1:  40  10  80  80  80  80
> >   2:  80  80  10  80  80  80
> >   3:  80  80  80  10  80  80
> >   4:  80  80  80  80  10  80
> >   5:  80  80  80  80  80  10
> > 
> > These are the same distances as on the host, mirroring the change made
> > to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> > Add a new level of NUMA for GPU's").
> > 
> > Signed-off-by: Reza Arbab 
> > ---
> >  hw/ppc/spapr.c | 11 +--
> >  hw/ppc/spapr_pci_nvlink2.c |  2 +-
> >  2 files changed, 10 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 88b4a1f17716..1d9193d5ee49 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, 
> > void *fdt)
> >  int rtas;
> >  GString *hypertas = g_string_sized_new(256);
> >  GString *qemu_hypertas = g_string_sized_new(256);
> > -uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> > +uint32_t refpoints[] = {
> > +cpu_to_be32(0x4),
> > +cpu_to_be32(0x4),
> > +cpu_to_be32(0x2),
> > +};
> >  uint32_t nr_refpoints;
> >  uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
> >  memory_region_size((spapr)->device_memory->mr);
> > @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, 
> > void *data)
> >  smc->linux_pci_probe = true;
> >  smc->smp_threads_vsmt = true;
> >  smc->nr_xirqs = SPAPR_NR_XIRQS;
> > -smc->nr_assoc_refpoints = 2;
> > +smc->nr_assoc_refpoints = 3;
> >  xfc->match_nvt = spapr_match_nvt;
> >  }
> >  
> > @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
> >   */
> >  static void spapr_machine_5_0_class_options(MachineClass *mc)
> >  {
> > +SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > +
> >  spapr_machine_5_1_class_options(mc);
> >  compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
> > +smc->nr_assoc_refpoints = 2;
> >  }
> >  
> >  DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
> > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> > index 8332d5694e46..247fd48731e2 100644
> > --- a/hw/ppc/spapr_pci_nvlink2.c
> > +++ b/hw/ppc/spapr_pci_nvlink2.c
> > @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState 
> > *sphb, void *fdt)
> >  uint32_t associativity[] = {
> >  cpu_to_be32(0x4),
> >  SPAPR_GPU_NUMA_ID,
> > -SPAPR_GPU_NUMA_ID,
> > +cpu_to_be32(nvslot->numa_id),
> 
> This is a guest visible change. It should theoretically be controlled
> with a compat property of the PHB (look for "static GlobalProperty" in
> spapr.c). But since this code is only used for GPU passthrough and we
> don't support migration of such devices, I guess it's okay. Maybe just
> mention it in the changelog.

Yeah, we might get away with it, but it should be too hard to get this
right, so let's do it.

> 
> >  SPAPR_GPU_NUMA_ID,
> >  cpu_to_be32(nvslot->numa_id)
> >  };
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

[PATCH v4 08/10] Support adding individual regions in libvhost-user

When the VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS is enabled, qemu will
transmit memory regions to a backend individually using the new message
VHOST_USER_ADD_MEM_REG. With this change vhost-user backends built with
libvhost-user can now map in new memory regions when VHOST_USER_ADD_MEM_REG
messages are received.

Qemu only sends VHOST_USER_ADD_MEM_REG messages when the
VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS feature is negotiated, and
since it is not yet supported in libvhost-user, this new functionality
is not yet used.

Signed-off-by: Raphael Norwitz 
---
 contrib/libvhost-user/libvhost-user.c | 103 ++
 contrib/libvhost-user/libvhost-user.h |   7 +++
 2 files changed, 110 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c 
b/contrib/libvhost-user/libvhost-user.c
index 9f039b7..d8ee7a2 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -138,6 +138,7 @@ vu_request_to_string(unsigned int req)
 REQ(VHOST_USER_GPU_SET_SOCKET),
 REQ(VHOST_USER_VRING_KICK),
 REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
+REQ(VHOST_USER_ADD_MEM_REG),
 REQ(VHOST_USER_MAX),
 };
 #undef REQ
@@ -663,6 +664,106 @@ generate_faults(VuDev *dev) {
 }
 
 static bool
+vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
+int i;
+bool track_ramblocks = dev->postcopy_listening;
+VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
+VuDevRegion *dev_region = >regions[dev->nregions];
+void *mmap_addr;
+
+/*
+ * If we are in postcopy mode and we receive a u64 payload with a 0 value
+ * we know all the postcopy client bases have been recieved, and we
+ * should start generating faults.
+ */
+if (track_ramblocks &&
+vmsg->size == sizeof(vmsg->payload.u64) &&
+vmsg->payload.u64 == 0) {
+(void)generate_faults(dev);
+return false;
+}
+
+DPRINT("Adding region: %d\n", dev->nregions);
+DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
+   msg_region->guest_phys_addr);
+DPRINT("memory_size: 0x%016"PRIx64"\n",
+   msg_region->memory_size);
+DPRINT("userspace_addr   0x%016"PRIx64"\n",
+   msg_region->userspace_addr);
+DPRINT("mmap_offset  0x%016"PRIx64"\n",
+   msg_region->mmap_offset);
+
+dev_region->gpa = msg_region->guest_phys_addr;
+dev_region->size = msg_region->memory_size;
+dev_region->qva = msg_region->userspace_addr;
+dev_region->mmap_offset = msg_region->mmap_offset;
+
+/*
+ * We don't use offset argument of mmap() since the
+ * mapped address has to be page aligned, and we use huge
+ * pages.
+ */
+if (track_ramblocks) {
+/*
+ * In postcopy we're using PROT_NONE here to catch anyone
+ * accessing it before we userfault.
+ */
+mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
+ PROT_NONE, MAP_SHARED,
+ vmsg->fds[0], 0);
+} else {
+mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
+ PROT_READ | PROT_WRITE, MAP_SHARED, vmsg->fds[0],
+ 0);
+}
+
+if (mmap_addr == MAP_FAILED) {
+vu_panic(dev, "region mmap error: %s", strerror(errno));
+} else {
+dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
+DPRINT("mmap_addr:   0x%016"PRIx64"\n",
+   dev_region->mmap_addr);
+}
+
+close(vmsg->fds[0]);
+
+if (track_ramblocks) {
+/*
+ * Return the address to QEMU so that it can translate the ufd
+ * fault addresses back.
+ */
+msg_region->userspace_addr = (uintptr_t)(mmap_addr +
+ dev_region->mmap_offset);
+
+/* Send the message back to qemu with the addresses filled in. */
+vmsg->fd_num = 0;
+if (!vu_send_reply(dev, dev->sock, vmsg)) {
+vu_panic(dev, "failed to respond to add-mem-region for postcopy");
+return false;
+}
+
+DPRINT("Successfully added new region in postcopy\n");
+dev->nregions++;
+return false;
+
+} else {
+for (i = 0; i < dev->max_queues; i++) {
+if (dev->vq[i].vring.desc) {
+if (map_ring(dev, >vq[i])) {
+vu_panic(dev, "remapping queue %d for new memory region",
+ i);
+}
+}
+}
+
+DPRINT("Successfully added new region\n");
+dev->nregions++;
+vmsg_set_reply_u64(vmsg, 0);
+return true;
+}
+}
+
+static bool
 vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
 {
 int i;
@@ -1668,6 +1769,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 return vu_handle_vring_kick(dev, vmsg);
 case VHOST_USER_GET_MAX_MEM_SLOTS:

[PATCH v4 10/10] Lift max ram slots limit in libvhost-user

Historically, VMs with vhost-user devices could hot-add memory a maximum
of 8 times. Now that the VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
protocol feature has been added, VMs with vhost-user backends which
support this new feature can support a configurable number of ram slots
up to the maximum supported by the target platform.

This change adds VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS support for
backends built with libvhost-user, and increases the number of supported
ram slots from 8 to 32.

Memory hot-add, hot-remove and postcopy migration were tested with
the vhost-user-bridge sample.

Signed-off-by: Raphael Norwitz 
---
 contrib/libvhost-user/libvhost-user.c | 17 +
 contrib/libvhost-user/libvhost-user.h | 15 +++
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c 
b/contrib/libvhost-user/libvhost-user.c
index 386449b..b1e6072 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -269,7 +269,7 @@ have_userfault(void)
 static bool
 vu_message_read(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
 {
-char control[CMSG_SPACE(VHOST_MEMORY_MAX_NREGIONS * sizeof(int))] = { };
+char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = 
{};
 struct iovec iov = {
 .iov_base = (char *)vmsg,
 .iov_len = VHOST_USER_HDR_SIZE,
@@ -340,7 +340,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
 {
 int rc;
 uint8_t *p = (uint8_t *)vmsg;
-char control[CMSG_SPACE(VHOST_MEMORY_MAX_NREGIONS * sizeof(int))] = { };
+char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = 
{};
 struct iovec iov = {
 .iov_base = (char *)vmsg,
 .iov_len = VHOST_USER_HDR_SIZE,
@@ -353,7 +353,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
 struct cmsghdr *cmsg;
 
 memset(control, 0, sizeof(control));
-assert(vmsg->fd_num <= VHOST_MEMORY_MAX_NREGIONS);
+assert(vmsg->fd_num <= VHOST_MEMORY_BASELINE_NREGIONS);
 if (vmsg->fd_num > 0) {
 size_t fdsize = vmsg->fd_num * sizeof(int);
 msg.msg_controllen = CMSG_SPACE(fdsize);
@@ -780,7 +780,7 @@ static bool
 vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 int i, j;
 bool found = false;
-VuDevRegion shadow_regions[VHOST_MEMORY_MAX_NREGIONS] = {};
+VuDevRegion shadow_regions[VHOST_USER_MAX_RAM_SLOTS] = {};
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
 
 DPRINT("Removing region:\n");
@@ -813,7 +813,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 
 if (found) {
 memcpy(dev->regions, shadow_regions,
-   sizeof(VuDevRegion) * VHOST_MEMORY_MAX_NREGIONS);
+   sizeof(VuDevRegion) * VHOST_USER_MAX_RAM_SLOTS);
 DPRINT("Successfully removed a region\n");
 dev->nregions--;
 vmsg_set_reply_u64(vmsg, 0);
@@ -1394,7 +1394,8 @@ vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg 
*vmsg)
 1ULL << VHOST_USER_PROTOCOL_F_SLAVE_REQ |
 1ULL << VHOST_USER_PROTOCOL_F_HOST_NOTIFIER |
 1ULL << VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD |
-1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK;
+1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK |
+1ULL << VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS;
 
 if (have_userfault()) {
 features |= 1ULL << VHOST_USER_PROTOCOL_F_PAGEFAULT;
@@ -1732,14 +1733,14 @@ static bool vu_handle_get_max_memslots(VuDev *dev, 
VhostUserMsg *vmsg)
 {
 vmsg->flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
 vmsg->size  = sizeof(vmsg->payload.u64);
-vmsg->payload.u64 = VHOST_MEMORY_MAX_NREGIONS;
+vmsg->payload.u64 = VHOST_USER_MAX_RAM_SLOTS;
 vmsg->fd_num = 0;
 
 if (!vu_message_write(dev, dev->sock, vmsg)) {
 vu_panic(dev, "Failed to send max ram slots: %s\n", strerror(errno));
 }
 
-DPRINT("u64: 0x%016"PRIx64"\n", (uint64_t) VHOST_MEMORY_MAX_NREGIONS);
+DPRINT("u64: 0x%016"PRIx64"\n", (uint64_t) VHOST_USER_MAX_RAM_SLOTS);
 
 return false;
 }
diff --git a/contrib/libvhost-user/libvhost-user.h 
b/contrib/libvhost-user/libvhost-user.h
index f843971..844c37c 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -28,7 +28,13 @@
 
 #define VIRTQUEUE_MAX_SIZE 1024
 
-#define VHOST_MEMORY_MAX_NREGIONS 8
+#define VHOST_MEMORY_BASELINE_NREGIONS 8
+
+/*
+ * Set a reasonable maximum number of ram slots, which will be supported by
+ * any architecture.
+ */
+#define VHOST_USER_MAX_RAM_SLOTS 32
 
 typedef enum VhostSetConfigType {
 VHOST_SET_CONFIG_TYPE_MASTER = 0,
@@ -55,6 +61,7 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
 VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
 VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
+

[PATCH v4 07/10] Support ram slot configuration in libvhost-user

The VHOST_USER_GET_MAX_MEM_SLOTS message allows a vhost-user backend to
specify a maximum number of ram slots it is willing to support. This
change adds support for libvhost-user to process this message. For now
the backend will reply with 8 as the maximum number of regions
supported.

libvhost-user does not yet support the vhost-user protocol feature
VHOST_USER_PROTOCOL_F_CONFIGIRE_MEM_SLOTS, so qemu should never
send the VHOST_USER_GET_MAX_MEM_SLOTS message. Therefore this new
functionality is not currently used.

Signed-off-by: Raphael Norwitz 
---
 contrib/libvhost-user/libvhost-user.c | 19 +++
 contrib/libvhost-user/libvhost-user.h |  1 +
 2 files changed, 20 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c 
b/contrib/libvhost-user/libvhost-user.c
index cccfa22..9f039b7 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -137,6 +137,7 @@ vu_request_to_string(unsigned int req)
 REQ(VHOST_USER_SET_INFLIGHT_FD),
 REQ(VHOST_USER_GPU_SET_SOCKET),
 REQ(VHOST_USER_VRING_KICK),
+REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
 REQ(VHOST_USER_MAX),
 };
 #undef REQ
@@ -1565,6 +1566,22 @@ vu_handle_vring_kick(VuDev *dev, VhostUserMsg *vmsg)
 return false;
 }
 
+static bool vu_handle_get_max_memslots(VuDev *dev, VhostUserMsg *vmsg)
+{
+vmsg->flags = VHOST_USER_REPLY_MASK | VHOST_USER_VERSION;
+vmsg->size  = sizeof(vmsg->payload.u64);
+vmsg->payload.u64 = VHOST_MEMORY_MAX_NREGIONS;
+vmsg->fd_num = 0;
+
+if (!vu_message_write(dev, dev->sock, vmsg)) {
+vu_panic(dev, "Failed to send max ram slots: %s\n", strerror(errno));
+}
+
+DPRINT("u64: 0x%016"PRIx64"\n", (uint64_t) VHOST_MEMORY_MAX_NREGIONS);
+
+return false;
+}
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1649,6 +1666,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 return vu_set_inflight_fd(dev, vmsg);
 case VHOST_USER_VRING_KICK:
 return vu_handle_vring_kick(dev, vmsg);
+case VHOST_USER_GET_MAX_MEM_SLOTS:
+return vu_handle_get_max_memslots(dev, vmsg);
 default:
 vmsg_close_fds(vmsg);
 vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h 
b/contrib/libvhost-user/libvhost-user.h
index f30394f..88ef40d 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -97,6 +97,7 @@ typedef enum VhostUserRequest {
 VHOST_USER_SET_INFLIGHT_FD = 32,
 VHOST_USER_GPU_SET_SOCKET = 33,
 VHOST_USER_VRING_KICK = 35,
+VHOST_USER_GET_MAX_MEM_SLOTS = 36,
 VHOST_USER_MAX
 } VhostUserRequest;
 
-- 
1.8.3.1

[PATCH v4 05/10] Lift max memory slots limit imposed by vhost-user

Historically, sending all memory regions to vhost-user backends in a
single message imposed a limitation on the number of times memory
could be hot-added to a VM with a vhost-user device. Now that backends
which support the VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS send memory
regions individually, we no longer need to impose this limitation on
devices which support this feature.

With this change, VMs with a vhost-user device which supports the
VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS can support a configurable
number of memory slots, up to the maximum allowed by the target
platform.

Existing backends which do not support
VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS are unaffected.

Signed-off-by: Raphael Norwitz 
Signed-off-by: Peter Turschmid 
Suggested-by: Mike Cui 
---
 docs/interop/vhost-user.rst |  7 +++---
 hw/virtio/vhost-user.c  | 56 ++---
 2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 037eefa..688b7c6 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1273,10 +1273,9 @@ Master message types
   feature has been successfully negotiated, this message is submitted
   by master to the slave. The slave should return the message with a
   u64 payload containing the maximum number of memory slots for
-  QEMU to expose to the guest. At this point, the value returned
-  by the backend will be capped at the maximum number of ram slots
-  which can be supported by vhost-user. Currently that limit is set
-  at VHOST_USER_MAX_RAM_SLOTS = 8.
+  QEMU to expose to the guest. The value returned by the backend
+  will be capped at the maximum number of ram slots which can be
+  supported by the target platform.
 
 ``VHOST_USER_ADD_MEM_REG``
   :id: 37
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 9358406..48b8081 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -35,11 +35,29 @@
 #include 
 #endif
 
-#define VHOST_MEMORY_MAX_NREGIONS8
+#define VHOST_MEMORY_BASELINE_NREGIONS8
 #define VHOST_USER_F_PROTOCOL_FEATURES 30
 #define VHOST_USER_SLAVE_MAX_FDS 8
 
 /*
+ * Set maximum number of RAM slots supported to
+ * the maximum number supported by the target
+ * hardware plaform.
+ */
+#if defined(TARGET_X86) || defined(TARGET_X86_64) || \
+defined(TARGET_ARM) || defined(TARGET_ARM_64)
+#include "hw/acpi/acpi.h"
+#define VHOST_USER_MAX_RAM_SLOTS ACPI_MAX_RAM_SLOTS
+
+#elif defined(TARGET_PPC) || defined(TARGET_PPC_64)
+#include "hw/ppc/spapr.h"
+#define VHOST_USER_MAX_RAM_SLOTS SPAPR_MAX_RAM_SLOTS
+
+#else
+#define VHOST_USER_MAX_RAM_SLOTS 512
+#endif
+
+/*
  * Maximum size of virtio device config space
  */
 #define VHOST_USER_MAX_CONFIG_SIZE 256
@@ -127,7 +145,7 @@ typedef struct VhostUserMemoryRegion {
 typedef struct VhostUserMemory {
 uint32_t nregions;
 uint32_t padding;
-VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
+VhostUserMemoryRegion regions[VHOST_MEMORY_BASELINE_NREGIONS];
 } VhostUserMemory;
 
 typedef struct VhostUserMemRegMsg {
@@ -222,7 +240,7 @@ struct vhost_user {
 int slave_fd;
 NotifierWithReturn postcopy_notifier;
 struct PostCopyFD  postcopy_fd;
-uint64_t   postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS];
+uint64_t   postcopy_client_bases[VHOST_USER_MAX_RAM_SLOTS];
 /* Length of the region_rb and region_rb_offset arrays */
 size_t region_rb_len;
 /* RAMBlock associated with a given region */
@@ -237,7 +255,7 @@ struct vhost_user {
 
 /* Our current regions */
 int num_shadow_regions;
-struct vhost_memory_region shadow_regions[VHOST_MEMORY_MAX_NREGIONS];
+struct vhost_memory_region shadow_regions[VHOST_USER_MAX_RAM_SLOTS];
 };
 
 struct scrub_regions {
@@ -392,7 +410,7 @@ int vhost_user_gpu_set_socket(struct vhost_dev *dev, int fd)
 static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base,
struct vhost_log *log)
 {
-int fds[VHOST_MEMORY_MAX_NREGIONS];
+int fds[VHOST_USER_MAX_RAM_SLOTS];
 size_t fd_num = 0;
 bool shmfd = virtio_has_feature(dev->protocol_features,
 VHOST_USER_PROTOCOL_F_LOG_SHMFD);
@@ -470,7 +488,7 @@ static int vhost_user_fill_set_mem_table_msg(struct 
vhost_user *u,
 mr = vhost_user_get_mr_data(reg->userspace_addr, , );
 if (fd > 0) {
 if (track_ramblocks) {
-assert(*fd_num < VHOST_MEMORY_MAX_NREGIONS);
+assert(*fd_num < VHOST_MEMORY_BASELINE_NREGIONS);
 trace_vhost_user_set_mem_table_withfd(*fd_num, mr->name,
   reg->memory_size,
   reg->guest_phys_addr,
@@ -478,7 +496,7 @@ static int vhost_user_fill_set_mem_table_msg(struct 
vhost_user *u,

[PATCH v4 09/10] Support individual region unmap in libvhost-user

When the VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS protocol feature is
enabled, on memory hot-unplug qemu will transmit memory regions to
remove individually using the new message VHOST_USER_REM_MEM_REG
message. With this change, vhost-user backends build with libvhost-user
can now unmap individual memory regions when receiving the
VHOST_USER_REM_MEM_REG message.

Qemu only sends VHOST_USER_REM_MEM_REG messages when the
VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS feature is negotiated, and
support for that feature has not yet been added in libvhost-user, this
new functionality is not yet used.

Signed-off-by: Raphael Norwitz 
---
 contrib/libvhost-user/libvhost-user.c | 63 +++
 contrib/libvhost-user/libvhost-user.h |  1 +
 2 files changed, 64 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c 
b/contrib/libvhost-user/libvhost-user.c
index d8ee7a2..386449b 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -139,6 +139,7 @@ vu_request_to_string(unsigned int req)
 REQ(VHOST_USER_VRING_KICK),
 REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
 REQ(VHOST_USER_ADD_MEM_REG),
+REQ(VHOST_USER_REM_MEM_REG),
 REQ(VHOST_USER_MAX),
 };
 #undef REQ
@@ -763,6 +764,66 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 }
 }
 
+static inline bool reg_equal(VuDevRegion *vudev_reg,
+ VhostUserMemoryRegion *msg_reg)
+{
+if (vudev_reg->gpa == msg_reg->guest_phys_addr &&
+vudev_reg->qva == msg_reg->userspace_addr &&
+vudev_reg->size == msg_reg->memory_size) {
+return true;
+}
+
+return false;
+}
+
+static bool
+vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
+int i, j;
+bool found = false;
+VuDevRegion shadow_regions[VHOST_MEMORY_MAX_NREGIONS] = {};
+VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
+
+DPRINT("Removing region:\n");
+DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
+   msg_region->guest_phys_addr);
+DPRINT("memory_size: 0x%016"PRIx64"\n",
+   msg_region->memory_size);
+DPRINT("userspace_addr   0x%016"PRIx64"\n",
+   msg_region->userspace_addr);
+DPRINT("mmap_offset  0x%016"PRIx64"\n",
+   msg_region->mmap_offset);
+
+for (i = 0, j = 0; i < dev->nregions; i++) {
+if (!reg_equal(>regions[i], msg_region)) {
+shadow_regions[j].gpa = dev->regions[i].gpa;
+shadow_regions[j].size = dev->regions[i].size;
+shadow_regions[j].qva = dev->regions[i].qva;
+shadow_regions[j].mmap_offset = dev->regions[i].mmap_offset;
+j++;
+} else {
+found = true;
+VuDevRegion *r = >regions[i];
+void *m = (void *) (uintptr_t) r->mmap_addr;
+
+if (m) {
+munmap(m, r->size + r->mmap_offset);
+}
+}
+}
+
+if (found) {
+memcpy(dev->regions, shadow_regions,
+   sizeof(VuDevRegion) * VHOST_MEMORY_MAX_NREGIONS);
+DPRINT("Successfully removed a region\n");
+dev->nregions--;
+vmsg_set_reply_u64(vmsg, 0);
+} else {
+vu_panic(dev, "Specified region not found\n");
+}
+
+return true;
+}
+
 static bool
 vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1771,6 +1832,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 return vu_handle_get_max_memslots(dev, vmsg);
 case VHOST_USER_ADD_MEM_REG:
 return vu_add_mem_reg(dev, vmsg);
+case VHOST_USER_REM_MEM_REG:
+return vu_rem_mem_reg(dev, vmsg);
 default:
 vmsg_close_fds(vmsg);
 vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h 
b/contrib/libvhost-user/libvhost-user.h
index 60ef7fd..f843971 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -99,6 +99,7 @@ typedef enum VhostUserRequest {
 VHOST_USER_VRING_KICK = 35,
 VHOST_USER_GET_MAX_MEM_SLOTS = 36,
 VHOST_USER_ADD_MEM_REG = 37,
+VHOST_USER_REM_MEM_REG = 38,
 VHOST_USER_MAX
 } VhostUserRequest;
 
-- 
1.8.3.1

[PATCH v4 06/10] Refactor out libvhost-user fault generation logic

In libvhost-user, the incoming postcopy migration path for setting the
backend's memory tables has become convolued. In particular, moving the
logic which starts generating faults, having received the final ACK from
qemu can be moved to a separate function. This simplifies the code
substantially.

This logic will also be needed by the postcopy path once the
VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS feature is supported.

Signed-off-by: Raphael Norwitz 
---
 contrib/libvhost-user/libvhost-user.c | 147 ++
 1 file changed, 79 insertions(+), 68 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c 
b/contrib/libvhost-user/libvhost-user.c
index 3bca996..cccfa22 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -584,6 +584,84 @@ map_ring(VuDev *dev, VuVirtq *vq)
 }
 
 static bool
+generate_faults(VuDev *dev) {
+int i;
+for (i = 0; i < dev->nregions; i++) {
+VuDevRegion *dev_region = >regions[i];
+int ret;
+#ifdef UFFDIO_REGISTER
+/*
+ * We should already have an open ufd. Mark each memory
+ * range as ufd.
+ * Discard any mapping we have here; note I can't use MADV_REMOVE
+ * or fallocate to make the hole since I don't want to lose
+ * data that's already arrived in the shared process.
+ * TODO: How to do hugepage
+ */
+ret = madvise((void *)(uintptr_t)dev_region->mmap_addr,
+  dev_region->size + dev_region->mmap_offset,
+  MADV_DONTNEED);
+if (ret) {
+fprintf(stderr,
+"%s: Failed to madvise(DONTNEED) region %d: %s\n",
+__func__, i, strerror(errno));
+}
+/*
+ * Turn off transparent hugepages so we dont get lose wakeups
+ * in neighbouring pages.
+ * TODO: Turn this backon later.
+ */
+ret = madvise((void *)(uintptr_t)dev_region->mmap_addr,
+  dev_region->size + dev_region->mmap_offset,
+  MADV_NOHUGEPAGE);
+if (ret) {
+/*
+ * Note: This can happen legally on kernels that are configured
+ * without madvise'able hugepages
+ */
+fprintf(stderr,
+"%s: Failed to madvise(NOHUGEPAGE) region %d: %s\n",
+__func__, i, strerror(errno));
+}
+struct uffdio_register reg_struct;
+reg_struct.range.start = (uintptr_t)dev_region->mmap_addr;
+reg_struct.range.len = dev_region->size + dev_region->mmap_offset;
+reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
+
+if (ioctl(dev->postcopy_ufd, UFFDIO_REGISTER, _struct)) {
+vu_panic(dev, "%s: Failed to userfault region %d "
+  "@%p + size:%zx offset: %zx: (ufd=%d)%s\n",
+ __func__, i,
+ dev_region->mmap_addr,
+ dev_region->size, dev_region->mmap_offset,
+ dev->postcopy_ufd, strerror(errno));
+return false;
+}
+if (!(reg_struct.ioctls & ((__u64)1 << _UFFDIO_COPY))) {
+vu_panic(dev, "%s Region (%d) doesn't support COPY",
+ __func__, i);
+return false;
+}
+DPRINT("%s: region %d: Registered userfault for %"
+   PRIx64 " + %" PRIx64 "\n", __func__, i,
+   (uint64_t)reg_struct.range.start,
+   (uint64_t)reg_struct.range.len);
+/* Now it's registered we can let the client at it */
+if (mprotect((void *)(uintptr_t)dev_region->mmap_addr,
+ dev_region->size + dev_region->mmap_offset,
+ PROT_READ | PROT_WRITE)) {
+vu_panic(dev, "failed to mprotect region %d for postcopy (%s)",
+ i, strerror(errno));
+return false;
+}
+/* TODO: Stash 'zero' support flags somewhere */
+#endif
+}
+
+return true;
+}
+
+static bool
 vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
 {
 int i;
@@ -655,74 +733,7 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg 
*vmsg)
 }
 
 /* OK, now we can go and register the memory and generate faults */
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *dev_region = >regions[i];
-int ret;
-#ifdef UFFDIO_REGISTER
-/* We should already have an open ufd. Mark each memory
- * range as ufd.
- * Discard any mapping we have here; note I can't use MADV_REMOVE
- * or fallocate to make the hole since I don't want to lose
- * data that's already arrived in the shared process.
- * TODO: How to do hugepage
- */
-ret = madvise((void *)(uintptr_t)dev_region->mmap_addr,
-  dev_region->size + dev_region->mmap_offset,
-  MADV_DONTNEED);
-if (ret) {
-

[PATCH v4 00/10] vhost-user: Lift Max Ram Slots Limitation

In QEMU today, a VM with a vhost-user device can hot add memory a
maximum of 8 times. See these threads, among others:

[1] https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg01046.html
https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg01236.html

[2] https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg04656.html

This series introduces a new protocol feature
VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS which, when enabled, lifts the
restriction on the maximum number RAM slots imposed by vhost-user.

Without vhost-user, a Qemu VM can support 256 ram slots (for ACPI targets),
or potentially more (the KVM max is 512). With each region, a file descriptor
must be sent over the socket. If that many regions are sent in a single message
there could be upwards of 256 file descriptors being opened in the backend 
process
at once. Opening that many fds could easily push the process past the open fd 
limit,
especially considering one backend process could have multiple vhost threads,
exposing different devices to different Qemu instances. Therefore to safely 
lift the
limit, transmitting regions should be split up over multiple messages.

In addition, the VHOST_USER_SET_MEM_TABLE message was not reused because
as the number of regions grows, the message becomes very large. In practice, 
such
large messages caused problems (truncated messages) and in the past it seems
the community has opted for smaller fixed size messages where possible. VRINGs,
for example, are sent to the backend individually instead of in one massive
message.

The implementation details are explained in more detail in the commit
messages, but at a high level the new protocol feature works as follows:
- If the VHOST_USER_PROTCOL_F_CONFIGURE_MEM_SLOTS feature is enabled,
  QEMU will send multiple VHOST_USER_ADD_MEM_REG and
  VHOST_USER_REM_MEM_REG messages to map and unmap individual memory
 regions instead of one large VHOST_USER_SET_MEM_TABLE message containing
  all memory regions.
- The vhost-user struct maintains a ’shadow state’ of memory regions
  already sent to the guest. Each time vhost_user_set_mem_table is called,
  the shadow state is compared with the new device state. A
  VHOST_USER_REM_MEM_REG will be sent for each region in the shadow state
  not in the device state. Then, a VHOST_USER_ADD_MEM_REG will be sent
  for each region in the device state but not the shadow state. After
  these messages have been sent, the shadow state will be updated to
  reflect the new device state.

The series consists of 10 changes:
1. Add helper to populate vhost-user message regions:
This change adds a helper to populate a VhostUserMemoryRegion from a
struct vhost_memory_region, which needs to be done in multiple places in
in this series.

2. Add vhost-user helper to get MemoryRegion data
This changes adds a helper to get a pointer to a MemoryRegion struct, along
with it's offset address and associated file descriptor. This helper is 
used to
simplify other vhost-user code paths and will be needed elsewhere in this
series.

3. Add VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
This change adds the VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
protocol feature. At this point, if negotiated, the feature only allows the
backend to limit the number of max ram slots to a number less than
VHOST_MEMORY_MAX_NREGIONS = 8.

4. Transmit vhost-user memory regions individually
With this change, if the VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
protocol feature is enabled, Qemu will send regions to the backend using
individual VHOST_USER_ADD_MEM_REG and VHOST_USER_REM_MEM_REG
messages.
The max number of ram slots supported is still limited to 8.

5. Lift max memory slots imposed by vhost-user
With this change, if the VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
protocol feature is enabled, the backend can support a configurable number 
of
ram slots up to the maximum allowed by the target platform.

6. Refactor out libvhost-user fault generation logic
This cleanup moves some logic from vu_set_mem_table_exec_postcopy() to a
separate helper, which will be needed elsewhere.

7. Support ram slot configuration in libvhost-user
   This change adds support for processing VHOST_USER_GET_MAX_MEMSLOTS
messages in libvhost-user.
The VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS protocol is not yet
enabled in libvhost-user, so at this point this change is non-functional.

8. Support adding individual regions in libvhost-user
This change adds libvhost-user support for mapping in new memory regions
when receiving VHOST_USER_ADD_MEM_REG messages.
The VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS protocol is not yet
enabled in libvhost-user, so at this point this change is non-functional.

9. Support individual region unmap in libvhost-user
This change adds libvhost-user support for unmapping removed memory regions
when receiving VHOST_USER_REM_MEM_REG messages.

[PATCH v4 01/10] Add helper to populate vhost-user message regions

When setting vhost-user memory tables, memory region descriptors must be
copied from the vhost_dev struct to the vhost-user message. To avoid
duplicating code in setting the memory tables, we should use a helper to
populate this field. This change adds this helper.

Signed-off-by: Raphael Norwitz 
---
 hw/virtio/vhost-user.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ec21e8f..2e0552d 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -407,6 +407,15 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, 
uint64_t base,
 return 0;
 }
 
+static void vhost_user_fill_msg_region(VhostUserMemoryRegion *dst,
+   struct vhost_memory_region *src)
+{
+assert(src != NULL && dst != NULL);
+dst->userspace_addr = src->userspace_addr;
+dst->memory_size = src->memory_size;
+dst->guest_phys_addr = src->guest_phys_addr;
+}
+
 static int vhost_user_fill_set_mem_table_msg(struct vhost_user *u,
  struct vhost_dev *dev,
  VhostUserMsg *msg,
@@ -417,6 +426,7 @@ static int vhost_user_fill_set_mem_table_msg(struct 
vhost_user *u,
 ram_addr_t offset;
 MemoryRegion *mr;
 struct vhost_memory_region *reg;
+VhostUserMemoryRegion region_buffer;
 
 msg->hdr.request = VHOST_USER_SET_MEM_TABLE;
 
@@ -441,12 +451,8 @@ static int vhost_user_fill_set_mem_table_msg(struct 
vhost_user *u,
 error_report("Failed preparing vhost-user memory table msg");
 return -1;
 }
-msg->payload.memory.regions[*fd_num].userspace_addr =
-reg->userspace_addr;
-msg->payload.memory.regions[*fd_num].memory_size =
-reg->memory_size;
-msg->payload.memory.regions[*fd_num].guest_phys_addr =
-reg->guest_phys_addr;
+vhost_user_fill_msg_region(_buffer, reg);
+msg->payload.memory.regions[*fd_num] = region_buffer;
 msg->payload.memory.regions[*fd_num].mmap_offset = offset;
 fds[(*fd_num)++] = fd;
 } else if (track_ramblocks) {
-- 
1.8.3.1

[PATCH v4 03/10] Add VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS

This change introduces a new feature to the vhost-user protocol allowing
a backend device to specify the maximum number of ram slots it supports.

At this point, the value returned by the backend will be capped at the
maximum number of ram slots which can be supported by vhost-user, which
is currently set to 8 because of underlying protocol limitations.

The returned value will be stored inside the VhostUserState struct so
that on device reconnect we can verify that the ram slot limitation
has not decreased since the last time the device connected.

Signed-off-by: Raphael Norwitz 
Signed-off-by: Peter Turschmid 
---
 docs/interop/vhost-user.rst| 16 ++
 hw/virtio/vhost-user.c | 49 --
 include/hw/virtio/vhost-user.h |  1 +
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 3b1b660..b3cf5c3 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -815,6 +815,7 @@ Protocol features
   #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD   12
   #define VHOST_USER_PROTOCOL_F_RESET_DEVICE 13
   #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
+  #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
 
 Master message types
 
@@ -1263,6 +1264,21 @@ Master message types
 
   The state.num field is currently reserved and must be set to 0.
 
+``VHOST_USER_GET_MAX_MEM_SLOTS``
+  :id: 36
+  :equivalent ioctl: N/A
+  :slave payload: u64
+
+  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
+  feature has been successfully negotiated, this message is submitted
+  by master to the slave. The slave should return the message with a
+  u64 payload containing the maximum number of memory slots for
+  QEMU to expose to the guest. At this point, the value returned
+  by the backend will be capped at the maximum number of ram slots
+  which can be supported by vhost-user. Currently that limit is set
+  at VHOST_USER_MAX_RAM_SLOTS = 8 because of underlying protocol
+  limitations.
+
 Slave message types
 ---
 
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 442b0d6..0af593f 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -59,6 +59,8 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
 VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
 VHOST_USER_PROTOCOL_F_RESET_DEVICE = 13,
+/* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
+VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
 VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -100,6 +102,8 @@ typedef enum VhostUserRequest {
 VHOST_USER_SET_INFLIGHT_FD = 32,
 VHOST_USER_GPU_SET_SOCKET = 33,
 VHOST_USER_RESET_DEVICE = 34,
+/* Message number 35 reserved for VHOST_USER_VRING_KICK. */
+VHOST_USER_GET_MAX_MEM_SLOTS = 36,
 VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -895,6 +899,23 @@ static int vhost_user_set_owner(struct vhost_dev *dev)
 return 0;
 }
 
+static int vhost_user_get_max_memslots(struct vhost_dev *dev,
+   uint64_t *max_memslots)
+{
+uint64_t backend_max_memslots;
+int err;
+
+err = vhost_user_get_u64(dev, VHOST_USER_GET_MAX_MEM_SLOTS,
+ _max_memslots);
+if (err < 0) {
+return err;
+}
+
+*max_memslots = backend_max_memslots;
+
+return 0;
+}
+
 static int vhost_user_reset_device(struct vhost_dev *dev)
 {
 VhostUserMsg msg = {
@@ -1392,7 +1413,7 @@ static int 
vhost_user_postcopy_notifier(NotifierWithReturn *notifier,
 
 static int vhost_user_backend_init(struct vhost_dev *dev, void *opaque)
 {
-uint64_t features, protocol_features;
+uint64_t features, protocol_features, ram_slots;
 struct vhost_user *u;
 int err;
 
@@ -1454,6 +1475,27 @@ static int vhost_user_backend_init(struct vhost_dev 
*dev, void *opaque)
  "slave-req protocol features.");
 return -1;
 }
+
+/* get max memory regions if backend supports configurable RAM slots */
+if (!virtio_has_feature(dev->protocol_features,
+VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS)) {
+u->user->memory_slots = VHOST_MEMORY_MAX_NREGIONS;
+} else {
+err = vhost_user_get_max_memslots(dev, _slots);
+if (err < 0) {
+return err;
+}
+
+if (ram_slots < u->user->memory_slots) {
+error_report("The backend specified a max ram slots limit "
+ "of %lu, when the prior validated limit was %d. "
+ "This limit should never decrease.", ram_slots,
+ u->user->memory_slots);
+return -1;
+}
+
+u->user->memory_slots = MIN(ram_slots, VHOST_MEMORY_MAX_NREGIONS);
+}
 }
 
 if

[PATCH v4 04/10] Transmit vhost-user memory regions individually

With this change, when the VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
protocol feature has been negotiated, Qemu no longer sends the backend
all the memory regions in a single message. Rather, when the memory
tables are set or updated, a series of VHOST_USER_ADD_MEM_REG and
VHOST_USER_REM_MEM_REG messages are sent to transmit the regions to map
and/or unmap instead of sending send all the regions in one fixed size
VHOST_USER_SET_MEM_TABLE message.

The vhost_user struct maintains a shadow state of the VM’s memory
regions. When the memory tables are modified, the
vhost_user_set_mem_table() function compares the new device memory state
to the shadow state and only sends regions which need to be unmapped or
mapped in. The regions which must be unmapped are sent first, followed
by the new regions to be mapped in. After all the messages have been
sent, the shadow state is set to the current virtual device state.

Existing backends which do not support
VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS are unaffected.

Signed-off-by: Raphael Norwitz 
Signed-off-by: Swapnil Ingle 
Signed-off-by: Peter Turschmid 
Suggested-by: Mike Cui 
---
 docs/interop/vhost-user.rst |  33 ++-
 hw/virtio/vhost-user.c  | 510 +---
 2 files changed, 469 insertions(+), 74 deletions(-)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index b3cf5c3..037eefa 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1276,8 +1276,37 @@ Master message types
   QEMU to expose to the guest. At this point, the value returned
   by the backend will be capped at the maximum number of ram slots
   which can be supported by vhost-user. Currently that limit is set
-  at VHOST_USER_MAX_RAM_SLOTS = 8 because of underlying protocol
-  limitations.
+  at VHOST_USER_MAX_RAM_SLOTS = 8.
+
+``VHOST_USER_ADD_MEM_REG``
+  :id: 37
+  :equivalent ioctl: N/A
+  :slave payload: memory region
+
+  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
+  feature has been successfully negotiated, this message is submitted
+  by the master to the slave. The message payload contains a memory
+  region descriptor struct, describing a region of guest memory which
+  the slave device must map in. When the
+  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
+  been successfully negotiated, along with the
+  ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
+  update the memory tables of the slave device.
+
+``VHOST_USER_REM_MEM_REG``
+  :id: 38
+  :equivalent ioctl: N/A
+  :slave payload: memory region
+
+  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
+  feature has been successfully negotiated, this message is submitted
+  by the master to the slave. The message payload contains a memory
+  region descriptor struct, describing a region of guest memory which
+  the slave device must unmap. When the
+  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
+  been successfully negotiated, along with the
+  ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
+  update the memory tables of the slave device.
 
 Slave message types
 ---
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 0af593f..9358406 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -104,6 +104,8 @@ typedef enum VhostUserRequest {
 VHOST_USER_RESET_DEVICE = 34,
 /* Message number 35 reserved for VHOST_USER_VRING_KICK. */
 VHOST_USER_GET_MAX_MEM_SLOTS = 36,
+VHOST_USER_ADD_MEM_REG = 37,
+VHOST_USER_REM_MEM_REG = 38,
 VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -128,6 +130,11 @@ typedef struct VhostUserMemory {
 VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
 } VhostUserMemory;
 
+typedef struct VhostUserMemRegMsg {
+uint32_t padding;
+VhostUserMemoryRegion region;
+} VhostUserMemRegMsg;
+
 typedef struct VhostUserLog {
 uint64_t mmap_size;
 uint64_t mmap_offset;
@@ -186,6 +193,7 @@ typedef union {
 struct vhost_vring_state state;
 struct vhost_vring_addr addr;
 VhostUserMemory memory;
+VhostUserMemRegMsg mem_reg;
 VhostUserLog log;
 struct vhost_iotlb_msg iotlb;
 VhostUserConfig config;
@@ -226,6 +234,16 @@ struct vhost_user {
 
 /* True once we've entered postcopy_listen */
 bool   postcopy_listen;
+
+/* Our current regions */
+int num_shadow_regions;
+struct vhost_memory_region shadow_regions[VHOST_MEMORY_MAX_NREGIONS];
+};
+
+struct scrub_regions {
+struct vhost_memory_region *region;
+int reg_idx;
+int fd_idx;
 };
 
 static bool ioeventfd_enabled(void)
@@ -489,8 +507,332 @@ static int vhost_user_fill_set_mem_table_msg(struct 
vhost_user *u,
 return 1;
 }
 
+static inline bool reg_equal(struct vhost_memory_region *shadow_reg,
+ struct vhost_memory_region *vdev_reg)
+{
+return

[PATCH v4 02/10] Add vhost-user helper to get MemoryRegion data

When setting the memory tables, qemu uses a memory region's userspace
address to look up the region's MemoryRegion struct. Among other things,
the MemoryRegion contains the region's offset and associated file
descriptor, all of which need to be sent to the backend.

With VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS, this logic will be
needed in multiple places, so before feature support is added it
should be moved to a helper function.

This helper is also used to simplify the vhost_user_can_merge()
function.

Signed-off-by: Raphael Norwitz 
---
 hw/virtio/vhost-user.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 2e0552d..442b0d6 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -407,6 +407,18 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, 
uint64_t base,
 return 0;
 }
 
+static MemoryRegion *vhost_user_get_mr_data(uint64_t addr, ram_addr_t *offset,
+int *fd)
+{
+MemoryRegion *mr;
+
+assert((uintptr_t)addr == addr);
+mr = memory_region_from_host((void *)(uintptr_t)addr, offset);
+*fd = memory_region_get_fd(mr);
+
+return mr;
+}
+
 static void vhost_user_fill_msg_region(VhostUserMemoryRegion *dst,
struct vhost_memory_region *src)
 {
@@ -433,10 +445,7 @@ static int vhost_user_fill_set_mem_table_msg(struct 
vhost_user *u,
 for (i = 0; i < dev->mem->nregions; ++i) {
 reg = dev->mem->regions + i;
 
-assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
-mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
- );
-fd = memory_region_get_fd(mr);
+mr = vhost_user_get_mr_data(reg->userspace_addr, , );
 if (fd > 0) {
 if (track_ramblocks) {
 assert(*fd_num < VHOST_MEMORY_MAX_NREGIONS);
@@ -1551,13 +1560,9 @@ static bool vhost_user_can_merge(struct vhost_dev *dev,
 {
 ram_addr_t offset;
 int mfd, rfd;
-MemoryRegion *mr;
-
-mr = memory_region_from_host((void *)(uintptr_t)start1, );
-mfd = memory_region_get_fd(mr);
 
-mr = memory_region_from_host((void *)(uintptr_t)start2, );
-rfd = memory_region_get_fd(mr);
+(void)vhost_user_get_mr_data(start1, , );
+(void)vhost_user_get_mr_data(start2, , );
 
 return mfd == rfd;
 }
-- 
1.8.3.1

Re: [PATCH v1 04/10] linux-user: completely re-write init_guest_space

On 13/05/2020 19.51, Alex Bennée wrote:
> First we ensure all guest space initialisation logic comes through
> probe_guest_base once we understand the nature of the binary we are
> loading. The convoluted init_guest_space routine is removed and
> replaced with a number of pgb_* helpers which are called depending on
> what requirements we have when loading the binary.
> 
> We first try to do what is requested by the host. Failing that we try
> and satisfy the guest requested base address. If all those options
> fail we fall back to finding a space in the memory map using our
> recently written read_self_maps() helper.
> 
> There are some additional complications we try and take into account
> when looking for holes in the address space. We try not to go directly
> after the system brk() space so there is space for a little growth. We
> also don't want to have to use negative offsets which would result in
> slightly less efficient code on x86 when it's unable to use the
> segment offset register.
> 
> Less mind-binding gotos and hopefully clearer logic throughout.
> 
> Signed-off-by: Alex Bennée 
> Acked-by: Laurent Vivier 
[...]
> diff --git a/linux-user/elfload.c b/linux-user/elfload.c
> index 619c054cc48..01a9323a637 100644
> --- a/linux-user/elfload.c
> +++ b/linux-user/elfload.c
> @@ -11,6 +11,7 @@
>  #include "qemu/queue.h"
>  #include "qemu/guest-random.h"
>  #include "qemu/units.h"
> +#include "qemu/selfmap.h"
>  
>  #ifdef _ARCH_PPC64
>  #undef ARCH_DLINFO
> @@ -382,68 +383,30 @@ enum {
>  
>  /* The commpage only exists for 32 bit kernels */
>  
> -/* Return 1 if the proposed guest space is suitable for the guest.
> - * Return 0 if the proposed guest space isn't suitable, but another
> - * address space should be tried.
> - * Return -1 if there is no way the proposed guest space can be
> - * valid regardless of the base.
> - * The guest code may leave a page mapped and populate it if the
> - * address is suitable.
> - */
> -static int init_guest_commpage(unsigned long guest_base,
> -   unsigned long guest_size)
> -{
> -unsigned long real_start, test_page_addr;
> -
> -/* We need to check that we can force a fault on access to the
> - * commpage at 0x0fxx
> - */
> -test_page_addr = guest_base + (0x0f00 & qemu_host_page_mask);
> -
> -/* If the commpage lies within the already allocated guest space,
> - * then there is no way we can allocate it.
> - *
> - * You may be thinking that that this check is redundant because
> - * we already validated the guest size against MAX_RESERVED_VA;
> - * but if qemu_host_page_mask is unusually large, then
> - * test_page_addr may be lower.
> - */
> -if (test_page_addr >= guest_base
> -&& test_page_addr < (guest_base + guest_size)) {
> -return -1;
> -}
> +#define ARM_COMMPAGE (intptr_t)0x0f00u
>  
> -/* Note it needs to be writeable to let us initialise it */
> -real_start = (unsigned long)
> - mmap((void *)test_page_addr, qemu_host_page_size,
> - PROT_READ | PROT_WRITE,
> - MAP_ANONYMOUS | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +static bool init_guest_commpage(void)
> +{
> +void *want = g2h(ARM_COMMPAGE & -qemu_host_page_size);
> +void *addr = mmap(want, qemu_host_page_size, PROT_READ | PROT_WRITE,
> +  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>  
> -/* If we can't map it then try another address */
> -if (real_start == -1ul) {
> -return 0;
> +if (addr == MAP_FAILED) {
> +perror("Allocating guest commpage");
> +exit(EXIT_FAILURE);
>  }
> -
> -if (real_start != test_page_addr) {
> -/* OS didn't put the page where we asked - unmap and reject */
> -munmap((void *)real_start, qemu_host_page_size);
> -return 0;
> +if (addr != want) {
> +return false;
>  }
>  
> -/* Leave the page mapped
> - * Populate it (mmap should have left it all 0'd)
> - */
> -
> -/* Kernel helper versions */
> -__put_user(5, (uint32_t *)g2h(0x0ffcul));
> +/* Set kernel helper versions; rest of page is 0.  */
> +__put_user(5, (uint32_t *)g2h(0x0ffcu));
>  
> -/* Now it's populated make it RO */
> -if (mprotect((void *)test_page_addr, qemu_host_page_size, PROT_READ)) {
> +if (mprotect(addr, qemu_host_page_size, PROT_READ)) {
>  perror("Protecting guest commpage");
> -exit(-1);
> +exit(EXIT_FAILURE);
>  }
> -
> -return 1; /* All good */
> +return true;
>  }
>  
>  #define ELF_HWCAP get_elf_hwcap()
> @@ -2075,239 +2038,267 @@ static abi_ulong create_elf_tables(abi_ulong p, int 
> argc, int envc,
>  return sp;
>  }
>  
> -unsigned long init_guest_space(unsigned long host_start,
> -   unsigned long host_size,
> -   unsigned long guest_start,
> -   bool

Re: [PATCH v8 0/8] pci_expander_brdige:acpi: Support pxb-pcie for ARM

Patchew URL: https://patchew.org/QEMU/20200521033631.1605-1-miaoy...@huawei.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 20200521033631.1605-1-miaoy...@huawei.com
Subject: [PATCH v8 0/8] pci_expander_brdige:acpi: Support pxb-pcie for ARM
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Switched to a new branch 'test'
b067a12 unit-test: Add the binary file and clear diff.h
93ad0f2 unit-test: Add testcase for pxb
1c7bcb1 unit-test: The files changed.
ad5fafc acpi: Align the size to 128k
a08c69e acpi: Refactor the source of host bridge and build tables for pxb
e1ad9a4 acpi: Extract crs build form acpi_build.c
c96e0de fw_cfg: Write the extra roots into the fw_cfg
887ac48 acpi: Extract two APIs from acpi_dsdt_add_pci

=== OUTPUT BEGIN ===
1/8 Checking commit 887ac4803f07 (acpi: Extract two APIs from acpi_dsdt_add_pci)
2/8 Checking commit c96e0de90d7e (fw_cfg: Write the extra roots into the fw_cfg)
3/8 Checking commit e1ad9a46a3b0 (acpi: Extract crs build form acpi_build.c)
4/8 Checking commit a08c69ea4ea9 (acpi: Refactor the source of host bridge and 
build tables for pxb)
5/8 Checking commit ad5fafcf0ee5 (acpi: Align the size to 128k)
6/8 Checking commit 1c7bcb134bf5 (unit-test: The files changed.)
7/8 Checking commit 93ad0f2148c0 (unit-test: Add testcase for pxb)
8/8 Checking commit b067a1263eb6 (unit-test: Add the binary file and clear 
diff.h)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#13: 
new file mode 100644

ERROR: Do not add expected files together with tests, follow instructions in 
tests/qtest/bios-tables-test.c: both tests/data/acpi/virt/DSDT.pxb and 
tests/qtest/bios-tables-test-allowed-diff.h found

ERROR: Do not add expected files together with tests, follow instructions in 
tests/qtest/bios-tables-test.c: both tests/data/acpi/virt/DSDT.pxb and 
tests/qtest/bios-tables-test-allowed-diff.h found

total: 2 errors, 1 warnings, 1 lines checked

Patch 8/8 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20200521033631.1605-1-miaoy...@huawei.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v8 0/8] pci_expander_brdige:acpi: Support pxb-pcie for ARM

Patchew URL: https://patchew.org/QEMU/20200521033631.1605-1-miaoy...@huawei.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  TESTcheck-qtest-x86_64: tests/qtest/qom-test
socket_accept failed: Resource temporarily unavailable
**
ERROR:/tmp/qemu-test/src/tests/qtest/libqtest.c:301:qtest_init_without_qmp_handshake:
 assertion failed: (s->fd >= 0 && s->qmp_fd >= 0)
/tmp/qemu-test/src/tests/qtest/libqtest.c:175: kill_qemu() detected QEMU death 
from signal 15 (Terminated)
ERROR - Bail out! 
ERROR:/tmp/qemu-test/src/tests/qtest/libqtest.c:301:qtest_init_without_qmp_handshake:
 assertion failed: (s->fd >= 0 && s->qmp_fd >= 0)
make: *** [check-qtest-x86_64] Error 1
make: *** Waiting for unfinished jobs
  TESTiotest-qcow2: 220
  TESTiotest-qcow2: 226
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=fba0d148ea2d4da7b4a208babcd55f63', '-u', 
'1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-5gx31uix/src/docker-src.2020-05-20-23.47.40.24884:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=fba0d148ea2d4da7b4a208babcd55f63
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-5gx31uix/src'
make: *** [docker-run-test-quick@centos7] Error 2

real23m10.014s
user0m9.319s


The full log is available at
http://patchew.org/logs/20200521033631.1605-1-miaoy...@huawei.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

[RFC v2 18/18] guest memory protection: Alter virtio default properties for protected guests

The default behaviour for virtio devices is not to use the platforms normal
DMA paths, but instead to use the fact that it's running in a hypervisor
to directly access guest memory.  That doesn't work if the guest's memory
is protected from hypervisor access, such as with AMD's SEV or POWER's PEF.

So, if a guest memory protection mechanism is enabled, then apply the
iommu_platform=on option so it will go through normal DMA mechanisms.
Those will presumably have some way of marking memory as shared with the
hypervisor or hardware so that DMA will work.

Signed-off-by: David Gibson 
---
 hw/core/machine.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 88d699bceb..cb6580954e 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -28,6 +28,8 @@
 #include "hw/mem/nvdimm.h"
 #include "migration/vmstate.h"
 #include "exec/guest-memory-protection.h"
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/virtio-pci.h"
 
 GlobalProperty hw_compat_5_0[] = {};
 const size_t hw_compat_5_0_len = G_N_ELEMENTS(hw_compat_5_0);
@@ -1159,6 +1161,15 @@ void machine_run_board_init(MachineState *machine)
  * areas.
  */
 machine_set_mem_merge(OBJECT(machine), false, _abort);
+
+/*
+ * Virtio devices can't count on directly accessing guest
+ * memory, so they need iommu_platform=on to use normal DMA
+ * mechanisms.  That requires disabling legacy virtio support
+ * for virtio pci devices
+ */
+object_register_sugar_prop(TYPE_VIRTIO_PCI, "disable-legacy", "on");
+object_register_sugar_prop(TYPE_VIRTIO_DEVICE, "iommu_platform", "on");
 }
 
 machine_class->init(machine);
-- 
2.26.2

[RFC v2 15/18] guest memory protection: Decouple kvm_memcrypt_*() helpers from KVM

The kvm_memcrypt_enabled() and kvm_memcrypt_encrypt_data() helper functions
don't conceptually have any connection to KVM (although it's not possible
in practice to use them without it).

They also rely on looking at the global KVMState.  But the same information
is available from the machine, and the only existing callers have natural
access to the machine state.

Therefore, move and rename them to helpers in guest-memory-protection.h,
taking an explicit machine parameter.

Signed-off-by: David Gibson 
---
 accel/kvm/kvm-all.c| 28 ---
 accel/stubs/kvm-stub.c | 10 ---
 hw/i386/pc_sysfw.c |  6 ++--
 include/exec/guest-memory-protection.h | 38 ++
 include/sysemu/kvm.h   | 17 
 5 files changed, 42 insertions(+), 57 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 3588adf1e1..1b10e94222 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -118,9 +118,6 @@ struct KVMState
 KVMMemoryListener memory_listener;
 QLIST_HEAD(, KVMParkedVcpu) kvm_parked_vcpus;
 
-/* memory encryption */
-GuestMemoryProtection *guest_memory_protection;
-
 /* For "info mtree -f" to tell if an MR is registered in KVM */
 int nr_as;
 struct KVMAs {
@@ -169,29 +166,6 @@ int kvm_get_max_memslots(void)
 return s->nr_slots;
 }
 
-bool kvm_memcrypt_enabled(void)
-{
-if (kvm_state && kvm_state->guest_memory_protection) {
-return true;
-}
-
-return false;
-}
-
-int kvm_memcrypt_encrypt_data(uint8_t *ptr, uint64_t len)
-{
-GuestMemoryProtection *gmpo = kvm_state->guest_memory_protection;
-
-if (gmpo) {
-GuestMemoryProtectionClass *gmpc =
-GUEST_MEMORY_PROTECTION_GET_CLASS(gmpo);
-
-return gmpc->encrypt_data(gmpo, ptr, len);
-}
-
-return 1;
-}
-
 /* Called with KVMMemoryListener.slots_lock held */
 static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
 {
@@ -2110,8 +2084,6 @@ static int kvm_init(MachineState *ms)
 if (ret < 0) {
 goto err;
 }
-
-kvm_state->guest_memory_protection = ms->gmpo;
 }
 
 ret = kvm_arch_init(ms, s);
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 82f118d2df..78b3eef117 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -104,16 +104,6 @@ int kvm_on_sigbus(int code, void *addr)
 return 1;
 }
 
-bool kvm_memcrypt_enabled(void)
-{
-return false;
-}
-
-int kvm_memcrypt_encrypt_data(uint8_t *ptr, uint64_t len)
-{
-  return 1;
-}
-
 #ifndef CONFIG_USER_ONLY
 int kvm_irqchip_add_msi_route(KVMState *s, int vector, PCIDevice *dev)
 {
diff --git a/hw/i386/pc_sysfw.c b/hw/i386/pc_sysfw.c
index b8d8ef59eb..9cef5f7780 100644
--- a/hw/i386/pc_sysfw.c
+++ b/hw/i386/pc_sysfw.c
@@ -38,6 +38,7 @@
 #include "sysemu/sysemu.h"
 #include "hw/block/flash.h"
 #include "sysemu/kvm.h"
+#include "exec/guest-memory-protection.h"
 
 /*
  * We don't have a theoretically justifiable exact lower bound on the base
@@ -196,10 +197,11 @@ static void pc_system_flash_map(PCMachineState *pcms,
 pc_isa_bios_init(rom_memory, flash_mem, size);
 
 /* Encrypt the pflash boot ROM */
-if (kvm_memcrypt_enabled()) {
+if (guest_memory_protection_enabled(MACHINE(pcms))) {
 flash_ptr = memory_region_get_ram_ptr(flash_mem);
 flash_size = memory_region_size(flash_mem);
-ret = kvm_memcrypt_encrypt_data(flash_ptr, flash_size);
+ret = guest_memory_protection_encrypt(MACHINE(pcms),
+  flash_ptr, flash_size);
 if (ret) {
 error_report("failed to encrypt pflash rom");
 exit(1);
diff --git a/include/exec/guest-memory-protection.h 
b/include/exec/guest-memory-protection.h
index 3707b96515..7d959b4910 100644
--- a/include/exec/guest-memory-protection.h
+++ b/include/exec/guest-memory-protection.h
@@ -14,6 +14,7 @@
 #define QEMU_GUEST_MEMORY_PROTECTION_H
 
 #include "qom/object.h"
+#include "hw/boards.h"
 
 typedef struct GuestMemoryProtection GuestMemoryProtection;
 
@@ -35,5 +36,42 @@ typedef struct GuestMemoryProtectionClass {
 int (*encrypt_data)(GuestMemoryProtection *, uint8_t *, uint64_t);
 } GuestMemoryProtectionClass;
 
+/**
+ * guest_memory_protection_enabled - return whether guest memory is
+ *   protected from hypervisor access
+ *   (with memory encryption or
+ *   otherwise)
+ * Returns: true guest memory is not directly accessible to qemu
+ *  false guest memory is directly accessible to qemu
+ */
+static inline bool guest_memory_protection_enabled(MachineState *machine)
+{
+return !!machine->gmpo;
+}
+
+/**
+ * guest_memory_protection_encrypt: encrypt the memory range to make
+ *

[RFC v2 17/18] spapr: Added PEF based guest memory protection

Some upcoming POWER machines have a system called PEF (Protected
Execution Framework) which uses a small ultravisor to allow guests to
run in a way that they can't be eavesdropped by the hypervisor.  The
effect is roughly similar to AMD SEV, although the mechanisms are
quite different.

Most of the work of this is done between the guest, KVM and the
ultravisor, with little need for involvement by qemu.  However qemu
does need to tell KVM to allow secure VMs.

Because the availability of secure mode is a guest visible difference
which depends on havint the right hardware and firmware, we don't
enable this by default.  In order to run a secure guest you need to
create a "pef-guest" object and set the guest-memory-protection machine 
property to point to it.

Note that this just *allows* secure guests, the architecture of PEF is
such that the guest still needs to talk to the ultravisor to enter
secure mode, so we can't know if the guest actually is secure until
well after machine creation time.

Signed-off-by: David Gibson 
---
 target/ppc/Makefile.objs |  2 +-
 target/ppc/pef.c | 81 
 2 files changed, 82 insertions(+), 1 deletion(-)
 create mode 100644 target/ppc/pef.c

diff --git a/target/ppc/Makefile.objs b/target/ppc/Makefile.objs
index e8fa18ce13..ac93b9700e 100644
--- a/target/ppc/Makefile.objs
+++ b/target/ppc/Makefile.objs
@@ -6,7 +6,7 @@ obj-y += machine.o mmu_helper.o mmu-hash32.o monitor.o 
arch_dump.o
 obj-$(TARGET_PPC64) += mmu-hash64.o mmu-book3s-v3.o compat.o
 obj-$(TARGET_PPC64) += mmu-radix64.o
 endif
-obj-$(CONFIG_KVM) += kvm.o
+obj-$(CONFIG_KVM) += kvm.o pef.o
 obj-$(call lnot,$(CONFIG_KVM)) += kvm-stub.o
 obj-y += dfp_helper.o
 obj-y += excp_helper.o
diff --git a/target/ppc/pef.c b/target/ppc/pef.c
new file mode 100644
index 00..823daf3e9c
--- /dev/null
+++ b/target/ppc/pef.c
@@ -0,0 +1,81 @@
+/*
+ * PEF (Protected Execution Framework) for POWER support
+ *
+ * Copyright David Gibson, Redhat Inc. 2020
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+
+#define TYPE_PEF_GUEST "pef-guest"
+#define PEF_GUEST(obj)  \
+OBJECT_CHECK(PefGuestState, (obj), TYPE_SEV_GUEST)
+
+typedef struct PefGuestState PefGuestState;
+
+/**
+ * PefGuestState:
+ *
+ * The PefGuestState object is used for creating and managing a PEF
+ * guest.
+ *
+ * # $QEMU \
+ * -object pef-guest,id=pef0 \
+ * -machine ...,guest-memory-protection=pef0
+ */
+struct PefGuestState {
+Object parent_obj;
+};
+
+static Error *pef_mig_blocker;
+
+static int pef_kvm_init(GuestMemoryProtection *gmpo, Error **errp)
+{
+PefGuestState *pef = PEF_GUEST(gmpo);
+
+if (!kvm_check_extension(kvm_state, KVM_CAP_PPC_SECURE_GUEST)) {
+error_setg(errp,
+   "KVM implementation does not support Secure VMs (is an 
ultravisor running?)");
+return -1;
+} else {
+int ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_PPC_SECURE_GUEST, 0, 1);
+
+if (ret < 0) {
+error_setg(errp,
+   "Error enabling PEF with KVM");
+return -1;
+}
+}
+
+return 0;
+}
+
+static void pef_guest_class_init(ObjectClass *oc, void *data)
+{
+GuestMemoryProtectionClass *gmpc = GUEST_MEMORY_PROTECTION_CLASS(oc);
+
+gmpc->kvm_init = pef_kvm_init;
+}
+
+static const TypeInfo pef_guest_info = {
+.parent = TYPE_OBJECT,
+.name = TYPE_PEF_GUEST,
+.instance_size = sizeof(PefGuestState),
+.class_init = pef_guest_class_init,
+.interfaces = (InterfaceInfo[]) {
+{ TYPE_GUEST_MEMORY_PROTECTION },
+{ TYPE_USER_CREATABLE },
+{ }
+}
+};
+
+static void
+pef_register_types(void)
+{
+type_register_static(_guest_info);
+}
+
+type_init(pef_register_types);
-- 
2.26.2

[RFC v2 16/18] guest memory protection: Add Error ** to GuestMemoryProtection::kvm_init

This allows failures to be reported richly and idiomatically.

Signed-off-by: David Gibson 
---
 accel/kvm/kvm-all.c|  4 +++-
 include/exec/guest-memory-protection.h |  2 +-
 target/i386/sev.c  | 31 +-
 3 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 1b10e94222..4011699736 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2079,9 +2079,11 @@ static int kvm_init(MachineState *ms)
 if (ms->gmpo) {
 GuestMemoryProtectionClass *gmpc =
 GUEST_MEMORY_PROTECTION_GET_CLASS(ms->gmpo);
+Error *local_err = NULL;
 
-ret = gmpc->kvm_init(ms->gmpo);
+ret = gmpc->kvm_init(ms->gmpo, _err);
 if (ret < 0) {
+error_report_err(local_err);
 goto err;
 }
 }
diff --git a/include/exec/guest-memory-protection.h 
b/include/exec/guest-memory-protection.h
index 7d959b4910..2a88475136 100644
--- a/include/exec/guest-memory-protection.h
+++ b/include/exec/guest-memory-protection.h
@@ -32,7 +32,7 @@ typedef struct GuestMemoryProtection GuestMemoryProtection;
 typedef struct GuestMemoryProtectionClass {
 InterfaceClass parent;
 
-int (*kvm_init)(GuestMemoryProtection *);
+int (*kvm_init)(GuestMemoryProtection *, Error **);
 int (*encrypt_data)(GuestMemoryProtection *, uint8_t *, uint64_t);
 } GuestMemoryProtectionClass;
 
diff --git a/target/i386/sev.c b/target/i386/sev.c
index 60e9d8c735..6a56ec203b 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -617,7 +617,7 @@ sev_vm_state_change(void *opaque, int running, RunState 
state)
 }
 }
 
-static int sev_kvm_init(GuestMemoryProtection *gmpo)
+static int sev_kvm_init(GuestMemoryProtection *gmpo, Error **errp)
 {
 SevGuestState *sev = SEV_GUEST(gmpo);
 char *devname;
@@ -633,14 +633,14 @@ static int sev_kvm_init(GuestMemoryProtection *gmpo)
 host_cbitpos = ebx & 0x3f;
 
 if (host_cbitpos != sev->cbitpos) {
-error_report("%s: cbitpos check failed, host '%d' requested '%d'",
- __func__, host_cbitpos, sev->cbitpos);
+error_setg(errp, "%s: cbitpos check failed, host '%d' requested '%d'",
+   __func__, host_cbitpos, sev->cbitpos);
 goto err;
 }
 
 if (sev->reduced_phys_bits < 1) {
-error_report("%s: reduced_phys_bits check failed, it should be >=1,"
- " requested '%d'", __func__, sev->reduced_phys_bits);
+error_setg(errp, "%s: reduced_phys_bits check failed, it should be 
>=1,"
+   " requested '%d'", __func__, sev->reduced_phys_bits);
 goto err;
 }
 
@@ -649,20 +649,19 @@ static int sev_kvm_init(GuestMemoryProtection *gmpo)
 devname = object_property_get_str(OBJECT(sev), "sev-device", NULL);
 sev->sev_fd = open(devname, O_RDWR);
 if (sev->sev_fd < 0) {
-error_report("%s: Failed to open %s '%s'", __func__,
- devname, strerror(errno));
-}
-g_free(devname);
-if (sev->sev_fd < 0) {
+error_setg(errp, "%s: Failed to open %s '%s'", __func__,
+   devname, strerror(errno));
+g_free(devname);
 goto err;
 }
+g_free(devname);
 
 ret = sev_platform_ioctl(sev->sev_fd, SEV_PLATFORM_STATUS, ,
  _error);
 if (ret) {
-error_report("%s: failed to get platform status ret=%d "
- "fw_error='%d: %s'", __func__, ret, fw_error,
- fw_error_to_str(fw_error));
+error_setg(errp, "%s: failed to get platform status ret=%d "
+   "fw_error='%d: %s'", __func__, ret, fw_error,
+   fw_error_to_str(fw_error));
 goto err;
 }
 sev->build_id = status.build;
@@ -672,14 +671,14 @@ static int sev_kvm_init(GuestMemoryProtection *gmpo)
 trace_kvm_sev_init();
 ret = sev_ioctl(sev->sev_fd, KVM_SEV_INIT, NULL, _error);
 if (ret) {
-error_report("%s: failed to initialize ret=%d fw_error=%d '%s'",
- __func__, ret, fw_error, fw_error_to_str(fw_error));
+error_setg(errp, "%s: failed to initialize ret=%d fw_error=%d '%s'",
+   __func__, ret, fw_error, fw_error_to_str(fw_error));
 goto err;
 }
 
 ret = sev_launch_start(sev);
 if (ret) {
-error_report("%s: failed to create encryption context", __func__);
+error_setg(errp, "%s: failed to create encryption context", __func__);
 goto err;
 }
 
-- 
2.26.2

[RFC v2 13/18] guest memory protection: Move side effect out of machine_set_memory_encryption()

When the "memory-encryption" property is set, we also disable KSM
merging for the guest, since it won't accomplish anything.

We want that, but doing it in the property set function itself is
thereoretically incorrect, in the unlikely event of some configuration
environment that set the property then cleared it again before
constructing the guest.

But more important, it makes some other cleanups we want more
difficult.  So, instead move this logic to machine_run_board_init()
conditional on the final value of the property.

Signed-off-by: David Gibson 
---
 hw/core/machine.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index bb3a7b18b1..e75f0b73d0 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -429,14 +429,6 @@ static void machine_set_memory_encryption(Object *obj, 
const char *value,
 
 g_free(ms->memory_encryption);
 ms->memory_encryption = g_strdup(value);
-
-/*
- * With memory encryption, the host can't see the real contents of RAM,
- * so there's no point in it trying to merge areas.
- */
-if (value) {
-machine_set_mem_merge(obj, false, errp);
-}
 }
 
 static bool machine_get_nvdimm(Object *obj, Error **errp)
@@ -1129,6 +1121,15 @@ void machine_run_board_init(MachineState *machine)
 }
 }
 
+if (machine->memory_encryption) {
+/*
+ * With guest memory protection, the host can't see the real
+ * contents of RAM, so there's no point in it trying to merge
+ * areas.
+ */
+machine_set_mem_merge(OBJECT(machine), false, _abort);
+}
+
 machine_class->init(machine);
 }
 
-- 
2.26.2

[RFC v2 14/18] guest memory protection: Rework the "memory-encryption" property

Currently the "memory-encryption" property is only looked at once we get to
kvm_init().  Although protection of guest memory from the hypervisor isn't
something that could really ever work with TCG, it's not conceptually tied
to the KVM accelerator.

In addition, the way the string property is resolved to an object is
almost identical to how a QOM link property is handled.

So, create a new "guest-memory-protection" link property which sets
this QOM interface link directly in the machine.  For compatibility we
keep the "memory-encryption" property, but now implemented in terms of
the new property.

Signed-off-by: David Gibson 
---
 accel/kvm/kvm-all.c | 23 +++
 hw/core/machine.c   | 41 -
 include/hw/boards.h |  4 +++-
 3 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5cf1a397e3..3588adf1e1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2102,25 +2102,16 @@ static int kvm_init(MachineState *ms)
  * if memory encryption object is specified then initialize the memory
  * encryption context.
  */
-if (ms->memory_encryption) {
-Object *obj = object_resolve_path_component(object_get_objects_root(),
-ms->memory_encryption);
-
-if (object_dynamic_cast(obj, TYPE_GUEST_MEMORY_PROTECTION)) {
-GuestMemoryProtection *gmpo = GUEST_MEMORY_PROTECTION(obj);
-GuestMemoryProtectionClass *gmpc =
-GUEST_MEMORY_PROTECTION_GET_CLASS(gmpo);
-
-ret = gmpc->kvm_init(gmpo);
-if (ret < 0) {
-goto err;
-}
+if (ms->gmpo) {
+GuestMemoryProtectionClass *gmpc =
+GUEST_MEMORY_PROTECTION_GET_CLASS(ms->gmpo);
 
-kvm_state->guest_memory_protection = gmpo;
-} else {
-ret = -1;
+ret = gmpc->kvm_init(ms->gmpo);
+if (ret < 0) {
 goto err;
 }
+
+kvm_state->guest_memory_protection = ms->gmpo;
 }
 
 ret = kvm_arch_init(ms, s);
diff --git a/hw/core/machine.c b/hw/core/machine.c
index e75f0b73d0..88d699bceb 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -27,6 +27,7 @@
 #include "hw/pci/pci.h"
 #include "hw/mem/nvdimm.h"
 #include "migration/vmstate.h"
+#include "exec/guest-memory-protection.h"
 
 GlobalProperty hw_compat_5_0[] = {};
 const size_t hw_compat_5_0_len = G_N_ELEMENTS(hw_compat_5_0);
@@ -419,16 +420,37 @@ static char *machine_get_memory_encryption(Object *obj, 
Error **errp)
 {
 MachineState *ms = MACHINE(obj);
 
-return g_strdup(ms->memory_encryption);
+if (ms->gmpo) {
+return object_get_canonical_path_component(OBJECT(ms->gmpo));
+}
+
+return NULL;
 }
 
 static void machine_set_memory_encryption(Object *obj, const char *value,
 Error **errp)
 {
-MachineState *ms = MACHINE(obj);
+Object *gmpo =
+object_resolve_path_component(object_get_objects_root(), value);
+
+if (!gmpo) {
+error_setg(errp, "No such memory encryption object '%s'", value);
+return;
+}
 
-g_free(ms->memory_encryption);
-ms->memory_encryption = g_strdup(value);
+object_property_set_link(obj, gmpo, "guest-memory-protection", errp);
+}
+
+static void machine_check_guest_memory_protection(const Object *obj,
+  const char *name,
+  Object *new_target,
+  Error **errp)
+{
+/*
+ * So far the only constraint is that the target has the
+ * TYPE_GUEST_MEMORY_PROTECTION interface, and that's checked by
+ * the QOM core
+ */
 }
 
 static bool machine_get_nvdimm(Object *obj, Error **errp)
@@ -849,6 +871,15 @@ static void machine_class_init(ObjectClass *oc, void *data)
 object_class_property_set_description(oc, "enforce-config-section",
 "Set on to enforce configuration section migration");
 
+object_class_property_add_link(oc, "guest-memory-protection",
+   TYPE_GUEST_MEMORY_PROTECTION,
+   offsetof(MachineState, gmpo),
+   machine_check_guest_memory_protection,
+   OBJ_PROP_LINK_STRONG);
+object_class_property_set_description(oc, "guest-memory-protection",
+"Set guest memory protection object to use");
+
+/* For compatibility */
 object_class_property_add_str(oc, "memory-encryption",
 machine_get_memory_encryption, machine_set_memory_encryption);
 object_class_property_set_description(oc, "memory-encryption",
@@ -1121,7 +1152,7 @@ void machine_run_board_init(MachineState *machine)
 }
 }
 
-if (machine->memory_encryption) {
+if (machine->gmpo) {
 /*
  * With guest memory

[RFC v2 09/18] target/i386: sev: Unify SEVState and SevGuestState

SEVState is contained with SevGuestState.  We've now fixed redundancies
and name conflicts, so there's no real point to the nested structure.  Just
move all the fields of SEVState into SevGuestState.

This eliminates the SEVState structure, which as a bonus removes the
confusion with the SevState enum.

Signed-off-by: David Gibson 
---
 target/i386/sev.c | 79 ---
 1 file changed, 34 insertions(+), 45 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 24e2dea9b8..d273174ad3 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -35,18 +35,6 @@
 
 typedef struct SevGuestState SevGuestState;
 
-struct SEVState {
-uint8_t api_major;
-uint8_t api_minor;
-uint8_t build_id;
-uint64_t me_mask;
-int sev_fd;
-SevState state;
-gchar *measurement;
-};
-
-typedef struct SEVState SEVState;
-
 /**
  * SevGuestState:
  *
@@ -70,7 +58,13 @@ struct SevGuestState {
 
 /* runtime state */
 uint32_t handle;
-SEVState state;
+uint8_t api_major;
+uint8_t api_minor;
+uint8_t build_id;
+uint64_t me_mask;
+int sev_fd;
+SevState state;
+gchar *measurement;
 };
 
 #define DEFAULT_GUEST_POLICY0x1 /* disable debug */
@@ -158,7 +152,7 @@ static bool
 sev_check_state(const SevGuestState *sev, SevState state)
 {
 assert(sev);
-return sev->state.state == state ? true : false;
+return sev->state == state ? true : false;
 }
 
 static void
@@ -167,9 +161,9 @@ sev_set_guest_state(SevGuestState *sev, SevState new_state)
 assert(new_state < SEV_STATE__MAX);
 assert(sev);
 
-trace_kvm_sev_change_state(SevState_str(sev->state.state),
+trace_kvm_sev_change_state(SevState_str(sev->state),
SevState_str(new_state));
-sev->state.state = new_state;
+sev->state = new_state;
 }
 
 static void
@@ -368,7 +362,7 @@ sev_enabled(void)
 uint64_t
 sev_get_me_mask(void)
 {
-return sev_guest ? sev_guest->state.me_mask : ~0;
+return sev_guest ? sev_guest->me_mask : ~0;
 }
 
 uint32_t
@@ -392,11 +386,11 @@ sev_get_info(void)
 info->enabled = sev_enabled();
 
 if (info->enabled) {
-info->api_major = sev_guest->state.api_major;
-info->api_minor = sev_guest->state.api_minor;
-info->build_id = sev_guest->state.build_id;
+info->api_major = sev_guest->api_major;
+info->api_minor = sev_guest->api_minor;
+info->build_id = sev_guest->build_id;
 info->policy = sev_guest->policy;
-info->state = sev_guest->state.state;
+info->state = sev_guest->state;
 info->handle = sev_guest->handle;
 }
 
@@ -507,7 +501,6 @@ sev_read_file_base64(const char *filename, guchar **data, 
gsize *len)
 static int
 sev_launch_start(SevGuestState *sev)
 {
-SEVState *s = >state;
 gsize sz;
 int ret = 1;
 int fw_error, rc;
@@ -535,7 +528,7 @@ sev_launch_start(SevGuestState *sev)
 }
 
 trace_kvm_sev_launch_start(start->policy, session, dh_cert);
-rc = sev_ioctl(s->sev_fd, KVM_SEV_LAUNCH_START, start, _error);
+rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, start, _error);
 if (rc < 0) {
 error_report("%s: LAUNCH_START ret=%d fw_error=%d '%s'",
 __func__, ret, fw_error, fw_error_to_str(fw_error));
@@ -566,7 +559,7 @@ sev_launch_update_data(SevGuestState *sev, uint8_t *addr, 
uint64_t len)
 update.uaddr = (__u64)(unsigned long)addr;
 update.len = len;
 trace_kvm_sev_launch_update_data(addr, len);
-ret = sev_ioctl(sev->state.sev_fd, KVM_SEV_LAUNCH_UPDATE_DATA,
+ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_UPDATE_DATA,
 , _error);
 if (ret) {
 error_report("%s: LAUNCH_UPDATE ret=%d fw_error=%d '%s'",
@@ -582,7 +575,6 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
 SevGuestState *sev = sev_guest;
 int ret, error;
 guchar *data;
-SEVState *s = >state;
 struct kvm_sev_launch_measure *measurement;
 
 if (!sev_check_state(sev, SEV_STATE_LAUNCH_UPDATE)) {
@@ -592,7 +584,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
 measurement = g_new0(struct kvm_sev_launch_measure, 1);
 
 /* query the measurement blob length */
-ret = sev_ioctl(sev->state.sev_fd, KVM_SEV_LAUNCH_MEASURE,
+ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE,
 measurement, );
 if (!measurement->len) {
 error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'",
@@ -604,7 +596,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
 measurement->uaddr = (unsigned long)data;
 
 /* get the measurement blob */
-ret = sev_ioctl(sev->state.sev_fd, KVM_SEV_LAUNCH_MEASURE,
+ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE,
 measurement, );
 if (ret) {
 error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'",
@@ -615,8 +607,8 @@ sev_launch_get_measure(Notifier *notifier, void *unused)

[RFC v2 10/18] guest memory protection: Add guest memory protection interface

Several architectures have mechanisms which are designed to protect guest
memory from interference or eavesdropping by a compromised hypervisor.  AMD
SEV does this with in-chip memory encryption and Intel has a similar
mechanism.  POWER's Protected Execution Framework (PEF) accomplishes a
similar goal using an ultravisor and new memory protection features,
instead of encryption.

This introduces a new GuestMemoryProtection QOM interface which we'll use
to (partially) unify handling of these various mechanisms.

Signed-off-by: David Gibson 
---
 backends/Makefile.objs |  2 ++
 backends/guest-memory-protection.c | 29 +
 include/exec/guest-memory-protection.h | 36 ++
 3 files changed, 67 insertions(+)
 create mode 100644 backends/guest-memory-protection.c
 create mode 100644 include/exec/guest-memory-protection.h

diff --git a/backends/Makefile.objs b/backends/Makefile.objs
index 28a847cd57..e4fb4f5280 100644
--- a/backends/Makefile.objs
+++ b/backends/Makefile.objs
@@ -21,3 +21,5 @@ common-obj-$(CONFIG_LINUX) += hostmem-memfd.o
 common-obj-$(CONFIG_GIO) += dbus-vmstate.o
 dbus-vmstate.o-cflags = $(GIO_CFLAGS)
 dbus-vmstate.o-libs = $(GIO_LIBS)
+
+common-obj-y += guest-memory-protection.o
diff --git a/backends/guest-memory-protection.c 
b/backends/guest-memory-protection.c
new file mode 100644
index 00..7e538214f7
--- /dev/null
+++ b/backends/guest-memory-protection.c
@@ -0,0 +1,29 @@
+#/*
+ * QEMU Guest Memory Protection interface
+ *
+ * Copyright: David Gibson, Red Hat Inc. 2020
+ *
+ * Authors:
+ *  David Gibson 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+
+#include "exec/guest-memory-protection.h"
+
+static const TypeInfo guest_memory_protection_info = {
+.name = TYPE_GUEST_MEMORY_PROTECTION,
+.parent = TYPE_INTERFACE,
+.class_size = sizeof(GuestMemoryProtectionClass),
+};
+
+static void guest_memory_protection_register_types(void)
+{
+type_register_static(_memory_protection_info);
+}
+
+type_init(guest_memory_protection_register_types)
diff --git a/include/exec/guest-memory-protection.h 
b/include/exec/guest-memory-protection.h
new file mode 100644
index 00..38e9b01667
--- /dev/null
+++ b/include/exec/guest-memory-protection.h
@@ -0,0 +1,36 @@
+#/*
+ * QEMU Guest Memory Protection interface
+ *
+ * Copyright: David Gibson, Red Hat Inc. 2020
+ *
+ * Authors:
+ *  David Gibson 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ *
+ */
+#ifndef QEMU_GUEST_MEMORY_PROTECTION_H
+#define QEMU_GUEST_MEMORY_PROTECTION_H
+
+#include "qom/object.h"
+
+typedef struct GuestMemoryProtection GuestMemoryProtection;
+
+#define TYPE_GUEST_MEMORY_PROTECTION "guest-memory-protection"
+#define GUEST_MEMORY_PROTECTION(obj)\
+INTERFACE_CHECK(GuestMemoryProtection, (obj),   \
+TYPE_GUEST_MEMORY_PROTECTION)
+#define GUEST_MEMORY_PROTECTION_CLASS(klass)\
+OBJECT_CLASS_CHECK(GuestMemoryProtectionClass, (klass), \
+   TYPE_GUEST_MEMORY_PROTECTION)
+#define GUEST_MEMORY_PROTECTION_GET_CLASS(obj)  \
+OBJECT_GET_CLASS(GuestMemoryProtectionClass, (obj), \
+ TYPE_GUEST_MEMORY_PROTECTION)
+
+typedef struct GuestMemoryProtectionClass {
+InterfaceClass parent;
+} GuestMemoryProtectionClass;
+
+#endif /* QEMU_GUEST_MEMORY_PROTECTION_H */
+
-- 
2.26.2

[RFC v2 08/18] target/i386: sev: Remove redundant handle field

The user can explicitly specify a handle via the "handle" property wired
to SevGuestState::handle.  That gets passed to the KVM_SEV_LAUNCH_START
ioctl() which may update it, the final value being copied back to both
SevGuestState::handle and SEVState::handle.

AFAICT, nothing will be looking SEVState::handle before it and
SevGuestState::handle have been updated from the ioctl().  So, remove the
field and just use SevGuestState::handle directly.

Signed-off-by: David Gibson 
---
 target/i386/sev.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 4b261beaa7..24e2dea9b8 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -40,7 +40,6 @@ struct SEVState {
 uint8_t api_minor;
 uint8_t build_id;
 uint64_t me_mask;
-uint32_t handle;
 int sev_fd;
 SevState state;
 gchar *measurement;
@@ -64,13 +63,13 @@ struct SevGuestState {
 /* configuration parameters */
 char *sev_device;
 uint32_t policy;
-uint32_t handle;
 char *dh_cert_file;
 char *session_file;
 uint32_t cbitpos;
 uint32_t reduced_phys_bits;
 
 /* runtime state */
+uint32_t handle;
 SEVState state;
 };
 
@@ -398,7 +397,7 @@ sev_get_info(void)
 info->build_id = sev_guest->state.build_id;
 info->policy = sev_guest->policy;
 info->state = sev_guest->state.state;
-info->handle = sev_guest->state.handle;
+info->handle = sev_guest->handle;
 }
 
 return info;
@@ -517,8 +516,7 @@ sev_launch_start(SevGuestState *sev)
 
 start = g_new0(struct kvm_sev_launch_start, 1);
 
-start->handle = object_property_get_int(OBJECT(sev), "handle",
-_abort);
+start->handle = sev->handle;
 start->policy = sev->policy;
 if (sev->session_file) {
 if (sev_read_file_base64(sev->session_file, , ) < 0) {
@@ -544,10 +542,8 @@ sev_launch_start(SevGuestState *sev)
 goto out;
 }
 
-object_property_set_int(OBJECT(sev), start->handle, "handle",
-_abort);
 sev_set_guest_state(sev, SEV_STATE_LAUNCH_UPDATE);
-s->handle = start->handle;
+sev->handle = start->handle;
 ret = 0;
 
 out:
-- 
2.26.2

[RFC v2 00/18] Refactor configuration of guest memory protection

A number of hardware platforms are implementing mechanisms whereby the
hypervisor does not have unfettered access to guest memory, in order
to mitigate the security impact of a compromised hypervisor.

AMD's SEV implements this with in-cpu memory encryption, and Intel has
its own memory encryption mechanism.  POWER has an upcoming mechanism
to accomplish this in a different way, using a new memory protection
level plus a small trusted ultravisor.  s390 also has a protected
execution environment.

The current code (committed or draft) for these features has each
platform's version configured entirely differently.  That doesn't seem
ideal for users, or particularly for management layers.

AMD SEV introduces a notionally generic machine option
"machine-encryption", but it doesn't actually cover any cases other
than SEV.

This series is a proposal to at least partially unify configuration
for these mechanisms, by renaming and generalizing AMD's
"memory-encryption" property.  It is replaced by a
"guest-memory-protection" property pointing to a platform specific
object which configures and manages the specific details.

For now this series covers just AMD SEV and POWER PEF.  I'm hoping it
can be extended to cover the Intel and s390 mechanisms as well,
though.

Note: I'm using the term "guest memory protection" throughout to refer
to mechanisms like this.  I don't particular like the term, it's both
long and not really precise.  If someone can think of a succinct way
of saying "a means of protecting guest memory from a possibly
compromised hypervisor", I'd be grateful for the suggestion.

Changes since v1:
 * Rebased
 * Fixed some errors pointed out by Dave Gilbert

David Gibson (18):
  target/i386: sev: Remove unused QSevGuestInfoClass
  target/i386: sev: Move local structure definitions into .c file
  target/i386: sev: Rename QSevGuestInfo
  target/i386: sev: Embed SEVState in SevGuestState
  target/i386: sev: Partial cleanup to sev_state global
  target/i386: sev: Remove redundant cbitpos and reduced_phys_bits
fields
  target/i386: sev: Remove redundant policy field
  target/i386: sev: Remove redundant handle field
  target/i386: sev: Unify SEVState and SevGuestState
  guest memory protection: Add guest memory protection interface
  guest memory protection: Handle memory encrption via interface
  guest memory protection: Perform KVM init via interface
  guest memory protection: Move side effect out of
machine_set_memory_encryption()
  guest memory protection: Rework the "memory-encryption" property
  guest memory protection: Decouple kvm_memcrypt_*() helpers from KVM
  guest memory protection: Add Error ** to
GuestMemoryProtection::kvm_init
  spapr: Added PEF based guest memory protection
  guest memory protection: Alter virtio default properties for protected
guests

 accel/kvm/kvm-all.c|  40 +--
 accel/kvm/sev-stub.c   |   5 -
 accel/stubs/kvm-stub.c |  10 -
 backends/Makefile.objs |   2 +
 backends/guest-memory-protection.c |  29 ++
 hw/core/machine.c  |  61 -
 hw/i386/pc_sysfw.c |   6 +-
 include/exec/guest-memory-protection.h |  77 ++
 include/hw/boards.h|   4 +-
 include/sysemu/kvm.h   |  17 --
 include/sysemu/sev.h   |   6 +-
 target/i386/sev.c  | 351 +
 target/i386/sev_i386.h |  49 
 target/ppc/Makefile.objs   |   2 +-
 target/ppc/pef.c   |  81 ++
 15 files changed, 441 insertions(+), 299 deletions(-)
 create mode 100644 backends/guest-memory-protection.c
 create mode 100644 include/exec/guest-memory-protection.h
 create mode 100644 target/ppc/pef.c

-- 
2.26.2

[RFC v2 12/18] guest memory protection: Perform KVM init via interface

Currently the "memory-encryption" machine option is notionally generic,
but in fact is only used for AMD SEV setups.  Make another step towards it
being actually generic, but having using the GuestMemoryProtection QOM
interface to dispatch the initial setup, rather than directly calling
sev_guest_init() from kvm_init().

Signed-off-by: David Gibson 
---
 accel/kvm/kvm-all.c| 18 ++---
 include/exec/guest-memory-protection.h |  1 +
 target/i386/sev.c  | 37 --
 3 files changed, 21 insertions(+), 35 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 40997de38c..5cf1a397e3 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -39,7 +39,6 @@
 #include "qemu/main-loop.h"
 #include "trace.h"
 #include "hw/irq.h"
-#include "sysemu/sev.h"
 #include "sysemu/balloon.h"
 #include "qapi/visitor.h"
 #include "qapi/qapi-types-common.h"
@@ -2104,8 +2103,21 @@ static int kvm_init(MachineState *ms)
  * encryption context.
  */
 if (ms->memory_encryption) {
-kvm_state->guest_memory_protection = 
sev_guest_init(ms->memory_encryption);
-if (!kvm_state->guest_memory_protection) {
+Object *obj = object_resolve_path_component(object_get_objects_root(),
+ms->memory_encryption);
+
+if (object_dynamic_cast(obj, TYPE_GUEST_MEMORY_PROTECTION)) {
+GuestMemoryProtection *gmpo = GUEST_MEMORY_PROTECTION(obj);
+GuestMemoryProtectionClass *gmpc =
+GUEST_MEMORY_PROTECTION_GET_CLASS(gmpo);
+
+ret = gmpc->kvm_init(gmpo);
+if (ret < 0) {
+goto err;
+}
+
+kvm_state->guest_memory_protection = gmpo;
+} else {
 ret = -1;
 goto err;
 }
diff --git a/include/exec/guest-memory-protection.h 
b/include/exec/guest-memory-protection.h
index eb712a5804..3707b96515 100644
--- a/include/exec/guest-memory-protection.h
+++ b/include/exec/guest-memory-protection.h
@@ -31,6 +31,7 @@ typedef struct GuestMemoryProtection GuestMemoryProtection;
 typedef struct GuestMemoryProtectionClass {
 InterfaceClass parent;
 
+int (*kvm_init)(GuestMemoryProtection *);
 int (*encrypt_data)(GuestMemoryProtection *, uint8_t *, uint64_t);
 } GuestMemoryProtectionClass;
 
diff --git a/target/i386/sev.c b/target/i386/sev.c
index 986c2fee51..60e9d8c735 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -300,26 +300,6 @@ sev_guest_instance_init(Object *obj)
OBJ_PROP_FLAG_READWRITE);
 }
 
-static SevGuestState *
-lookup_sev_guest_info(const char *id)
-{
-Object *obj;
-SevGuestState *info;
-
-obj = object_resolve_path_component(object_get_objects_root(), id);
-if (!obj) {
-return NULL;
-}
-
-info = (SevGuestState *)
-object_dynamic_cast(obj, TYPE_SEV_GUEST);
-if (!info) {
-return NULL;
-}
-
-return info;
-}
-
 bool
 sev_enabled(void)
 {
@@ -637,23 +617,15 @@ sev_vm_state_change(void *opaque, int running, RunState 
state)
 }
 }
 
-GuestMemoryProtection *
-sev_guest_init(const char *id)
+static int sev_kvm_init(GuestMemoryProtection *gmpo)
 {
-SevGuestState *sev;
+SevGuestState *sev = SEV_GUEST(gmpo);
 char *devname;
 int ret, fw_error;
 uint32_t ebx;
 uint32_t host_cbitpos;
 struct sev_user_data_status status = {};
 
-sev = lookup_sev_guest_info(id);
-if (!sev) {
-error_report("%s: '%s' is not a valid '%s' object",
- __func__, id, TYPE_SEV_GUEST);
-goto err;
-}
-
 sev_guest = sev;
 sev->state = SEV_STATE_UNINIT;
 
@@ -715,10 +687,10 @@ sev_guest_init(const char *id)
 qemu_add_machine_init_done_notifier(_machine_done_notify);
 qemu_add_vm_change_state_handler(sev_vm_state_change, sev);
 
-return GUEST_MEMORY_PROTECTION(sev);
+return 0;
 err:
 sev_guest = NULL;
-return NULL;
+return -1;
 }
 
 static int
@@ -757,6 +729,7 @@ sev_guest_class_init(ObjectClass *oc, void *data)
 object_class_property_set_description(oc, "session-file",
 "guest owners session parameters (encoded with base64)");
 
+gmpc->kvm_init = sev_kvm_init;
 gmpc->encrypt_data = sev_encrypt_data;
 }
 
-- 
2.26.2

[RFC v2 06/18] target/i386: sev: Remove redundant cbitpos and reduced_phys_bits fields

The SEVState structure has cbitpos and reduced_phys_bits fields which are
simply copied from the SevGuestState structure and never changed.  Now that
SEVState is embedded in SevGuestState we can just access the original copy
directly.

Signed-off-by: David Gibson 
---
 target/i386/sev.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 9e8ab7b056..d25af37136 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -41,8 +41,6 @@ struct SEVState {
 uint8_t build_id;
 uint32_t policy;
 uint64_t me_mask;
-uint32_t cbitpos;
-uint32_t reduced_phys_bits;
 uint32_t handle;
 int sev_fd;
 SevState state;
@@ -378,13 +376,13 @@ sev_get_me_mask(void)
 uint32_t
 sev_get_cbit_position(void)
 {
-return sev_guest ? sev_guest->state.cbitpos : 0;
+return sev_guest ? sev_guest->cbitpos : 0;
 }
 
 uint32_t
 sev_get_reduced_phys_bits(void)
 {
-return sev_guest ? sev_guest->state.reduced_phys_bits : 0;
+return sev_guest ? sev_guest->reduced_phys_bits : 0;
 }
 
 SevInfo *
@@ -713,22 +711,19 @@ sev_guest_init(const char *id)
 host_cpuid(0x801F, 0, NULL, , NULL, NULL);
 host_cbitpos = ebx & 0x3f;
 
-s->cbitpos = object_property_get_int(OBJECT(sev), "cbitpos", NULL);
-if (host_cbitpos != s->cbitpos) {
+if (host_cbitpos != sev->cbitpos) {
 error_report("%s: cbitpos check failed, host '%d' requested '%d'",
- __func__, host_cbitpos, s->cbitpos);
+ __func__, host_cbitpos, sev->cbitpos);
 goto err;
 }
 
-s->reduced_phys_bits = object_property_get_int(OBJECT(sev),
-"reduced-phys-bits", NULL);
-if (s->reduced_phys_bits < 1) {
+if (sev->reduced_phys_bits < 1) {
 error_report("%s: reduced_phys_bits check failed, it should be >=1,"
- " requested '%d'", __func__, s->reduced_phys_bits);
+ " requested '%d'", __func__, sev->reduced_phys_bits);
 goto err;
 }
 
-s->me_mask = ~(1UL << s->cbitpos);
+s->me_mask = ~(1UL << sev->cbitpos);
 
 devname = object_property_get_str(OBJECT(sev), "sev-device", NULL);
 s->sev_fd = open(devname, O_RDWR);
-- 
2.26.2

[RFC v2 02/18] target/i386: sev: Move local structure definitions into .c file

Neither QSevGuestInfo nor SEVState (not to be confused with SevState) is
used anywhere outside target/i386/sev.c, so they might as well live in
there rather than in a (somewhat) exposed header.

Signed-off-by: David Gibson 
---
 target/i386/sev.c  | 44 ++
 target/i386/sev_i386.h | 44 --
 2 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 2312510cf2..53def5f41a 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -29,6 +29,50 @@
 #include "trace.h"
 #include "migration/blocker.h"
 
+#define TYPE_QSEV_GUEST_INFO "sev-guest"
+#define QSEV_GUEST_INFO(obj)  \
+OBJECT_CHECK(QSevGuestInfo, (obj), TYPE_QSEV_GUEST_INFO)
+
+typedef struct QSevGuestInfo QSevGuestInfo;
+
+/**
+ * QSevGuestInfo:
+ *
+ * The QSevGuestInfo object is used for creating a SEV guest.
+ *
+ * # $QEMU \
+ * -object sev-guest,id=sev0 \
+ * -machine ...,memory-encryption=sev0
+ */
+struct QSevGuestInfo {
+Object parent_obj;
+
+char *sev_device;
+uint32_t policy;
+uint32_t handle;
+char *dh_cert_file;
+char *session_file;
+uint32_t cbitpos;
+uint32_t reduced_phys_bits;
+};
+
+struct SEVState {
+QSevGuestInfo *sev_info;
+uint8_t api_major;
+uint8_t api_minor;
+uint8_t build_id;
+uint32_t policy;
+uint64_t me_mask;
+uint32_t cbitpos;
+uint32_t reduced_phys_bits;
+uint32_t handle;
+int sev_fd;
+SevState state;
+gchar *measurement;
+};
+
+typedef struct SEVState SEVState;
+
 #define DEFAULT_GUEST_POLICY0x1 /* disable debug */
 #define DEFAULT_SEV_DEVICE  "/dev/sev"
 
diff --git a/target/i386/sev_i386.h b/target/i386/sev_i386.h
index 4f193642ac..8eb7de1bef 100644
--- a/target/i386/sev_i386.h
+++ b/target/i386/sev_i386.h
@@ -28,10 +28,6 @@
 #define SEV_POLICY_DOMAIN   0x10
 #define SEV_POLICY_SEV  0x20
 
-#define TYPE_QSEV_GUEST_INFO "sev-guest"
-#define QSEV_GUEST_INFO(obj)  \
-OBJECT_CHECK(QSevGuestInfo, (obj), TYPE_QSEV_GUEST_INFO)
-
 extern bool sev_enabled(void);
 extern uint64_t sev_get_me_mask(void);
 extern SevInfo *sev_get_info(void);
@@ -40,44 +36,4 @@ extern uint32_t sev_get_reduced_phys_bits(void);
 extern char *sev_get_launch_measurement(void);
 extern SevCapability *sev_get_capabilities(void);
 
-typedef struct QSevGuestInfo QSevGuestInfo;
-
-/**
- * QSevGuestInfo:
- *
- * The QSevGuestInfo object is used for creating a SEV guest.
- *
- * # $QEMU \
- * -object sev-guest,id=sev0 \
- * -machine ...,memory-encryption=sev0
- */
-struct QSevGuestInfo {
-Object parent_obj;
-
-char *sev_device;
-uint32_t policy;
-uint32_t handle;
-char *dh_cert_file;
-char *session_file;
-uint32_t cbitpos;
-uint32_t reduced_phys_bits;
-};
-
-struct SEVState {
-QSevGuestInfo *sev_info;
-uint8_t api_major;
-uint8_t api_minor;
-uint8_t build_id;
-uint32_t policy;
-uint64_t me_mask;
-uint32_t cbitpos;
-uint32_t reduced_phys_bits;
-uint32_t handle;
-int sev_fd;
-SevState state;
-gchar *measurement;
-};
-
-typedef struct SEVState SEVState;
-
 #endif
-- 
2.26.2

[RFC v2 11/18] guest memory protection: Handle memory encrption via interface

At the moment AMD SEV sets a special function pointer, plus an opaque
handle in KVMState to let things know how to encrypt guest memory.

Now that we have a QOM interface for handling things related to guest
memory protection, use a QOM method on that interface, rather than a bare
function pointer for this.

Signed-off-by: David Gibson 
---
 accel/kvm/kvm-all.c| 23 +++
 accel/kvm/sev-stub.c   |  5 --
 include/exec/guest-memory-protection.h |  2 +
 include/sysemu/sev.h   |  6 +-
 target/i386/sev.c  | 84 ++
 5 files changed, 63 insertions(+), 57 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d06cc04079..40997de38c 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -45,6 +45,7 @@
 #include "qapi/qapi-types-common.h"
 #include "qapi/qapi-visit-common.h"
 #include "sysemu/reset.h"
+#include "exec/guest-memory-protection.h"
 
 #include "hw/boards.h"
 
@@ -119,8 +120,7 @@ struct KVMState
 QLIST_HEAD(, KVMParkedVcpu) kvm_parked_vcpus;
 
 /* memory encryption */
-void *memcrypt_handle;
-int (*memcrypt_encrypt_data)(void *handle, uint8_t *ptr, uint64_t len);
+GuestMemoryProtection *guest_memory_protection;
 
 /* For "info mtree -f" to tell if an MR is registered in KVM */
 int nr_as;
@@ -172,7 +172,7 @@ int kvm_get_max_memslots(void)
 
 bool kvm_memcrypt_enabled(void)
 {
-if (kvm_state && kvm_state->memcrypt_handle) {
+if (kvm_state && kvm_state->guest_memory_protection) {
 return true;
 }
 
@@ -181,10 +181,13 @@ bool kvm_memcrypt_enabled(void)
 
 int kvm_memcrypt_encrypt_data(uint8_t *ptr, uint64_t len)
 {
-if (kvm_state->memcrypt_handle &&
-kvm_state->memcrypt_encrypt_data) {
-return kvm_state->memcrypt_encrypt_data(kvm_state->memcrypt_handle,
-  ptr, len);
+GuestMemoryProtection *gmpo = kvm_state->guest_memory_protection;
+
+if (gmpo) {
+GuestMemoryProtectionClass *gmpc =
+GUEST_MEMORY_PROTECTION_GET_CLASS(gmpo);
+
+return gmpc->encrypt_data(gmpo, ptr, len);
 }
 
 return 1;
@@ -2101,13 +2104,11 @@ static int kvm_init(MachineState *ms)
  * encryption context.
  */
 if (ms->memory_encryption) {
-kvm_state->memcrypt_handle = sev_guest_init(ms->memory_encryption);
-if (!kvm_state->memcrypt_handle) {
+kvm_state->guest_memory_protection = 
sev_guest_init(ms->memory_encryption);
+if (!kvm_state->guest_memory_protection) {
 ret = -1;
 goto err;
 }
-
-kvm_state->memcrypt_encrypt_data = sev_encrypt_data;
 }
 
 ret = kvm_arch_init(ms, s);
diff --git a/accel/kvm/sev-stub.c b/accel/kvm/sev-stub.c
index 4f97452585..4a5cc5569e 100644
--- a/accel/kvm/sev-stub.c
+++ b/accel/kvm/sev-stub.c
@@ -15,11 +15,6 @@
 #include "qemu-common.h"
 #include "sysemu/sev.h"
 
-int sev_encrypt_data(void *handle, uint8_t *ptr, uint64_t len)
-{
-abort();
-}
-
 void *sev_guest_init(const char *id)
 {
 return NULL;
diff --git a/include/exec/guest-memory-protection.h 
b/include/exec/guest-memory-protection.h
index 38e9b01667..eb712a5804 100644
--- a/include/exec/guest-memory-protection.h
+++ b/include/exec/guest-memory-protection.h
@@ -30,6 +30,8 @@ typedef struct GuestMemoryProtection GuestMemoryProtection;
 
 typedef struct GuestMemoryProtectionClass {
 InterfaceClass parent;
+
+int (*encrypt_data)(GuestMemoryProtection *, uint8_t *, uint64_t);
 } GuestMemoryProtectionClass;
 
 #endif /* QEMU_GUEST_MEMORY_PROTECTION_H */
diff --git a/include/sysemu/sev.h b/include/sysemu/sev.h
index 98c1ec8d38..7735a7942e 100644
--- a/include/sysemu/sev.h
+++ b/include/sysemu/sev.h
@@ -16,6 +16,8 @@
 
 #include "sysemu/kvm.h"
 
-void *sev_guest_init(const char *id);
-int sev_encrypt_data(void *handle, uint8_t *ptr, uint64_t len);
+typedef struct GuestMemoryProtection GuestMemoryProtection;
+
+GuestMemoryProtection *sev_guest_init(const char *id);
+
 #endif
diff --git a/target/i386/sev.c b/target/i386/sev.c
index d273174ad3..986c2fee51 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -28,6 +28,7 @@
 #include "sysemu/runstate.h"
 #include "trace.h"
 #include "migration/blocker.h"
+#include "exec/guest-memory-protection.h"
 
 #define TYPE_SEV_GUEST "sev-guest"
 #define SEV_GUEST(obj)  \
@@ -281,26 +282,6 @@ sev_guest_set_sev_device(Object *obj, const char *value, 
Error **errp)
 sev->sev_device = g_strdup(value);
 }
 
-static void
-sev_guest_class_init(ObjectClass *oc, void *data)
-{
-object_class_property_add_str(oc, "sev-device",
-  sev_guest_get_sev_device,
-  sev_guest_set_sev_device);
-object_class_property_set_description(oc, "sev-device",
-"SEV device to use");
-object_class_property_add_str(oc, "dh-cert-file",
-

[RFC v2 05/18] target/i386: sev: Partial cleanup to sev_state global

The SEV code uses a pretty ugly global to access its internal state.  Now
that SEVState is embedded in SevGuestState, we can avoid accessing it via
the global in some cases.  In the remaining cases use a new global
referencing the containing SevGuestState which will simplify some future
transformations.

Signed-off-by: David Gibson 
---
 target/i386/sev.c | 92 ---
 1 file changed, 48 insertions(+), 44 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index b4ab9720d6..9e8ab7b056 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -80,7 +80,7 @@ struct SevGuestState {
 #define DEFAULT_GUEST_POLICY0x1 /* disable debug */
 #define DEFAULT_SEV_DEVICE  "/dev/sev"
 
-static SEVState *sev_state;
+static SevGuestState *sev_guest;
 static Error *sev_mig_blocker;
 
 static const char *const sev_fw_errlist[] = {
@@ -159,21 +159,21 @@ fw_error_to_str(int code)
 }
 
 static bool
-sev_check_state(SevState state)
+sev_check_state(const SevGuestState *sev, SevState state)
 {
-assert(sev_state);
-return sev_state->state == state ? true : false;
+assert(sev);
+return sev->state.state == state ? true : false;
 }
 
 static void
-sev_set_guest_state(SevState new_state)
+sev_set_guest_state(SevGuestState *sev, SevState new_state)
 {
 assert(new_state < SEV_STATE__MAX);
-assert(sev_state);
+assert(sev);
 
-trace_kvm_sev_change_state(SevState_str(sev_state->state),
+trace_kvm_sev_change_state(SevState_str(sev->state.state),
SevState_str(new_state));
-sev_state->state = new_state;
+sev->state.state = new_state;
 }
 
 static void
@@ -366,25 +366,25 @@ lookup_sev_guest_info(const char *id)
 bool
 sev_enabled(void)
 {
-return sev_state ? true : false;
+return !!sev_guest;
 }
 
 uint64_t
 sev_get_me_mask(void)
 {
-return sev_state ? sev_state->me_mask : ~0;
+return sev_guest ? sev_guest->state.me_mask : ~0;
 }
 
 uint32_t
 sev_get_cbit_position(void)
 {
-return sev_state ? sev_state->cbitpos : 0;
+return sev_guest ? sev_guest->state.cbitpos : 0;
 }
 
 uint32_t
 sev_get_reduced_phys_bits(void)
 {
-return sev_state ? sev_state->reduced_phys_bits : 0;
+return sev_guest ? sev_guest->state.reduced_phys_bits : 0;
 }
 
 SevInfo *
@@ -393,15 +393,15 @@ sev_get_info(void)
 SevInfo *info;
 
 info = g_new0(SevInfo, 1);
-info->enabled = sev_state ? true : false;
+info->enabled = sev_enabled();
 
 if (info->enabled) {
-info->api_major = sev_state->api_major;
-info->api_minor = sev_state->api_minor;
-info->build_id = sev_state->build_id;
-info->policy = sev_state->policy;
-info->state = sev_state->state;
-info->handle = sev_state->handle;
+info->api_major = sev_guest->state.api_major;
+info->api_minor = sev_guest->state.api_minor;
+info->build_id = sev_guest->state.build_id;
+info->policy = sev_guest->state.policy;
+info->state = sev_guest->state.state;
+info->handle = sev_guest->state.handle;
 }
 
 return info;
@@ -550,7 +550,7 @@ sev_launch_start(SevGuestState *sev)
 
 object_property_set_int(OBJECT(sev), start->handle, "handle",
 _abort);
-sev_set_guest_state(SEV_STATE_LAUNCH_UPDATE);
+sev_set_guest_state(sev, SEV_STATE_LAUNCH_UPDATE);
 s->handle = start->handle;
 s->policy = start->policy;
 ret = 0;
@@ -563,7 +563,7 @@ out:
 }
 
 static int
-sev_launch_update_data(uint8_t *addr, uint64_t len)
+sev_launch_update_data(SevGuestState *sev, uint8_t *addr, uint64_t len)
 {
 int ret, fw_error;
 struct kvm_sev_launch_update_data update;
@@ -575,7 +575,7 @@ sev_launch_update_data(uint8_t *addr, uint64_t len)
 update.uaddr = (__u64)(unsigned long)addr;
 update.len = len;
 trace_kvm_sev_launch_update_data(addr, len);
-ret = sev_ioctl(sev_state->sev_fd, KVM_SEV_LAUNCH_UPDATE_DATA,
+ret = sev_ioctl(sev->state.sev_fd, KVM_SEV_LAUNCH_UPDATE_DATA,
 , _error);
 if (ret) {
 error_report("%s: LAUNCH_UPDATE ret=%d fw_error=%d '%s'",
@@ -588,19 +588,20 @@ sev_launch_update_data(uint8_t *addr, uint64_t len)
 static void
 sev_launch_get_measure(Notifier *notifier, void *unused)
 {
+SevGuestState *sev = sev_guest;
 int ret, error;
 guchar *data;
-SEVState *s = sev_state;
+SEVState *s = >state;
 struct kvm_sev_launch_measure *measurement;
 
-if (!sev_check_state(SEV_STATE_LAUNCH_UPDATE)) {
+if (!sev_check_state(sev, SEV_STATE_LAUNCH_UPDATE)) {
 return;
 }
 
 measurement = g_new0(struct kvm_sev_launch_measure, 1);
 
 /* query the measurement blob length */
-ret = sev_ioctl(sev_state->sev_fd, KVM_SEV_LAUNCH_MEASURE,
+ret = sev_ioctl(sev->state.sev_fd, KVM_SEV_LAUNCH_MEASURE,
 measurement, );
 if (!measurement->len) {
 error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d

[RFC v2 03/18] target/i386: sev: Rename QSevGuestInfo

At the moment this is a purely passive object which is just a container for
information used elsewhere, hence the name.  I'm going to change that
though, so as a preliminary rename it to SevGuestState.

That name risks confusion with both SEVState and SevState, but I'll be
working on that in following patches.

Signed-off-by: David Gibson 
---
 target/i386/sev.c | 87 ---
 1 file changed, 44 insertions(+), 43 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 53def5f41a..b6ed719fb5 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -29,22 +29,23 @@
 #include "trace.h"
 #include "migration/blocker.h"
 
-#define TYPE_QSEV_GUEST_INFO "sev-guest"
-#define QSEV_GUEST_INFO(obj)  \
-OBJECT_CHECK(QSevGuestInfo, (obj), TYPE_QSEV_GUEST_INFO)
+#define TYPE_SEV_GUEST "sev-guest"
+#define SEV_GUEST(obj)  \
+OBJECT_CHECK(SevGuestState, (obj), TYPE_SEV_GUEST)
 
-typedef struct QSevGuestInfo QSevGuestInfo;
+typedef struct SevGuestState SevGuestState;
 
 /**
- * QSevGuestInfo:
+ * SevGuestState:
  *
- * The QSevGuestInfo object is used for creating a SEV guest.
+ * The SevGuestState object is used for creating and managing a SEV
+ * guest.
  *
  * # $QEMU \
  * -object sev-guest,id=sev0 \
  * -machine ...,memory-encryption=sev0
  */
-struct QSevGuestInfo {
+struct SevGuestState {
 Object parent_obj;
 
 char *sev_device;
@@ -57,7 +58,7 @@ struct QSevGuestInfo {
 };
 
 struct SEVState {
-QSevGuestInfo *sev_info;
+SevGuestState *sev_info;
 uint8_t api_major;
 uint8_t api_minor;
 uint8_t build_id;
@@ -235,82 +236,82 @@ static struct RAMBlockNotifier sev_ram_notifier = {
 };
 
 static void
-qsev_guest_finalize(Object *obj)
+sev_guest_finalize(Object *obj)
 {
 }
 
 static char *
-qsev_guest_get_session_file(Object *obj, Error **errp)
+sev_guest_get_session_file(Object *obj, Error **errp)
 {
-QSevGuestInfo *s = QSEV_GUEST_INFO(obj);
+SevGuestState *s = SEV_GUEST(obj);
 
 return s->session_file ? g_strdup(s->session_file) : NULL;
 }
 
 static void
-qsev_guest_set_session_file(Object *obj, const char *value, Error **errp)
+sev_guest_set_session_file(Object *obj, const char *value, Error **errp)
 {
-QSevGuestInfo *s = QSEV_GUEST_INFO(obj);
+SevGuestState *s = SEV_GUEST(obj);
 
 s->session_file = g_strdup(value);
 }
 
 static char *
-qsev_guest_get_dh_cert_file(Object *obj, Error **errp)
+sev_guest_get_dh_cert_file(Object *obj, Error **errp)
 {
-QSevGuestInfo *s = QSEV_GUEST_INFO(obj);
+SevGuestState *s = SEV_GUEST(obj);
 
 return g_strdup(s->dh_cert_file);
 }
 
 static void
-qsev_guest_set_dh_cert_file(Object *obj, const char *value, Error **errp)
+sev_guest_set_dh_cert_file(Object *obj, const char *value, Error **errp)
 {
-QSevGuestInfo *s = QSEV_GUEST_INFO(obj);
+SevGuestState *s = SEV_GUEST(obj);
 
 s->dh_cert_file = g_strdup(value);
 }
 
 static char *
-qsev_guest_get_sev_device(Object *obj, Error **errp)
+sev_guest_get_sev_device(Object *obj, Error **errp)
 {
-QSevGuestInfo *sev = QSEV_GUEST_INFO(obj);
+SevGuestState *sev = SEV_GUEST(obj);
 
 return g_strdup(sev->sev_device);
 }
 
 static void
-qsev_guest_set_sev_device(Object *obj, const char *value, Error **errp)
+sev_guest_set_sev_device(Object *obj, const char *value, Error **errp)
 {
-QSevGuestInfo *sev = QSEV_GUEST_INFO(obj);
+SevGuestState *sev = SEV_GUEST(obj);
 
 sev->sev_device = g_strdup(value);
 }
 
 static void
-qsev_guest_class_init(ObjectClass *oc, void *data)
+sev_guest_class_init(ObjectClass *oc, void *data)
 {
 object_class_property_add_str(oc, "sev-device",
-  qsev_guest_get_sev_device,
-  qsev_guest_set_sev_device);
+  sev_guest_get_sev_device,
+  sev_guest_set_sev_device);
 object_class_property_set_description(oc, "sev-device",
 "SEV device to use");
 object_class_property_add_str(oc, "dh-cert-file",
-  qsev_guest_get_dh_cert_file,
-  qsev_guest_set_dh_cert_file);
+  sev_guest_get_dh_cert_file,
+  sev_guest_set_dh_cert_file);
 object_class_property_set_description(oc, "dh-cert-file",
 "guest owners DH certificate (encoded with base64)");
 object_class_property_add_str(oc, "session-file",
-  qsev_guest_get_session_file,
-  qsev_guest_set_session_file);
+  sev_guest_get_session_file,
+  sev_guest_set_session_file);
 object_class_property_set_description(oc, "session-file",
 "guest owners session parameters (encoded with base64)");
 }
 
 static void
-qsev_guest_init(Object

[RFC v2 04/18] target/i386: sev: Embed SEVState in SevGuestState

Currently SevGuestState contains only configuration information.  For
runtime state another non-QOM struct SEVState is allocated separately.

Simplify things by instead embedding the SEVState structure in
SevGuestState.

Signed-off-by: David Gibson 
---
 target/i386/sev.c | 54 +--
 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index b6ed719fb5..b4ab9720d6 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -35,6 +35,22 @@
 
 typedef struct SevGuestState SevGuestState;
 
+struct SEVState {
+uint8_t api_major;
+uint8_t api_minor;
+uint8_t build_id;
+uint32_t policy;
+uint64_t me_mask;
+uint32_t cbitpos;
+uint32_t reduced_phys_bits;
+uint32_t handle;
+int sev_fd;
+SevState state;
+gchar *measurement;
+};
+
+typedef struct SEVState SEVState;
+
 /**
  * SevGuestState:
  *
@@ -48,6 +64,7 @@ typedef struct SevGuestState SevGuestState;
 struct SevGuestState {
 Object parent_obj;
 
+/* configuration parameters */
 char *sev_device;
 uint32_t policy;
 uint32_t handle;
@@ -55,25 +72,11 @@ struct SevGuestState {
 char *session_file;
 uint32_t cbitpos;
 uint32_t reduced_phys_bits;
-};
 
-struct SEVState {
-SevGuestState *sev_info;
-uint8_t api_major;
-uint8_t api_minor;
-uint8_t build_id;
-uint32_t policy;
-uint64_t me_mask;
-uint32_t cbitpos;
-uint32_t reduced_phys_bits;
-uint32_t handle;
-int sev_fd;
-SevState state;
-gchar *measurement;
+/* runtime state */
+SEVState state;
 };
 
-typedef struct SEVState SEVState;
-
 #define DEFAULT_GUEST_POLICY0x1 /* disable debug */
 #define DEFAULT_SEV_DEVICE  "/dev/sev"
 
@@ -506,12 +509,12 @@ sev_read_file_base64(const char *filename, guchar **data, 
gsize *len)
 }
 
 static int
-sev_launch_start(SEVState *s)
+sev_launch_start(SevGuestState *sev)
 {
+SEVState *s = >state;
 gsize sz;
 int ret = 1;
 int fw_error, rc;
-SevGuestState *sev = s->sev_info;
 struct kvm_sev_launch_start *start;
 guchar *session = NULL, *dh_cert = NULL;
 
@@ -686,6 +689,7 @@ sev_vm_state_change(void *opaque, int running, RunState 
state)
 void *
 sev_guest_init(const char *id)
 {
+SevGuestState *sev;
 SEVState *s;
 char *devname;
 int ret, fw_error;
@@ -693,27 +697,27 @@ sev_guest_init(const char *id)
 uint32_t host_cbitpos;
 struct sev_user_data_status status = {};
 
-sev_state = s = g_new0(SEVState, 1);
-s->sev_info = lookup_sev_guest_info(id);
-if (!s->sev_info) {
+sev = lookup_sev_guest_info(id);
+if (!sev) {
 error_report("%s: '%s' is not a valid '%s' object",
  __func__, id, TYPE_SEV_GUEST);
 goto err;
 }
 
+sev_state = s = >state;
 s->state = SEV_STATE_UNINIT;
 
 host_cpuid(0x801F, 0, NULL, , NULL, NULL);
 host_cbitpos = ebx & 0x3f;
 
-s->cbitpos = object_property_get_int(OBJECT(s->sev_info), "cbitpos", NULL);
+s->cbitpos = object_property_get_int(OBJECT(sev), "cbitpos", NULL);
 if (host_cbitpos != s->cbitpos) {
 error_report("%s: cbitpos check failed, host '%d' requested '%d'",
  __func__, host_cbitpos, s->cbitpos);
 goto err;
 }
 
-s->reduced_phys_bits = object_property_get_int(OBJECT(s->sev_info),
+s->reduced_phys_bits = object_property_get_int(OBJECT(sev),
 "reduced-phys-bits", NULL);
 if (s->reduced_phys_bits < 1) {
 error_report("%s: reduced_phys_bits check failed, it should be >=1,"
@@ -723,7 +727,7 @@ sev_guest_init(const char *id)
 
 s->me_mask = ~(1UL << s->cbitpos);
 
-devname = object_property_get_str(OBJECT(s->sev_info), "sev-device", NULL);
+devname = object_property_get_str(OBJECT(sev), "sev-device", NULL);
 s->sev_fd = open(devname, O_RDWR);
 if (s->sev_fd < 0) {
 error_report("%s: Failed to open %s '%s'", __func__,
@@ -754,7 +758,7 @@ sev_guest_init(const char *id)
 goto err;
 }
 
-ret = sev_launch_start(s);
+ret = sev_launch_start(sev);
 if (ret) {
 error_report("%s: failed to create encryption context", __func__);
 goto err;
-- 
2.26.2

[RFC v2 07/18] target/i386: sev: Remove redundant policy field

SEVState::policy is set from the final value of the policy field in the
parameter structure for the KVM_SEV_LAUNCH_START ioctl().  But, AFAICT
that ioctl() won't ever change it from the original supplied value which
comes from SevGuestState::policy.

So, remove this field and just use SevGuestState::policy directly.

Signed-off-by: David Gibson 
---
 target/i386/sev.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index d25af37136..4b261beaa7 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -39,7 +39,6 @@ struct SEVState {
 uint8_t api_major;
 uint8_t api_minor;
 uint8_t build_id;
-uint32_t policy;
 uint64_t me_mask;
 uint32_t handle;
 int sev_fd;
@@ -397,7 +396,7 @@ sev_get_info(void)
 info->api_major = sev_guest->state.api_major;
 info->api_minor = sev_guest->state.api_minor;
 info->build_id = sev_guest->state.build_id;
-info->policy = sev_guest->state.policy;
+info->policy = sev_guest->policy;
 info->state = sev_guest->state.state;
 info->handle = sev_guest->state.handle;
 }
@@ -520,8 +519,7 @@ sev_launch_start(SevGuestState *sev)
 
 start->handle = object_property_get_int(OBJECT(sev), "handle",
 _abort);
-start->policy = object_property_get_int(OBJECT(sev), "policy",
-_abort);
+start->policy = sev->policy;
 if (sev->session_file) {
 if (sev_read_file_base64(sev->session_file, , ) < 0) {
 goto out;
@@ -550,7 +548,6 @@ sev_launch_start(SevGuestState *sev)
 _abort);
 sev_set_guest_state(sev, SEV_STATE_LAUNCH_UPDATE);
 s->handle = start->handle;
-s->policy = start->policy;
 ret = 0;
 
 out:
-- 
2.26.2

[RFC v2 01/18] target/i386: sev: Remove unused QSevGuestInfoClass

This structure is nothing but an empty wrapper around the parent class,
which by QOM conventions means we don't need it at all.

Signed-off-by: David Gibson 
---
 target/i386/sev.c  | 1 -
 target/i386/sev_i386.h | 5 -
 2 files changed, 6 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 51cdbe5496..2312510cf2 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -287,7 +287,6 @@ static const TypeInfo qsev_guest_info = {
 .name = TYPE_QSEV_GUEST_INFO,
 .instance_size = sizeof(QSevGuestInfo),
 .instance_finalize = qsev_guest_finalize,
-.class_size = sizeof(QSevGuestInfoClass),
 .class_init = qsev_guest_class_init,
 .instance_init = qsev_guest_init,
 .interfaces = (InterfaceInfo[]) {
diff --git a/target/i386/sev_i386.h b/target/i386/sev_i386.h
index 8ada9d385d..4f193642ac 100644
--- a/target/i386/sev_i386.h
+++ b/target/i386/sev_i386.h
@@ -41,7 +41,6 @@ extern char *sev_get_launch_measurement(void);
 extern SevCapability *sev_get_capabilities(void);
 
 typedef struct QSevGuestInfo QSevGuestInfo;
-typedef struct QSevGuestInfoClass QSevGuestInfoClass;
 
 /**
  * QSevGuestInfo:
@@ -64,10 +63,6 @@ struct QSevGuestInfo {
 uint32_t reduced_phys_bits;
 };
 
-struct QSevGuestInfoClass {
-ObjectClass parent_class;
-};
-
 struct SEVState {
 QSevGuestInfo *sev_info;
 uint8_t api_major;
-- 
2.26.2

[PATCH v8 5/8] acpi: Align the size to 128k

If table size is changed between virt_acpi_build and
virt_acpi_build_update, the table size would not be updated to
UEFI, therefore, just align the size to 128kb, which is enough
and same with x86. It would warn if 64k is not enough and the
align size should be updated.

Signed-off-by: Yubo Miao 
---
 hw/arm/virt-acpi-build.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 14fcabd197..d0616738e5 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -57,6 +57,8 @@
 #include "hw/pci/pcie_host.h"
 #define ARM_SPI_BASE 32
 
+#define ACPI_BUILD_TABLE_SIZE 0x2
+
 static void acpi_dsdt_add_cpus(Aml *scope, int smp_cpus)
 {
 uint16_t i;
@@ -885,6 +887,15 @@ struct AcpiBuildState {
 bool patched;
 } AcpiBuildState;
 
+static void acpi_align_size(GArray *blob, unsigned align)
+{
+/*
+ * Align size to multiple of given size. This reduces the chance
+ * we need to change size in the future (breaking cross version migration).
+ */
+g_array_set_size(blob, ROUND_UP(acpi_data_len(blob), align));
+}
+
 static
 void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
 {
@@ -967,6 +978,20 @@ void virt_acpi_build(VirtMachineState *vms, 
AcpiBuildTables *tables)
 build_rsdp(tables->rsdp, tables->linker, _data);
 }
 
+/*
+ * The align size is 128, warn if 64k is not enough therefore
+ * the align size could be resized.
+ */
+if (tables_blob->len > ACPI_BUILD_TABLE_SIZE / 2) {
+warn_report("ACPI table size %u exceeds %d bytes,"
+" migration may not work",
+tables_blob->len, ACPI_BUILD_TABLE_SIZE / 2);
+error_printf("Try removing CPUs, NUMA nodes, memory slots"
+ " or PCI bridges.");
+}
+acpi_align_size(tables_blob, ACPI_BUILD_TABLE_SIZE);
+
+
 /* Cleanup memory that's no longer used. */
 g_array_free(table_offsets, true);
 }
-- 
2.19.1

[PATCH v8 4/8] acpi: Refactor the source of host bridge and build tables for pxb

The resources of pxbs are obtained by crs_build and the resources
used by pxbs would be moved from the resources defined for host-bridge.

The resources for pxb are composed of following two parts:
1. The bar space of the pci-bridge/pcie-root-port behined it
2. The config space of devices behind it.

Signed-off-by: Yubo Miao 
---
 hw/arm/virt-acpi-build.c | 127 +--
 1 file changed, 110 insertions(+), 17 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 24ebc06a9f..14fcabd197 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -51,6 +51,10 @@
 #include "migration/vmstate.h"
 #include "hw/acpi/ghes.h"
 
+#include "hw/arm/virt.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci/pcie_host.h"
 #define ARM_SPI_BASE 32
 
 static void acpi_dsdt_add_cpus(Aml *scope, int smp_cpus)
@@ -268,19 +272,80 @@ static void acpi_dsdt_add_pci_osc(Aml *dev, Aml *scope)
 }
 
 static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
-  uint32_t irq, bool use_highmem, bool 
highmem_ecam)
+  uint32_t irq, bool use_highmem, bool 
highmem_ecam,
+  VirtMachineState *vms)
 {
 int ecam_id = VIRT_ECAM_ID(highmem_ecam);
-Aml *method, *crs;
+int i;
+Aml *method, *crs, *dev_pxb;
 hwaddr base_mmio = memmap[VIRT_PCIE_MMIO].base;
 hwaddr size_mmio = memmap[VIRT_PCIE_MMIO].size;
 hwaddr base_pio = memmap[VIRT_PCIE_PIO].base;
 hwaddr size_pio = memmap[VIRT_PCIE_PIO].size;
 hwaddr base_ecam = memmap[ecam_id].base;
 hwaddr size_ecam = memmap[ecam_id].size;
+CrsRangeEntry *entry;
+CrsRangeSet crs_range_set;
+
+crs_range_set_init(_range_set);
 int nr_pcie_buses = size_ecam / PCIE_MMCFG_SIZE_MIN;
 
 Aml *dev = aml_device("%s", "PCI0");
+PCIHostState *s = PCI_GET_PCIE_HOST_STATE;
+
+PCIBus *bus = s->bus;
+/* start to construct the tables for pxb */
+if (bus) {
+QLIST_FOREACH(bus, >child, sibling) {
+uint8_t bus_num = pci_bus_num(bus);
+uint8_t numa_node = pci_bus_numa_node(bus);
+
+if (!pci_bus_is_root(bus)) {
+continue;
+}
+/*
+ * 0 - (nr_pcie_buses - 1) is the bus range for the main
+ * host-bridge and it equals the MIN of the
+ * busNr defined for pxb-pcie.
+ */
+if (bus_num < nr_pcie_buses) {
+nr_pcie_buses = bus_num;
+}
+
+dev_pxb = aml_device("PC%.02X", bus_num);
+aml_append(dev_pxb, aml_name_decl("_HID", aml_string("PNP0A08")));
+aml_append(dev_pxb, aml_name_decl("_CID", aml_string("PNP0A03")));
+aml_append(dev_pxb, aml_name_decl("_ADR", aml_int(0)));
+aml_append(dev_pxb, aml_name_decl("_CCA", aml_int(1)));
+aml_append(dev_pxb, aml_name_decl("_SEG", aml_int(0)));
+aml_append(dev_pxb, aml_name_decl("_BBN", aml_int(bus_num)));
+aml_append(dev_pxb, aml_name_decl("_UID", aml_int(bus_num)));
+aml_append(dev_pxb,
+   aml_name_decl("_STR", aml_unicode("pxb Device")));
+if (numa_node != NUMA_NODE_UNASSIGNED) {
+method = aml_method("_PXM", 0, AML_NOTSERIALIZED);
+aml_append(method, aml_return(aml_int(numa_node)));
+aml_append(dev_pxb, method);
+}
+
+acpi_dsdt_add_pci_route_table(dev_pxb, scope, irq);
+
+/*
+ * Resources defined for PXBs are composed by the folling parts:
+ * 1. The resources the pci-brige/pcie-root-port need.
+ * 2. The resources the devices behind pxb need.
+ */
+crs = build_crs(PCI_HOST_BRIDGE(BUS(bus)->parent), _range_set);
+aml_append(dev_pxb, aml_name_decl("_CRS", crs));
+
+acpi_dsdt_add_pci_osc(dev_pxb, scope);
+
+aml_append(scope, dev_pxb);
+
+}
+}
+
+/* tables for the main */
 aml_append(dev, aml_name_decl("_HID", aml_string("PNP0A08")));
 aml_append(dev, aml_name_decl("_CID", aml_string("PNP0A03")));
 aml_append(dev, aml_name_decl("_SEG", aml_int(0)));
@@ -301,25 +366,51 @@ static void acpi_dsdt_add_pci(Aml *scope, const 
MemMapEntry *memmap,
 aml_word_bus_number(AML_MIN_FIXED, AML_MAX_FIXED, AML_POS_DECODE,
 0x, 0x, nr_pcie_buses - 1, 0x,
 nr_pcie_buses));
-aml_append(rbuf,
-aml_dword_memory(AML_POS_DECODE, AML_MIN_FIXED, AML_MAX_FIXED,
- AML_NON_CACHEABLE, AML_READ_WRITE, 0x, base_mmio,
- base_mmio + size_mmio - 1, 0x, size_mmio));
-aml_append(rbuf,
-aml_dword_io(AML_MIN_FIXED, AML_MAX_FIXED, AML_POS_DECODE,
- AML_ENTIRE_RANGE, 0x, 0x, size_pio -

[PATCH v8 6/8] unit-test: The files changed.

The unit-test is seperated into three patches:
1. The files changed and list in bios-tables-test-allowed-diff.h
2. The unit-test
3. The binary file and clear bios-tables-test-allowed-diff.h

The ASL diff would also be listed.
Sice there are 1000+lines diff, some changes would be omitted.

  * Original Table Header:
  * Signature"DSDT"
- * Length   0x14BB (5307)
+ * Length   0x1E7A (7802)
  * Revision 0x02
- * Checksum 0xD1
+ * Checksum 0x57
  * OEM ID   "BOCHS "
  * OEM Table ID "BXPCDSDT"
  * OEM Revision 0x0001 (1)

+Device (PC80)
+{
+Name (_HID, "PNP0A08" /* PCI Express Bus */)  // _HID: Hardware ID
+Name (_CID, "PNP0A03" /* PCI Bus */)  // _CID: Compatible ID
+Name (_ADR, Zero)  // _ADR: Address
+Name (_CCA, One)  // _CCA: Cache Coherency Attribute
+Name (_SEG, Zero)  // _SEG: PCI Segment
+Name (_BBN, 0x80)  // _BBN: BIOS Bus Number
+Name (_UID, 0x80)  // _UID: Unique ID
+Name (_STR, Unicode ("pxb Device"))  // _STR: Description String
+Name (_PRT, Package (0x80)  // _PRT: PCI Routing Table
+{
+Package (0x04)
+{
+0x,
+Zero,
+GSI0,
+Zero
+},
+

Packages are omitted.

+Package (0x04)
+{
+0x001F,
+0x03,
+GSI2,
+Zero
+}
+})
+Device (GSI0)
+{
+Name (_HID, "PNP0C0F" /* PCI Interrupt Link Device */)  // 
_HID: Hardware ID
+Name (_UID, Zero)  // _UID: Unique ID
+Name (_PRS, ResourceTemplate ()  // _PRS: Possible Resource 
Settings
+{
+Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, 
,, )
+{
+0x0023,
+}
+})
+Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource 
Settings
+{
+Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, 
,, )
+{
+0x0023,
+}
+})
+Method (_SRS, 1, NotSerialized)  // _SRS: Set Resource Settings
+{
+}
+}

GSI1,2,3 are omitted.

+Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
+{
+WordBusNumber (ResourceProducer, MinFixed, MaxFixed, PosDecode,
+0x, // Granularity
+0x0080, // Range Minimum
+0x0080, // Range Maximum
+0x, // Translation Offset
+0x0001, // Length
+,, )
+})
+Name (SUPP, Zero)
+Name (CTRL, Zero)
+Method (_OSC, 4, NotSerialized)  // _OSC: Operating System 
Capabilities
+{
+CreateDWordField (Arg3, Zero, CDW1)
+If ((Arg0 == ToUUID ("33db4d5b-1ff7-401c-9657-7441c03dd766") 
/* PCI Host Bridge Device */))
+{
+CreateDWordField (Arg3, 0x04, CDW2)
+CreateDWordField (Arg3, 0x08, CDW3)
+SUPP = CDW2 /* \_SB_.PC80._OSC.CDW2 */
+CTRL = CDW3 /* \_SB_.PC80._OSC.CDW3 */
+CTRL &= 0x1F
+If ((Arg1 != One))
+{
+CDW1 |= 0x08
+}
+
+If ((CDW3 != CTRL))
+{
+CDW1 |= 0x10
+}
+
+CDW3 = CTRL /* \_SB_.PC80.CTRL */
+Return (Arg3)
+}
+Else
+{
+CDW1 |= 0x04
+Return (Arg3)
+}
+}

DSM is are omitted

 Device (PCI0)
 {
 Name (_HID, "PNP0A08" /* PCI Express Bus */)  // _HID: Hardware ID
 WordBusNumber (ResourceProducer, MinFixed, MaxFixed, 
PosDecode,
 0x, // Granularity
 0x, // Range Minimum
-0x00FF, // Range Maximum
+0x007F, // Range Maximum
 0x, // Translation Offset
-0x0100, // Length
+0x0080, // Length

Signed-off-by: Yubo Miao 
---
 tests/qtest/bios-tables-test-allowed-diff.h | 1 +
 1 file changed, 1 insertion(+)

diff --git

[PATCH v8 3/8] acpi: Extract crs build form acpi_build.c

Extract crs build form acpi_build.c, the function could also be used
to build the crs for pxbs for arm. The resources are composed by two parts:
1. The bar space of pci-bridge/pcie-root-ports
2. The resources needed by devices behind PXBs.
The base and limit of memory/io are obtained from the config via two APIs:
pci_bridge_get_base and pci_bridge_get_limit

Signed-off-by: Yubo Miao 
---
 hw/acpi/aml-build.c | 275 ++
 hw/i386/acpi-build.c| 285 
 include/hw/acpi/aml-build.h |  25 
 3 files changed, 300 insertions(+), 285 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 3681ec6e3d..5802597f8a 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -26,6 +26,9 @@
 #include "qemu/bitops.h"
 #include "sysemu/numa.h"
 #include "hw/boards.h"
+#include "hw/pci/pci_host.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pci_bridge.h"
 
 static GArray *build_alloc_array(void)
 {
@@ -54,6 +57,125 @@ static void build_append_array(GArray *array, GArray *val)
 
 #define ACPI_NAMESEG_LEN 4
 
+void crs_range_insert(GPtrArray *ranges, uint64_t base, uint64_t limit)
+{
+CrsRangeEntry *entry;
+
+entry = g_malloc(sizeof(*entry));
+entry->base = base;
+entry->limit = limit;
+
+g_ptr_array_add(ranges, entry);
+}
+
+static void crs_range_free(gpointer data)
+{
+CrsRangeEntry *entry = (CrsRangeEntry *)data;
+g_free(entry);
+}
+
+void crs_range_set_init(CrsRangeSet *range_set)
+{
+range_set->io_ranges = g_ptr_array_new_with_free_func(crs_range_free);
+range_set->mem_ranges = g_ptr_array_new_with_free_func(crs_range_free);
+range_set->mem_64bit_ranges =
+g_ptr_array_new_with_free_func(crs_range_free);
+}
+
+void crs_range_set_free(CrsRangeSet *range_set)
+{
+g_ptr_array_free(range_set->io_ranges, true);
+g_ptr_array_free(range_set->mem_ranges, true);
+g_ptr_array_free(range_set->mem_64bit_ranges, true);
+}
+
+static gint crs_range_compare(gconstpointer a, gconstpointer b)
+{
+CrsRangeEntry *entry_a = *(CrsRangeEntry **)a;
+CrsRangeEntry *entry_b = *(CrsRangeEntry **)b;
+
+if (entry_a->base < entry_b->base) {
+return -1;
+} else if (entry_a->base > entry_b->base) {
+return 1;
+} else {
+return 0;
+}
+}
+
+/*
+ * crs_replace_with_free_ranges - given the 'used' ranges within [start - end]
+ * interval, computes the 'free' ranges from the same interval.
+ * Example: If the input array is { [a1 - a2],[b1 - b2] }, the function
+ * will return { [base - a1], [a2 - b1], [b2 - limit] }.
+ */
+void crs_replace_with_free_ranges(GPtrArray *ranges,
+ uint64_t start, uint64_t end)
+{
+GPtrArray *free_ranges = g_ptr_array_new();
+uint64_t free_base = start;
+int i;
+
+g_ptr_array_sort(ranges, crs_range_compare);
+for (i = 0; i < ranges->len; i++) {
+CrsRangeEntry *used = g_ptr_array_index(ranges, i);
+
+if (free_base < used->base) {
+crs_range_insert(free_ranges, free_base, used->base - 1);
+}
+
+free_base = used->limit + 1;
+}
+
+if (free_base < end) {
+crs_range_insert(free_ranges, free_base, end);
+}
+
+g_ptr_array_set_size(ranges, 0);
+for (i = 0; i < free_ranges->len; i++) {
+g_ptr_array_add(ranges, g_ptr_array_index(free_ranges, i));
+}
+
+g_ptr_array_free(free_ranges, true);
+}
+
+static void crs_range_merge(GPtrArray *range)
+{
+GPtrArray *tmp =  g_ptr_array_new_with_free_func(crs_range_free);
+CrsRangeEntry *entry;
+uint64_t range_base, range_limit;
+int i;
+
+if (!range->len) {
+return;
+}
+
+g_ptr_array_sort(range, crs_range_compare);
+
+entry = g_ptr_array_index(range, 0);
+range_base = entry->base;
+range_limit = entry->limit;
+for (i = 1; i < range->len; i++) {
+entry = g_ptr_array_index(range, i);
+if (entry->base - 1 == range_limit) {
+range_limit = entry->limit;
+} else {
+crs_range_insert(tmp, range_base, range_limit);
+range_base = entry->base;
+range_limit = entry->limit;
+}
+}
+crs_range_insert(tmp, range_base, range_limit);
+
+g_ptr_array_set_size(range, 0);
+for (i = 0; i < tmp->len; i++) {
+entry = g_ptr_array_index(tmp, i);
+crs_range_insert(range, entry->base, entry->limit);
+}
+g_ptr_array_free(tmp, true);
+}
+
+
 static void
 build_append_nameseg(GArray *array, const char *seg)
 {
@@ -1877,6 +1999,159 @@ build_hdr:
  "FACP", tbl->len - fadt_start, f->rev, oem_id, oem_table_id);
 }
 
+Aml *build_crs(PCIHostState *host, CrsRangeSet *range_set)
+{
+Aml *crs = aml_resource_template();
+CrsRangeSet temp_range_set;
+CrsRangeEntry *entry;
+uint8_t max_bus = pci_bus_num(host->bus);
+uint8_t type;
+int devfn;
+int i;
+
+

[PATCH v8 8/8] unit-test: Add the binary file and clear diff.h

Add the binary file DSDT.pxb and clear bios-tables-test-allowed-diff.h

Signed-off-by: Yubo Miao 
---
 tests/data/acpi/virt/DSDT.pxb   | Bin 0 -> 7802 bytes
 tests/qtest/bios-tables-test-allowed-diff.h |   1 -
 2 files changed, 1 deletion(-)
 create mode 100644 tests/data/acpi/virt/DSDT.pxb

diff --git a/tests/data/acpi/virt/DSDT.pxb b/tests/data/acpi/virt/DSDT.pxb
new file mode 100644
index 
..d5f0533a02d62bc2ae2db9b9de9484e5c06652fe
GIT binary patch
literal 7802
zcmeI1%WoT16o;=LiS6+tw&|kDbINPK+mQkW$GN2t>*4DY
z5GE1@x}%ZUunAHY{255B*s){5x*Prhb`0mvok@O()@MN3!S4-1E)-#wYff>!#D(
zdAOidc(<`_Z#avMce{3z_Jx#EdRxC{zj_wB({~#Ey~7#1TrS7^8|`MgZg<-hEUS3`
zR=cV84zJqVo#0rnvr#TrD*mx}-|jiN8EfisLTO+^WtIANRE0w4D0)D-m9e1W1K-`?I3==%fJX5q(*jFqgf=xI;=+i!j2&*$h#YZ@nAgr*
zjGD-ZNQ_Zn)R1vWWJD!K92l37u_Q7^B!{8KV*-1>~
zMqFZKfw6*G%GO{fw77VxlVHu;{{;Uks;S<
zUSgaFMgtjgosLV43&5~}QI+eoATeGBMiUuwolZ!MSAo$|L(Vw8cgfeg7$ixQ&>j5adlI-QXimw<5-8FHP@N{q|EcpDjVoz6*&6<};4
zL$1?#iE$Me9bnYtI$e+$*MPBw47pBA65|FiwtYtDhpxTi&!fB5E!WE{)VJ8wgqf
zQN7vI`@BBFX|2AGs_uAR4$*c+C9j#H9`7}G}OzaP-oI?ys;54Gnhd{>C9kg#AMP?FOx!@Ni*^?sUtLF
z{m6IphEmhyTLvL|jxf&=@0@|>h{+5lPa%4aGEZuLX$HYiYO>IiLiCI=`-
zGtNBYUS@Dfs3}8F3ehvcJgIFrSI@g73GPWDdRolWVxH8*p(lmtnPi?x=9%Q46ryK}
zd8U{rHGSwwA$q2nXPSAYxhI9_nPHw8=1EN=dQymU3el5po1kv9%#)f*
z^rR3ybIdcxJagQWLiEft*_CKNp>M9*>NInF%CxhI9_Szw+8=1EN}dQym<6U=jh
zc}{Ro3ej_tc}_ARl0q^1}>DMZgA^DHvYBKM>a
zJ!hEb4D+NW8a*jQ<%RFbfCxz%a$2{klCpF#ZNg;a9GtYVEInO;QL{D1OFrQi8
zXZ!;5q$V9bDMZf_^DHsX68EIgcZYER-{LZqUNJz{O9IR6<1D|`%V{aWZ#O?fGWMl>9uyC1I^JJHH~_tpRAI8KF
zx)=JKj#RwSmE*~$N5MF=JF5>K=)rpb$^MTSvtOU2a0Ek
zp0J{OUnX^Ex184IVqw1Dy1kP)(81l~?9rpUmR_}c+}-Uptij%4QEy

[PATCH v8 7/8] unit-test: Add testcase for pxb

Add testcase for pxb to make sure the ACPI table is correct for guest.

Signed-off-by: Yubo Miao 
---
 tests/qtest/bios-tables-test.c | 58 ++
 1 file changed, 52 insertions(+), 6 deletions(-)

diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index c9843829b3..557b7e40ff 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -621,12 +621,21 @@ static void test_acpi_one(const char *params, test_data 
*data)
  * TODO: convert '-drive if=pflash' to new syntax (see e33763be7cd3)
  * when arm/virt boad starts to support it.
  */
-args = g_strdup_printf("-machine %s %s -accel tcg -nodefaults 
-nographic "
-"-drive if=pflash,format=raw,file=%s,readonly "
-"-drive if=pflash,format=raw,file=%s,snapshot=on -cdrom %s %s",
-data->machine, data->tcg_only ? "" : "-accel kvm",
-data->uefi_fl1, data->uefi_fl2, data->cd, params ? params : "");
-
+if (data->cd) {
+args = g_strdup_printf("-machine %s %s -accel tcg "
+"-nodefaults -nographic "
+"-drive if=pflash,format=raw,file=%s,readonly "
+"-drive if=pflash,format=raw,file=%s,snapshot=on -cdrom %s %s",
+data->machine, data->tcg_only ? "" : "-accel kvm",
+data->uefi_fl1, data->uefi_fl2, data->cd, params ? params : 
"");
+} else {
+args = g_strdup_printf("-machine %s %s -accel tcg "
+"-nodefaults -nographic "
+"-drive if=pflash,format=raw,file=%s,readonly "
+"-drive if=pflash,format=raw,file=%s,snapshot=on %s",
+data->machine, data->tcg_only ? "" : "-accel kvm",
+data->uefi_fl1, data->uefi_fl2, params ? params : "");
+}
 } else {
 /* Disable kernel irqchip to be able to override apic irq0. */
 args = g_strdup_printf("-machine %s,kernel-irqchip=off %s -accel tcg "
@@ -966,6 +975,40 @@ static void test_acpi_virt_tcg_numamem(void)
 
 }
 
+#ifdef CONFIG_PXB
+static void test_acpi_virt_tcg_pxb(void)
+{
+test_data data = {
+.machine = "virt",
+.tcg_only = true,
+.uefi_fl1 = "pc-bios/edk2-aarch64-code.fd",
+.uefi_fl2 = "pc-bios/edk2-arm-vars.fd",
+.ram_start = 0x4000ULL,
+.scan_len = 128ULL * 1024 * 1024,
+};
+/*
+ * While using -cdrom, the cdrom would auto plugged into pxb-pcie,
+ * the reason is the bus of pxb-pcie is also root bus, it would lead
+ * to the error only PCI/PCIE bridge could plug onto pxb.
+ * Therefore,thr cdrom is defined and plugged onto the scsi controller
+ * to solve the conflicts.
+ */
+data.variant = ".pxb";
+test_acpi_one(" -device pcie-root-port,chassis=1,id=pci.1"
+  " -device virtio-scsi-pci,id=scsi0,bus=pci.1"
+  " -drive file="
+  
"tests/data/uefi-boot-images/bios-tables-test.aarch64.iso.qcow2,"
+  "if=none,media=cdrom,id=drive-scsi0-0-0-1,readonly=on"
+  " -device scsi-cd,bus=scsi0.0,scsi-id=0,"
+  "drive=drive-scsi0-0-0-1,id=scsi0-0-0-1,bootindex=1"
+  " -cpu cortex-a57"
+  " -device pxb-pcie,bus_nr=128",
+  );
+
+free_test_data();
+}
+#endif
+
 static void test_acpi_tcg_acpi_hmat(const char *machine)
 {
 test_data data;
@@ -1058,6 +1101,9 @@ int main(int argc, char *argv[])
 qtest_add_func("acpi/virt", test_acpi_virt_tcg);
 qtest_add_func("acpi/virt/numamem", test_acpi_virt_tcg_numamem);
 qtest_add_func("acpi/virt/memhp", test_acpi_virt_tcg_memhp);
+#ifdef CONFIG_PXB
+qtest_add_func("acpi/virt/pxb", test_acpi_virt_tcg_pxb);
+#endif
 }
 ret = g_test_run();
 boot_sector_cleanup(disk);
-- 
2.19.1

[PATCH v8 2/8] fw_cfg: Write the extra roots into the fw_cfg

Write the extra roots into the fw_cfg, therefore the uefi could
get the extra roots. Only if the uefi knows there are extra roots,
the config space of devices behind the root could be obtained.

Signed-off-by: Yubo Miao 
---
 hw/arm/virt.c  |  8 
 hw/i386/pc.c   | 18 ++
 hw/nvram/fw_cfg.c  | 20 
 include/hw/nvram/fw_cfg.h  |  2 ++
 include/hw/pci/pcie_host.h |  4 
 5 files changed, 36 insertions(+), 16 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index c41d5f9778..f64ff42ab5 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -78,6 +78,8 @@
 #include "hw/virtio/virtio-iommu.h"
 #include "hw/char/pl011.h"
 #include "qemu/guest-random.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pcie_host.h"
 
 #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
 static void virt_##major##_##minor##_class_init(ObjectClass *oc, \
@@ -1457,6 +1459,10 @@ void virt_machine_done(Notifier *notifier, void *data)
 ARMCPU *cpu = ARM_CPU(first_cpu);
 struct arm_boot_info *info = >bootinfo;
 AddressSpace *as = arm_boot_address_space(cpu, info);
+PCIHostState *s = PCI_GET_PCIE_HOST_STATE;
+
+PCIBus *bus = s->bus;
+FWCfgState *fw_cfg = vms->fw_cfg;
 
 /*
  * If the user provided a dtb, we assume the dynamic sysbus nodes
@@ -1475,6 +1481,8 @@ void virt_machine_done(Notifier *notifier, void *data)
 exit(1);
 }
 
+fw_cfg_write_extra_pci_roots(bus, fw_cfg);
+
 virt_acpi_setup(vms);
 virt_build_smbios(vms);
 }
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 2128f3d6fe..94b1d3df14 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -842,26 +842,12 @@ void pc_machine_done(Notifier *notifier, void *data)
 PCMachineState, machine_done);
 X86MachineState *x86ms = X86_MACHINE(pcms);
 PCIBus *bus = pcms->bus;
+FWCfgState *fw_cfg = x86ms->fw_cfg;
 
 /* set the number of CPUs */
 rtc_set_cpus_count(x86ms->rtc, x86ms->boot_cpus);
 
-if (bus) {
-int extra_hosts = 0;
-
-QLIST_FOREACH(bus, >child, sibling) {
-/* look for expander root buses */
-if (pci_bus_is_root(bus)) {
-extra_hosts++;
-}
-}
-if (extra_hosts && x86ms->fw_cfg) {
-uint64_t *val = g_malloc(sizeof(*val));
-*val = cpu_to_le64(extra_hosts);
-fw_cfg_add_file(x86ms->fw_cfg,
-"etc/extra-pci-roots", val, sizeof(*val));
-}
-}
+fw_cfg_write_extra_pci_roots(bus, fw_cfg);
 
 acpi_setup();
 if (x86ms->fw_cfg) {
diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index 8dd50c2c72..824cfcf054 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -40,6 +40,7 @@
 #include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/acpi/aml-build.h"
+#include "hw/pci/pci_bus.h"
 
 #define FW_CFG_FILE_SLOTS_DFLT 0x20
 
@@ -742,6 +743,25 @@ static void *fw_cfg_modify_bytes_read(FWCfgState *s, 
uint16_t key,
 return ptr;
 }
 
+void fw_cfg_write_extra_pci_roots(PCIBus *bus, FWCfgState *s)
+{
+if (bus) {
+int extra_hosts = 0;
+QLIST_FOREACH(bus, >child, sibling) {
+/* look for expander root buses */
+if (pci_bus_is_root(bus)) {
+extra_hosts++;
+}
+}
+if (extra_hosts && s) {
+uint64_t *val = g_malloc(sizeof(*val));
+*val = cpu_to_le64(extra_hosts);
+fw_cfg_add_file(s,
+   "etc/extra-pci-roots", val, sizeof(*val));
+}
+}
+}
+
 void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void *data, size_t len)
 {
 trace_fw_cfg_add_bytes(key, trace_key_name(key), len);
diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
index 25d9307018..eb86ee5ae6 100644
--- a/include/hw/nvram/fw_cfg.h
+++ b/include/hw/nvram/fw_cfg.h
@@ -79,6 +79,8 @@ struct FWCfgMemState {
 MemoryRegionOps wide_data_ops;
 };
 
+void fw_cfg_write_extra_pci_roots(PCIBus *bus, FWCfgState *s);
+
 /**
  * fw_cfg_add_bytes:
  * @s: fw_cfg device being modified
diff --git a/include/hw/pci/pcie_host.h b/include/hw/pci/pcie_host.h
index 3f7b9886d1..c93f2d7011 100644
--- a/include/hw/pci/pcie_host.h
+++ b/include/hw/pci/pcie_host.h
@@ -27,6 +27,10 @@
 #define TYPE_PCIE_HOST_BRIDGE "pcie-host-bridge"
 #define PCIE_HOST_BRIDGE(obj) \
 OBJECT_CHECK(PCIExpressHost, (obj), TYPE_PCIE_HOST_BRIDGE)
+#define PCI_GET_PCIE_HOST_STATE \
+OBJECT_CHECK(PCIHostState, \
+ object_resolve_path_type("", "pcie-host-bridge", NULL), \
+ TYPE_PCIE_HOST_BRIDGE)
 
 #define PCIE_HOST_MCFG_BASE "MCFG"
 #define PCIE_HOST_MCFG_SIZE "mcfg_size"
-- 
2.19.1

[PATCH v8 1/8] acpi: Extract two APIs from acpi_dsdt_add_pci

Extract two APIs acpi_dsdt_add_pci_route_table and
acpi_dsdt_add_pci_osc from acpi_dsdt_add_pci. The first
API is used to specify the pci route table and the second
API is used to declare the operation system capabilities.
These two APIs would be used to specify the pxb-pcie in DSDT.

Signed-off-by: Yubo Miao 
---
 hw/arm/virt-acpi-build.c | 129 ++-
 1 file changed, 72 insertions(+), 57 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 1b0a584c7b..24ebc06a9f 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -148,29 +148,11 @@ static void acpi_dsdt_add_virtio(Aml *scope,
 }
 }
 
-static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
-  uint32_t irq, bool use_highmem, bool 
highmem_ecam)
+static void acpi_dsdt_add_pci_route_table(Aml *dev, Aml *scope,
+  uint32_t irq)
 {
-int ecam_id = VIRT_ECAM_ID(highmem_ecam);
-Aml *method, *crs, *ifctx, *UUID, *ifctx1, *elsectx, *buf;
 int i, slot_no;
-hwaddr base_mmio = memmap[VIRT_PCIE_MMIO].base;
-hwaddr size_mmio = memmap[VIRT_PCIE_MMIO].size;
-hwaddr base_pio = memmap[VIRT_PCIE_PIO].base;
-hwaddr size_pio = memmap[VIRT_PCIE_PIO].size;
-hwaddr base_ecam = memmap[ecam_id].base;
-hwaddr size_ecam = memmap[ecam_id].size;
-int nr_pcie_buses = size_ecam / PCIE_MMCFG_SIZE_MIN;
-
-Aml *dev = aml_device("%s", "PCI0");
-aml_append(dev, aml_name_decl("_HID", aml_string("PNP0A08")));
-aml_append(dev, aml_name_decl("_CID", aml_string("PNP0A03")));
-aml_append(dev, aml_name_decl("_SEG", aml_int(0)));
-aml_append(dev, aml_name_decl("_BBN", aml_int(0)));
-aml_append(dev, aml_name_decl("_UID", aml_string("PCI0")));
-aml_append(dev, aml_name_decl("_STR", aml_unicode("PCIe 0 Device")));
-aml_append(dev, aml_name_decl("_CCA", aml_int(1)));
-
+Aml *method, *crs;
 /* Declare the PCI Routing Table. */
 Aml *rt_pkg = aml_varpackage(PCI_SLOT_MAX * PCI_NUM_PINS);
 for (slot_no = 0; slot_no < PCI_SLOT_MAX; slot_no++) {
@@ -206,41 +188,11 @@ static void acpi_dsdt_add_pci(Aml *scope, const 
MemMapEntry *memmap,
 aml_append(dev_gsi, method);
 aml_append(dev, dev_gsi);
 }
+}
 
-method = aml_method("_CBA", 0, AML_NOTSERIALIZED);
-aml_append(method, aml_return(aml_int(base_ecam)));
-aml_append(dev, method);
-
-method = aml_method("_CRS", 0, AML_NOTSERIALIZED);
-Aml *rbuf = aml_resource_template();
-aml_append(rbuf,
-aml_word_bus_number(AML_MIN_FIXED, AML_MAX_FIXED, AML_POS_DECODE,
-0x, 0x, nr_pcie_buses - 1, 0x,
-nr_pcie_buses));
-aml_append(rbuf,
-aml_dword_memory(AML_POS_DECODE, AML_MIN_FIXED, AML_MAX_FIXED,
- AML_NON_CACHEABLE, AML_READ_WRITE, 0x, base_mmio,
- base_mmio + size_mmio - 1, 0x, size_mmio));
-aml_append(rbuf,
-aml_dword_io(AML_MIN_FIXED, AML_MAX_FIXED, AML_POS_DECODE,
- AML_ENTIRE_RANGE, 0x, 0x, size_pio - 1, base_pio,
- size_pio));
-
-if (use_highmem) {
-hwaddr base_mmio_high = memmap[VIRT_HIGH_PCIE_MMIO].base;
-hwaddr size_mmio_high = memmap[VIRT_HIGH_PCIE_MMIO].size;
-
-aml_append(rbuf,
-aml_qword_memory(AML_POS_DECODE, AML_MIN_FIXED, AML_MAX_FIXED,
- AML_NON_CACHEABLE, AML_READ_WRITE, 0x,
- base_mmio_high,
- base_mmio_high + size_mmio_high - 1, 0x,
- size_mmio_high));
-}
-
-aml_append(method, aml_return(rbuf));
-aml_append(dev, method);
-
+static void acpi_dsdt_add_pci_osc(Aml *dev, Aml *scope)
+{
+Aml *method, *UUID, *ifctx, *ifctx1, *elsectx, *buf;
 /* Declare an _OSC (OS Control Handoff) method */
 aml_append(dev, aml_name_decl("SUPP", aml_int(0)));
 aml_append(dev, aml_name_decl("CTRL", aml_int(0)));
@@ -248,7 +200,8 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry 
*memmap,
 aml_append(method,
 aml_create_dword_field(aml_arg(3), aml_int(0), "CDW1"));
 
-/* PCI Firmware Specification 3.0
+/*
+ * PCI Firmware Specification 3.0
  * 4.5.1. _OSC Interface for PCI Host Bridge Devices
  * The _OSC interface for a PCI/PCI-X/PCI Express hierarchy is
  * identified by the Universal Unique IDentifier (UUID)
@@ -293,7 +246,8 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry 
*memmap,
 
 method = aml_method("_DSM", 4, AML_NOTSERIALIZED);
 
-/* PCI Firmware Specification 3.0
+/*
+ * PCI Firmware Specification 3.0
  * 4.6.1. _DSM for PCI Express Slot Information
  * The UUID in _DSM in this context is
  * {E5C937D0-3553-4D7A-9117-EA4D19C3434D}
@@ -311,6 +265,67 @@ static void acpi_dsdt_add_pci(Aml *scope,

[PATCH v8 0/8] pci_expander_brdige:acpi: Support pxb-pcie for ARM

Changes with v7
v7->v8:
Fix the error:no member named 'fw_cfg' in 'struct PCMachineState'

I have one question for patch
[PATCH v8 8/8] unit-test: Add the binary file and clear diff.

I followed instructions in tests/qtest/bios-tables-test.c
to updated golden master binaries and empty
tests/qtest/bios-tables-test-allowed-diff.h.

However, checkpatch.pl would report the error
ERROR: Do not add expected files together with tests.

Does the error matters?

Changes with v6
v6->v7:
Refactor fw_cfg_write_extra_pci_roots
Add API PCI_GET_PCIE_HOST_STATE
Fix typos

Changes with v5
v5->v6: stat crs_range_insert in aml_build.h

Changes with v4
v4->v5: Not using specific resources for PXB.
Instead, the resources for pxb are composed of the bar space of the
pci-bridge/pcie-root-port behined it and the config space of devices
behind it.

Only if the bios(uefi for arm) support multiple roots,
configure space of devices behind pxbs could be obtained.
The uefi work is updated for discussion by the following link:
https://edk2.groups.io/g/devel/message/56901?p=,,,20,0,0,0::Created,,add+extra+roots+for+Arm,20,2,0,72723351
[PATCH] ArmVirtPkg/FdtPciHostBridgeLib: add extra roots for Arm.

Currently pxb-pcie is not supported by arm,
the reason for it is pxb-pcie is not described in DSDT table
and only one main host bridge is described in acpi tables,
which means it is not impossible to present different io numas
for different devices.

This series of patches make arm to support PXB-PCIE.

Users can configure pxb-pcie with certain numa, Example command
is:

   -device pxb-pcie,id=pci.7,bus_nr=128,numa_node=0,bus=pcie.0,addr=0x9

Yubo Miao (8):
  acpi: Extract two APIs from acpi_dsdt_add_pci
  fw_cfg: Write the extra roots into the fw_cfg
  acpi: Extract crs build form acpi_build.c
  acpi: Refactor the source of host bridge and build tables for pxb
  acpi: Align the size to 128k
  unit-test: The files changed.
  unit-test: Add testcase for pxb
  unit-test: Add the binary file and clear diff.h

 hw/acpi/aml-build.c| 275 +++
 hw/arm/virt-acpi-build.c   | 249 +---
 hw/arm/virt.c  |   8 +
 hw/i386/acpi-build.c   | 285 -
 hw/i386/pc.c   |  18 +--
 hw/nvram/fw_cfg.c  |  20 +++
 include/hw/acpi/aml-build.h|  25 +++
 include/hw/nvram/fw_cfg.h  |   2 +
 include/hw/pci/pcie_host.h |   4 +
 tests/data/acpi/virt/DSDT.pxb  | Bin 0 -> 7802 bytes
 tests/qtest/bios-tables-test.c |  58 ++-
 11 files changed, 579 insertions(+), 365 deletions(-)
 create mode 100644 tests/data/acpi/virt/DSDT.pxb

-- 
2.19.1

Re: [PATCH v3 9/9] target/riscv: Use a smaller guess size for no-MMU PMP

On Wed, May 20, 2020 at 5:40 AM Alistair Francis
 wrote:
>
> Signed-off-by: Alistair Francis 
> ---
>  target/riscv/pmp.c | 14 +-
>  1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/target/riscv/pmp.c b/target/riscv/pmp.c
> index 0e6b640fbd..607a991260 100644
> --- a/target/riscv/pmp.c
> +++ b/target/riscv/pmp.c
> @@ -233,12 +233,16 @@ bool pmp_hart_has_privs(CPURISCVState *env, 
> target_ulong addr,
>  return true;
>  }
>
> -/*
> - * if size is unknown (0), assume that all bytes
> - * from addr to the end of the page will be accessed.
> - */
>  if (size == 0) {
> -pmp_size = -(addr | TARGET_PAGE_MASK);
> +if (!riscv_feature(env, RISCV_FEATURE_MMU)) {

My previous comments were not fully addressed. I think the logic should be:

if (riscv_feature(env, RISCV_FEATURE_MMU))

Otherwise it does not match your comment and the commit title.

> +/*
> + * If size is unknown (0), assume that all bytes
> + * from addr to the end of the page will be accessed.
> + */
> +pmp_size = -(addr | TARGET_PAGE_MASK);
> +} else {
> +pmp_size = sizeof(target_ulong);
> +}
>  } else {
>  pmp_size = size;
>  }

Regards,
Bin

Re: [PATCH v3 2/9] target/riscv: Don't overwrite the reset vector

On Wed, May 20, 2020 at 5:39 AM Alistair Francis
 wrote:
>
> The reset vector is set in the init function don't set it again in
> realize.
>
> Signed-off-by: Alistair Francis 
> ---
>  target/riscv/cpu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>

Reviewed-by: Bin Meng

Re: [PATCH v4 5/5] target/i386: remove Icelake-Client CPU model

2020-05-20 Thread Robert Hoo

On Wed, 2020-05-20 at 10:17 +0100, Daniel P. Berrangé wrote:
> On Wed, May 20, 2020 at 10:10:07AM +0800, Chenyi Qiang wrote:
> > There are no Icelake Desktop products in the market. Remove the
> > Icelake-Client CPU model.
> 
> QEMU has been shipping this CPU model for 2 years now. Regardless
> of what CPUs Intel are selling, it is possible for users to be
> running VMs with Icelake-Client CPU if their host satisfies the
> listed features. So I don't think it is valid to remove this.
> 
This 'Icelake-Client' actually doesn't exist. How do we define its
feature list? and who will be using it? If any special feature tailor
requirement, it can be simply achieved by '-cpu Icelake,+/-' features,
this is the correct way.

I think we should remove it. When we realize something's not correct,
we should fix it ASAP. Leaving it there will only cause more serious
issue in the future.

> Regards,
> Daniel

Re: [PATCH] riscv: Change the default behavior if no -bios option is specified

Hi Alistair,

On Thu, May 7, 2020 at 5:02 AM Alistair Francis  wrote:
>
> On Tue, May 5, 2020 at 6:34 PM Bin Meng  wrote:
> >
> > Hi Alistair,
> >
> > On Wed, May 6, 2020 at 6:37 AM Alistair Francis  
> > wrote:
> > >
> > > On Tue, May 5, 2020 at 1:34 PM Alistair Francis  
> > > wrote:
> > > >
> > > > On Fri, May 1, 2020 at 5:21 AM Bin Meng  wrote:
> > > > >
> > > > > From: Bin Meng 
> > > > >
> > > > > Per QEMU deprecated doc, QEMU 4.1 introduced support for the -bios
> > > > > option in QEMU for RISC-V for the virt machine and sifive_u machine.
> > > > > The default behavior has been that QEMU does not automatically load
> > > > > any firmware if no -bios option is included.
> > > > >
> > > > > Now 2 releases passed, it's time to change the default behavior to
> > > > > load the default OpenSBI firmware automatically. The firmware is
> > > > > included with the QEMU release and no user interaction is required.
> > > > > All a user needs to do is specify the kernel they want to boot with
> > > > > the -kernel option.
> > > > >
> > > > > Signed-off-by: Bin Meng 
> > > >
> > > > Thanks!
> > > >
> > > > Reviewed-by: Alistair Francis 
> > > >
> > > > Applied to the RISC-V tree.
> > >
> > > This fails `make check`
> > >
> > > qemu-system-riscv64: Unable to load the RISC-V firmware
> > > "opensbi-riscv64-spike-fw_jump.elf"
> > > Broken pipe
> > > /scratch/alistair/software/qemu/tests/qtest/libqtest.c:166:
> > > kill_qemu() tried to terminate QEMU process but encountered exit
> > > status 1 (expected 0)
> > > ERROR - too few tests run (expected 7, got 2)
> > > make: *** [/scratch/alistair/software/qemu/tests/Makefile.include:637:
> > > check-qtest-riscv64] Error 1
> > >
> >
> > Please apply this patch to fix the "make check" as well.
> >
> > [5/5] riscv: Suppress the error report for QEMU testing with
> > riscv_find_firmware()
> > http://patchwork.ozlabs.org/project/qemu-devel/patch/1588348254-7241-6-git-send-email-bmeng...@gmail.com/
>
> In future please send all related patches in a single series.
>
> I have applied those two patches.

I checked https://github.com/alistair23/qemu/ but could not find where
these 2 patches applied. Just make sure I was not looking at the wrong
place?

Regards,
Bin

RE: [PATCH 0/7] Latest COLO tree queued patches

2020-05-20 Thread Zhang, Chen



> -Original Message-
> From: Jason Wang 
> Sent: Wednesday, May 20, 2020 8:23 PM
> To: Zhang, Chen ; qemu-devel@nongnu.org; Lukas
> Straub 
> Cc: zhangc...@gmail.com
> Subject: Re: [PATCH 0/7] Latest COLO tree queued patches
> 
> 
> On 2020/5/20 下午5:07, Zhang, Chen wrote:
> > It looks ASan doesn't fully support makecontext/swapcontext functions
> and may produce false positives in some cases.
> > And Lukas's patch maybe touch it.
> > What do we need to do?
> 
> 
> We need first identify if those are false positives. (Which I believe
> yes, since I don't think this series have effect on the those qtests).
> 
> And maybe we can consider to avoid using coroutine .
> 

Hi Lukas, Can you double check your patches whether or not are false positives?
If yes, maybe we can ignore this error. 

Thanks
Zhang Chen

> Thanks
> 
> 
> >
> > Thanks
> > Zhang Chen
> >
> >
> >> -Original Message-
> >> From: no-re...@patchew.org 
> >> Sent: Wednesday, May 20, 2020 12:41 PM
> >> To: Zhang, Chen 
> >> Cc: jasow...@redhat.com; Zhang, Chen ;
> qemu-
> >> de...@nongnu.org; zhangc...@gmail.com
> >> Subject: Re: [PATCH 0/7] Latest COLO tree queued patches
> >>
> >> Patchew URL: https://patchew.org/QEMU/20200519200207.17773-1-
> >> chen.zh...@intel.com/
> >>
> >>
> >>
> >> Hi,
> >>
> >> This series failed the asan build test. Please find the testing commands
> and
> >> their output below. If you have Docker installed, you can probably
> reproduce
> >> it
> >> locally.
> >>
> >> === TEST SCRIPT BEGIN ===
> >> #!/bin/bash
> >> export ARCH=x86_64
> >> make docker-image-fedora V=1 NETWORK=1
> >> time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu
> J=14
> >> NETWORK=1
> >> === TEST SCRIPT END ===
> >>
> >> PASS 1 fdc-test /x86_64/fdc/cmos
> >> PASS 2 fdc-test /x86_64/fdc/no_media_on_start
> >> PASS 3 fdc-test /x86_64/fdc/read_without_media
> >> ==6214==WARNING: ASan doesn't fully support
> makecontext/swapcontext
> >> functions and may produce false positives in some cases!
> >> PASS 4 fdc-test /x86_64/fdc/media_change
> >> PASS 5 fdc-test /x86_64/fdc/sense_interrupt
> >> PASS 6 fdc-test /x86_64/fdc/relative_seek
> >> ---
> >> PASS 32 test-opts-visitor /visitor/opts/range/beyond
> >> PASS 33 test-opts-visitor /visitor/opts/dict/unvisited
> >> MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 +
> 1))}
> >> tests/test-coroutine -m=quick -k --tap < /dev/null | 
> >> ./scripts/tap-driver.pl
> --
> >> test-name="test-coroutine"
> >> ==6253==WARNING: ASan doesn't fully support
> makecontext/swapcontext
> >> functions and may produce false positives in some cases!
> >> ==6253==WARNING: ASan is ignoring requested
> __asan_handle_no_return:
> >> stack top: 0x7ffcb42bb000; bottom 0x7f9c45e2; size: 0x00606e49b000
> >> (414167183360)
> >> False positive error reports may follow
> >> For details see https://github.com/google/sanitizers/issues/189
> >> PASS 1 test-coroutine /basic/no-dangling-access
> >> ---
> >> PASS 13 test-aio /aio/event/wait/no-flush-cb
> >> PASS 11 fdc-test /x86_64/fdc/read_no_dma_18
> >> PASS 14 test-aio /aio/timer/schedule
> >> ==6268==WARNING: ASan doesn't fully support
> makecontext/swapcontext
> >> functions and may produce false positives in some cases!
> >> PASS 15 test-aio /aio/coroutine/queue-chaining
> >> PASS 16 test-aio /aio-gsource/flush
> >> PASS 17 test-aio /aio-gsource/bh/schedule
> >> ---
> >> PASS 27 test-aio /aio-gsource/event/wait/no-flush-cb
> >> PASS 28 test-aio /aio-gsource/timer/schedule
> >> MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 +
> 1))}
> >> tests/test-aio-multithread -m=quick -k --tap < /dev/null | ./scripts/tap-
> >> driver.pl --test-name="test-aio-multithread"
> >> ==6273==WARNING: ASan doesn't fully support
> makecontext/swapcontext
> >> functions and may produce false positives in some cases!
> >> PASS 1 test-aio-multithread /aio/multi/lifecycle
> >> PASS 2 test-aio-multithread /aio/multi/schedule
> >> PASS 12 fdc-test /x86_64/fdc/read_no_dma_19
> >> PASS 13 fdc-test /x86_64/fdc/fuzz-registers
> >> MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 +
> 1))}
> >> QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64
> >> QTEST_QEMU_IMG=qemu-img tests/qtest/ide-test -m=quick -k --tap <
> >> /dev/null | ./scripts/tap-driver.pl --test-name="ide-test"
> >> PASS 3 test-aio-multithread /aio/multi/mutex/contended
> >> ==6295==WARNING: ASan doesn't fully support
> makecontext/swapcontext
> >> functions and may produce false positives in some cases!
> >> PASS 1 ide-test /x86_64/ide/identify
> >> ==6306==WARNING: ASan doesn't fully support
> makecontext/swapcontext
> >> functions and may produce false positives in some cases!
> >> PASS 2 ide-test /x86_64/ide/flush
> >> ==6312==WARNING: ASan doesn't fully support
> makecontext/swapcontext
> >> functions and may produce false positives in some cases!
> >> PASS 3 ide-test /x86_64/ide/bmdma/simple_rw
> >> ==6318==WARNING: ASan doesn't fully support
> makecontext/swapcontext
> >> functions and may

Re: [PATCH v2 3/3] target/riscv: Drop support for ISA spec version 1.09.1

On Fri, May 8, 2020 at 3:22 AM Alistair Francis
 wrote:
>
> The RISC-V ISA spec version 1.09.1 has been deprecated in QEMU since
> 4.1. It's not commonly used so let's remove support for it.
>
> Signed-off-by: Alistair Francis 
> ---
>  target/riscv/cpu.c|  2 -
>  target/riscv/cpu.h|  1 -
>  target/riscv/csr.c| 82 ---
>  .../riscv/insn_trans/trans_privileged.inc.c   |  6 --
>  4 files changed, 17 insertions(+), 74 deletions(-)
>
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index 112f2e3a2f..eeb91f8513 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -368,8 +368,6 @@ static void riscv_cpu_realize(DeviceState *dev, Error 
> **errp)
>  priv_version = PRIV_VERSION_1_11_0;
>  } else if (!g_strcmp0(cpu->cfg.priv_spec, "v1.10.0")) {
>  priv_version = PRIV_VERSION_1_10_0;
> -} else if (!g_strcmp0(cpu->cfg.priv_spec, "v1.9.1")) {
> -priv_version = PRIV_VERSION_1_09_1;
>  } else {
>  error_setg(errp,
> "Unsupported privilege spec version '%s'",
> diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
> index 76b98d7a33..c022539012 100644
> --- a/target/riscv/cpu.h
> +++ b/target/riscv/cpu.h
> @@ -73,7 +73,6 @@ enum {
>  RISCV_FEATURE_MISA
>  };
>
> -#define PRIV_VERSION_1_09_1 0x00010901
>  #define PRIV_VERSION_1_10_0 0x00011000
>  #define PRIV_VERSION_1_11_0 0x00011100
>
> diff --git a/target/riscv/csr.c b/target/riscv/csr.c
> index 11d184cd16..df3498b24f 100644
> --- a/target/riscv/csr.c
> +++ b/target/riscv/csr.c
> @@ -58,31 +58,11 @@ static int ctr(CPURISCVState *env, int csrno)
>  #if !defined(CONFIG_USER_ONLY)
>  CPUState *cs = env_cpu(env);
>  RISCVCPU *cpu = RISCV_CPU(cs);
> -uint32_t ctr_en = ~0u;
>
>  if (!cpu->cfg.ext_counters) {
>  /* The Counters extensions is not enabled */
>  return -1;
>  }
> -
> -/*
> - * The counters are always enabled at run time on newer priv specs, as 
> the
> - * CSR has changed from controlling that the counters can be read to
> - * controlling that the counters increment.
> - */
> -if (env->priv_ver > PRIV_VERSION_1_09_1) {
> -return 0;
> -}
> -
> -if (env->priv < PRV_M) {
> -ctr_en &= env->mcounteren;
> -}
> -if (env->priv < PRV_S) {
> -ctr_en &= env->scounteren;
> -}
> -if (!(ctr_en & (1u << (csrno & 31 {
> -return -1;
> -}
>  #endif
>  return 0;
>  }
> @@ -358,34 +338,21 @@ static int write_mstatus(CPURISCVState *env, int csrno, 
> target_ulong val)
>  int dirty;
>
>  /* flush tlb on mstatus fields that affect VM */
> -if (env->priv_ver <= PRIV_VERSION_1_09_1) {
> -if ((val ^ mstatus) & (MSTATUS_MXR | MSTATUS_MPP |
> -MSTATUS_MPRV | MSTATUS_SUM | MSTATUS_VM)) {
> -tlb_flush(env_cpu(env));
> -}
> -mask = MSTATUS_SIE | MSTATUS_SPIE | MSTATUS_MIE | MSTATUS_MPIE |
> -MSTATUS_SPP | MSTATUS_FS | MSTATUS_MPRV | MSTATUS_SUM |
> -MSTATUS_MPP | MSTATUS_MXR |
> -(validate_vm(env, get_field(val, MSTATUS_VM)) ?
> -MSTATUS_VM : 0);
> +if ((val ^ mstatus) & (MSTATUS_MXR | MSTATUS_MPP | MSTATUS_MPV |
> +MSTATUS_MPRV | MSTATUS_SUM)) {
> +tlb_flush(env_cpu(env));
>  }
> -if (env->priv_ver >= PRIV_VERSION_1_10_0) {
> -if ((val ^ mstatus) & (MSTATUS_MXR | MSTATUS_MPP | MSTATUS_MPV |
> -MSTATUS_MPRV | MSTATUS_SUM)) {
> -tlb_flush(env_cpu(env));
> -}
> -mask = MSTATUS_SIE | MSTATUS_SPIE | MSTATUS_MIE | MSTATUS_MPIE |
> -MSTATUS_SPP | MSTATUS_FS | MSTATUS_MPRV | MSTATUS_SUM |
> -MSTATUS_MPP | MSTATUS_MXR | MSTATUS_TVM | MSTATUS_TSR |
> -MSTATUS_TW;
> +mask = MSTATUS_SIE | MSTATUS_SPIE | MSTATUS_MIE | MSTATUS_MPIE |
> +MSTATUS_SPP | MSTATUS_FS | MSTATUS_MPRV | MSTATUS_SUM |
> +MSTATUS_MPP | MSTATUS_MXR | MSTATUS_TVM | MSTATUS_TSR |
> +MSTATUS_TW;
>  #if defined(TARGET_RISCV64)
> -/*
> - * RV32: MPV and MTL are not in mstatus. The current plan is to
> - * add them to mstatush. For now, we just don't support it.
> - */
> -mask |= MSTATUS_MTL | MSTATUS_MPV;
> +/*
> + * RV32: MPV and MTL are not in mstatus. The current plan is to
> + * add them to mstatush. For now, we just don't support it.
> + */
> +mask |= MSTATUS_MTL | MSTATUS_MPV;

The indentation level is wrong

>  #endif
> -}
>
>  mstatus = (mstatus & ~mask) | (val & mask);
>
> @@ -553,8 +520,7 @@ static int write_mcounteren(CPURISCVState *env, int 
> csrno, target_ulong val)
>  /* This regiser is replaced with CSR_MCOUNTINHIBIT in 1.11.0 */
>  static int read_mscounteren(CPURISCVState *env, int csrno, target_ulong *val)
>  {
> -

Re: [PATCH v2 2/3] target/riscv: Remove the deprecated CPUs

On Fri, May 8, 2020 at 3:19 AM Alistair Francis
 wrote:
>
> Signed-off-by: Alistair Francis 
> ---
>  target/riscv/cpu.c  | 28 
>  target/riscv/cpu.h  |  7 ---
>  tests/qtest/machine-none-test.c |  4 ++--
>  3 files changed, 2 insertions(+), 37 deletions(-)
>

Reviewed-by: Bin Meng

Re: [PATCH v2 1/3] hw/riscv: spike: Remove deprecated ISA specific machines

On Fri, May 8, 2020 at 3:21 AM Alistair Francis
 wrote:
>
> The ISA specific Spike machines have  been deprecated in QEMU since 4.1,

nits: there are 2 spaces between 'have' and 'been'

> let's finally remove them.
>
> Signed-off-by: Alistair Francis 
> Reviewed-by: Philippe Mathieu-Daudé 
> ---
>  hw/riscv/spike.c | 217 ---
>  include/hw/riscv/spike.h |   6 +-
>  2 files changed, 2 insertions(+), 221 deletions(-)
>

Reviewed-by: Bin Meng

Re: [PATCH v6 0/7] dwc-hsotg (aka dwc2) USB host controller emulation

Patchew URL: https://patchew.org/QEMU/20200520235349.21215-1-pauld...@gmail.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 20200520235349.21215-1-pauld...@gmail.com
Subject: [PATCH v6 0/7] dwc-hsotg (aka dwc2) USB host controller emulation
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Switched to a new branch 'test'
343a8bc raspi2 acceptance test: add test for dwc-hsotg (dwc2) USB host
fc6529d wire in the dwc-hsotg (dwc2) USB host controller emulation
e37eb58 usb: add short-packet handling to usb-storage driver
e811428 dwc-hsotg (dwc2) USB host controller emulation
5830514 dwc-hsotg (dwc2) USB host controller state definitions
f2ab518 dwc-hsotg (dwc2) USB host controller register definitions
b919e2a raspi: add BCM2835 SOC MPHI emulation

=== OUTPUT BEGIN ===
1/7 Checking commit b919e2a6e4c5 (raspi: add BCM2835 SOC MPHI emulation)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#65: 
new file mode 100644

WARNING: line over 80 characters
#220: FILE: hw/misc/bcm2835_mphi.c:151:
+memory_region_init_io(>iomem, obj, _mmio_ops, s, "mphi", 
MPHI_MMIO_SIZE);

total: 0 errors, 2 warnings, 285 lines checked

Patch 1/7 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
2/7 Checking commit f2ab51848698 (dwc-hsotg (dwc2) USB host controller register 
definitions)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#18: 
new file mode 100644

WARNING: architecture specific defines should be avoided
#64: FILE: include/hw/usb/dwc2-regs.h:42:
+#ifndef __DWC2_HW_H__

ERROR: code indent should never use tabs
#67: FILE: include/hw/usb/dwc2-regs.h:45:
+#define HSOTG_REG(x)^I(x)$

ERROR: code indent should never use tabs
#69: FILE: include/hw/usb/dwc2-regs.h:47:
+#define GOTGCTL^I^I^I^IHSOTG_REG(0x000)$

ERROR: code indent should never use tabs
#70: FILE: include/hw/usb/dwc2-regs.h:48:
+#define GOTGCTL_CHIRPEN^I^I^IBIT(27)$

ERROR: code indent should never use tabs
#71: FILE: include/hw/usb/dwc2-regs.h:49:
+#define GOTGCTL_MULT_VALID_BC_MASK^I(0x1f << 22)$

ERROR: code indent should never use tabs
#72: FILE: include/hw/usb/dwc2-regs.h:50:
+#define GOTGCTL_MULT_VALID_BC_SHIFT^I22$

ERROR: code indent should never use tabs
#73: FILE: include/hw/usb/dwc2-regs.h:51:
+#define GOTGCTL_OTGVER^I^I^IBIT(20)$

ERROR: code indent should never use tabs
#74: FILE: include/hw/usb/dwc2-regs.h:52:
+#define GOTGCTL_BSESVLD^I^I^IBIT(19)$

ERROR: code indent should never use tabs
#75: FILE: include/hw/usb/dwc2-regs.h:53:
+#define GOTGCTL_ASESVLD^I^I^IBIT(18)$

ERROR: code indent should never use tabs
#76: FILE: include/hw/usb/dwc2-regs.h:54:
+#define GOTGCTL_DBNC_SHORT^I^IBIT(17)$

ERROR: code indent should never use tabs
#77: FILE: include/hw/usb/dwc2-regs.h:55:
+#define GOTGCTL_CONID_B^I^I^IBIT(16)$

ERROR: code indent should never use tabs
#78: FILE: include/hw/usb/dwc2-regs.h:56:
+#define GOTGCTL_DBNCE_FLTR_BYPASS^IBIT(15)$

ERROR: code indent should never use tabs
#79: FILE: include/hw/usb/dwc2-regs.h:57:
+#define GOTGCTL_DEVHNPEN^I^IBIT(11)$

ERROR: code indent should never use tabs
#80: FILE: include/hw/usb/dwc2-regs.h:58:
+#define GOTGCTL_HSTSETHNPEN^I^IBIT(10)$

ERROR: code indent should never use tabs
#81: FILE: include/hw/usb/dwc2-regs.h:59:
+#define GOTGCTL_HNPREQ^I^I^IBIT(9)$

ERROR: code indent should never use tabs
#82: FILE: include/hw/usb/dwc2-regs.h:60:
+#define GOTGCTL_HSTNEGSCS^I^IBIT(8)$

ERROR: code indent should never use tabs
#83: FILE: include/hw/usb/dwc2-regs.h:61:
+#define GOTGCTL_SESREQ^I^I^IBIT(1)$

ERROR: code indent should never use tabs
#84: FILE: include/hw/usb/dwc2-regs.h:62:
+#define GOTGCTL_SESREQSCS^I^IBIT(0)$

ERROR: code indent should never use tabs
#86: FILE: include/hw/usb/dwc2-regs.h:64:
+#define GOTGINT^I^I^I^IHSOTG_REG(0x004)$

ERROR: code indent should never use tabs
#87: FILE: include/hw/usb/dwc2-regs.h:65:
+#define GOTGINT_DBNCE_DONE^I^IBIT(19)$

ERROR: code indent should never use tabs
#88: FILE: include/hw/usb/dwc2-regs.h:66:
+#define GOTGINT_A_DEV_TOUT_CHG^I^IBIT(18)$

ERROR: code indent should never use tabs
#89: FILE: include/hw/usb/dwc2-regs.h:67:
+#define GOTGINT_HST_NEG_DET^I^IBIT(17)$

ERROR: code indent should never use tabs
#90: FILE: include/hw/usb/dwc2-regs.h:68:
+#define GOTGINT_HST_NEG_SUC_STS_CHNG^IBIT(9)$

ERROR: code indent should never use tabs
#91: FILE: include/hw/usb/dwc2-regs.h:69:
+#define GOTGINT_SES_REQ_SUC_STS_CHNG^IBIT(8)$

ERROR: code indent should never use tabs
#92: FILE: include/hw/usb/dwc2-regs.h:70:
+#define GOTGINT_SES_END_DET^I^IBIT(2)$

ERROR: code indent should never use tabs
#94: FILE: include/hw/usb/dwc2-regs.h:72:

Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-05-20 Thread John G Johnson




> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> The former seems intended to provide the server with access to the
> entire GPA space, while the latter indicates an IOVA to GPA mapping of
> those regions.  Doesn't this break the basic isolation of a vIOMMU?
> This essentially says to me "here's all the guest memory, but please
> only access these regions for which we're providing DMA mappings".
> That invites abuse.
> 

The purpose behind separating QEMU into multiple processes is
to provide an additional layer protection for the infrastructure against
a malign guest, not for the guest against itself, so preventing a server
that has been compromised by a guest from accessing all of guest memory
adds no additional benefit.  We don’t even have an IOMMU in our current
guest model for this reason.

The implementation was stolen from vhost-user, with the exception
that we push IOTLB translations from client to server like VFIO does, as
opposed to pulling them from server to client like vhost-user does.

That said, neither the qemu-mp nor MUSER implementation uses an
IOMMU, so if you prefer another IOMMU model, we can consider it.  We
could only send the guest memory file descriptors with IOMMU_MAP_DMA
requests, although that would cost performance since each request would
require the server to execute an mmap() system call.


> Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
> "an array of file descriptors will be sent as part of the message
> meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
> the table specified as 0?  How does a client learn their Device ID?
> 

SCM_RIGHTS message controls allow sendmsg() to send an array of
file descriptors over a UNIX domain socket.

We’re only supporting one device per socket in this protocol
version, so the device ID will always be 0.  This may change in a future
revision, so we included the field in the header to avoid a major version
change if device multiplexing is added later.


> VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
> capability chain), the cap_offset and next pointers within the chain
> need to specify what their offset is relative to (ie. the start of the
> packet, the start of the vfio compatible data structure, etc).  I
> assume the latter for client compatibility.
> 

Yes.  We will attempt to make the language clearer.


> Also on REGION_INFO, offset is specified as "the base offset to be
> given to the mmap() call for regions with the MMAP attribute".  Base
> offset from what?  Is the mmap performed on the socket fd?  Do we not
> allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
> Why do we specify "MMIO" in those operations versus simply "REGION"?
> Are we arbitrarily excluding support for I/O port regions or device
> specific regions?  If these commands replace direct read and write to
> an fd offset, how is PCI config space handled?
> 

The base offset refers to the sparse areas, where the sparse area
offset is added to the base region offset.  We will try to make the text
clearer here as well.

MMIO was added to distinguish these operations from DMA operations.
I can see how this can cause confusion when the region refers to a port range,
so we can change the name to REGION_READ/WRITE. 


> VFIO_USER_MMIO_READ specifies the count field is zero and the reply
> will include the count specifying the amount of data read.  How does
> the client specify how much data to read?  Via message size?
> 

This is a bug in the doc.  As you said, the read field should
be the amount of data to be read.


> VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
> should only ever have access via IOVA, which implies a DMA mapping
> exists for the device.  Can you provide an example of why we need these
> commands since there seems little point to this interface if a device
> cannot directly interact with VM memory.
> 

It is a GPA.  The device emulation code would only handle the DMA
addresses the guest programmed it with; the server infrastructure knows
whether an IOMMU exists, and whether the DMA address needs translation to
GPA or not.


> The IOMMU commands should be unnecessary, a vIOMMU should be
> transparent to the server by virtue that the device only knows about
> IOVA mappings accessible to the device.  Requiring the client to expose
> all memory to the server implies that the server must always be trusted.
> 

The client and server are equally trusted; the guest is the untrusted
entity.


> Interrupt info format, s/type/index/, s/vector/subindex/
> 

ok


> In addition to the unused ioctls, the entire concept of groups and
> containers are not found in this specification.  To some degree that
> makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
> to group relationship.  However, the container is very much

[PATCH v6 5/7] usb: add short-packet handling to usb-storage driver

The dwc-hsotg (dwc2) USB host depends on a short packet to
indicate the end of an IN transfer. The usb-storage driver
currently doesn't provide this, so fix it.

I have tested this change rather extensively using a PC
emulation with xhci, ehci, and uhci controllers, and have
not observed any regressions.

Signed-off-by: Paul Zimmerman 
---
 hw/usb/dev-storage.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/usb/dev-storage.c b/hw/usb/dev-storage.c
index 5c4b57b06b..ae3c550042 100644
--- a/hw/usb/dev-storage.c
+++ b/hw/usb/dev-storage.c
@@ -229,6 +229,9 @@ static void usb_msd_copy_data(MSDState *s, USBPacket *p)
 usb_packet_copy(p, scsi_req_get_buf(s->req) + s->scsi_off, len);
 s->scsi_len -= len;
 s->scsi_off += len;
+if (len > s->data_len) {
+len = s->data_len;
+}
 s->data_len -= len;
 if (s->scsi_len == 0 || s->data_len == 0) {
 scsi_req_continue(s->req);
@@ -303,6 +306,9 @@ static void usb_msd_command_complete(SCSIRequest *req, 
uint32_t status, size_t r
 if (s->data_len) {
 int len = (p->iov.size - p->actual_length);
 usb_packet_skip(p, len);
+if (len > s->data_len) {
+len = s->data_len;
+}
 s->data_len -= len;
 }
 if (s->data_len == 0) {
@@ -469,6 +475,9 @@ static void usb_msd_handle_data(USBDevice *dev, USBPacket 
*p)
 int len = p->iov.size - p->actual_length;
 if (len) {
 usb_packet_skip(p, len);
+if (len > s->data_len) {
+len = s->data_len;
+}
 s->data_len -= len;
 if (s->data_len == 0) {
 s->mode = USB_MSDM_CSW;
@@ -528,13 +537,17 @@ static void usb_msd_handle_data(USBDevice *dev, USBPacket 
*p)
 int len = p->iov.size - p->actual_length;
 if (len) {
 usb_packet_skip(p, len);
+if (len > s->data_len) {
+len = s->data_len;
+}
 s->data_len -= len;
 if (s->data_len == 0) {
 s->mode = USB_MSDM_CSW;
 }
 }
 }
-if (p->actual_length < p->iov.size) {
+if (p->actual_length < p->iov.size && (p->short_not_ok ||
+s->scsi_len >= p->ep->max_packet_size)) {
 DPRINTF("Deferring packet %p [wait data-in]\n", p);
 s->packet = p;
 p->status = USB_RET_ASYNC;
-- 
2.17.1

[PATCH v6 6/7] wire in the dwc-hsotg (dwc2) USB host controller emulation

Wire the dwc-hsotg (dwc2) emulation into Qemu

Signed-off-by: Paul Zimmerman 
Reviewed-by: Philippe Mathieu-Daude 
---
 hw/arm/bcm2835_peripherals.c | 21 -
 include/hw/arm/bcm2835_peripherals.h |  3 ++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/hw/arm/bcm2835_peripherals.c b/hw/arm/bcm2835_peripherals.c
index 5e2c832d95..3b554cfac0 100644
--- a/hw/arm/bcm2835_peripherals.c
+++ b/hw/arm/bcm2835_peripherals.c
@@ -129,6 +129,13 @@ static void bcm2835_peripherals_init(Object *obj)
 /* Mphi */
 sysbus_init_child_obj(obj, "mphi", >mphi, sizeof(s->mphi),
   TYPE_BCM2835_MPHI);
+
+/* DWC2 */
+sysbus_init_child_obj(obj, "dwc2", >dwc2, sizeof(s->dwc2),
+  TYPE_DWC2_USB);
+
+object_property_add_const_link(OBJECT(>dwc2), "dma-mr",
+   OBJECT(>gpu_bus_mr));
 }
 
 static void bcm2835_peripherals_realize(DeviceState *dev, Error **errp)
@@ -386,6 +393,19 @@ static void bcm2835_peripherals_realize(DeviceState *dev, 
Error **errp)
 qdev_get_gpio_in_named(DEVICE(>ic), BCM2835_IC_GPU_IRQ,
INTERRUPT_HOSTPORT));
 
+/* DWC2 */
+object_property_set_bool(OBJECT(>dwc2), true, "realized", );
+if (err) {
+error_propagate(errp, err);
+return;
+}
+
+memory_region_add_subregion(>peri_mr, USB_OTG_OFFSET,
+sysbus_mmio_get_region(SYS_BUS_DEVICE(>dwc2), 0));
+sysbus_connect_irq(SYS_BUS_DEVICE(>dwc2), 0,
+qdev_get_gpio_in_named(DEVICE(>ic), BCM2835_IC_GPU_IRQ,
+   INTERRUPT_USB));
+
 create_unimp(s, >armtmr, "bcm2835-sp804", ARMCTRL_TIMER0_1_OFFSET, 
0x40);
 create_unimp(s, >cprman, "bcm2835-cprman", CPRMAN_OFFSET, 0x1000);
 create_unimp(s, >a2w, "bcm2835-a2w", A2W_OFFSET, 0x1000);
@@ -399,7 +419,6 @@ static void bcm2835_peripherals_realize(DeviceState *dev, 
Error **errp)
 create_unimp(s, >otp, "bcm2835-otp", OTP_OFFSET, 0x80);
 create_unimp(s, >dbus, "bcm2835-dbus", DBUS_OFFSET, 0x8000);
 create_unimp(s, >ave0, "bcm2835-ave0", AVE0_OFFSET, 0x8000);
-create_unimp(s, >dwc2, "dwc-usb2", USB_OTG_OFFSET, 0x1000);
 create_unimp(s, >sdramc, "bcm2835-sdramc", SDRAMC_OFFSET, 0x100);
 }
 
diff --git a/include/hw/arm/bcm2835_peripherals.h 
b/include/hw/arm/bcm2835_peripherals.h
index 7a7a8f6141..48a0ad1633 100644
--- a/include/hw/arm/bcm2835_peripherals.h
+++ b/include/hw/arm/bcm2835_peripherals.h
@@ -27,6 +27,7 @@
 #include "hw/sd/bcm2835_sdhost.h"
 #include "hw/gpio/bcm2835_gpio.h"
 #include "hw/timer/bcm2835_systmr.h"
+#include "hw/usb/hcd-dwc2.h"
 #include "hw/misc/unimp.h"
 
 #define TYPE_BCM2835_PERIPHERALS "bcm2835-peripherals"
@@ -67,7 +68,7 @@ typedef struct BCM2835PeripheralState {
 UnimplementedDeviceState ave0;
 UnimplementedDeviceState bscsl;
 UnimplementedDeviceState smi;
-UnimplementedDeviceState dwc2;
+DWC2State dwc2;
 UnimplementedDeviceState sdramc;
 } BCM2835PeripheralState;
 
-- 
2.17.1

[PATCH v6 7/7] raspi2 acceptance test: add test for dwc-hsotg (dwc2) USB host

Add a check for functional dwc-hsotg (dwc2) USB host emulation to
the Raspi 2 acceptance test

Signed-off-by: Paul Zimmerman 
Reviewed-by: Philippe Mathieu-Daude 
---
 tests/acceptance/boot_linux_console.py | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tests/acceptance/boot_linux_console.py 
b/tests/acceptance/boot_linux_console.py
index c6b06a1a13..abb5ca3dd4 100644
--- a/tests/acceptance/boot_linux_console.py
+++ b/tests/acceptance/boot_linux_console.py
@@ -378,13 +378,18 @@ class BootLinuxConsole(Test):
 
 self.vm.set_console()
 kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
-   serial_kernel_cmdline[uart_id])
+   serial_kernel_cmdline[uart_id] +
+   ' root=/dev/mmcblk0p2 rootwait ' +
+   'dwc_otg.fiq_fsm_enable=0')
 self.vm.add_args('-kernel', kernel_path,
  '-dtb', dtb_path,
- '-append', kernel_command_line)
+ '-append', kernel_command_line,
+ '-device', 'usb-kbd')
 self.vm.launch()
 console_pattern = 'Kernel command line: %s' % kernel_command_line
 self.wait_for_console_pattern(console_pattern)
+console_pattern = 'Product: QEMU USB Keyboard'
+self.wait_for_console_pattern(console_pattern)
 
 def test_arm_raspi2_uart0(self):
 """
-- 
2.17.1

[PATCH v6 2/7] dwc-hsotg (dwc2) USB host controller register definitions

Import the dwc-hsotg (dwc2) register definitions file from the
Linux kernel. This is a copy of drivers/usb/dwc2/hw.h from the
mainline Linux kernel, the only changes being to the header, and
two instances of 'u32' changed to 'uint32_t' to allow it to
compile. Checkpatch throws a boatload of errors due to the tab
indentation, but I would rather import it as-is than reformat it.

Signed-off-by: Paul Zimmerman 
---
 include/hw/usb/dwc2-regs.h | 895 +
 1 file changed, 895 insertions(+)
 create mode 100644 include/hw/usb/dwc2-regs.h

diff --git a/include/hw/usb/dwc2-regs.h b/include/hw/usb/dwc2-regs.h
new file mode 100644
index 00..96dc07fb6f
--- /dev/null
+++ b/include/hw/usb/dwc2-regs.h
@@ -0,0 +1,899 @@
+/* SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause) */
+/*
+ * Imported from the Linux kernel file drivers/usb/dwc2/hw.h, commit
+ * a89bae709b3492b478480a2c9734e7e9393b279c ("usb: dwc2: Move
+ * UTMI_PHY_DATA defines closer")
+ *
+ * hw.h - DesignWare HS OTG Controller hardware definitions
+ *
+ * Copyright 2004-2013 Synopsys, Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions, and the following disclaimer,
+ *without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. The names of the above-listed copyright holders may not be used
+ *to endorse or promote products derived from this software without
+ *specific prior written permission.
+ *
+ * ALTERNATIVELY, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") as published by the Free Software
+ * Foundation; either version 2 of the License, or (at your option) any
+ * later version.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
+ * IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __DWC2_HW_H__
+#define __DWC2_HW_H__
+
+#define HSOTG_REG(x)   (x)
+
+#define GOTGCTLHSOTG_REG(0x000)
+#define GOTGCTL_CHIRPENBIT(27)
+#define GOTGCTL_MULT_VALID_BC_MASK (0x1f << 22)
+#define GOTGCTL_MULT_VALID_BC_SHIFT22
+#define GOTGCTL_OTGVER BIT(20)
+#define GOTGCTL_BSESVLDBIT(19)
+#define GOTGCTL_ASESVLDBIT(18)
+#define GOTGCTL_DBNC_SHORT BIT(17)
+#define GOTGCTL_CONID_BBIT(16)
+#define GOTGCTL_DBNCE_FLTR_BYPASS  BIT(15)
+#define GOTGCTL_DEVHNPEN   BIT(11)
+#define GOTGCTL_HSTSETHNPENBIT(10)
+#define GOTGCTL_HNPREQ BIT(9)
+#define GOTGCTL_HSTNEGSCS  BIT(8)
+#define GOTGCTL_SESREQ BIT(1)
+#define GOTGCTL_SESREQSCS  BIT(0)
+
+#define GOTGINTHSOTG_REG(0x004)
+#define GOTGINT_DBNCE_DONE BIT(19)
+#define GOTGINT_A_DEV_TOUT_CHG BIT(18)
+#define GOTGINT_HST_NEG_DETBIT(17)
+#define GOTGINT_HST_NEG_SUC_STS_CHNG   BIT(9)
+#define GOTGINT_SES_REQ_SUC_STS_CHNG   BIT(8)
+#define GOTGINT_SES_END_DETBIT(2)
+
+#define GAHBCFGHSOTG_REG(0x008)
+#define GAHBCFG_AHB_SINGLE BIT(23)
+#define GAHBCFG_NOTI_ALL_DMA_WRIT  BIT(22)
+#define GAHBCFG_REM_MEM_SUPP   BIT(21)
+#define GAHBCFG_P_TXF_EMP_LVL  BIT(8)
+#define GAHBCFG_NP_TXF_EMP_LVL BIT(7)
+#define GAHBCFG_DMA_EN BIT(5)
+#define GAHBCFG_HBSTLEN_MASK   (0xf << 1)
+#define GAHBCFG_HBSTLEN_SHIFT  1
+#define GAHBCFG_HBSTLEN_SINGLE 0
+#define GAHBCFG_HBSTLEN_INCR   1
+#define GAHBCFG_HBSTLEN_INCR4  3
+#define GAHBCFG_HBSTLEN_INCR8  5
+#define GAHBCFG_HBSTLEN_INCR16 7
+#define GAHBCFG_GLBL_INTR_EN   BIT(0)
+#define GAHBCFG_CTRL_MASK  (GAHBCFG_P_TXF_EMP_LVL | \
+GAHBCFG_NP_TXF_EMP_LVL | \
+

[PATCH v6 4/7] dwc-hsotg (dwc2) USB host controller emulation

Add the dwc-hsotg (dwc2) USB host controller emulation code.
Based on hw/usb/hcd-ehci.c and hw/usb/hcd-ohci.c.

Note that to use this with the dwc-otg driver in the Raspbian
kernel, you must pass the option "dwc_otg.fiq_fsm_enable=0" on
the kernel command line.

Emulation of slave mode and of descriptor-DMA mode has not been
implemented yet. These modes are seldom used.

I have used some on-line sources of information while developing
this emulation, including:

http://www.capital-micro.com/PDF/CME-M7_Family_User_Guide_EN.pdf
which has a pretty complete description of the controller starting
on page 370.

https://sourceforge.net/p/wive-ng/wive-ng-mt/ci/master/tree/docs/DataSheets/RT3050_5x_V2.0_081408_0902.pdf
which has a description of the controller registers starting on
page 130.

Thanks to Felippe Mathieu-Daude for providing a cleaner method
of implementing the memory regions for the controller registers.

Signed-off-by: Paul Zimmerman 
---
 hw/usb/Kconfig   |5 +
 hw/usb/Makefile.objs |1 +
 hw/usb/hcd-dwc2.c| 1417 ++
 hw/usb/trace-events  |   50 ++
 4 files changed, 1473 insertions(+)
 create mode 100644 hw/usb/hcd-dwc2.c

diff --git a/hw/usb/Kconfig b/hw/usb/Kconfig
index 464348ba14..d4d8c37c28 100644
--- a/hw/usb/Kconfig
+++ b/hw/usb/Kconfig
@@ -46,6 +46,11 @@ config USB_MUSB
 bool
 select USB
 
+config USB_DWC2
+bool
+default y
+select USB
+
 config TUSB6010
 bool
 select USB_MUSB
diff --git a/hw/usb/Makefile.objs b/hw/usb/Makefile.objs
index 66835e5bf7..fa5c3fa1b8 100644
--- a/hw/usb/Makefile.objs
+++ b/hw/usb/Makefile.objs
@@ -12,6 +12,7 @@ common-obj-$(CONFIG_USB_EHCI_SYSBUS) += hcd-ehci-sysbus.o
 common-obj-$(CONFIG_USB_XHCI) += hcd-xhci.o
 common-obj-$(CONFIG_USB_XHCI_NEC) += hcd-xhci-nec.o
 common-obj-$(CONFIG_USB_MUSB) += hcd-musb.o
+common-obj-$(CONFIG_USB_DWC2) += hcd-dwc2.o
 
 common-obj-$(CONFIG_TUSB6010) += tusb6010.o
 common-obj-$(CONFIG_IMX)  += chipidea.o
diff --git a/hw/usb/hcd-dwc2.c b/hw/usb/hcd-dwc2.c
new file mode 100644
index 00..72cbd051f3
--- /dev/null
+++ b/hw/usb/hcd-dwc2.c
@@ -0,0 +1,1417 @@
+/*
+ * dwc-hsotg (dwc2) USB host controller emulation
+ *
+ * Based on hw/usb/hcd-ehci.c and hw/usb/hcd-ohci.c
+ *
+ * Note that to use this emulation with the dwc-otg driver in the
+ * Raspbian kernel, you must pass the option "dwc_otg.fiq_fsm_enable=0"
+ * on the kernel command line.
+ *
+ * Some useful documentation used to develop this emulation can be
+ * found online (as of April 2020) at:
+ *
+ * http://www.capital-micro.com/PDF/CME-M7_Family_User_Guide_EN.pdf
+ * which has a pretty complete description of the controller starting
+ * on page 370.
+ *
+ * 
https://sourceforge.net/p/wive-ng/wive-ng-mt/ci/master/tree/docs/DataSheets/RT3050_5x_V2.0_081408_0902.pdf
+ * which has a description of the controller registers starting on
+ * page 130.
+ *
+ * Copyright (c) 2020 Paul Zimmerman 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qapi/error.h"
+#include "hw/usb/dwc2-regs.h"
+#include "hw/usb/hcd-dwc2.h"
+#include "migration/vmstate.h"
+#include "trace.h"
+#include "qemu/log.h"
+#include "qemu/error-report.h"
+#include "qemu/main-loop.h"
+#include "hw/qdev-properties.h"
+
+#define USB_HZ_FS   1200
+#define USB_HZ_HS   9600
+#define USB_FRMINTVL12000
+
+/* nifty macros from Arnon's EHCI version  */
+#define get_field(data, field) \
+(((data) & field##_MASK) >> field##_SHIFT)
+
+#define set_field(data, newval, field) do { \
+uint32_t val = *(data); \
+val &= ~field##_MASK; \
+val |= ((newval) << field##_SHIFT) & field##_MASK; \
+*(data) = val; \
+} while (0)
+
+#define get_bit(data, bitmask) \
+(!!((data) & (bitmask)))
+
+/* update irq line */
+static inline void dwc2_update_irq(DWC2State *s)
+{
+static int oldlevel;
+int level = 0;
+
+if ((s->gintsts & s->gintmsk) && (s->gahbcfg & GAHBCFG_GLBL_INTR_EN)) {
+level = 1;
+}
+if (level != oldlevel) {
+oldlevel = level;
+trace_usb_dwc2_update_irq(level);
+qemu_set_irq(s->irq, level);
+}
+}
+
+/* flag interrupt condition */
+static inline void dwc2_raise_global_irq(DWC2State *s, uint32_t intr)
+{
+if (!(s->gintsts & intr)) {
+s->gintsts |= intr;
+trace_usb_dwc2_raise_global_irq(intr);
+dwc2_update_irq(s);
+}
+}
+
+static inline void dwc2_lower_global_irq(DWC2State *s, uint32_t intr)
+{
+

[PATCH v6 3/7] dwc-hsotg (dwc2) USB host controller state definitions

Add the dwc-hsotg (dwc2) USB host controller state definitions.
Mostly based on hw/usb/hcd-ehci.h.

Signed-off-by: Paul Zimmerman 
---
 hw/usb/hcd-dwc2.h | 190 ++
 1 file changed, 190 insertions(+)
 create mode 100644 hw/usb/hcd-dwc2.h

diff --git a/hw/usb/hcd-dwc2.h b/hw/usb/hcd-dwc2.h
new file mode 100644
index 00..4ba809a07b
--- /dev/null
+++ b/hw/usb/hcd-dwc2.h
@@ -0,0 +1,190 @@
+/*
+ * dwc-hsotg (dwc2) USB host controller state definitions
+ *
+ * Based on hw/usb/hcd-ehci.h
+ *
+ * Copyright (c) 2020 Paul Zimmerman 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef HW_USB_DWC2_H
+#define HW_USB_DWC2_H
+
+#include "qemu/timer.h"
+#include "hw/irq.h"
+#include "hw/sysbus.h"
+#include "hw/usb.h"
+#include "sysemu/dma.h"
+
+#define DWC2_MMIO_SIZE  0x11000
+
+#define DWC2_NB_CHAN8   /* Number of host channels */
+#define DWC2_MAX_XFER_SIZE  65536   /* Max transfer size expected in HCTSIZ */
+
+typedef struct DWC2Packet DWC2Packet;
+typedef struct DWC2State DWC2State;
+typedef struct DWC2Class DWC2Class;
+
+enum async_state {
+DWC2_ASYNC_NONE = 0,
+DWC2_ASYNC_INITIALIZED,
+DWC2_ASYNC_INFLIGHT,
+DWC2_ASYNC_FINISHED,
+};
+
+struct DWC2Packet {
+USBPacket packet;
+uint32_t devadr;
+uint32_t epnum;
+uint32_t epdir;
+uint32_t mps;
+uint32_t pid;
+uint32_t index;
+uint32_t pcnt;
+uint32_t len;
+int32_t async;
+bool small;
+bool needs_service;
+};
+
+struct DWC2State {
+/*< private >*/
+SysBusDevice parent_obj;
+
+/*< public >*/
+USBBus bus;
+qemu_irq irq;
+MemoryRegion *dma_mr;
+AddressSpace dma_as;
+MemoryRegion container;
+MemoryRegion hsotg;
+MemoryRegion fifos;
+
+union {
+#define DWC2_GLBREG_SIZE0x70
+uint32_t glbreg[DWC2_GLBREG_SIZE / sizeof(uint32_t)];
+struct {
+uint32_t gotgctl;   /* 00 */
+uint32_t gotgint;   /* 04 */
+uint32_t gahbcfg;   /* 08 */
+uint32_t gusbcfg;   /* 0c */
+uint32_t grstctl;   /* 10 */
+uint32_t gintsts;   /* 14 */
+uint32_t gintmsk;   /* 18 */
+uint32_t grxstsr;   /* 1c */
+uint32_t grxstsp;   /* 20 */
+uint32_t grxfsiz;   /* 24 */
+uint32_t gnptxfsiz; /* 28 */
+uint32_t gnptxsts;  /* 2c */
+uint32_t gi2cctl;   /* 30 */
+uint32_t gpvndctl;  /* 34 */
+uint32_t ggpio; /* 38 */
+uint32_t guid;  /* 3c */
+uint32_t gsnpsid;   /* 40 */
+uint32_t ghwcfg1;   /* 44 */
+uint32_t ghwcfg2;   /* 48 */
+uint32_t ghwcfg3;   /* 4c */
+uint32_t ghwcfg4;   /* 50 */
+uint32_t glpmcfg;   /* 54 */
+uint32_t gpwrdn;/* 58 */
+uint32_t gdfifocfg; /* 5c */
+uint32_t gadpctl;   /* 60 */
+uint32_t grefclk;   /* 64 */
+uint32_t gintmsk2;  /* 68 */
+uint32_t gintsts2;  /* 6c */
+};
+};
+
+union {
+#define DWC2_FSZREG_SIZE0x04
+uint32_t fszreg[DWC2_FSZREG_SIZE / sizeof(uint32_t)];
+struct {
+uint32_t hptxfsiz;  /* 100 */
+};
+};
+
+union {
+#define DWC2_HREG0_SIZE 0x44
+uint32_t hreg0[DWC2_HREG0_SIZE / sizeof(uint32_t)];
+struct {
+uint32_t hcfg;  /* 400 */
+uint32_t hfir;  /* 404 */
+uint32_t hfnum; /* 408 */
+uint32_t rsvd0; /* 40c */
+uint32_t hptxsts;   /* 410 */
+uint32_t haint; /* 414 */
+uint32_t haintmsk;  /* 418 */
+uint32_t hflbaddr;  /* 41c */
+uint32_t rsvd1[8];  /* 420-43c */
+uint32_t hprt0; /* 440 */
+};
+};
+
+#define DWC2_HREG1_SIZE (0x20 * DWC2_NB_CHAN)
+uint32_t hreg1[DWC2_HREG1_SIZE / sizeof(uint32_t)];
+
+#define hcchar(_ch) hreg1[((_ch) << 3) + 0] /* 500, 520, ... */
+#define hcsplt(_ch) hreg1[((_ch) << 3) + 1] /* 504, 524, ... */
+#define hcint(_ch)  hreg1[((_ch) << 3) + 2] /* 508, 528, ... */
+#define hcintmsk(_ch)   hreg1[((_ch) << 3) + 3] /* 50c, 52c, ... */
+#define hctsiz(_ch) hreg1[((_ch) << 3) + 4] /* 510, 530, ... */
+#define hcdma(_ch)  hreg1[((_ch) << 3) + 5]

[PATCH v6 0/7] dwc-hsotg (aka dwc2) USB host controller emulation

This verion fixes a few things pointed out by Peter, and one by
Felippe.

This patch series adds emulation for the dwc-hsotg USB controller,
which is used on the Raspberry Pi 3 and earlier, as well as a number
of other development boards. The main benefit for Raspberry Pi is that
this enables networking on these boards, since the network adapter is
attached via USB.

The emulation is working quite well, I have tested with USB network,
mass storage, mouse, keyboard, and tablet. I have tested with the dwc2
driver in the upstream Linux kernel, and with the dwc-otg driver in the
Raspbian kernel.

One remaining issue is that USB host passthrough does not work. I tried
connecting to a USB stick on the host, but the device generates babble
errors and does not work. This is because the dwc-hsotg controller only
has one root port, so a full-speed dev-hub device is always connected
to it, and high-speed USB devices on the host do not work at full-speed
on the guest. (I have WIP code to add high-speed support to dev-hub to
fix this.)

The patch series also includes a very basic emulation of the MPHI
device on the Raspberry Pi SOC, which provides the FIQ interrupt that
is used by the dwc-otg driver in the Raspbian kernel. But that driver
still does not work in full FIQ mode, so it is necessary to add a
parameter to the kernel command line ("dwc_otg.fiq_fsm_enable=0") to
make it work.

I have used some online sources of information while developing this
emulation, including:

http://www.capital-micro.com/PDF/CME-M7_Family_User_Guide_EN.pdf
which has a pretty complete description of the controller starting
on page 370.

https://sourceforge.net/p/wive-ng/wive-ng-mt/ci/master/tree/docs/DataSheets/RT3050_5x_V2.0_081408_0902.pdf
which has a description of the controller registers starting on
page 130.

Changes v5-v6:
- In bcm2835_mphi.c, make mphi_reset() do initialization of the device
state, per Peter M.

- In hcd-dwc2.c, replace fprintf() with qemu_log_mask(LOG_GUEST_ERROR),
and add qemu_log_mask(LOG_UNIMP) for the TODO functionality, per
Peter M.

- In hcd-dwc2.c, switch to using 3-phase reset, per Peter M.

- In dwc2-regs.h, change comment style of first line to Qemu style,
and add a note about which Linux commit the file is from, per
Felippe M.

Changes v4-v5:
- Changed MemoryRegionOps to use '.impl.[min/max]_access_size' and
removed ANDing of memory values with 0x, per Felippe M.

- hcd-dwc2.c: Changed NULL check of return from
object_property_get_link() call to an assertion, per Felippe.

- bcm2835_mphi.c/h:
* Changed swirq_set/swirq_clr registers into a single register,
per Felippe.
* Simplified memory region code, per Felippe.

Changes v3-v4:
- Reworked the memory region / register access code according to
an example patch from Felippe Mathieu-Daudé.

- Moved the Makefile/Kconfig changes for this file into this
patch, per Felipe.

- Fixed a missing DEFINE_PROP_END_OF_LIST() in dwc2_usb_properties.

Changes v2-v3:
- Fixed the high-speed frame time emulation so that high-speed
mouse/tablet will work correctly once we have high-speed hub
support.

- Added a "usb_version" property to the dwc-hsotg controller, to
allow choosing whether the controller emulates a USB 1 full-speed
host or a USB 2 high-speed host.

- Added a test for a working dwc-hsotg controller to the raspi2
acceptance test, requested by Philippe M.

- Added #defines for the register array sizes, instead of hard-
coding them in multiple places.

- Removed the NB_PORTS #define and the associated iteration code,
since the controller only supports a single root port.

- Removed some unused fields from the controller state struct.

- Added pointers to some online documentation to the top of
hcd-dwc2.c, requested by Peter M.

- Reworked the init/realize code to remove some confusing function
names, requested by Peter M.

- Added VMStateDescription structs for the controller and MPHI
state, requested by Peter M (untested).

Changes v1-v2:
- Fixed checkpatch errors/warnings, except for dwc2-regs.h since
that is a direct import from the Linux kernel.

- Switched from debug printfs to tracepoints in hcd-dwc2.c, on the
advice of Gerd. I just dropped the debug prints in bcm2835_mphi.c,
since I didn't consider them very useful.

- Updated a couple of the commit messages with more info.

Thanks for your time,
Paul

---

Paul Zimmerman (7):
raspi: add BCM2835 SOC MPHI emulation
dwc-hsotg (dwc2) USB host controller register definitions
dwc-hsotg (dwc2) USB host controller state definitions
dwc-hsotg (dwc2) USB host controller emulation
usb: add short-packet handling to usb-storage driver
wire in the dwc-hsotg (dwc2) USB host controller emulation
raspi2 acceptance test: add test for dwc-hsotg (dwc2) USB host

hw/arm/bcm2835_peripherals.c | 38 +-
hw/misc/Makefile.objs |

[PATCH v6 1/7] raspi: add BCM2835 SOC MPHI emulation

Add BCM2835 SOC MPHI (Message-based Parallel Host Interface)
emulation. It is very basic, only providing the FIQ interrupt
needed to allow the dwc-otg USB host controller driver in the
Raspbian kernel to function.

Signed-off-by: Paul Zimmerman 
Acked-by: Philippe Mathieu-Daude 
Reviewed-by: Peter Maydell 
---
 hw/arm/bcm2835_peripherals.c |  17 +++
 hw/misc/Makefile.objs|   1 +
 hw/misc/bcm2835_mphi.c   | 191 +++
 include/hw/arm/bcm2835_peripherals.h |   2 +
 include/hw/misc/bcm2835_mphi.h   |  44 ++
 5 files changed, 255 insertions(+)
 create mode 100644 hw/misc/bcm2835_mphi.c
 create mode 100644 include/hw/misc/bcm2835_mphi.h

diff --git a/hw/arm/bcm2835_peripherals.c b/hw/arm/bcm2835_peripherals.c
index f1bcc14f55..b3e0495040 100644
--- a/hw/arm/bcm2835_peripherals.c
+++ b/hw/arm/bcm2835_peripherals.c
@@ -125,6 +125,10 @@ static void bcm2835_peripherals_init(Object *obj)
OBJECT(>sdhci.sdbus));
 object_property_add_const_link(OBJECT(>gpio), "sdbus-sdhost",
OBJECT(>sdhost.sdbus));
+
+/* Mphi */
+sysbus_init_child_obj(obj, "mphi", >mphi, sizeof(s->mphi),
+  TYPE_BCM2835_MPHI);
 }
 
 static void bcm2835_peripherals_realize(DeviceState *dev, Error **errp)
@@ -360,6 +364,19 @@ static void bcm2835_peripherals_realize(DeviceState *dev, 
Error **errp)
 
 object_property_add_alias(OBJECT(s), "sd-bus", OBJECT(>gpio), "sd-bus");
 
+/* Mphi */
+object_property_set_bool(OBJECT(>mphi), true, "realized", );
+if (err) {
+error_propagate(errp, err);
+return;
+}
+
+memory_region_add_subregion(>peri_mr, MPHI_OFFSET,
+sysbus_mmio_get_region(SYS_BUS_DEVICE(>mphi), 0));
+sysbus_connect_irq(SYS_BUS_DEVICE(>mphi), 0,
+qdev_get_gpio_in_named(DEVICE(>ic), BCM2835_IC_GPU_IRQ,
+   INTERRUPT_HOSTPORT));
+
 create_unimp(s, >armtmr, "bcm2835-sp804", ARMCTRL_TIMER0_1_OFFSET, 
0x40);
 create_unimp(s, >cprman, "bcm2835-cprman", CPRMAN_OFFSET, 0x1000);
 create_unimp(s, >a2w, "bcm2835-a2w", A2W_OFFSET, 0x1000);
diff --git a/hw/misc/Makefile.objs b/hw/misc/Makefile.objs
index 68aae2eabb..91085cc21b 100644
--- a/hw/misc/Makefile.objs
+++ b/hw/misc/Makefile.objs
@@ -57,6 +57,7 @@ common-obj-$(CONFIG_OMAP) += omap_l4.o
 common-obj-$(CONFIG_OMAP) += omap_sdrc.o
 common-obj-$(CONFIG_OMAP) += omap_tap.o
 common-obj-$(CONFIG_RASPI) += bcm2835_mbox.o
+common-obj-$(CONFIG_RASPI) += bcm2835_mphi.o
 common-obj-$(CONFIG_RASPI) += bcm2835_property.o
 common-obj-$(CONFIG_RASPI) += bcm2835_rng.o
 common-obj-$(CONFIG_RASPI) += bcm2835_thermal.o
diff --git a/hw/misc/bcm2835_mphi.c b/hw/misc/bcm2835_mphi.c
new file mode 100644
index 00..0428e10ba5
--- /dev/null
+++ b/hw/misc/bcm2835_mphi.c
@@ -0,0 +1,191 @@
+/*
+ * BCM2835 SOC MPHI emulation
+ *
+ * Very basic emulation, only providing the FIQ interrupt needed to
+ * allow the dwc-otg USB host controller driver in the Raspbian kernel
+ * to function.
+ *
+ * Copyright (c) 2020 Paul Zimmerman 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "hw/misc/bcm2835_mphi.h"
+#include "migration/vmstate.h"
+#include "qemu/error-report.h"
+#include "qemu/log.h"
+#include "qemu/main-loop.h"
+
+static inline void mphi_raise_irq(BCM2835MphiState *s)
+{
+qemu_set_irq(s->irq, 1);
+}
+
+static inline void mphi_lower_irq(BCM2835MphiState *s)
+{
+qemu_set_irq(s->irq, 0);
+}
+
+static uint64_t mphi_reg_read(void *ptr, hwaddr addr, unsigned size)
+{
+BCM2835MphiState *s = ptr;
+uint32_t val = 0;
+
+switch (addr) {
+case 0x28:  /* outdda */
+val = s->outdda;
+break;
+case 0x2c:  /* outddb */
+val = s->outddb;
+break;
+case 0x4c:  /* ctrl */
+val = s->ctrl;
+val |= 1 << 17;
+break;
+case 0x50:  /* intstat */
+val = s->intstat;
+break;
+case 0x1f0: /* swirq_set */
+val = s->swirq;
+break;
+case 0x1f4: /* swirq_clr */
+val = s->swirq;
+break;
+default:
+qemu_log_mask(LOG_UNIMP, "read from unknown register");
+break;
+}
+
+return val;
+}
+
+static void mphi_reg_write(void *ptr, hwaddr addr, uint64_t val, unsigned size)
+{
+BCM2835MphiState *s = ptr;
+int do_irq = 0;
+
+switch (addr) {
+case 0x28:  /* outdda */
+s->outdda = val;

Re: [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs

2020-05-20 Thread Greg Kurz

On Mon, 18 May 2020 16:44:18 -0500
Reza Arbab  wrote:

> NUMA nodes corresponding to GPU memory currently have the same
> affinity/distance as normal memory nodes. Add a third NUMA associativity
> reference point enabling us to give GPU nodes more distance.
> 
> This is guest visible information, which shouldn't change under a
> running guest across migration between different qemu versions, so make
> the change effective only in new (pseries > 5.0) machine types.
> 
> Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> 
> node distances:
> node   0   1   2   3   4   5
>   0:  10  40  40  40  40  40
>   1:  40  10  40  40  40  40
>   2:  40  40  10  40  40  40
>   3:  40  40  40  10  40  40
>   4:  40  40  40  40  10  40
>   5:  40  40  40  40  40  10
> 
> After:
> 
> node distances:
> node   0   1   2   3   4   5
>   0:  10  40  80  80  80  80
>   1:  40  10  80  80  80  80
>   2:  80  80  10  80  80  80
>   3:  80  80  80  10  80  80
>   4:  80  80  80  80  10  80
>   5:  80  80  80  80  80  10
> 
> These are the same distances as on the host, mirroring the change made
> to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> Add a new level of NUMA for GPU's").
> 
> Signed-off-by: Reza Arbab 
> ---
>  hw/ppc/spapr.c | 11 +--
>  hw/ppc/spapr_pci_nvlink2.c |  2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 88b4a1f17716..1d9193d5ee49 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void 
> *fdt)
>  int rtas;
>  GString *hypertas = g_string_sized_new(256);
>  GString *qemu_hypertas = g_string_sized_new(256);
> -uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> +uint32_t refpoints[] = {
> +cpu_to_be32(0x4),
> +cpu_to_be32(0x4),
> +cpu_to_be32(0x2),
> +};
>  uint32_t nr_refpoints;
>  uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
>  memory_region_size((spapr)->device_memory->mr);
> @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, 
> void *data)
>  smc->linux_pci_probe = true;
>  smc->smp_threads_vsmt = true;
>  smc->nr_xirqs = SPAPR_NR_XIRQS;
> -smc->nr_assoc_refpoints = 2;
> +smc->nr_assoc_refpoints = 3;
>  xfc->match_nvt = spapr_match_nvt;
>  }
>  
> @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
>   */
>  static void spapr_machine_5_0_class_options(MachineClass *mc)
>  {
> +SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>  spapr_machine_5_1_class_options(mc);
>  compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
> +smc->nr_assoc_refpoints = 2;
>  }
>  
>  DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> index 8332d5694e46..247fd48731e2 100644
> --- a/hw/ppc/spapr_pci_nvlink2.c
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, 
> void *fdt)
>  uint32_t associativity[] = {
>  cpu_to_be32(0x4),
>  SPAPR_GPU_NUMA_ID,
> -SPAPR_GPU_NUMA_ID,
> +cpu_to_be32(nvslot->numa_id),

This is a guest visible change. It should theoretically be controlled
with a compat property of the PHB (look for "static GlobalProperty" in
spapr.c). But since this code is only used for GPU passthrough and we
don't support migration of such devices, I guess it's okay. Maybe just
mention it in the changelog.

>  SPAPR_GPU_NUMA_ID,
>  cpu_to_be32(nvslot->numa_id)
>  };

[Bug 1856335] Re: Cache Layout wrong on many Zen Arch CPUs

2020-05-20 Thread Heiko Sieger

This is the CPU cache layout as shown by lscpu -a -e

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINEMAXMHZMINMHZ
  00  00 0:0:0:0  yes 3800. 2200.
  10  01 1:1:1:0  yes 3800. 2200.
  20  02 2:2:2:0  yes 3800. 2200.
  30  03 3:3:3:1  yes 3800. 2200.
  40  04 4:4:4:1  yes 3800. 2200.
  50  05 5:5:5:1  yes 3800. 2200.
  60  06 6:6:6:2  yes 3800. 2200.
  70  07 7:7:7:2  yes 3800. 2200.
  80  08 8:8:8:2  yes 3800. 2200.
  90  09 9:9:9:3  yes 3800. 2200.
 100  0   10 10:10:10:3   yes 3800. 2200.
 110  0   11 11:11:11:3   yes 3800. 2200.
 120  00 0:0:0:0  yes 3800. 2200.
 130  01 1:1:1:0  yes 3800. 2200.
 140  02 2:2:2:0  yes 3800. 2200.
 150  03 3:3:3:1  yes 3800. 2200.
 160  04 4:4:4:1  yes 3800. 2200.
 170  05 5:5:5:1  yes 3800. 2200.
 180  06 6:6:6:2  yes 3800. 2200.
 190  07 7:7:7:2  yes 3800. 2200.
 200  08 8:8:8:2  yes 3800. 2200.
 210  09 9:9:9:3  yes 3800. 2200.
 220  0   10 10:10:10:3   yes 3800. 2200.
 230  0   11 11:11:11:3   yes 3800. 2200.

I was trying to allocate cache using the cachetune feature in libvirt,
but it turns out to be either misleading or much too complicated to be
usable. Here is what I tried:

  24
  

























  
  


  
  


  
  


  
  

  

Unfortunately it gives the following error when I try to start the VM:

Error starting domain: internal error: Missing or inconsistent resctrl
info for memory bandwidth allocation

I have resctrl mounted like this:

mount -t resctrl resctrl /sys/fs/resctrl

This error leads to the following description on how to allocate memory
bandwith: https://software.intel.com/content/www/us/en/develop/articles
/use-intel-resource-director-technology-to-allocate-memory-
bandwidth.html

I think this is over the top and perhaps I'm trying the wrong approach.
All I can say is that every suggestion I've seen and tried so far has
led me to one conclusion: QEMU does NOT support the L3 cache layout of
the new ZEN 2 arch CPUs such as the Ryzen 9 3900X.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1856335

Title:
  Cache Layout wrong on many Zen Arch CPUs

Status in QEMU:
  New

Bug description:
  AMD CPUs have L3 cache per 2, 3 or 4 cores. Currently, TOPOEXT seems
  to always map Cache ass if it was an 4-Core per CCX CPU, which is
  incorrect, and costs upwards 30% performance (more realistically 10%)
  in L3 Cache Layout aware applications.

  Example on a 4-CCX CPU (1950X /w 8 Cores and no SMT):

    
  EPYC-IBPB
  AMD
  

  In windows, coreinfo reports correctly:

    Unified Cache 1, Level 3,8 MB, Assoc  16, LineSize  64
    Unified Cache 6, Level 3,8 MB, Assoc  16, LineSize  64

  On a 3-CCX CPU (3960X /w 6 cores and no SMT):

   
  EPYC-IBPB
  AMD
  

  in windows, coreinfo reports incorrectly:

  --  Unified Cache  1, Level 3,8 MB, Assoc  16, LineSize  64
  **  Unified Cache  6, Level 3,8 MB, Assoc  16, LineSize  64

  Validated against 3.0, 3.1, 4.1 and 4.2 versions of qemu-kvm.

  With newer Qemu there is a fix (that does behave correctly) in using the dies 
parameter:
   

  The problem is that the dies are exposed differently than how AMD does
  it natively, they are exposed to Windows as sockets, which means, that
  if you are nto a business user, you can't ever have a machine with
  more than two CCX (6 cores) as consumer versions of Windows only
  supports two sockets. (Should this be reported as a separate bug?)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1856335/+subscriptions

Re: [PATCH v2 1/2] spapr: Add associativity reference point count to machine info

2020-05-20 Thread Greg Kurz

On Mon, 18 May 2020 16:44:17 -0500
Reza Arbab  wrote:

> Make the number of NUMA associativity reference points a
> machine-specific value, using the currently assumed default (two
> reference points). This preps the next patch to conditionally change it.
> 
> Signed-off-by: Reza Arbab 
> ---
>  hw/ppc/spapr.c | 6 +-
>  include/hw/ppc/spapr.h | 1 +
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index c18eab0a2305..88b4a1f17716 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -889,10 +889,12 @@ static int spapr_dt_rng(void *fdt)
>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>  {
>  MachineState *ms = MACHINE(spapr);
> +SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(ms);
>  int rtas;
>  GString *hypertas = g_string_sized_new(256);
>  GString *qemu_hypertas = g_string_sized_new(256);
>  uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> +uint32_t nr_refpoints;
>  uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
>  memory_region_size((spapr)->device_memory->mr);
>  uint32_t lrdr_capacity[] = {
> @@ -944,8 +946,9 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void 
> *fdt)
>   qemu_hypertas->str, qemu_hypertas->len));
>  g_string_free(qemu_hypertas, TRUE);
>  
> +nr_refpoints = MIN(smc->nr_assoc_refpoints, ARRAY_SIZE(refpoints));

Having the machine requesting more reference points than available
would clearly be a bug. I'd rather add an assert() than silently
clipping to the size of refpoints[].

>  _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> - refpoints, sizeof(refpoints)));
> + refpoints, nr_refpoints * sizeof(uint32_t)));
>  

Size can be expressed without yet another explicit reference to the
uint32_t type:

nr_refpoints * sizeof(refpoints[0])

>  _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
>   maxdomains, sizeof(maxdomains)));
> @@ -4541,6 +4544,7 @@ static void spapr_machine_class_init(ObjectClass *oc, 
> void *data)
>  smc->linux_pci_probe = true;
>  smc->smp_threads_vsmt = true;
>  smc->nr_xirqs = SPAPR_NR_XIRQS;
> +smc->nr_assoc_refpoints = 2;

When adding a new setting for the default machine type, we usually
take care of older machine types at the same time, ie. folding this
patch into the next one. Both patches are simple enough that it should
be okay and this would avoid this line to be touched again.

>  xfc->match_nvt = spapr_match_nvt;
>  }
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index e579eaf28c05..abaf9a92adc0 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -129,6 +129,7 @@ struct SpaprMachineClass {
>  bool linux_pci_probe;
>  bool smp_threads_vsmt; /* set VSMT to smp_threads by default */
>  hwaddr rma_limit;  /* clamp the RMA to this size */
> +uint32_t nr_assoc_refpoints;
>  
>  void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>uint64_t *buid, hwaddr *pio,

Re: [PATCH v2 0/4] Introduce 'yank' oob qmp command to recover from hanging qemu

Patchew URL: https://patchew.org/QEMU/cover.1590008051.git.lukasstra...@web.de/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

/usr/bin/ld: /tmp/qemu-test/src/chardev/char-socket.c:1101: undefined reference 
to `yank_unregister_function'
clang++ -g  -Wl,--warn-common -fsanitize=undefined -fsanitize=address 
-Wl,-z,relro -Wl,-z,now -pie -m64  -fstack-protector-strong -o 
tests/check-qnull tests/check-qnull.o  libqemuutil.a   -lm -lz  -lgthread-2.0 
-pthread -lglib-2.0  -lnettle  -lgnutls  -L/usr/lib -lzstd   -lrt -lz -lutil 
-lcap-ng 
clang -iquote /tmp/qemu-test/build/. -iquote . -iquote 
/tmp/qemu-test/src/tcg/i386 -isystem /tmp/qemu-test/src/linux-headers -isystem 
/tmp/qemu-test/build/linux-headers -iquote . -iquote /tmp/qemu-test/src -iquote 
/tmp/qemu-test/src/accel/tcg -iquote /tmp/qemu-test/src/include -iquote 
/tmp/qemu-test/src/disas/libvixl -I/tmp/qemu-test/src/tests/fp 
-I/tmp/qemu-test/src/tests/fp/berkeley-softfloat-3/source/include 
-I/tmp/qemu-test/src/tests/fp/berkeley-softfloat-3/source/8086-SSE 
-I/tmp/qemu-test/src/tests/fp/berkeley-testfloat-3/source 
-I/usr/include/pixman-1 -Werror -fsanitize=undefined -fsanitize=address 
-pthread -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -fPIE -DPIE -m64 
-mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE 
-Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings 
-Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -std=gnu99 
-Wno-string-plus-int -Wno-typedef-redefinition -Wno-initializer-overrides 
-Wexpansion-to-defined -Wendif-labels -Wno-shift-negative-value 
-Wno-missing-include-dirs -Wempty-body -Wnested-externs -Wformat-security 
-Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-definition 
-Wtype-limits -fstack-protector-strong -I/usr/include/p11-kit-1 
-DLEGACY_RDMA_REG_MR -DSTRUCT_IOVEC_DEFINED -I/usr/include/libpng16 
-I/usr/include/spice-1 -I/usr/include/spice-server -I/usr/include/cacard 
-I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/nss3 
-I/usr/include/nspr4 -pthread -I/usr/include/libmount -I/usr/include/blkid 
-I/usr/include/uuid -I/usr/include/pixman-1 -DHW_POISON_H -DTARGET_ARM  
-DSOFTFLOAT_ROUND_ODD -DINLINE_LEVEL=5 -DSOFTFLOAT_FAST_DIV32TO16 
-DSOFTFLOAT_FAST_DIV64TO32 -DSOFTFLOAT_FAST_INT64  -DFLOAT16 -DFLOAT64 
-DEXTFLOAT80 -DFLOAT128 -DFLOAT_ROUND_ODD -DLONG_DOUBLE_IS_EXTFLOAT80 
-Wno-missing-prototypes -Wno-redundant-decls -Wno-return-type -Wno-error -MMD 
-MP -MT s_addMagsF16.o -MF ./s_addMagsF16.d -g   -c -o s_addMagsF16.o 
/tmp/qemu-test/src/tests/fp/berkeley-softfloat-3/source/s_addMagsF16.c
clang-8: error: linker command failed with exit code 1 (use -v to see 
invocation)
clang -iquote /tmp/qemu-test/build/. -iquote . -iquote 
/tmp/qemu-test/src/tcg/i386 -isystem /tmp/qemu-test/src/linux-headers -isystem 
/tmp/qemu-test/build/linux-headers -iquote . -iquote /tmp/qemu-test/src -iquote 
/tmp/qemu-test/src/accel/tcg -iquote /tmp/qemu-test/src/include -iquote 
/tmp/qemu-test/src/disas/libvixl -I/tmp/qemu-test/src/tests/fp 
-I/tmp/qemu-test/src/tests/fp/berkeley-softfloat-3/source/include 
-I/tmp/qemu-test/src/tests/fp/berkeley-softfloat-3/source/8086-SSE 
-I/tmp/qemu-test/src/tests/fp/berkeley-testfloat-3/source 
-I/usr/include/pixman-1 -Werror -fsanitize=undefined -fsanitize=address 
-pthread -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -fPIE -DPIE -m64 
-mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE 
-Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings 
-Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -std=gnu99 
-Wno-string-plus-int -Wno-typedef-redefinition -Wno-initializer-overrides 
-Wexpansion-to-defined -Wendif-labels -Wno-shift-negative-value 
-Wno-missing-include-dirs -Wempty-body -Wnested-externs -Wformat-security 
-Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-definition 
-Wtype-limits -fstack-protector-strong -I/usr/include/p11-kit-1 
-DLEGACY_RDMA_REG_MR -DSTRUCT_IOVEC_DEFINED -I/usr/include/libpng16 
-I/usr/include/spice-1 -I/usr/include/spice-server -I/usr/include/cacard 
-I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/nss3 
-I/usr/include/nspr4 -pthread -I/usr/include/libmount -I/usr/include/blkid 
-I/usr/include/uuid -I/usr/include/pixman-1 -DHW_POISON_H -DTARGET_ARM  
-DSOFTFLOAT_ROUND_ODD -DINLINE_LEVEL=5 -DSOFTFLOAT_FAST_DIV32TO16 
-DSOFTFLOAT_FAST_DIV64TO32 -DSOFTFLOAT_FAST_INT64  -DFLOAT16 -DFLOAT64 
-DEXTFLOAT80 -DFLOAT128 -DFLOAT_ROUND_ODD -DLONG_DOUBLE_IS_EXTFLOAT80 
-Wno-missing-prototypes -Wno-redundant-decls -Wno-return-type -Wno-error -MMD 
-MP -MT s_subMagsF16.o -MF ./s_subMagsF16.d -g   -c -o s_subMagsF16.o

Re: [PATCH v2 0/4] Introduce 'yank' oob qmp command to recover from hanging qemu

Patchew URL: https://patchew.org/QEMU/cover.1590008051.git.lukasstra...@web.de/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

/tmp/qemu-test/src/chardev/char-socket.c:1381: undefined reference to 
`yank_register_instance'
chardev/char-socket.o: In function `char_socket_finalize':
/tmp/qemu-test/src/chardev/char-socket.c:1084: undefined reference to 
`yank_unregister_instance'
collect2: error: ld returned 1 exit status
  CC  s_normSubnormalF64Sig.o
  CC  s_roundPackToF64.o
  CC  s_normRoundPackToF64.o
make: *** [tests/test-char] Error 1
make: *** Waiting for unfinished jobs
  CC  s_addMagsF64.o
  CC  s_subMagsF64.o
---
Not run: 259
Failures: 140 143 267
Failed 3 of 119 iotests
make: *** [check-tests/check-block.sh] Error 1
make: *** wait: No child processes.  Stop.
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 664, in 
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=01104a80117541dfabf73e596a8c2328', '-u', 
'1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-1_i_19yu/src/docker-src.2020-05-20-19.03.34.15520:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=01104a80117541dfabf73e596a8c2328
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-1_i_19yu/src'
make: *** [docker-run-test-quick@centos7] Error 2

real12m8.434s
user0m8.504s


The full log is available at
http://patchew.org/logs/cover.1590008051.git.lukasstra...@web.de/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH v1 1/2] riscv: sifive_e: Manually define the machine

2020-05-20 Thread Palmer Dabbelt


On Thu, 14 May 2020 13:47:07 PDT (-0700), Alistair Francis wrote:

Signed-off-by: Alistair Francis 
---
 hw/riscv/sifive_e.c | 41 +++--
 include/hw/riscv/sifive_e.h |  4 
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/hw/riscv/sifive_e.c b/hw/riscv/sifive_e.c
index b53109521e..472a98970b 100644
--- a/hw/riscv/sifive_e.c
+++ b/hw/riscv/sifive_e.c
@@ -79,7 +79,7 @@ static void riscv_sifive_e_init(MachineState *machine)
 {
 const struct MemmapEntry *memmap = sifive_e_memmap;

-SiFiveEState *s = g_new0(SiFiveEState, 1);
+SiFiveEState *s = RISCV_E_MACHINE(machine);
 MemoryRegion *sys_mem = get_system_memory();
 MemoryRegion *main_mem = g_new(MemoryRegion, 1);
 int i;
@@ -115,6 +115,35 @@ static void riscv_sifive_e_init(MachineState *machine)
 }
 }

+static void sifive_e_machine_instance_init(Object *obj)
+{
+}
+
+static void sifive_e_machine_class_init(ObjectClass *oc, void *data)
+{
+MachineClass *mc = MACHINE_CLASS(oc);
+
+mc->desc = "RISC-V Board compatible with SiFive E SDK";
+mc->init = riscv_sifive_e_init;
+mc->max_cpus = 1;
+mc->default_cpu_type = SIFIVE_E_CPU;
+}
+
+static const TypeInfo sifive_e_machine_typeinfo = {
+.name   = MACHINE_TYPE_NAME("sifive_e"),
+.parent = TYPE_MACHINE,
+.class_init = sifive_e_machine_class_init,
+.instance_init = sifive_e_machine_instance_init,
+.instance_size = sizeof(SiFiveEState),
+};
+
+static void sifive_e_machine_init_register_types(void)
+{
+type_register_static(_e_machine_typeinfo);
+}
+
+type_init(sifive_e_machine_init_register_types)
+
 static void riscv_sifive_e_soc_init(Object *obj)
 {
 MachineState *ms = MACHINE(qdev_get_machine());
@@ -214,16 +243,6 @@ static void riscv_sifive_e_soc_realize(DeviceState *dev, 
Error **errp)
 >xip_mem);
 }

-static void riscv_sifive_e_machine_init(MachineClass *mc)
-{
-mc->desc = "RISC-V Board compatible with SiFive E SDK";
-mc->init = riscv_sifive_e_init;
-mc->max_cpus = 1;
-mc->default_cpu_type = SIFIVE_E_CPU;
-}
-
-DEFINE_MACHINE("sifive_e", riscv_sifive_e_machine_init)
-
 static void riscv_sifive_e_soc_class_init(ObjectClass *oc, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(oc);
diff --git a/include/hw/riscv/sifive_e.h b/include/hw/riscv/sifive_e.h
index 25ce7aa9d5..414992119e 100644
--- a/include/hw/riscv/sifive_e.h
+++ b/include/hw/riscv/sifive_e.h
@@ -47,6 +47,10 @@ typedef struct SiFiveEState {
 SiFiveESoCState soc;
 } SiFiveEState;

+#define TYPE_RISCV_E_MACHINE MACHINE_TYPE_NAME("sifive_e")
+#define RISCV_E_MACHINE(obj) \
+OBJECT_CHECK(SiFiveEState, (obj), TYPE_RISCV_E_MACHINE)
+
 enum {
 SIFIVE_E_DEBUG,
 SIFIVE_E_MROM,


Reviewed-by: Palmer Dabbelt

Re: [PATCH v1 2/2] sifive_e: Support the revB machine

2020-05-20 Thread Palmer Dabbelt


On Thu, 14 May 2020 13:47:10 PDT (-0700), Alistair Francis wrote:

Signed-off-by: Alistair Francis 
---
 hw/riscv/sifive_e.c | 35 +++
 include/hw/riscv/sifive_e.h |  1 +
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/hw/riscv/sifive_e.c b/hw/riscv/sifive_e.c
index 472a98970b..cb7818341b 100644
--- a/hw/riscv/sifive_e.c
+++ b/hw/riscv/sifive_e.c
@@ -98,10 +98,14 @@ static void riscv_sifive_e_init(MachineState *machine)
 memmap[SIFIVE_E_DTIM].base, main_mem);

 /* Mask ROM reset vector */
-uint32_t reset_vec[2] = {
-0x204002b7,/* 0x1000: lui t0,0x20400 */
-0x00028067,/* 0x1004: jr  t0 */
-};
+uint32_t reset_vec[2];
+
+if (s->revb) {
+reset_vec[0] = 0x200102b7;/* 0x1000: lui t0,0x20010 */
+} else {
+reset_vec[0] = 0x204002b7;/* 0x1000: lui t0,0x20400 */
+}
+reset_vec[1] = 0x00028067;/* 0x1004: jr  t0 */

 /* copy in the reset vector in little_endian byte order */
 for (i = 0; i < sizeof(reset_vec) >> 2; i++) {
@@ -115,8 +119,31 @@ static void riscv_sifive_e_init(MachineState *machine)
 }
 }

+static bool sifive_e_machine_get_revb(Object *obj, Error **errp)
+{
+SiFiveEState *s = RISCV_E_MACHINE(obj);
+
+return s->revb;
+}
+
+static void sifive_e_machine_set_revb(Object *obj, bool value, Error **errp)
+{
+SiFiveEState *s = RISCV_E_MACHINE(obj);
+
+s->revb = value;
+}
+
 static void sifive_e_machine_instance_init(Object *obj)
 {
+SiFiveEState *s = RISCV_E_MACHINE(obj);
+
+s->revb = false;
+object_property_add_bool(obj, "revb", sifive_e_machine_get_revb,
+ sifive_e_machine_set_revb, NULL);
+object_property_set_description(obj, "revb",
+"Set on to tell QEMU that it should model "
+"the revB HiFive1 board",
+NULL);
 }

 static void sifive_e_machine_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/riscv/sifive_e.h b/include/hw/riscv/sifive_e.h
index 414992119e..0d3cd07fcc 100644
--- a/include/hw/riscv/sifive_e.h
+++ b/include/hw/riscv/sifive_e.h
@@ -45,6 +45,7 @@ typedef struct SiFiveEState {

 /*< public >*/
 SiFiveESoCState soc;
+bool revb;
 } SiFiveEState;

 #define TYPE_RISCV_E_MACHINE MACHINE_TYPE_NAME("sifive_e")


IIRC there are way more differences between the un-suffixed FE310 and the Rev
B, specifically the interrupt map is all different.

Re: [PATCH v5 2/5] qcow2: Expose bitmaps' size during measure

2020-05-20 Thread Nir Soffer

On Thu, May 21, 2020 at 1:01 AM Eric Blake  wrote:
>
> It's useful to know how much space can be occupied by qcow2 persistent
> bitmaps, even though such metadata is unrelated to the guest-visible
> data.  Report this value as an additional QMP field, present when
> measuring an existing image and output format that both support
> bitmaps.  Update iotest 178 and 190 to updated output, as well as new
> coverage in 190 demonstrating non-zero values made possible with the
> recently-added qemu-img bitmap command.
>
> On the command-line side, 'qemu-img measure' gains a new --bitmaps
> flag.  When present, the bitmap size is rolled into the two existing
> measures (or errors if either the source image or destination format
> lacks bitmaps); when absent, there is never an error (for
> back-compat), but the output will instead include a new line item for
> bitmaps (which you would have to manually add), with that line being
> omitted in the same cases where passing --bitmaps would error.

Supporting 2 ways to measure, one by specifying --bitmaps, and another
by adding bitmaps key is not a good idea. We really need one way.

Each one has advantages. adding --bitmaps flag is consistent with
"qemu-img convert"
and future extensions that may require  new flag, and adding "bitmaps"
key is consistent
with "qmeu-img info", showing bitmaps when they exist.

Adding a "bitmaps" key has an advantage that we can use it to test if qemu-img
supports measuring and copying bitmaps (since both features are expected to
be delivered at the same time). So we can avoid checking --help learn about
the capabilities.

I'm ok with both options, can we have only one?

> The behavior chosen here is symmetrical with the upcoming 'qemu-img
> convert --bitmaps' being added in the next patch: that is, either both
> commands will succeed (your qemu-img was new enough to do bitmap
> manipulations, AND you correctly measured and copied the bitmaps, even
> if that measurement was 0 because there was nothing to copy) or both
> fail (either your qemu-img is too old to understand --bitmaps, or it
> understands it but your choice of images do not support seamless
> transition of bitmaps because either source, destination, or both lack
> bitmap support).
>
> The addition of a new field demonstrates why we should always
> zero-initialize qapi C structs; while the qcow2 driver still fully
> populates all fields, the raw and crypto drivers had to be tweaked to
> avoid uninitialized data.
>
> See also: https://bugzilla.redhat.com/1779904
>
> Reported-by: Nir Soffer 
> Signed-off-by: Eric Blake 
> ---
>  docs/tools/qemu-img.rst  | 12 ++-
>  qapi/block-core.json | 15 ++---
>  block/qcow2.h|  2 ++
>  block/crypto.c   |  2 +-
>  block/qcow2-bitmap.c | 36 
>  block/qcow2.c| 14 ++--
>  block/raw-format.c   |  2 +-
>  qemu-img.c   | 25 ++
>  qemu-img-cmds.hx |  4 +--
>  tests/qemu-iotests/178.out.qcow2 | 16 +
>  tests/qemu-iotests/190   | 58 ++--
>  tests/qemu-iotests/190.out   | 35 ++-
>  12 files changed, 205 insertions(+), 16 deletions(-)
>
> diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
> index 38d464ea3f23..9a8112fc9f58 100644
> --- a/docs/tools/qemu-img.rst
> +++ b/docs/tools/qemu-img.rst
> @@ -593,7 +593,7 @@ Command description:
>For more information, consult ``include/block/block.h`` in QEMU's
>source code.
>
> -.. option:: measure [--output=OFMT] [-O OUTPUT_FMT] [-o OPTIONS] [--size N | 
> [--object OBJECTDEF] [--image-opts] [-f FMT] [-l SNAPSHOT_PARAM] FILENAME]
> +.. option:: measure [--output=OFMT] [-O OUTPUT_FMT] [-o OPTIONS] [--size N | 
> [--object OBJECTDEF] [--image-opts] [-f FMT] [--bitmaps] [-l SNAPSHOT_PARAM] 
> FILENAME]
>
>Calculate the file size required for a new image.  This information
>can be used to size logical volumes or SAN LUNs appropriately for
> @@ -616,6 +616,7 @@ Command description:
>
>  required size: 524288
>  fully allocated size: 1074069504
> +bitmaps size: 0
>
>The ``required size`` is the file size of the new image.  It may be smaller
>than the virtual disk size if the image format supports compact 
> representation.
> @@ -625,6 +626,15 @@ Command description:
>occupy with the exception of internal snapshots, dirty bitmaps, vmstate 
> data,
>and other advanced image format features.
>
> +  The ``bitmaps size`` is the additional size required in order to
> +  copy bitmaps from a source image in addition to the guest-visible
> +  data; the line is omitted if either source or destination lacks
> +  bitmap support, or 0 if bitmaps are supported but there is nothing
> +  to copy.  If the ``--bitmaps`` option is in use, the bitmap size is
> +  instead folded into the required and fully-allocated size for
> +  convenience,

Re: [PATCH v2 1/4] Introduce yank feature

2020-05-20 Thread Paolo Bonzini

On 20/05/20 23:05, Lukas Straub wrote:
> +
> +void yank_init(void)
> +{
> +qemu_mutex_init();
> +}

You can use __constructor__ for this to avoid the call in vl.c.  See
job.c for an example.

Thanks,

Paolo

[PATCH v5 3/5] qemu-img: Factor out code for merging bitmaps

The next patch will add another client that wants to merge dirty
bitmaps; it will be easier to refactor the code to construct the QAPI
struct correctly into a helper function.

Signed-off-by: Eric Blake 
---
 qemu-img.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index d719b9d35468..c1bafb57023a 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1639,6 +1639,23 @@ out4:
 return ret;
 }

+static void do_dirty_bitmap_merge(const char *dst_node, const char *dst_name,
+  const char *src_node, const char *src_name,
+  Error **errp)
+{
+BlockDirtyBitmapMergeSource *merge_src;
+BlockDirtyBitmapMergeSourceList *list;
+
+merge_src = g_new0(BlockDirtyBitmapMergeSource, 1);
+merge_src->type = QTYPE_QDICT;
+merge_src->u.external.node = g_strdup(src_node);
+merge_src->u.external.name = g_strdup(src_name);
+list = g_new0(BlockDirtyBitmapMergeSourceList, 1);
+list->value = merge_src;
+qmp_block_dirty_bitmap_merge(dst_node, dst_name, list, errp);
+qapi_free_BlockDirtyBitmapMergeSourceList(list);
+}
+
 enum ImgConvertBlockStatus {
 BLK_DATA,
 BLK_ZERO,
@@ -4715,21 +4732,11 @@ static int img_bitmap(int argc, char **argv)
 qmp_block_dirty_bitmap_disable(bs->node_name, bitmap, );
 op = "disable";
 break;
-case BITMAP_MERGE: {
-BlockDirtyBitmapMergeSource *merge_src;
-BlockDirtyBitmapMergeSourceList *list;
-
-merge_src = g_new0(BlockDirtyBitmapMergeSource, 1);
-merge_src->type = QTYPE_QDICT;
-merge_src->u.external.node = g_strdup(src_bs->node_name);
-merge_src->u.external.name = g_strdup(act->src);
-list = g_new0(BlockDirtyBitmapMergeSourceList, 1);
-list->value = merge_src;
-qmp_block_dirty_bitmap_merge(bs->node_name, bitmap, list, );
-qapi_free_BlockDirtyBitmapMergeSourceList(list);
+case BITMAP_MERGE:
+do_dirty_bitmap_merge(bs->node_name, bitmap, src_bs->node_name,
+  act->src, );
 op = "merge";
 break;
-}
 default:
 g_assert_not_reached();
 }
-- 
2.26.2

[PATCH v5 4/5] qemu-img: Add convert --bitmaps option

Make it easier to copy all the persistent bitmaps of (the top layer
of) a source image along with its guest-visible contents, by adding a
boolean flag for use with qemu-img convert.  This is basically
shorthand, as the same effect could be accomplished with a series of
'qemu-img bitmap --add' and 'qemu-img bitmap --merge -b source'
commands, or by QMP commands.

Note that this command will fail in the same scenarios where 'qemu-img
measure --bitmaps' fails, when either the source or the destanation
lacks persistent bitmap support altogether.

See also https://bugzilla.redhat.com/show_bug.cgi?id=1779893

While touching this, clean up a couple coding issues spotted in the
same function: an extra blank line, and merging back-to-back 'if
(!skip_create)' blocks.

Signed-off-by: Eric Blake 
Message-Id: <20200513011648.166876-9-ebl...@redhat.com>
---
 docs/tools/qemu-img.rst |  6 +++-
 qemu-img.c  | 77 +++--
 qemu-img-cmds.hx|  4 +--
 3 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 9a8112fc9f58..35050fc51070 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -162,6 +162,10 @@ Parameters to convert subcommand:

 .. program:: qemu-img-convert

+.. option:: --bitmaps
+
+  Additionally copy all persistent bitmaps from the top layer of the source
+
 .. option:: -n

   Skip the creation of the target volume
@@ -397,7 +401,7 @@ Command description:
   4
 Error on reading data

-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T 
SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l SNAPSHOT_PARAM] 
[-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] 
OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l 
SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 
[...]] OUTPUT_FILENAME

   Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
   to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
diff --git a/qemu-img.c b/qemu-img.c
index c1bafb57023a..1494d8f5c409 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -192,6 +192,7 @@ static void QEMU_NORETURN help(void)
"   hiding corruption that has already occurred.\n"
"\n"
"Parameters to convert subcommand:\n"
+   "  '--bitmaps' copies all top-level persistent bitmaps to 
destination\n"
"  '-m' specifies how many coroutines work in parallel during the 
convert\n"
"   process (defaults to 8)\n"
"  '-W' allow to write to the target out of order rather than 
sequential\n"
@@ -2139,6 +2140,39 @@ static int convert_do_copy(ImgConvertState *s)
 return s->ret;
 }

+static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst)
+{
+BdrvDirtyBitmap *bm;
+Error *err = NULL;
+
+FOR_EACH_DIRTY_BITMAP(src, bm) {
+const char *name;
+
+if (!bdrv_dirty_bitmap_get_persistence(bm)) {
+continue;
+}
+name = bdrv_dirty_bitmap_name(bm);
+qmp_block_dirty_bitmap_add(dst->node_name, name,
+   true, bdrv_dirty_bitmap_granularity(bm),
+   true, true,
+   true, !bdrv_dirty_bitmap_enabled(bm),
+   );
+if (err) {
+error_reportf_err(err, "Failed to create bitmap %s: ", name);
+return -1;
+}
+
+do_dirty_bitmap_merge(dst->node_name, name, src->node_name, name,
+  );
+if (err) {
+error_reportf_err(err, "Failed to populate bitmap %s: ", name);
+return -1;
+}
+}
+
+return 0;
+}
+
 #define MAX_BUF_SECTORS 32768

 static int img_convert(int argc, char **argv)
@@ -2160,6 +2194,8 @@ static int img_convert(int argc, char **argv)
 int64_t ret = -EINVAL;
 bool force_share = false;
 bool explict_min_sparse = false;
+bool bitmaps = false;
+size_t nbitmaps = 0;

 ImgConvertState s = (ImgConvertState) {
 /* Need at least 4k of zeros for sparse detection */
@@ -2179,6 +2215,7 @@ static int img_convert(int argc, char **argv)
 {"target-image-opts", no_argument, 0, OPTION_TARGET_IMAGE_OPTS},
 {"salvage", no_argument, 0, OPTION_SALVAGE},
 {"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
+{"bitmaps", no_argument, 0, OPTION_BITMAPS},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc, argv, ":hf:O:B:Cco:l:S:pt:T:qnm:WU",
@@ -2304,6 +2341,9 @@ static int img_convert(int argc, char **argv)
  */

[PATCH v5 0/5] qemu-img: Add convert --bitmaps

v4 was here:
https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg03182.html
original cover letter here:
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg03464.html

Based-on: <20200519175707.815782-1-ebl...@redhat.com>
[pull v3 bitmaps patches for 2020-05-18]

Since then:
- patch 1 is new (fixes regression from recent NBD pull)
- patch 2, 4: include fixes suggested by Vladimir; biggest is that
bitmaps computation is now in qcow2-bitmaps.c instead of qcow2.c
- patch 3: split out from patch 4 (was v4 8/9)
- patch 5: rebase to master

001/5:[down] 'iotests: Fix test 178'
002/5:[0106] [FC] 'qcow2: Expose bitmaps' size during measure'
003/5:[down] 'qemu-img: Factor out code for merging bitmaps'
004/5:[0012] [FC] 'qemu-img: Add convert --bitmaps option'
005/5:[0002] [FC] 'iotests: Add test 291 to for qemu-img bitmap coverage'

Series can also be downloaded at:
https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qemu-img-bitmaps-v5

Eric Blake (5):
  iotests: Fix test 178
  qcow2: Expose bitmaps' size during measure
  qemu-img: Factor out code for merging bitmaps
  qemu-img: Add convert --bitmaps option
  iotests: Add test 291 to for qemu-img bitmap coverage

 docs/tools/qemu-img.rst  |  18 -
 qapi/block-core.json |  15 ++--
 block/qcow2.h|   2 +
 block/crypto.c   |   2 +-
 block/qcow2-bitmap.c |  36 +
 block/qcow2.c|  14 +++-
 block/raw-format.c   |   2 +-
 qemu-img.c   | 135 +++
 qemu-img-cmds.hx |   8 +-
 tests/qemu-iotests/178.out.qcow2 |  18 -
 tests/qemu-iotests/178.out.raw   |   2 +-
 tests/qemu-iotests/190   |  58 -
 tests/qemu-iotests/190.out   |  35 +++-
 tests/qemu-iotests/291   | 112 +
 tests/qemu-iotests/291.out   |  80 ++
 tests/qemu-iotests/group |   1 +
 16 files changed, 501 insertions(+), 37 deletions(-)
 create mode 100755 tests/qemu-iotests/291
 create mode 100644 tests/qemu-iotests/291.out

-- 
2.26.2

[PATCH v5 2/5] qcow2: Expose bitmaps' size during measure

It's useful to know how much space can be occupied by qcow2 persistent
bitmaps, even though such metadata is unrelated to the guest-visible
data.  Report this value as an additional QMP field, present when
measuring an existing image and output format that both support
bitmaps.  Update iotest 178 and 190 to updated output, as well as new
coverage in 190 demonstrating non-zero values made possible with the
recently-added qemu-img bitmap command.

On the command-line side, 'qemu-img measure' gains a new --bitmaps
flag.  When present, the bitmap size is rolled into the two existing
measures (or errors if either the source image or destination format
lacks bitmaps); when absent, there is never an error (for
back-compat), but the output will instead include a new line item for
bitmaps (which you would have to manually add), with that line being
omitted in the same cases where passing --bitmaps would error.

The behavior chosen here is symmetrical with the upcoming 'qemu-img
convert --bitmaps' being added in the next patch: that is, either both
commands will succeed (your qemu-img was new enough to do bitmap
manipulations, AND you correctly measured and copied the bitmaps, even
if that measurement was 0 because there was nothing to copy) or both
fail (either your qemu-img is too old to understand --bitmaps, or it
understands it but your choice of images do not support seamless
transition of bitmaps because either source, destination, or both lack
bitmap support).

The addition of a new field demonstrates why we should always
zero-initialize qapi C structs; while the qcow2 driver still fully
populates all fields, the raw and crypto drivers had to be tweaked to
avoid uninitialized data.

See also: https://bugzilla.redhat.com/1779904

Reported-by: Nir Soffer 
Signed-off-by: Eric Blake 
---
 docs/tools/qemu-img.rst  | 12 ++-
 qapi/block-core.json | 15 ++---
 block/qcow2.h|  2 ++
 block/crypto.c   |  2 +-
 block/qcow2-bitmap.c | 36 
 block/qcow2.c| 14 ++--
 block/raw-format.c   |  2 +-
 qemu-img.c   | 25 ++
 qemu-img-cmds.hx |  4 +--
 tests/qemu-iotests/178.out.qcow2 | 16 +
 tests/qemu-iotests/190   | 58 ++--
 tests/qemu-iotests/190.out   | 35 ++-
 12 files changed, 205 insertions(+), 16 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 38d464ea3f23..9a8112fc9f58 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -593,7 +593,7 @@ Command description:
   For more information, consult ``include/block/block.h`` in QEMU's
   source code.

-.. option:: measure [--output=OFMT] [-O OUTPUT_FMT] [-o OPTIONS] [--size N | 
[--object OBJECTDEF] [--image-opts] [-f FMT] [-l SNAPSHOT_PARAM] FILENAME]
+.. option:: measure [--output=OFMT] [-O OUTPUT_FMT] [-o OPTIONS] [--size N | 
[--object OBJECTDEF] [--image-opts] [-f FMT] [--bitmaps] [-l SNAPSHOT_PARAM] 
FILENAME]

   Calculate the file size required for a new image.  This information
   can be used to size logical volumes or SAN LUNs appropriately for
@@ -616,6 +616,7 @@ Command description:

 required size: 524288
 fully allocated size: 1074069504
+bitmaps size: 0

   The ``required size`` is the file size of the new image.  It may be smaller
   than the virtual disk size if the image format supports compact 
representation.
@@ -625,6 +626,15 @@ Command description:
   occupy with the exception of internal snapshots, dirty bitmaps, vmstate data,
   and other advanced image format features.

+  The ``bitmaps size`` is the additional size required in order to
+  copy bitmaps from a source image in addition to the guest-visible
+  data; the line is omitted if either source or destination lacks
+  bitmap support, or 0 if bitmaps are supported but there is nothing
+  to copy.  If the ``--bitmaps`` option is in use, the bitmap size is
+  instead folded into the required and fully-allocated size for
+  convenience, rather than being a separate line item; using the
+  option will raise an error if bitmaps are not supported.
+
 .. option:: snapshot [--object OBJECTDEF] [--image-opts] [-U] [-q] [-l | -a 
SNAPSHOT | -c SNAPSHOT | -d SNAPSHOT] FILENAME

   List, apply, create or delete snapshots in image *FILENAME*.
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 6fbacddab2cc..d5049c309380 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -636,18 +636,23 @@
 # efficiently so file size may be smaller than virtual disk size.
 #
 # The values are upper bounds that are guaranteed to fit the new image file.
-# Subsequent modification, such as internal snapshot or bitmap creation, may
-# require additional space and is not covered here.
+# Subsequent modification, such as internal snapshot or further bitmap
+# creation, may require additional space and is not

[PATCH v5 1/5] iotests: Fix test 178

A recent change to qemu-img changed expected error message output, but
178 takes long enough to execute that it does not get run by 'make
check' or './check -g quick'.

Fixes: 43d589b074
Signed-off-by: Eric Blake 
---
 tests/qemu-iotests/178.out.qcow2 | 2 +-
 tests/qemu-iotests/178.out.raw   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/178.out.qcow2 b/tests/qemu-iotests/178.out.qcow2
index f59bf4b2fbc4..4b69524c80ee 100644
--- a/tests/qemu-iotests/178.out.qcow2
+++ b/tests/qemu-iotests/178.out.qcow2
@@ -13,7 +13,7 @@ qemu-img: Invalid option list: ,
 qemu-img: Invalid parameter 'snapshot.foo'
 qemu-img: Failed in parsing snapshot param 'snapshot.foo'
 qemu-img: --output must be used with human or json as argument.
-qemu-img: Image size must be less than 8 EiB!
+qemu-img: Invalid image size specified. Must be between 0 and 
9223372036854775807.
 qemu-img: Unknown file format 'foo'

 == Size calculation for a new file (human) ==
diff --git a/tests/qemu-iotests/178.out.raw b/tests/qemu-iotests/178.out.raw
index 404ca908d8c2..20e17da115cb 100644
--- a/tests/qemu-iotests/178.out.raw
+++ b/tests/qemu-iotests/178.out.raw
@@ -13,7 +13,7 @@ qemu-img: Invalid option list: ,
 qemu-img: Invalid parameter 'snapshot.foo'
 qemu-img: Failed in parsing snapshot param 'snapshot.foo'
 qemu-img: --output must be used with human or json as argument.
-qemu-img: Image size must be less than 8 EiB!
+qemu-img: Invalid image size specified. Must be between 0 and 
9223372036854775807.
 qemu-img: Unknown file format 'foo'

 == Size calculation for a new file (human) ==
-- 
2.26.2

[PATCH v5 5/5] iotests: Add test 291 to for qemu-img bitmap coverage

Add a new test covering the 'qemu-img bitmap' subcommand, as well as
'qemu-img convert --bitmaps', both added in recent patches.

Signed-off-by: Eric Blake 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/291 | 112 +
 tests/qemu-iotests/291.out |  80 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 193 insertions(+)
 create mode 100755 tests/qemu-iotests/291
 create mode 100644 tests/qemu-iotests/291.out

diff --git a/tests/qemu-iotests/291 b/tests/qemu-iotests/291
new file mode 100755
index ..3ca83b9cd1f7
--- /dev/null
+++ b/tests/qemu-iotests/291
@@ -0,0 +1,112 @@
+#!/usr/bin/env bash
+#
+# Test qemu-img bitmap handling
+#
+# Copyright (C) 2018-2020 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1 # failure is the default!
+
+_cleanup()
+{
+_cleanup_test_img
+nbd_server_stop
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.nbd
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+_require_command QEMU_NBD
+
+echo
+echo "=== Initial image setup ==="
+echo
+
+# Create backing image with one bitmap
+TEST_IMG="$TEST_IMG.base" _make_test_img 10M
+$QEMU_IMG bitmap --add -f $IMGFMT "$TEST_IMG.base" b0
+$QEMU_IO -c 'w 3M 1M' -f $IMGFMT "$TEST_IMG.base" | _filter_qemu_io
+
+# Create initial image and populate two bitmaps: one active, one inactive.
+ORIG_IMG=$TEST_IMG
+TEST_IMG=$TEST_IMG.orig
+_make_test_img -b "$ORIG_IMG.base" -F $IMGFMT 10M
+$QEMU_IO -c 'w 0 1M' -f $IMGFMT "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG bitmap --add -g 512k -f $IMGFMT "$TEST_IMG" b1
+$QEMU_IMG bitmap --add --disable -f $IMGFMT "$TEST_IMG" b2
+$QEMU_IO -c 'w 3M 1M' -f $IMGFMT "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG bitmap --clear -f $IMGFMT "$TEST_IMG" b1
+$QEMU_IO -c 'w 1M 1M' -f $IMGFMT "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG bitmap --disable -f $IMGFMT "$TEST_IMG" b1
+$QEMU_IMG bitmap --enable -f $IMGFMT "$TEST_IMG" b2
+$QEMU_IO -c 'w 2M 1M' -f $IMGFMT "$TEST_IMG" | _filter_qemu_io
+
+echo
+echo "=== Bitmap preservation not possible to non-qcow2 ==="
+echo
+
+TEST_IMG=$ORIG_IMG
+$QEMU_IMG convert --bitmaps -O raw "$TEST_IMG.orig" "$TEST_IMG" &&
+echo "unexpected success"
+
+echo
+echo "=== Convert with bitmap preservation ==="
+echo
+
+# Only bitmaps from the active layer are copied
+$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG.orig" "$TEST_IMG"
+$QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+# But we can also merge in bitmaps from other layers.  This test is a bit
+# contrived to cover more code paths, in reality, you could merge directly
+# into b0 without going through tmp
+$QEMU_IMG bitmap --add --disable -f $IMGFMT "$TEST_IMG" b0
+$QEMU_IMG bitmap --add --merge b0 -b "$TEST_IMG.base" -F $IMGFMT \
+ -f $IMGFMT "$TEST_IMG" tmp
+$QEMU_IMG bitmap --merge tmp -f $IMGFMT "$TEST_IMG" b0
+$QEMU_IMG bitmap --remove --image-opts \
+driver=$IMGFMT,file.driver=file,file.filename="$TEST_IMG" tmp
+$QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+
+echo
+echo "=== Check bitmap contents ==="
+echo
+
+# x-dirty-bitmap is a hack for reading bitmaps; it abuses block status to
+# report "data":false for portions of the bitmap which are set
+IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket"
+nbd_server_start_unix_socket -r -f qcow2 -B b0 "$TEST_IMG"
+$QEMU_IMG map --output=json --image-opts \
+"$IMG,x-dirty-bitmap=qemu:dirty-bitmap:b0" | _filter_qemu_img_map
+nbd_server_start_unix_socket -r -f qcow2 -B b1 "$TEST_IMG"
+$QEMU_IMG map --output=json --image-opts \
+"$IMG,x-dirty-bitmap=qemu:dirty-bitmap:b1" | _filter_qemu_img_map
+nbd_server_start_unix_socket -r -f qcow2 -B b2 "$TEST_IMG"
+$QEMU_IMG map --output=json --image-opts \
+"$IMG,x-dirty-bitmap=qemu:dirty-bitmap:b2" | _filter_qemu_img_map
+
+# success, all done
+echo '*** done'
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/291.out b/tests/qemu-iotests/291.out
new file mode 100644
index ..8c62017567e9
--- /dev/null
+++ b/tests/qemu-iotests/291.out
@@ -0,0 +1,80 @@
+QA output created by 291
+
+=== Initial image setup ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=10485760

[Bug 1856335] Re: Cache Layout wrong on many Zen Arch CPUs

2020-05-20 Thread Heiko Sieger

Jan, I tried your suggestion but it didn't make a difference. Here is my
current setup:

h/w: AMD Ryzen 9 3900X
kernel: 5.4
QEMU: 5.0.0-6
Chipset selection: Q35-5.0

Configuration: host-passthrough, cache enabled

Use CoreInfo.exe inside Windows. The problem is this:

Logical Processor to Cache Map:
**-- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
**-- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 
64
**-- Unified Cache 0, Level 2, 512 KB, Assoc 8, LineSize 64
 Unified Cache 1, Level 3, 16 MB, Assoc 16, LineSize 64

The last line above should be as follows:

**-- Unified Cache 0, Level 3, 16 MB, Assoc 16,
LineSize 64

The cache is supposed to be associated with 3 cores a 2 threads in group
0. Yet it shows 8 (2x4) vcpus inside a cache that is associated with the
next group.

In total, I always get 3 L3 caches instead of 4 L4 caches for my 12
cores / 24 threads. Also see my next post.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1856335

Title:
  Cache Layout wrong on many Zen Arch CPUs

Status in QEMU:
  New

Bug description:
  AMD CPUs have L3 cache per 2, 3 or 4 cores. Currently, TOPOEXT seems
  to always map Cache ass if it was an 4-Core per CCX CPU, which is
  incorrect, and costs upwards 30% performance (more realistically 10%)
  in L3 Cache Layout aware applications.

  Example on a 4-CCX CPU (1950X /w 8 Cores and no SMT):

    
  EPYC-IBPB
  AMD
  

  In windows, coreinfo reports correctly:

    Unified Cache 1, Level 3,8 MB, Assoc  16, LineSize  64
    Unified Cache 6, Level 3,8 MB, Assoc  16, LineSize  64

  On a 3-CCX CPU (3960X /w 6 cores and no SMT):

   
  EPYC-IBPB
  AMD
  

  in windows, coreinfo reports incorrectly:

  --  Unified Cache  1, Level 3,8 MB, Assoc  16, LineSize  64
  **  Unified Cache  6, Level 3,8 MB, Assoc  16, LineSize  64

  Validated against 3.0, 3.1, 4.1 and 4.2 versions of qemu-kvm.

  With newer Qemu there is a fix (that does behave correctly) in using the dies 
parameter:
   

  The problem is that the dies are exposed differently than how AMD does
  it natively, they are exposed to Windows as sockets, which means, that
  if you are nto a business user, you can't ever have a machine with
  more than two CCX (6 cores) as consumer versions of Windows only
  supports two sockets. (Should this be reported as a separate bug?)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1856335/+subscriptions

Re: [PATCH v4 3/3] block: make BlockConf.*_size properties 32-bit

On Wed, May 20, 2020 at 05:54:44PM +0200, Kevin Wolf wrote:
> Am 20.05.2020 um 10:06 hat Roman Kagan geschrieben:
> > Devices (virtio-blk, scsi, etc.) and the block layer are happy to use
> > 32-bit for logical_block_size, physical_block_size, and min_io_size.
> > However, the properties in BlockConf are defined as uint16_t limiting
> > the values to 32768.
> > 
> > This appears unnecessary tight, and we've seen bigger block sizes handy
> > at times.
> > 
> > Make them 32 bit instead and lift the limitation up to 2 MiB which
> > appears to be good enough for everybody, and matches the qcow2 cluster
> > size limit.
> > 
> > As the values can now be fairly big and awkward to type, make the
> > property setter accept common size suffixes (k, m).
> > 
> > Also as the devices which use min_io_size (virtio-blk and scsi) pass its
> > value to the guest in units of logical blocks in a 16bit field, to
> > prevent its silent truncation add a corresponding check to
> > blkconf_blocksizes.
> > 
> > Signed-off-by: Roman Kagan 
> > ---
> > v3 -> v4:
> > - check min_io_size against truncation [Kevin]
> > 
> > v2 -> v3:
> > - mention qcow2 cluster size limit in the log and comment [Eric]
> > 
> > v1 -> v2:
> > - cap the property at 2 MiB [Eric]
> > - accept size suffixes
> > 
> >  include/hw/block/block.h |  8 
> >  include/hw/qdev-properties.h |  2 +-
> >  hw/block/block.c | 11 +++
> >  hw/core/qdev-properties.c| 34 --
> >  4 files changed, 40 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/hw/block/block.h b/include/hw/block/block.h
> > index 784953a237..2fa09aa0b1 100644
> > --- a/include/hw/block/block.h
> > +++ b/include/hw/block/block.h
> > @@ -18,9 +18,9 @@
> >  
> >  typedef struct BlockConf {
> >  BlockBackend *blk;
> > -uint16_t physical_block_size;
> > -uint16_t logical_block_size;
> > -uint16_t min_io_size;
> > +uint32_t physical_block_size;
> > +uint32_t logical_block_size;
> > +uint32_t min_io_size;
> >  uint32_t opt_io_size;
> >  int32_t bootindex;
> >  uint32_t discard_granularity;
> > @@ -51,7 +51,7 @@ static inline unsigned int 
> > get_physical_block_exp(BlockConf *conf)
> >_conf.logical_block_size),\
> >  DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,\
> >_conf.physical_block_size),   \
> > -DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),\
> > +DEFINE_PROP_UINT32("min_io_size", _state, _conf.min_io_size, 0),\
> >  DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),\
> >  DEFINE_PROP_UINT32("discard_granularity", _state,   \
> > _conf.discard_granularity, -1),  \
> > diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
> > index f161604fb6..f9e0f8c041 100644
> > --- a/include/hw/qdev-properties.h
> > +++ b/include/hw/qdev-properties.h
> > @@ -197,7 +197,7 @@ extern const PropertyInfo qdev_prop_pcie_link_width;
> >  #define DEFINE_PROP_BIOS_CHS_TRANS(_n, _s, _f, _d) \
> >  DEFINE_PROP_SIGNED(_n, _s, _f, _d, qdev_prop_bios_chs_trans, int)
> >  #define DEFINE_PROP_BLOCKSIZE(_n, _s, _f) \
> > -DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint16_t)
> > +DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint32_t)
> >  #define DEFINE_PROP_PCI_HOST_DEVADDR(_n, _s, _f) \
> >  DEFINE_PROP(_n, _s, _f, qdev_prop_pci_host_devaddr, 
> > PCIHostDeviceAddress)
> >  #define DEFINE_PROP_OFF_AUTO_PCIBAR(_n, _s, _f, _d) \
> > diff --git a/hw/block/block.c b/hw/block/block.c
> > index 5f8ebff59c..cd95e7e38f 100644
> > --- a/hw/block/block.c
> > +++ b/hw/block/block.c
> > @@ -96,6 +96,17 @@ bool blkconf_blocksizes(BlockConf *conf, Error **errp)
> >  return false;
> >  }
> >  
> > +/*
> > + * all devices which support min_io_size (scsi and virtio-blk) expose 
> > it to
> > + * the guest as a uint16_t in units of logical blocks
> > + */
> > +if ((conf->min_io_size / conf->logical_block_size) > UINT16_MAX) {
> > +error_setg(errp,
> > +   "min_io_size must be no more than " 
> > stringify(UINT16_MAX)
> > +   " of logical_block_size");
> 
> I'm not a native speaker, but "no more than 65536 of
> logical_block_size" sounds odd to me.

Neither am I but I agree with the feeling.

> Maybe "65536 times the logical_block_size"?

Sounds better indeed, will do in the respin.
Or perhaps "no more than 65536 logical blocks"?

Thanks,
Roman.

> 
> > +return false;
> > +}
> > +
> >  if (conf->opt_io_size % conf->logical_block_size) {
> >  error_setg(errp,
> > "opt_io_size must be a multple of logical_block_size");
> 
> Kevin
>

Re: [PATCH v7 03/12] tests/vm: pass args through to BaseVM's init

2020-05-20 Thread Alex Bennée



Robert Foley  writes:

A brief rationale wouldn't go amiss in the commit message. e.g. "We will
shortly need to pass more parameters to the class so lets just pass args
rather than growing the parameter list."

Otherwise:

Reviewed-by: Alex Bennée 


> Signed-off-by: Robert Foley 
> ---
>  tests/vm/basevm.py | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/tests/vm/basevm.py b/tests/vm/basevm.py
> index a2d4054d72..fbefda0595 100644
> --- a/tests/vm/basevm.py
> +++ b/tests/vm/basevm.py
> @@ -61,9 +61,9 @@ class BaseVM(object):
>  # 4 is arbitrary, but greater than 2,
>  # since we found we need to wait more than twice as long.
>  tcg_ssh_timeout_multiplier = 4
> -def __init__(self, debug=False, vcpus=None, genisoimage=None):
> +def __init__(self, args):
>  self._guest = None
> -self._genisoimage = genisoimage
> +self._genisoimage = args.genisoimage
>  self._tmpdir = os.path.realpath(tempfile.mkdtemp(prefix="vm-test-",
>   suffix=".tmp",
>   dir="."))
> @@ -76,7 +76,7 @@ class BaseVM(object):
>  self._ssh_pub_key_file = os.path.join(self._tmpdir, "id_rsa.pub")
>  open(self._ssh_pub_key_file, "w").write(SSH_PUB_KEY)
>  
> -self.debug = debug
> +self.debug = args.debug
>  self._stderr = sys.stderr
>  self._devnull = open(os.devnull, "w")
>  if self.debug:
> @@ -90,8 +90,8 @@ class BaseVM(object):
> (",ipv6=no" if not self.ipv6 else ""),
>  "-device", "virtio-net-pci,netdev=vnet",
>  "-vnc", "127.0.0.1:0,to=20"]
> -if vcpus and vcpus > 1:
> -self._args += ["-smp", "%d" % vcpus]
> +if args.jobs and args.jobs > 1:
> +self._args += ["-smp", "%d" % args.jobs]
>  if kvm_available(self.arch):
>  self._args += ["-enable-kvm"]
>  else:
> @@ -438,8 +438,7 @@ def main(vmcls):
>  return 1
>  logging.basicConfig(level=(logging.DEBUG if args.debug
> else logging.WARN))
> -vm = vmcls(debug=args.debug, vcpus=args.jobs,
> -   genisoimage=args.genisoimage)
> +vm = vmcls(args)
>  if args.build_image:
>  if os.path.exists(args.image) and not args.force:
>  sys.stderr.writelines(["Image file exists: %s\n" % 
> args.image,


-- 
Alex Bennée

Re: [PATCH v4 3/3] block: make BlockConf.*_size properties 32-bit

On Wed, May 20, 2020 at 11:04:44AM +0200, Philippe Mathieu-Daudé wrote:
> On 5/20/20 10:06 AM, Roman Kagan wrote:
> > Devices (virtio-blk, scsi, etc.) and the block layer are happy to use
> > 32-bit for logical_block_size, physical_block_size, and min_io_size.
> > However, the properties in BlockConf are defined as uint16_t limiting
> > the values to 32768.
> > 
> > This appears unnecessary tight, and we've seen bigger block sizes handy
> > at times.
> > 
> > Make them 32 bit instead and lift the limitation up to 2 MiB which
> > appears to be good enough for everybody, and matches the qcow2 cluster
> > size limit.
> > 
> > As the values can now be fairly big and awkward to type, make the
> > property setter accept common size suffixes (k, m).
> > 
> > Also as the devices which use min_io_size (virtio-blk and scsi) pass its
> > value to the guest in units of logical blocks in a 16bit field, to
> > prevent its silent truncation add a corresponding check to
> > blkconf_blocksizes.
> > 
> > Signed-off-by: Roman Kagan 
> > ---
> > v3 -> v4:
> > - check min_io_size against truncation [Kevin]
> > 
> > v2 -> v3:
> > - mention qcow2 cluster size limit in the log and comment [Eric]
> > 
> > v1 -> v2:
> > - cap the property at 2 MiB [Eric]
> > - accept size suffixes
> > 
> >   include/hw/block/block.h |  8 
> >   include/hw/qdev-properties.h |  2 +-
> >   hw/block/block.c | 11 +++
> >   hw/core/qdev-properties.c| 34 --
> >   4 files changed, 40 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/hw/block/block.h b/include/hw/block/block.h
> > index 784953a237..2fa09aa0b1 100644
> > --- a/include/hw/block/block.h
> > +++ b/include/hw/block/block.h
> > @@ -18,9 +18,9 @@
> >   typedef struct BlockConf {
> >   BlockBackend *blk;
> > -uint16_t physical_block_size;
> > -uint16_t logical_block_size;
> > -uint16_t min_io_size;
> > +uint32_t physical_block_size;
> > +uint32_t logical_block_size;
> > +uint32_t min_io_size;
> >   uint32_t opt_io_size;
> >   int32_t bootindex;
> >   uint32_t discard_granularity;
> > @@ -51,7 +51,7 @@ static inline unsigned int 
> > get_physical_block_exp(BlockConf *conf)
> > _conf.logical_block_size),\
> >   DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,\
> > _conf.physical_block_size),   \
> > -DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),\
> > +DEFINE_PROP_UINT32("min_io_size", _state, _conf.min_io_size, 0),\
> >   DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),\
> >   DEFINE_PROP_UINT32("discard_granularity", _state,   \
> >  _conf.discard_granularity, -1),  \
> > diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
> > index f161604fb6..f9e0f8c041 100644
> > --- a/include/hw/qdev-properties.h
> > +++ b/include/hw/qdev-properties.h
> > @@ -197,7 +197,7 @@ extern const PropertyInfo qdev_prop_pcie_link_width;
> >   #define DEFINE_PROP_BIOS_CHS_TRANS(_n, _s, _f, _d) \
> >   DEFINE_PROP_SIGNED(_n, _s, _f, _d, qdev_prop_bios_chs_trans, int)
> >   #define DEFINE_PROP_BLOCKSIZE(_n, _s, _f) \
> > -DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint16_t)
> > +DEFINE_PROP_UNSIGNED(_n, _s, _f, 0, qdev_prop_blocksize, uint32_t)
> >   #define DEFINE_PROP_PCI_HOST_DEVADDR(_n, _s, _f) \
> >   DEFINE_PROP(_n, _s, _f, qdev_prop_pci_host_devaddr, 
> > PCIHostDeviceAddress)
> >   #define DEFINE_PROP_OFF_AUTO_PCIBAR(_n, _s, _f, _d) \
> > diff --git a/hw/block/block.c b/hw/block/block.c
> > index 5f8ebff59c..cd95e7e38f 100644
> > --- a/hw/block/block.c
> > +++ b/hw/block/block.c
> > @@ -96,6 +96,17 @@ bool blkconf_blocksizes(BlockConf *conf, Error **errp)
> >   return false;
> >   }
> > +/*
> > + * all devices which support min_io_size (scsi and virtio-blk) expose 
> > it to
> > + * the guest as a uint16_t in units of logical blocks
> > + */
> > +if ((conf->min_io_size / conf->logical_block_size) > UINT16_MAX) {
> > +error_setg(errp,
> > +   "min_io_size must be no more than " 
> > stringify(UINT16_MAX)
> > +   " of logical_block_size");
> > +return false;
> > +}
> > +
> >   if (conf->opt_io_size % conf->logical_block_size) {
> >   error_setg(errp,
> >  "opt_io_size must be a multple of logical_block_size");
> > diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
> > index cc924815da..fd03cc7597 100644
> > --- a/hw/core/qdev-properties.c
> > +++ b/hw/core/qdev-properties.c
> > @@ -14,6 +14,7 @@
> >   #include "qapi/visitor.h"
> >   #include "chardev/char.h"
> >   #include "qemu/uuid.h"
> > +#include "qemu/units.h"
> >   void qdev_prop_set_after_realize(DeviceState *dev, const char *name,
>

Re: [PULL 2/6] qemu_img: add cvtnum_full to print error reports


On 5/18/20 11:32 AM, Eric Blake wrote:

From: Eyal Moscovici 

All calls to cvtnum check the return value and print the same error
message more or less. And so error reporting moved to cvtnum_full to
reduce code duplication and provide a single error
message. Additionally, cvtnum now wraps cvtnum_full with the existing
default range of 0 to MAX_INT64.

Acked-by: Mark Kanda 
Signed-off-by: Eyal Moscovici 
Message-Id: <20200513133629.18508-2-eyal.moscov...@oracle.com>
Reviewed-by: Eric Blake 
[eblake: fix printf formatting, avoid trailing space, change error wording,
reformat commit message]
Signed-off-by: Eric Blake 
---



@@ -572,16 +584,8 @@ static int img_create(int argc, char **argv)
  if (optind < argc) {
  int64_t sval;

-sval = cvtnum(argv[optind++]);
+sval = cvtnum("image size", argv[optind++]);
  if (sval < 0) {
-if (sval == -ERANGE) {
-error_report("Image size must be less than 8 EiB!");


This change broke iotest 178:

--- /home/eblake/qemu/tests/qemu-iotests/178.out.qcow2	2020-05-20 
16:33:20.065710365 -0500
+++ /home/eblake/qemu/build/tests/qemu-iotests/178.out.bad	2020-05-20 
16:35:28.924512423 -0500

@@ -13,7 +13,7 @@
 qemu-img: Invalid parameter 'snapshot.foo'
 qemu-img: Failed in parsing snapshot param 'snapshot.foo'
 qemu-img: --output must be used with human or json as argument.
-qemu-img: Image size must be less than 8 EiB!
+qemu-img: Invalid image size specified. Must be between 0 and 
9223372036854775807.

 qemu-img: Unknown file format 'foo'

 == Size calculation for a new file (human) ==

I'll post a followup patch shortly.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: Emulating Solaris 10 on SPARC64 sun4u

2020-05-20 Thread Mike Russo

> Using the proprietary firmware for this would be ideal. It would also
> provide reliable access to the kernel debugger which would be
> extremely
> helpful for diagnosing what's going wrong with the console. I'm not
> sure how I would go about making progress on this though. I know there
> are binaries of the BIOS for Sun4m machines floating around but I'm
> not
> aware of any for Sun4u machines.
> 

I haven't been able to find any of this firmware either. Not sure if
this helps but someone says they've got the Ultra 1 firmware (along with
cgsix and cgthree) available here:
https://people.csail.mit.edu/fredette/tme/sun-u1-nbsd.html

Re: [PATCH v4 2/3] block: consolidate blocksize properties consistency checks

On Wed, May 20, 2020 at 10:57:00AM +0200, Philippe Mathieu-Daudé wrote:
> On 5/20/20 10:06 AM, Roman Kagan wrote:
> > Several block device properties related to blocksize configuration must
> > be in certain relationship WRT each other: physical block must be no
> > smaller than logical block; min_io_size, opt_io_size, and
> > discard_granularity must be a multiple of a logical block.
> > 
> > To ensure these requirements are met, add corresponding consistency
> > checks to blkconf_blocksizes, adjusting its signature to communicate
> > possible error to the caller.  Also remove the now redundant consistency
> > checks from the specific devices.
> > 
> > Signed-off-by: Roman Kagan 
> > ---
> > v4: new patch
> > 
> >   include/hw/block/block.h   |  2 +-
> >   hw/block/block.c   | 29 -
> >   hw/block/fdc.c |  5 -
> >   hw/block/nvme.c|  5 -
> >   hw/block/virtio-blk.c  |  7 +--
> >   hw/ide/qdev.c  |  5 -
> >   hw/scsi/scsi-disk.c| 10 +++---
> >   hw/usb/dev-storage.c   |  5 -
> >   tests/qemu-iotests/172.out |  2 +-
> >   9 files changed, 50 insertions(+), 20 deletions(-)
> > 
> > diff --git a/include/hw/block/block.h b/include/hw/block/block.h
> > index d7246f3862..784953a237 100644
> > --- a/include/hw/block/block.h
> > +++ b/include/hw/block/block.h
> > @@ -87,7 +87,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void 
> > *buf, hwaddr size,
> >   bool blkconf_geometry(BlockConf *conf, int *trans,
> > unsigned cyls_max, unsigned heads_max, unsigned 
> > secs_max,
> > Error **errp);
> > -void blkconf_blocksizes(BlockConf *conf);
> > +bool blkconf_blocksizes(BlockConf *conf, Error **errp);
> >   bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
> >  bool resizable, Error **errp);
> > diff --git a/hw/block/block.c b/hw/block/block.c
> > index bf56c7612b..5f8ebff59c 100644
> > --- a/hw/block/block.c
> > +++ b/hw/block/block.c
> > @@ -61,7 +61,7 @@ bool blk_check_size_and_read_all(BlockBackend *blk, void 
> > *buf, hwaddr size,
> >   return true;
> >   }
> > -void blkconf_blocksizes(BlockConf *conf)
> > +bool blkconf_blocksizes(BlockConf *conf, Error **errp)
> >   {
> >   BlockBackend *blk = conf->blk;
> >   BlockSizes blocksizes;
> > @@ -83,6 +83,33 @@ void blkconf_blocksizes(BlockConf *conf)
> >   conf->logical_block_size = BDRV_SECTOR_SIZE;
> >   }
> >   }
> > +
> > +if (conf->logical_block_size > conf->physical_block_size) {
> > +error_setg(errp,
> > +   "logical_block_size > physical_block_size not 
> > supported");
> 
> "not supported" or "invalid"?
> 
> > +return false;
> > +}
> > +
> > +if (conf->min_io_size % conf->logical_block_size) {
> 
> It seems the block code usually do:
> 
>if (!QEMU_IS_ALIGNED(conf->min_io_size, conf->logical_block_size)) {
> 
> > +error_setg(errp,
> > +   "min_io_size must be a multple of logical_block_size");
> 
> Typo "multple" -> "multiple".
> 
> > +return false;
> > +}
> > +
> > +if (conf->opt_io_size % conf->logical_block_size) {
> > +error_setg(errp,
> > +   "opt_io_size must be a multple of logical_block_size");
> 
> Ditto.
> 
> > +return false;
> > +}
> > +
> > +if (conf->discard_granularity != -1 &&
> > +conf->discard_granularity % conf->logical_block_size) {
> > +error_setg(errp, "discard_granularity must be "
> > +   "a multple of logical_block_size");
> 
> Again.
> 
> > +return false;
> > +}
> > +
> > +return true;
> 
> Usually we return true for error, isn't it?

I just followed the convention of all other functions with error
handling in this file.

> >   }
> >   bool blkconf_apply_backend_options(BlockConf *conf, bool readonly,
> > diff --git a/hw/block/fdc.c b/hw/block/fdc.c
> > index c5fb9d6ece..8eda572ef4 100644
> > --- a/hw/block/fdc.c
> > +++ b/hw/block/fdc.c
> > @@ -554,7 +554,10 @@ static void floppy_drive_realize(DeviceState *qdev, 
> > Error **errp)
> >   read_only = !blk_bs(dev->conf.blk) || 
> > blk_is_read_only(dev->conf.blk);
> >   }
> > -blkconf_blocksizes(>conf);
> > +if (!blkconf_blocksizes(>conf, errp)) {
> > +return;
> > +}
> > +
> >   if (dev->conf.logical_block_size != 512 ||
> >   dev->conf.physical_block_size != 512)
> >   {
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 2f3100e56c..672650e162 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1390,7 +1390,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> > **errp)
> >   host_memory_backend_set_mapped(n->pmrdev, true);
> >   }
> > -blkconf_blocksizes(>conf);
> > +if (!blkconf_blocksizes(>conf, errp)) {
> > +return;
> > +}
> > +
> >   if

Re: [PATCH v5 4/7] dwc-hsotg (dwc2) USB host controller emulation

On Wed, May 20, 2020 at 6:18 AM Peter Maydell 
wrote:

> On Wed, 20 May 2020 at 06:49, Paul Zimmerman  wrote:
> > Is there a tree somewhere that has a working example of a
> > three-phase reset? I did a 'git grep' on the master branch and didn't
> > find any code that is actually using it. I tried to implement it from
> > the example in reset.rst, but I'm getting a segfault on the first line in
> > resettable_class_set_parent_phases() that I'm having trouble figuring
> > out.
>
> Hmm, I thought we'd committed a change of a device to use the new
> mechanism along with the actual implementation but I can't see it
> now. Damien, what's the status with getting Xilinx devices to use the
> 3-phase reset API?
>
>
Never mind, I found the problem, I wasn't initializing my class properly.
It's working now, I'll send along a new patch series shortly.

Thanks,
Paul

thanks
> -- PMM
>

Re: [PATCH v4 1/3] virtio-blk: store opt_io_size with correct size

On Wed, May 20, 2020 at 06:44:44AM -0400, Michael S. Tsirkin wrote:
> On Wed, May 20, 2020 at 11:06:55AM +0300, Roman Kagan wrote:
> > The width of opt_io_size in virtio_blk_topology is 32bit.
> > 
> > Use the appropriate accessor to store it.
> > 
> > Signed-off-by: Roman Kagan 
> 
> 
> Thanks for the patch!
> Could you add a bit of analysis - when does this cause
> bugs? I'm guessing on BE systems with legacy virtio, right?

I guess so too.  It was found just by eye inspection, trying to figure
out the potential truncation of opt_io_size in virtio-blk and why it's
different from scsi.  I don't have any analysis to add :(

> Also, should we convert virtio_stw_p and friends to get the
> pointer to the correct value type, as opposed to void *?

I dunno.  I guess they were designed to be used with untyped buffers and
modeled after virtio_{st,ld}*_phys.  The same question applies to the
underlying {st,ld}_{b,l}e_p.

> This will catch bugs like this ...

I'll try and see if this change doesn't cause too much churn / pain.
But I suggest to decouple it from the simple patch at hand.

Thanks,
Roman.

> > ---
> > v4: new patch
> > 
> >  hw/block/virtio-blk.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index f5f6fc925e..413083e62f 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -918,7 +918,7 @@ static void virtio_blk_update_config(VirtIODevice 
> > *vdev, uint8_t *config)
> >  virtio_stw_p(vdev, , conf->cyls);
> >  virtio_stl_p(vdev, _size, blk_size);
> >  virtio_stw_p(vdev, _io_size, conf->min_io_size / blk_size);
> > -virtio_stw_p(vdev, _io_size, conf->opt_io_size / blk_size);
> > +virtio_stl_p(vdev, _io_size, conf->opt_io_size / blk_size);
> >  blkcfg.geometry.heads = conf->heads;
> >  /*
> >   * We must ensure that the block device capacity is a multiple of
> > -- 
> > 2.26.2
>

[PATCH v2 4/4] migration: Add yank feature

Register yank functions on sockets to shut them down.

Signed-off-by: Lukas Straub 
---
 migration/migration.c |  9 +
 migration/qemu-file-channel.c |  6 ++
 migration/socket.c| 11 +++
 3 files changed, 26 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 187ac0410c..f89fcba198 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -54,6 +54,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "yank.h"

 #define MAX_THROTTLE  (32 << 20)  /* Migration transfer speed throttling */

@@ -231,6 +232,8 @@ void migration_incoming_state_destroy(void)
 qapi_free_SocketAddressList(mis->socket_address_list);
 mis->socket_address_list = NULL;
 }
+
+yank_unregister_instance((char *) "migration");
 }

 static void migrate_generate_event(int new_state)
@@ -362,6 +365,7 @@ void qemu_start_incoming_migration(const char *uri, Error 
**errp)
 const char *p;

 qapi_event_send_migration(MIGRATION_STATUS_SETUP);
+yank_register_instance((char *) "migration");
 if (!strcmp(uri, "defer")) {
 deferred_incoming_migration(errp);
 } else if (strstart(uri, "tcp:", )) {
@@ -377,6 +381,7 @@ void qemu_start_incoming_migration(const char *uri, Error 
**errp)
 } else if (strstart(uri, "fd:", )) {
 fd_start_incoming_migration(p, errp);
 } else {
+yank_unregister_instance((char *) "migration");
 error_setg(errp, "unknown migration protocol: %s", uri);
 }
 }
@@ -1632,6 +1637,7 @@ static void migrate_fd_cleanup(MigrationState *s)
 }
 notifier_list_notify(_state_notifiers, s);
 block_cleanup_parameters(s);
+yank_unregister_instance((char *) "migration");
 }

 static void migrate_fd_cleanup_schedule(MigrationState *s)
@@ -2036,6 +2042,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 return;
 }

+yank_register_instance((char *) "migration");
 if (strstart(uri, "tcp:", )) {
 tcp_start_outgoing_migration(s, p, _err);
 #ifdef CONFIG_RDMA
@@ -2049,6 +2056,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 } else if (strstart(uri, "fd:", )) {
 fd_start_outgoing_migration(s, p, _err);
 } else {
+yank_unregister_instance((char *) "migration");
 error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "uri",
"a valid migration protocol");
 migrate_set_state(>state, MIGRATION_STATUS_SETUP,
@@ -2058,6 +2066,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 }

 if (local_err) {
+yank_unregister_instance((char *) "migration");
 migrate_fd_error(s, local_err);
 error_propagate(errp, local_err);
 return;
diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index d2ce32f4b9..6224bda029 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -27,6 +27,7 @@
 #include "qemu-file.h"
 #include "io/channel-socket.h"
 #include "qemu/iov.h"
+#include "yank.h"


 static ssize_t channel_writev_buffer(void *opaque,
@@ -104,6 +105,11 @@ static int channel_close(void *opaque, Error **errp)
 int ret;
 QIOChannel *ioc = QIO_CHANNEL(opaque);
 ret = qio_channel_close(ioc, errp);
+if (object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_SOCKET)
+&& OBJECT(ioc)->ref == 2) {
+yank_unregister_function((char *) "migration", yank_generic_iochannel,
+ QIO_CHANNEL(ioc));
+}
 object_unref(OBJECT(ioc));
 return ret;
 }
diff --git a/migration/socket.c b/migration/socket.c
index 97c9efde59..bbca53cc49 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -26,6 +26,7 @@
 #include "io/channel-socket.h"
 #include "io/net-listener.h"
 #include "trace.h"
+#include "yank.h"


 struct SocketOutgoingArgs {
@@ -35,6 +36,8 @@ struct SocketOutgoingArgs {
 void socket_send_channel_create(QIOTaskFunc f, void *data)
 {
 QIOChannelSocket *sioc = qio_channel_socket_new();
+yank_register_function((char *) "migration", yank_generic_iochannel,
+   QIO_CHANNEL(sioc));
 qio_channel_socket_connect_async(sioc, outgoing_args.saddr,
  f, data, NULL, NULL);
 }
@@ -42,6 +45,8 @@ void socket_send_channel_create(QIOTaskFunc f, void *data)
 int socket_send_channel_destroy(QIOChannel *send)
 {
 /* Remove channel */
+yank_unregister_function((char *) "migration", yank_generic_iochannel,
+ QIO_CHANNEL(send));
 object_unref(OBJECT(send));
 if (outgoing_args.saddr) {
 qapi_free_SocketAddress(outgoing_args.saddr);
@@ -101,6 +106,8 @@ static void socket_outgoing_migration(QIOTask *task,
 Error *err = NULL;

 if (qio_task_propagate_error(task, )) {
+yank_unregister_function((char *) "migration", yank_generic_iochannel,
+ QIO_CHANNEL(sioc));

[PATCH v2 2/4] block/nbd.c: Add yank feature

Register a yank function which shuts down the socket and sets
s->state = NBD_CLIENT_QUIT. This is the same behaviour as if an
error occured.

Signed-off-by: Lukas Straub 
---
 Makefile.objs |   1 +
 block/nbd.c   | 101 --
 2 files changed, 65 insertions(+), 37 deletions(-)

diff --git a/Makefile.objs b/Makefile.objs
index a7c967633a..8e403b81f3 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -18,6 +18,7 @@ block-obj-y += block.o blockjob.o job.o
 block-obj-y += block/ scsi/
 block-obj-y += qemu-io-cmds.o
 block-obj-$(CONFIG_REPLICATION) += replication.o
+block-obj-y += yank.o

 block-obj-m = block/

diff --git a/block/nbd.c b/block/nbd.c
index 2160859f64..3a41749f1b 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -35,6 +35,7 @@
 #include "qemu/option.h"
 #include "qemu/cutils.h"
 #include "qemu/main-loop.h"
+#include "qemu/atomic.h"

 #include "qapi/qapi-visit-sockets.h"
 #include "qapi/qmp/qstring.h"
@@ -43,6 +44,8 @@
 #include "block/nbd.h"
 #include "block/block_int.h"

+#include "yank.h"
+
 #define EN_OPTSTR ":exportname="
 #define MAX_NBD_REQUESTS16

@@ -84,6 +87,8 @@ typedef struct BDRVNBDState {
 NBDReply reply;
 BlockDriverState *bs;

+char *yank_name;
+
 /* Connection parameters */
 uint32_t reconnect_delay;
 SocketAddress *saddr;
@@ -94,6 +99,7 @@ typedef struct BDRVNBDState {
 } BDRVNBDState;

 static int nbd_client_connect(BlockDriverState *bs, Error **errp);
+static void nbd_yank(void *opaque);

 static void nbd_clear_bdrvstate(BDRVNBDState *s)
 {
@@ -106,17 +112,19 @@ static void nbd_clear_bdrvstate(BDRVNBDState *s)
 s->tlscredsid = NULL;
 g_free(s->x_dirty_bitmap);
 s->x_dirty_bitmap = NULL;
+g_free(s->yank_name);
+s->yank_name = NULL;
 }

 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (ret == -EIO) {
-if (s->state == NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) == NBD_CLIENT_CONNECTED) {
 s->state = s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
 NBD_CLIENT_CONNECTING_NOWAIT;
 }
 } else {
-if (s->state == NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) == NBD_CLIENT_CONNECTED) {
 qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
 }
 s->state = NBD_CLIENT_QUIT;
@@ -167,7 +175,7 @@ static void nbd_client_attach_aio_context(BlockDriverState 
*bs,
  * s->connection_co is either yielded from nbd_receive_reply or from
  * nbd_co_reconnect_loop()
  */
-if (s->state == NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) == NBD_CLIENT_CONNECTED) {
 qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context);
 }

@@ -206,7 +214,7 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;

-if (s->state == NBD_CLIENT_CONNECTED) {
+if (atomic_read(>state) == NBD_CLIENT_CONNECTED) {
 /* finish any pending coroutines */
 assert(s->ioc);
 qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
@@ -230,13 +238,14 @@ static void nbd_teardown_connection(BlockDriverState *bs)

 static bool nbd_client_connecting(BDRVNBDState *s)
 {
-return s->state == NBD_CLIENT_CONNECTING_WAIT ||
-s->state == NBD_CLIENT_CONNECTING_NOWAIT;
+NBDClientState state = atomic_read(>state);
+return state == NBD_CLIENT_CONNECTING_WAIT ||
+state == NBD_CLIENT_CONNECTING_NOWAIT;
 }

 static bool nbd_client_connecting_wait(BDRVNBDState *s)
 {
-return s->state == NBD_CLIENT_CONNECTING_WAIT;
+return atomic_read(>state) == NBD_CLIENT_CONNECTING_WAIT;
 }

 static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
@@ -274,6 +283,7 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState 
*s)
 /* Finalize previous connection if any */
 if (s->ioc) {
 nbd_client_detach_aio_context(s->bs);
+yank_unregister_function(s->yank_name, nbd_yank, s->bs);
 object_unref(OBJECT(s->sioc));
 s->sioc = NULL;
 object_unref(OBJECT(s->ioc));
@@ -305,7 +315,7 @@ static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState 
*s)
 nbd_reconnect_attempt(s);

 while (nbd_client_connecting(s)) {
-if (s->state == NBD_CLIENT_CONNECTING_WAIT &&
+if (atomic_read(>state) == NBD_CLIENT_CONNECTING_WAIT &&
 qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time_ns > delay_ns)
 {
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
@@ -341,7 +351,7 @@ static coroutine_fn void nbd_connection_entry(void *opaque)
 int ret = 0;
 Error *local_err = NULL;

-while (s->state != NBD_CLIENT_QUIT) {
+while (atomic_read(>state) != NBD_CLIENT_QUIT) {
 /*
  * The NBD client can only really be considered idle when it has
  * yielded from qio_channel_readv_all_eof(), waiting for data. This is
@@ -356,7 +366,7 @@ static coroutine_fn

[PATCH v2 3/4] chardev/char-socket.c: Add yank feature

Register a yank function to shutdown the socket on yank.

Signed-off-by: Lukas Straub 
---
 chardev/char-socket.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index 185fe38dda..d5c6cd2153 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -34,6 +34,7 @@
 #include "qapi/error.h"
 #include "qapi/clone-visitor.h"
 #include "qapi/qapi-visit-sockets.h"
+#include "yank.h"

 #include "chardev/char-io.h"

@@ -69,6 +70,7 @@ typedef struct {
 size_t read_msgfds_num;
 int *write_msgfds;
 size_t write_msgfds_num;
+char *yank_name;

 SocketAddress *addr;
 bool is_listen;
@@ -409,6 +411,11 @@ static void tcp_chr_free_connection(Chardev *chr)

 tcp_set_msgfds(chr, NULL, 0);
 remove_fd_in_watch(chr);
+if (s->state == TCP_CHARDEV_STATE_CONNECTING
+|| s->state == TCP_CHARDEV_STATE_CONNECTED) {
+yank_unregister_function(s->yank_name, yank_generic_iochannel,
+ QIO_CHANNEL(s->sioc));
+}
 object_unref(OBJECT(s->sioc));
 s->sioc = NULL;
 object_unref(OBJECT(s->ioc));
@@ -912,6 +919,8 @@ static int tcp_chr_add_client(Chardev *chr, int fd)
 }
 tcp_chr_change_state(s, TCP_CHARDEV_STATE_CONNECTING);
 tcp_chr_set_client_ioc_name(chr, sioc);
+yank_register_function(s->yank_name, yank_generic_iochannel,
+   QIO_CHANNEL(sioc));
 ret = tcp_chr_new_client(chr, sioc);
 object_unref(OBJECT(sioc));
 return ret;
@@ -926,6 +935,8 @@ static void tcp_chr_accept(QIONetListener *listener,

 tcp_chr_change_state(s, TCP_CHARDEV_STATE_CONNECTING);
 tcp_chr_set_client_ioc_name(chr, cioc);
+yank_register_function(s->yank_name, yank_generic_iochannel,
+   QIO_CHANNEL(cioc));
 tcp_chr_new_client(chr, cioc);
 }

@@ -941,6 +952,8 @@ static int tcp_chr_connect_client_sync(Chardev *chr, Error 
**errp)
 object_unref(OBJECT(sioc));
 return -1;
 }
+yank_register_function(s->yank_name, yank_generic_iochannel,
+   QIO_CHANNEL(sioc));
 tcp_chr_new_client(chr, sioc);
 object_unref(OBJECT(sioc));
 return 0;
@@ -956,6 +969,8 @@ static void tcp_chr_accept_server_sync(Chardev *chr)
 tcp_chr_change_state(s, TCP_CHARDEV_STATE_CONNECTING);
 sioc = qio_net_listener_wait_client(s->listener);
 tcp_chr_set_client_ioc_name(chr, sioc);
+yank_register_function(s->yank_name, yank_generic_iochannel,
+   QIO_CHANNEL(sioc));
 tcp_chr_new_client(chr, sioc);
 object_unref(OBJECT(sioc));
 }
@@ -1066,6 +1081,8 @@ static void char_socket_finalize(Object *obj)
 object_unref(OBJECT(s->tls_creds));
 }
 g_free(s->tls_authz);
+yank_unregister_instance(s->yank_name);
+g_free(s->yank_name);

 qemu_chr_be_event(chr, CHR_EVENT_CLOSED);
 }
@@ -1081,6 +1098,8 @@ static void qemu_chr_socket_connected(QIOTask *task, void 
*opaque)

 if (qio_task_propagate_error(task, )) {
 tcp_chr_change_state(s, TCP_CHARDEV_STATE_DISCONNECTED);
+yank_unregister_function(s->yank_name, yank_generic_iochannel,
+ QIO_CHANNEL(sioc));
 check_report_connect_error(chr, err);
 error_free(err);
 goto cleanup;
@@ -1115,6 +1134,8 @@ static void tcp_chr_connect_client_async(Chardev *chr)
 tcp_chr_change_state(s, TCP_CHARDEV_STATE_CONNECTING);
 sioc = qio_channel_socket_new();
 tcp_chr_set_client_ioc_name(chr, sioc);
+yank_register_function(s->yank_name, yank_generic_iochannel,
+   QIO_CHANNEL(sioc));
 /*
  * Normally code would use the qio_channel_socket_connect_async
  * method which uses a QIOTask + qio_task_set_error internally
@@ -1356,6 +1377,9 @@ static void qmp_chardev_open_socket(Chardev *chr,
 qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_FD_PASS);
 }

+s->yank_name = g_strconcat("chardev:", chr->label, NULL);
+yank_register_instance(s->yank_name);
+
 /* be isn't opened until we get a connection */
 *be_opened = false;

--
2.20.1



pgpssmOlufF0N.pgp
Description: OpenPGP digital signature

[PATCH v2 1/4] Introduce yank feature

The yank feature allows to recover from hanging qemu by "yanking"
at various parts. Other qemu systems can register themselves and
multiple yank functions. Then all yank functions for selected
instances can be called by the 'yank' out-of-band qmp command.
Available instances can be queried by a 'query-yank' oob command.

Signed-off-by: Lukas Straub 
---
 qapi/misc.json |  45 +
 softmmu/vl.c   |   2 +
 yank.c | 174 +
 yank.h |  69 
 4 files changed, 290 insertions(+)
 create mode 100644 yank.c
 create mode 100644 yank.h

diff --git a/qapi/misc.json b/qapi/misc.json
index 99b90ac80b..f5228b2502 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -1550,3 +1550,48 @@
 ##
 { 'command': 'query-vm-generation-id', 'returns': 'GuidInfo' }

+##
+# @YankInstances:
+#
+# @instances: List of yank instances.
+#
+# Yank instances are named after the following schema:
+# "blockdev:", "chardev:" and "migration"
+#
+# Since: 5.1
+##
+{ 'struct': 'YankInstances', 'data': {'instances': ['str'] } }
+
+##
+# @yank:
+#
+# Recover from hanging qemu by yanking the specified instances.
+#
+# Takes @YankInstances as argument.
+#
+# Returns: nothing.
+#
+# Example:
+#
+# -> { "execute": "yank", "arguments": { "instances": ["blockdev:nbd0"] } }
+# <- { "return": {} }
+#
+# Since: 5.1
+##
+{ 'command': 'yank', 'data': 'YankInstances', 'allow-oob': true }
+
+##
+# @query-yank:
+#
+# Query yank instances.
+#
+# Returns: @YankInstances
+#
+# Example:
+#
+# -> { "execute": "query-yank" }
+# <- { "return": { "instances": ["blockdev:nbd0"] } }
+#
+# Since: 5.1
+##
+{ 'command': 'query-yank', 'returns': 'YankInstances', 'allow-oob': true }
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 32c0047889..5d99749d29 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -112,6 +112,7 @@
 #include "qapi/qmp/qerror.h"
 #include "sysemu/iothread.h"
 #include "qemu/guest-random.h"
+#include "yank.h"

 #define MAX_VIRTIO_CONSOLES 1

@@ -2906,6 +2907,7 @@ void qemu_init(int argc, char **argv, char **envp)
 precopy_infrastructure_init();
 postcopy_infrastructure_init();
 monitor_init_globals();
+yank_init();

 if (qcrypto_init() < 0) {
 error_reportf_err(err, "cannot initialize crypto: ");
diff --git a/yank.c b/yank.c
new file mode 100644
index 00..bfce19185e
--- /dev/null
+++ b/yank.c
@@ -0,0 +1,174 @@
+/*
+ * QEMU yank feature
+ *
+ * Copyright (c) Lukas Straub 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/thread.h"
+#include "qemu/queue.h"
+#include "qapi/qapi-commands-misc.h"
+#include "io/channel.h"
+#include "yank.h"
+
+struct YankFuncAndParam {
+YankFn *func;
+void *opaque;
+QLIST_ENTRY(YankFuncAndParam) next;
+};
+
+struct YankInstance {
+char *name;
+QLIST_HEAD(, YankFuncAndParam) yankfns;
+QLIST_ENTRY(YankInstance) next;
+};
+
+static QemuMutex lock;
+static QLIST_HEAD(yankinst_list, YankInstance) head
+= QLIST_HEAD_INITIALIZER(head);
+
+static struct YankInstance *yank_find_instance(char *name)
+{
+struct YankInstance *tmp, *instance;
+instance = NULL;
+QLIST_FOREACH(tmp, , next) {
+if (!strcmp(tmp->name, name)) {
+instance = tmp;
+}
+}
+return instance;
+}
+
+void yank_register_instance(char *instance_name)
+{
+struct YankInstance *instance;
+
+qemu_mutex_lock();
+assert(!yank_find_instance(instance_name));
+
+instance = g_slice_new(struct YankInstance);
+instance->name = g_strdup(instance_name);
+QLIST_INIT(>yankfns);
+QLIST_INSERT_HEAD(, instance, next);
+
+qemu_mutex_unlock();
+}
+
+void yank_unregister_instance(char *instance_name)
+{
+struct YankInstance *instance;
+
+qemu_mutex_lock();
+instance = yank_find_instance(instance_name);
+assert(instance);
+
+assert(QLIST_EMPTY(>yankfns));
+QLIST_REMOVE(instance, next);
+g_free(instance->name);
+g_slice_free(struct YankInstance, instance);
+
+qemu_mutex_unlock();
+}
+
+void yank_register_function(char *instance_name, YankFn *func, void *opaque)
+{
+struct YankInstance *instance;
+struct YankFuncAndParam *entry;
+
+qemu_mutex_lock();
+instance = yank_find_instance(instance_name);
+assert(instance);
+
+entry = g_slice_new(struct YankFuncAndParam);
+entry->func = func;
+entry->opaque = opaque;
+
+QLIST_INSERT_HEAD(>yankfns, entry, next);
+qemu_mutex_unlock();
+}
+
+void yank_unregister_function(char *instance_name, YankFn *func, void *opaque)
+{
+struct YankInstance *instance;
+struct YankFuncAndParam *entry;
+
+qemu_mutex_lock();
+instance = yank_find_instance(instance_name);
+assert(instance);
+
+QLIST_FOREACH(entry, >yankfns, next) {
+if (entry->func == func && entry->opaque == opaque) {
+

[PATCH v2 0/4] Introduce 'yank' oob qmp command to recover from hanging qemu

Hello Everyone,
In many cases, if qemu has a network connection (qmp, migration, chardev, etc.)
to some other server and that server dies or hangs, qemu hangs too.
These patches introduce the new 'yank' out-of-band qmp command to recover from
these kinds of hangs. The different subsystems register callbacks which get
executed with the yank command. For example the callback can shutdown() a
socket. This is intended for the colo use-case, but it can be used for other
things too of course.

Regards,
Lukas Straub

v2:
 -don't touch io/ code anymore
 -always register yank functions
 -'yank' now takes a list of instances to yank
 -'query-yank' returns a list of yankable instances

Lukas Straub (4):
  Introduce yank feature
  block/nbd.c: Add yank feature
  chardev/char-socket.c: Add yank feature
  migration: Add yank feature

 Makefile.objs |   1 +
 block/nbd.c   | 101 
 chardev/char-socket.c |  24 +
 migration/migration.c |   9 ++
 migration/qemu-file-channel.c |   6 ++
 migration/socket.c|  11 +++
 qapi/misc.json|  45 +
 softmmu/vl.c  |   2 +
 yank.c| 174 ++
 yank.h|  69 ++
 10 files changed, 405 insertions(+), 37 deletions(-)
 create mode 100644 yank.c
 create mode 100644 yank.h

--
2.20.1


pgpQsy_AXs28Z.pgp
Description: OpenPGP digital signature

[PATCH v2 6/6] migration/migration.c: Fix hang in ram_save_host_page

migration_rate_limit will erroneously ratelimit a shutdown socket,
which causes the migration thread to hang in ram_save_host_page
if the socket is shutdown.

Fix this by explicitly testing if the socket has errors or was
shutdown in migration_rate_limit.

Signed-off-by: Lukas Straub 
---
 migration/migration.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 187ac0410c..e8bd32d48c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3347,6 +3347,10 @@ bool migration_rate_limit(void)
 bool urgent = false;
 migration_update_counters(s, now);
 if (qemu_file_rate_limit(s->to_dst_file)) {
+
+if (qemu_file_get_error(s->to_dst_file)) {
+return false;
+}
 /*
  * Wait for a delay to do rate limiting OR
  * something urgent to post the semaphore.
--
2.20.1


pgpKAN0ylI4IS.pgp
Description: OpenPGP digital signature

[PATCH v2 5/6] migration/colo.c: Move colo_notify_compares_event to the right place

If the secondary has to failover during checkpointing, it still is
in the old state (i.e. different state than primary). Thus we can't
expose the primary state until after the checkpoint is sent.

This fixes sporadic connection reset of client connections during
failover.

Signed-off-by: Lukas Straub 
Reviewed-by: zhanghailiang 
---
 migration/colo.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index a69782efc5..a3fc21e86e 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -430,12 +430,6 @@ static int colo_do_checkpoint_transaction(MigrationState 
*s,
 goto out;
 }

-qemu_event_reset(>colo_checkpoint_event);
-colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, _err);
-if (local_err) {
-goto out;
-}
-
 /* Disable block migration */
 migrate_set_block_enabled(false, _err);
 qemu_mutex_lock_iothread();
@@ -494,6 +488,12 @@ static int colo_do_checkpoint_transaction(MigrationState 
*s,
 goto out;
 }

+qemu_event_reset(>colo_checkpoint_event);
+colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, _err);
+if (local_err) {
+goto out;
+}
+
 colo_receive_check_message(s->rp_state.from_dst_file,
COLO_MESSAGE_VMSTATE_LOADED, _err);
 if (local_err) {
--
2.20.1



pgpr6xL4hTW1t.pgp
Description: OpenPGP digital signature

[PATCH v2 4/6] migration/colo.c: Relaunch failover even if there was an error