from:"Marc Zyngier"

Re: [RFC v2 0/5] ARM Nested Virt Support

2024-02-12 Thread Marc Zyngier


On 2024-02-09 18:57, Peter Maydell wrote:

On Fri, 9 Feb 2024 at 16:00, Eric Auger  wrote:


This series adds ARM Nested Virtualization support in KVM mode.
This is a respin of previous contributions from Miguel [1] and Haibo 
[2].


This was tested with Marc's v11 [3] on Ampere HW with fedora L1 guest 
and

L2 guests booted without EDK2. However it does not work yet with
EDK2 but it looks unrelated to this qemu integration (host hard 
lockups).


The host needs to be booted with "kvm-arm.mode=nested" option and
qemu needs to be invoked with :

-machine virt,virtualization=on

There is a known issue with hosts supporting SVE. Kernel does not 
support both

SVE and NV2 and the current qemu integration has an issue with the
scratch_host_vcpu startup because both are enabled if exposed by the 
kernel.
This is independent on whether sve is disabled on the command line. 
Unfortunately
I lost access to the HW that expose that issue so I couldn't fix it in 
this

version.


You can probably repro that by running the whole setup under
QEMU's FEAT_NV emulation, which will be able to give you a CPU
with both FEAT_NV and SVE.

Personally I think that this is a kernel missing-feature that
should really be fixed as part of getting the kernel patches
upstreamed. There's no cause to force every userspace VMM to
develop extra complications for this.


I don't plan to make NV visible to userspace before this is fixed.
Which may delay KVM NV by another year or five, but I don't think
anyone is really waiting for it anyway.

M.
--
Jazz is not dead. It just smells funny...

Re: [PATCH v3 13/14] hw/arm: Prefer arm_feature(AARCH64) over object_property_find(aarch64)

2024-01-11 Thread Marc Zyngier

On Thu, 11 Jan 2024 09:39:18 +,
Philippe Mathieu-Daudé  wrote:
> 
> On 10/1/24 20:53, Philippe Mathieu-Daudé wrote:
> > The "aarch64" property is added to ARMCPU when the
> > ARM_FEATURE_AARCH64 feature is available. Rather than
> > checking whether the QOM property is present, directly
> > check the feature.
> > 
> > Suggested-by: Markus Armbruster 
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> >   hw/arm/virt.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 49ed5309ff..a43e87874c 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -2140,7 +2140,7 @@ static void machvirt_init(MachineState *machine)
> >   numa_cpu_pre_plug(&possible_cpus->cpus[cs->cpu_index], 
> > DEVICE(cpuobj),
> > &error_fatal);
> >   -aarch64 &= object_property_get_bool(cpuobj, "aarch64",
> > NULL);
> > +aarch64 &= arm_feature(cpu_env(cs), ARM_FEATURE_AARCH64);
> 
> So after this patch there are no more use of the ARMCPU "aarch64"
> property from code. Still it is exposed via the qom-tree. Thus it
> can be set (see aarch64_cpu_set_aarch64). I could understand one
> flip this feature to create a custom CPU (as a big-LITTLE setup
> as Marc mentioned on IRC), but I don't understand what is the
> expected behavior when this is flipped at runtime. Can that
> happen in real hardware (how could the guest react to that...)?

I don't think it makes any sense to do that while a guest is running
(and no HW I'm aware of would do this). However, it all depends what
you consider "run time". You could imagine creating a skeletal VM with
all features, and then apply a bunch of changes before the guest
actually runs.

I don't know enough about the qom-tree and dynamic manipulation of
these properties though, and I'm likely to be wrong about the expected
usage model.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH V9 32/46] vfio-pci: cpr part 2 (msi)

2023-07-13 Thread Marc Zyngier

On Thu, 13 Jul 2023 13:35:57 +0100,
Kunkun Jiang  wrote:
> 
> For ARM, it will first send a DISCARD command to ITS and then
> establish the interrupt reporting channel for GICv3. The DISCARD
> will remove the pending interrupt. Interrupts that come before
> channel re-establishment are silently discarded.  Do you guys have
> any good ideas?

I'm missing the context, but if you're worried about interrupts that
are lost between the DISCARD and the MAPTI commands, the only way to
solve the problem is to inject a spurious interrupt after the MAPTI
has taken place.

If it hurts, don't do that.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [RFC PATCH 2/5] hw/intc/gicv3: add support for setting KVM vGIC maintenance IRQ

2023-03-06 Thread Marc Zyngier

On Mon, 06 Mar 2023 14:02:33 +,
Peter Maydell  wrote:
> 
> On Mon, 27 Feb 2023 at 16:37, Miguel Luis  wrote:
> >
> > From: Haibo Xu 
> >
> > Use the VGIC maintenance IRQ if VHE is requested. As per the ARM GIC
> > Architecture Specification for GICv3 and GICv4 Arm strongly recommends that
> > maintenance interrupts are configured to use INTID 25 matching the
> > Server Base System Architecture (SBSA) recomendation.
> 
> What does this mean for QEMU, though? If the issue is
> "KVM doesn't support the maintenance interrupt being anything
> other than INTID 25" then we should say so (and have our code
> error out if the board tries to use some other value).

No, KVM doesn't give two hoots about the INTID, as long as this is a
PPI that is otherwise unused.

> If the
> issue is "the *host* has to be using the right INTID" then I
> would hope that KVM simply doesn't expose the capability if
> the host h/w won't let it work correctly.

No host maintenance interrupt, no NV. This is specially mandatory as
the L1 guest is in (almost) complete control of the ICH_*_EL2
registers and expects MIs to be delivered.

> If KVM can happily
> use any maintenance interrupt ID that the board model wants,
> then we should make that work, rather than hardcoding 25 into
> our gicv3 code.

+1.

I'd eliminate any reference to SBSA, as it has no bearing on either
KVM nor the QEMU GIC code.

I also question the "if VHE is requested". Not having VHE doesn't
preclude virtualisation. Was that supposed to be "virtualisation
extension" instead?

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: regression: insmod module failed in VM with nvdimm on

2022-11-29 Thread Marc Zyngier

On Wed, 30 Nov 2022 02:52:35 +,
"chenxiang (M)"  wrote:
> 
> Hi,
> 
> We boot the VM using following commands (with nvdimm on)  (qemu
> version 6.1.50, kernel 6.0-r4):

How relevant is the presence of the nvdimm? Do you observe the failure
without this?

> 
> qemu-system-aarch64 -machine
> virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
> /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
> /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
> 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
> ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
> -object memory-backend-ram,id=ram1,size=10G -device
> nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
> -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1
> 
> Then in VM we insmod a module, vmalloc error occurs as follows (kernel
> 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):
> 
> estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
> [8.186563] vmap allocation for size 20480 failed: use
> vmalloc= to increase size

Have you tried increasing the vmalloc size to check that this is
indeed the problem?

[...]

> We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
> defer initialization to initcall where permitted").

I guess you mean commit fc5a89f75d2a instead, right?

> Do you have any idea about the issue?

I sort of suspect that the nvdimm gets vmap-ed and consumes a large
portion of the vmalloc space, but you give very little information
that could help here...

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2] vfio/pci: Verify each MSI vector to avoid invalid MSI vectors

2022-11-26 Thread Marc Zyngier

On Sat, 26 Nov 2022 06:33:15 +,
"chenxiang (M)"  wrote:
> 
> 
> 在 2022/11/23 20:08, Marc Zyngier 写道:
> > On Wed, 23 Nov 2022 01:42:36 +,
> > chenxiang  wrote:
> >> From: Xiang Chen 
> >> 
> >> Currently the number of MSI vectors comes from register PCI_MSI_FLAGS
> >> which should be power-of-2 in qemu, in some scenaries it is not the same as
> >> the number that driver requires in guest, for example, a PCI driver wants
> >> to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
> >> 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
> >> guest only wants to allocate 6 MSI vectors.
> >> 
> >> When GICv4.1 is enabled, it iterates over all possible MSIs and enable the
> >> forwarding while the guest has only created some of mappings in the virtual
> >> ITS, so some calls fail. The exception print is as following:
> >> vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
> >> registration
> >> fails:66311
> >> 
> >> To avoid the issue, verify each MSI vector, skip some operations such as
> >> request_irq() and irq_bypass_register_producer() for those invalid MSI 
> >> vectors.
> >> 
> >> Signed-off-by: Xiang Chen 
> >> ---
> >> I reported the issue at the link:
> >> https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/
> >> 
> >> Change Log:
> >> v1 -> v2:
> >> Verify each MSI vector in kernel instead of adding systemcall according to
> >> Mar's suggestion
> >> ---
> >>   arch/arm64/kvm/vgic/vgic-irqfd.c  | 13 +
> >>   arch/arm64/kvm/vgic/vgic-its.c| 36 
> >> 
> >>   arch/arm64/kvm/vgic/vgic.h|  1 +
> >>   drivers/vfio/pci/vfio_pci_intrs.c | 33 +
> >>   include/linux/kvm_host.h  |  2 ++
> >>   5 files changed, 85 insertions(+)
> >> 
> >> diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c 
> >> b/arch/arm64/kvm/vgic/vgic-irqfd.c
> >> index 475059b..71f6af57 100644
> >> --- a/arch/arm64/kvm/vgic/vgic-irqfd.c
> >> +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c
> >> @@ -98,6 +98,19 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
> >>return vgic_its_inject_msi(kvm, &msi);
> >>   }
> >>   +int kvm_verify_msi(struct kvm *kvm,
> >> + struct kvm_kernel_irq_routing_entry *irq_entry)
> >> +{
> >> +  struct kvm_msi msi;
> >> +
> >> +  if (!vgic_has_its(kvm))
> >> +  return -ENODEV;
> >> +
> >> +  kvm_populate_msi(irq_entry, &msi);
> >> +
> >> +  return vgic_its_verify_msi(kvm, &msi);
> >> +}
> >> +
> >>   /**
> >>* kvm_arch_set_irq_inatomic: fast-path for irqfd injection
> >>*/
> >> diff --git a/arch/arm64/kvm/vgic/vgic-its.c 
> >> b/arch/arm64/kvm/vgic/vgic-its.c
> >> index 94a666d..8312a4a 100644
> >> --- a/arch/arm64/kvm/vgic/vgic-its.c
> >> +++ b/arch/arm64/kvm/vgic/vgic-its.c
> >> @@ -767,6 +767,42 @@ int vgic_its_inject_cached_translation(struct kvm 
> >> *kvm, struct kvm_msi *msi)
> >>return 0;
> >>   }
> >>   +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi)
> >> +{
> >> +  struct vgic_its *its;
> >> +  struct its_ite *ite;
> >> +  struct kvm_vcpu *vcpu;
> >> +  int ret = 0;
> >> +
> >> +  if (!irqchip_in_kernel(kvm) || (msi->flags & ~KVM_MSI_VALID_DEVID))
> >> +  return -EINVAL;
> >> +
> >> +  if (!vgic_has_its(kvm))
> >> +  return -ENODEV;
> >> +
> >> +  its = vgic_msi_to_its(kvm, msi);
> >> +  if (IS_ERR(its))
> >> +  return PTR_ERR(its);
> >> +
> >> +  mutex_lock(&its->its_lock);
> >> +  if (!its->enabled) {
> >> +  ret = -EBUSY;
> >> +  goto unlock;
> >> +  }
> >> +  ite = find_ite(its, msi->devid, msi->data);
> >> +  if (!ite || !its_is_collection_mapped(ite->collection)) {
> >> +  ret = E_ITS_INT_UNMAPPED_INTERRUPT;
> >> +  goto unlock;
> >> +  }
> >> +
> >> +  vcpu = kvm_get_vcpu(kvm, ite->collection->target_addr);
> >> +  if (!vcpu)
> >> +  ret = E_ITS_INT_UNMAPPED_INTERRUPT;
> > I'm sorry, but what does this mean to the caller? This should never
> > leak outside of the ITS code.
> 
> Actually it is already leak outside of ITS code, and please see the
> exception printk (E_ITS_INT_UNMAPPED_INTERRUPT is 0x10307 which is
> equal to 66311):
> 
> vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
> registration fails:66311
> 

But that's hardly interpreted, which is the whole point. Only zero is
considered a success value.

> > Honestly, the whole things seems really complicated to avoid something
> > that is only a harmless warning .
> 
> It seems also waste some interrupts. Allocating and requesting some
> interrupts but not used.

What makes you think they are not used? A guest can install a mapping
for those at any point. They won't be directly injected, but they will
be delivered to the guest via the normal SW injection mechanism.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2] vfio/pci: Verify each MSI vector to avoid invalid MSI vectors

2022-11-24 Thread Marc Zyngier

On Wed, 23 Nov 2022 19:55:14 +,
Alex Williamson  wrote:
> 
> On Wed, 23 Nov 2022 12:08:05 +0000
> Marc Zyngier  wrote:
> 
> > On Wed, 23 Nov 2022 01:42:36 +,
> > chenxiang  wrote:
> > > 
> > > +static int vfio_pci_verify_msi_entry(struct vfio_pci_core_device *vdev,
> > > + struct eventfd_ctx *trigger)
> > > +{
> > > + struct kvm *kvm = vdev->vdev.kvm;
> > > + struct kvm_kernel_irqfd *tmp;
> > > + struct kvm_kernel_irq_routing_entry irq_entry;
> > > + int ret = -ENODEV;
> > > +
> > > + spin_lock_irq(&kvm->irqfds.lock);
> > > + list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > + if (trigger == tmp->eventfd) {
> > > + ret = 0;
> > > + break;
> > > + }
> > > + }
> > > + spin_unlock_irq(&kvm->irqfds.lock);
> > > + if (ret)
> > > + return ret;
> > > + irq_entry = tmp->irq_entry;
> > > + return kvm_verify_msi(kvm, &irq_entry);  
> > 
> > How does this work on !arm64? Why do we need an on-stack version of
> > tmp->irq_entry?
> 
> Not only on !arm64, but in any scenario that doesn't involve KVM.
> There cannot be a hard dependency between vfio and kvm.  Thanks,

Yup, good point.

> 
> Alex
> 
> PS - What driver/device actually cares about more than 1 MSI vector and
> doesn't implement MSI-X?

Unfortunately, there is a metric ton of crap that fits in that
description:

01:00.0 Network controller: Broadcom Inc. and subsidiaries Device 4433 (rev 07)
Subsystem: Apple Inc. Device 4387
Device tree node: 
/sys/firmware/devicetree/base/soc/pcie@69000/pci@0,0/wifi@0,0
Flags: bus master, fast devsel, latency 0, IRQ 97, IOMMU group 4
Memory at 6c140 (64-bit, non-prefetchable) [size=64K]
Memory at 6c000 (64-bit, non-prefetchable) [size=16M]
Capabilities: [48] Power Management version 3
Capabilities: [58] MSI: Enable+ Count=1/32 Maskable- 64bit+

... and no MSI-X in sight. Pass this to a VM, and you'll see exactly
what is described here. And that's not old stuff either. This is brand
new HW.

Do we need to care? I don't think so.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2] vfio/pci: Verify each MSI vector to avoid invalid MSI vectors

2022-11-23 Thread Marc Zyngier

On Wed, 23 Nov 2022 01:42:36 +,
chenxiang  wrote:
> 
> From: Xiang Chen 
> 
> Currently the number of MSI vectors comes from register PCI_MSI_FLAGS
> which should be power-of-2 in qemu, in some scenaries it is not the same as
> the number that driver requires in guest, for example, a PCI driver wants
> to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
> 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
> guest only wants to allocate 6 MSI vectors.
> 
> When GICv4.1 is enabled, it iterates over all possible MSIs and enable the
> forwarding while the guest has only created some of mappings in the virtual
> ITS, so some calls fail. The exception print is as following:
> vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
> registration
> fails:66311
> 
> To avoid the issue, verify each MSI vector, skip some operations such as
> request_irq() and irq_bypass_register_producer() for those invalid MSI 
> vectors.
> 
> Signed-off-by: Xiang Chen 
> ---
> I reported the issue at the link:
> https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/
> 
> Change Log:
> v1 -> v2:
> Verify each MSI vector in kernel instead of adding systemcall according to
> Mar's suggestion
> ---
>  arch/arm64/kvm/vgic/vgic-irqfd.c  | 13 +
>  arch/arm64/kvm/vgic/vgic-its.c| 36 
>  arch/arm64/kvm/vgic/vgic.h|  1 +
>  drivers/vfio/pci/vfio_pci_intrs.c | 33 +
>  include/linux/kvm_host.h  |  2 ++
>  5 files changed, 85 insertions(+)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c 
> b/arch/arm64/kvm/vgic/vgic-irqfd.c
> index 475059b..71f6af57 100644
> --- a/arch/arm64/kvm/vgic/vgic-irqfd.c
> +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c
> @@ -98,6 +98,19 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
>   return vgic_its_inject_msi(kvm, &msi);
>  }
>  
> +int kvm_verify_msi(struct kvm *kvm,
> +struct kvm_kernel_irq_routing_entry *irq_entry)
> +{
> + struct kvm_msi msi;
> +
> + if (!vgic_has_its(kvm))
> + return -ENODEV;
> +
> + kvm_populate_msi(irq_entry, &msi);
> +
> + return vgic_its_verify_msi(kvm, &msi);
> +}
> +
>  /**
>   * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
>   */
> diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
> index 94a666d..8312a4a 100644
> --- a/arch/arm64/kvm/vgic/vgic-its.c
> +++ b/arch/arm64/kvm/vgic/vgic-its.c
> @@ -767,6 +767,42 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, 
> struct kvm_msi *msi)
>   return 0;
>  }
>  
> +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi)
> +{
> + struct vgic_its *its;
> + struct its_ite *ite;
> + struct kvm_vcpu *vcpu;
> + int ret = 0;
> +
> + if (!irqchip_in_kernel(kvm) || (msi->flags & ~KVM_MSI_VALID_DEVID))
> + return -EINVAL;
> +
> + if (!vgic_has_its(kvm))
> + return -ENODEV;
> +
> + its = vgic_msi_to_its(kvm, msi);
> + if (IS_ERR(its))
> + return PTR_ERR(its);
> +
> + mutex_lock(&its->its_lock);
> + if (!its->enabled) {
> + ret = -EBUSY;
> + goto unlock;
> + }
> + ite = find_ite(its, msi->devid, msi->data);
> + if (!ite || !its_is_collection_mapped(ite->collection)) {
> + ret = E_ITS_INT_UNMAPPED_INTERRUPT;
> + goto unlock;
> + }
> +
> + vcpu = kvm_get_vcpu(kvm, ite->collection->target_addr);
> + if (!vcpu)
> + ret = E_ITS_INT_UNMAPPED_INTERRUPT;

I'm sorry, but what does this mean to the caller? This should never
leak outside of the ITS code.

> +unlock:
> + mutex_unlock(&its->its_lock);
> + return ret;
> +}
> +
>  /*
>   * Queries the KVM IO bus framework to get the ITS pointer from the given
>   * doorbell address.
> diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
> index 0c8da72..d452150 100644
> --- a/arch/arm64/kvm/vgic/vgic.h
> +++ b/arch/arm64/kvm/vgic/vgic.h
> @@ -240,6 +240,7 @@ int kvm_vgic_register_its_device(void);
>  void vgic_enable_lpis(struct kvm_vcpu *vcpu);
>  void vgic_flush_pending_lpis(struct kvm_vcpu *vcpu);
>  int vgic_its_inject_msi(struct kvm *kvm, struct kvm_msi *msi);
> +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi);
>  int vgic_v3_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr 
> *attr);
>  int vgic_v3_dist_uaccess(struct kvm_vcpu *vcpu, bool is_write,
>int offset, u32 *val);
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 40c3d7c..3027805 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -315,6 +316,28 @@ static int vfio_msi_enable(struct vfio_pci_core_device 
> *vdev, int nvec, bool msi
>   return 0;
>  }
>

Re: [PATCH] KVM: Add system call KVM_VERIFY_MSI to verify MSI vector

2022-11-10 Thread Marc Zyngier

On Wed, 09 Nov 2022 06:21:18 +,
"chenxiang (M)"  wrote:
> 
> Hi Marc,
> 
> 
> 在 2022/11/8 20:47, Marc Zyngier 写道:
> > On Tue, 08 Nov 2022 08:08:57 +,
> > chenxiang  wrote:
> >> From: Xiang Chen 
> >> 
> >> Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS
> >> which should be power-of-2, but in some scenaries it is not the same as
> >> the number that driver requires in guest, for example, a PCI driver wants
> >> to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
> >> 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
> >> guest only wants to allocate 6 MSI vectors.
> >> 
> >> When GICv4.1 is enabled, we can see some exception print as following for
> >> above scenaro:
> >> vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
> >> registration fails:66311
> >> 
> >> In order to verify whether a MSI vector is valid, add KVM_VERIFY_MSI to do
> >> that. If there is a mapping, return 0, otherwise return negative value.
> >> 
> >> This is the kernel part of adding system call KVM_VERIFY_MSI.
> > Exposing something that is an internal implementation detail to
> > userspace feels like the absolute wrong way to solve this issue.
> > 
> > Can you please characterise the issue you're having? Is it that vfio
> > tries to enable an interrupt for which there is no virtual ITS
> > mapping? Shouldn't we instead try and manage this in the kernel?
> 
> Before i reported the issue to community, you gave a suggestion about
> the issue, but not sure whether i misundertood your meaning.
> You can refer to the link for more details about the issue.
> https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/

Right. It would have been helpful to mention this earlier. Anyway, I
would really like this to be done without involving userspace at all.

But first, can you please confirm that the VM works as expected
despite the message? If that's the case, we only need to handle the
case where this is a multi-MSI setup, and I think this can be done in
VFIO, without involving userspace.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH] KVM: Add system call KVM_VERIFY_MSI to verify MSI vector

2022-11-08 Thread Marc Zyngier

On Tue, 08 Nov 2022 08:08:57 +,
chenxiang  wrote:
> 
> From: Xiang Chen 
> 
> Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS
> which should be power-of-2, but in some scenaries it is not the same as
> the number that driver requires in guest, for example, a PCI driver wants
> to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
> 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
> guest only wants to allocate 6 MSI vectors.
>
> When GICv4.1 is enabled, we can see some exception print as following for
> above scenaro:
> vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
> registration fails:66311
> 
> In order to verify whether a MSI vector is valid, add KVM_VERIFY_MSI to do
> that. If there is a mapping, return 0, otherwise return negative value.
> 
> This is the kernel part of adding system call KVM_VERIFY_MSI.

Exposing something that is an internal implementation detail to
userspace feels like the absolute wrong way to solve this issue.

Can you please characterise the issue you're having? Is it that vfio
tries to enable an interrupt for which there is no virtual ITS
mapping? Shouldn't we instead try and manage this in the kernel?

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v6 0/7] hw/arm/virt: Improve address assignment for high memory regions

2022-10-29 Thread Marc Zyngier

On Wed, 26 Oct 2022 01:29:56 +0100,
Gavin Shan  wrote:
> 
> Hi Peter and Marc,
> 
> On 10/24/22 11:54 AM, Gavin Shan wrote:
> > There are three high memory regions, which are VIRT_HIGH_REDIST2,
> > VIRT_HIGH_PCIE_ECAM and VIRT_HIGH_PCIE_MMIO. Their base addresses
> > are floating on highest RAM address. However, they can be disabled
> > in several cases.
> > 
> > (1) One specific high memory region is disabled by developer by
> >  toggling vms->highmem_{redists, ecam, mmio}.
> > 
> > (2) VIRT_HIGH_PCIE_ECAM region is disabled on machine, which is
> >  'virt-2.12' or ealier than it.
> > 
> > (3) VIRT_HIGH_PCIE_ECAM region is disabled when firmware is loaded
> >  on 32-bits system.
> > 
> > (4) One specific high memory region is disabled when it breaks the
> >  PA space limit.
> > 
> > The current implementation of virt_set_memmap() isn't comprehensive
> > because the space for one specific high memory region is always
> > reserved from the PA space for case (1), (2) and (3). In the code,
> > 'base' and 'vms->highest_gpa' are always increased for those three
> > cases. It's unnecessary since the assigned space of the disabled
> > high memory region won't be used afterwards.
> > 
> > The series intends to improve the address assignment for these
> > high memory regions and introduces new properties for user to
> > selectively disable those 3 high memory regions.
> > 
> > PATCH[1-4] preparatory work for the improvment
> > PATCH[5]   improve high memory region address assignment
> > PATCH[6]   adds 'compact-highmem' to enable or disable the optimization
> > PATCH[7]   adds properties so that high memory regions can be disabled
> > 
> > v5: https://lists.nongnu.org/archive/html/qemu-arm/2022-10/msg00280.html
> > v4: https://lists.nongnu.org/archive/html/qemu-arm/2022-10/msg00067.html
> > v3: https://lists.nongnu.org/archive/html/qemu-arm/2022-09/msg00258.html
> > v2: https://lore.kernel.org/all/20220815062958.100366-1-gs...@redhat.com/T/
> > v1: https://lists.nongnu.org/archive/html/qemu-arm/2022-08/msg00013.html
> > 
> 
> Could you help to take a look when getting a chance? I think Connie and
> Eric are close to complete the reviews, but v7 is still needed to address
> extra comments from them. I hope to make v7 mergeable if possible :)

With the comments from Connie and Eric addressed, this looks good to
me:

Reviewed-by: Marc Zyngier 

Thanks for having gone the extra mile on this one.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v5 6/6] hw/arm/virt: Add 'compact-highmem' property

2022-10-20 Thread Marc Zyngier

On Thu, 20 Oct 2022 00:57:32 +0100,
Gavin Shan  wrote:
> 
> For Marc's suggestion to add properties so that these high memory
> regions can be disabled by users. I can add one patch after this one
> to introduce the following 3 properties. Could you please confirm
> the property names are good enough? It's nice if Marc can help to
> confirm before I'm going to work on next revision.
> 
> "highmem-ecam":"on"/"off" on vms->highmem_ecam
> "highmem-mmio":"on"/"off" on vms->highmem_mmio
> "highmem-redists": "on"/"off" on vms->highmem_redists

I think that'd be reasonable, and would give the user some actual
control over what gets exposed in the highmem region.

I guess that the annoying thing with these options is that they allow
the user to request conflicting settings (256 CPUs and
highmem-redists=off, for example). You'll need to make this fail more
or less gracefully.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v4 6/6] hw/arm/virt: Add 'compact-highmem' property

2022-10-04 Thread Marc Zyngier

On Tue, 04 Oct 2022 01:26:27 +0100,
Gavin Shan  wrote:
> 
> After the improvement to high memory region address assignment is
> applied, the memory layout can be changed, introducing possible
> migration breakage. For example, VIRT_HIGH_PCIE_MMIO memory region
> is disabled or enabled when the optimization is applied or not, with
> the following configuration.
> 
>   pa_bits  = 40;
>   vms->highmem_redists = false;
>   vms->highmem_ecam= false;
>   vms->highmem_mmio= true;

The question is how are these parameters specified by a user? Short of
hacking the code, this isn't really possible.

> 
>   # qemu-system-aarch64 -accel kvm -cpu host\
> -machine virt-7.2,compact-highmem={on, off} \
> -m 4G,maxmem=511G -monitor stdio
> 
>   Regioncompact-highmem=off compact-highmem=on
>   
>   RAM   [1GB 512GB][1GB 512GB]
>   HIGH_GIC_REDISTS  [512GB   512GB+64MB]   [disabled]
>   HIGH_PCIE_ECAM[512GB+256MB 512GB+512MB]  [disabled]
>   HIGH_PCIE_MMIO[disabled] [512GB   1TB]
> 
> In order to keep backwords compatibility, we need to disable the
> optimization on machines, which is virt-7.1 or ealier than it. It
> means the optimization is enabled by default from virt-7.2. Besides,
> 'compact-highmem' property is added so that the optimization can be
> explicitly enabled or disabled on all machine types by users.

Not directly related to this series, but it seems to me that we should
be aiming at reproducible results across HW implementations (at least
with KVM). Depending on how many PA bits the HW implements, we end-up
with a set of devices or another, which is likely to be confusing for
a user.

I think we should consider an additional set of changes to allow a
user to specify the PA bits as well as the devices they want to see
enabled.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v3] target/arm/kvm: Retry KVM_CREATE_VM call if it fails EINTR

2022-09-30 Thread Marc Zyngier

On Fri, 30 Sep 2022 12:38:24 +0100,
Peter Maydell  wrote:
> 
> Occasionally the KVM_CREATE_VM ioctl can return EINTR, even though
> there is no pending signal to be taken. In commit 94ccff13382055
> we added a retry-on-EINTR loop to the KVM_CREATE_VM call in the
> generic KVM code. Adopt the same approach for the use of the
> ioctl in the Arm-specific KVM code (where we use it to create a
> scratch VM for probing for various things).
> 
> For more information, see the mailing list thread:
> https://lore.kernel.org/qemu-devel/8735e0s1zw.wl-...@kernel.org/
> 
> Reported-by: Vitaly Chikunov 
> Signed-off-by: Peter Maydell 
> ---
> The view in the thread seems to be that this is a kernel bug (because
> in QEMU's case there shouldn't be a signal to be delivered at this
> point because of our signal handling strategy); so I've adopted the
> same "just retry-on-EINTR for this specific ioctl" approach that
> commit 94ccff13 did, rather than, for instance, something wider like
> "make kvm_ioctl() and friends always retry on EINTR".
> 
> v2: correctly check for -1 and errno is EINTR...
> v3: really correctly check errno. This time for sure!
> ---
>  target/arm/kvm.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Acked-by: Marc Zyngier 

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH] target/arm: Use the max page size in a 2-stage ptw

2022-09-28 Thread Marc Zyngier

On Wed, 28 Sep 2022 05:34:53 +0100,
Zenghui Yu  wrote:
> 
> [ Fix Marc's email address ]

Ah, many thanks Zenghui! I was wondering whether my discussion with
Richard had any result. As it turns out, it had an almost immediate
result!

> 
> On 2022/9/13 21:56, Richard Henderson wrote:
> > We had only been reporting the stage2 page size.  This causes
> > problems if stage1 is using a larger page size (16k, 2M, etc),
> > but stage2 is using a smaller page size, because cputlb does
> > not set large_page_{addr,mask} properly.
> > 
> > Fix by using the max of the two page sizes.
> > 
> > Reported-by: Marc Zyngier 

This is no longer a thing (and hasn't been for over 3 years!  ;-).
m...@kernel.org is the canonical address.

> > Signed-off-by: Richard Henderson 
> > ---
> > 
> > Hi Mark, I think this will fix the issue that you mentioned on Monday.
> > It certainly appears to fit the bill vs the described symptoms.
> > 
> > This is based on my ptw.c rewrite, full tree at
> > 
> > https://gitlab.com/rth7680/qemu/-/tree/tgt-arm-rme
> > 
> > Based-on: 20220822152741.1617527-1-richard.hender...@linaro.org
> > ("[PATCH v2 00/66] target/arm: Implement FEAT_HAFDBS")

Thanks Richard. I'll try and give it a spin shortly.

Cheers,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH] target/arm/kvm: Retry KVM_CREATE_VM call if it fails EINTR

2022-09-26 Thread Marc Zyngier

On Mon, 26 Sep 2022 09:36:44 -0400,
Peter Maydell  wrote:
> 
> Occasionally the KVM_CREATE_VM ioctl can return EINTR, even though
> there is no pending signal to be taken. In commit 94ccff13382055
> we added a retry-on-EINTR loop to the KVM_CREATE_VM call in the
> generic KVM code. Adopt the same approach for the use of the
> ioctl in the Arm-specific KVM code (where we use it to create a
> scratch VM for probing for various things).
> 
> For more information, see the mailing list thread:
> https://lore.kernel.org/qemu-devel/8735e0s1zw.wl-...@kernel.org/
> 
> Reported-by: Vitaly Chikunov 
> Signed-off-by: Peter Maydell 

Acked-by: Marc Zyngier 

M.

-- 
Without deviation from the norm, progress is not possible.

Re: qemu-system-aarch64: Failed to retrieve host CPU features

2022-08-13 Thread Marc Zyngier

On Sat, 13 Aug 2022 12:11:37 +0100,
Vitaly Chikunov  wrote:
> 
> Marc,
> 
> On Fri, Aug 12, 2022 at 04:02:37PM +0100, Marc Zyngier wrote:
> > On Fri, 12 Aug 2022 10:25:55 +0100,
> > Peter Maydell  wrote:
> > > 
> > > I've added some more relevant mailing lists to the cc.
> > > 
> > > On Fri, 12 Aug 2022 at 09:45, Vitaly Chikunov  wrote:
> > > > On Fri, Aug 12, 2022 at 05:14:27AM +0300, Vitaly Chikunov wrote:
> > > > > I noticed that we starting to get many errors like this:
> > > > >
> > > > >   qemu-system-aarch64: Failed to retrieve host CPU features
> > > > >
> > > > > Where many is 1-2% per run, depends on host, host is Kunpeng-920, and
> > > > > Linux kernel is v5.15.59, but it started to appear months before that.
> > > > >
> > > > > strace shows in erroneous case:
> > > > >
> > > > >   1152244 ioctl(9, KVM_CREATE_VM, 0x30)   = -1 EINTR (Interrupted 
> > > > > system call)
> > > > >
> > > > > And I see in target/arm/kvm.c:kvm_arm_create_scratch_host_vcpu:
> > > > >
> > > > > vmfd = ioctl(kvmfd, KVM_CREATE_VM, max_vm_pa_size);
> > > > > if (vmfd < 0) {
> > > > > goto err;
> > > > > }
> > > > >
> > > > > Maybe it should restart ioctl on EINTR?
> > > > >
> > > > > I don't see EINTR documented in ioctl(2) nor in Linux'
> > > > > Documentation/virt/kvm/api.rst for KVM_CREATE_VM, but for KVM_RUN it
> > > > > says "an unmasked signal is pending".
> > > >
> > > > I am suggested that almost any blocking syscall could return EINTR, so I
> > > > checked the strace log and it does not show evidence of arriving a 
> > > > signal,
> > > > the log ends like this:
> > > >
> > > >   1152244 openat(AT_FDCWD, "/dev/kvm", O_RDWR|O_CLOEXEC) = 9
> > > >   1152244 ioctl(9, KVM_CHECK_EXTENSION, KVM_CAP_ARM_VM_IPA_SIZE) = 48
> > > >   1152244 ioctl(9, KVM_CREATE_VM, 0x30)   = -1 EINTR (Interrupted 
> > > > system call)
> > > >   1152244 close(9)= 0
> > > >   1152244 newfstatat(2, "", {st_dev=makedev(0, 0xd), st_ino=57869925, 
> > > > st_mode=S_IFIFO|0600, st_nlink=1, st_uid=517, st_gid=517, 
> > > > st_blksize=4096, st_blocks=0, st_size=0, st_atime=1660268019 /* 
> > > > 2022-08-12T01:33:39.850436293+ */, st_atime_nsec=850436293, 
> > > > st_mtime=1660268019 /* 2022-08-12T01:33:39.850436293+ */, 
> > > > st_mtime_nsec=850436293, st_ctime=1660268019 /* 
> > > > 2022-08-12T01:33:39.850436293+ */, st_ctime_nsec=850436293}, 
> > > > AT_EMPTY_PATH) = 0
> > > >   1152244 write(2, "qemu-system-aarch64: Failed to r"..., 58) = 58
> > > >   1152244 exit_group(1)   = ?
> > > >   1152245 <... clock_nanosleep resumed> ) = ?
> > > >   1152245 +++ exited with 1 +++
> > > >   1152244 +++ exited with 1 +++
> > > 
> > > KVM folks: should we expect that KVM_CREATE_VM might fail EINTR
> > > and need retrying?
> > 
> > In general, yes. But for this particular one, this is pretty odd.
> > 
> > The only path I can so far see that would match this behaviour is if
> > mm_take_all_locks() (called from __mmu_notifier_register()) was
> > getting interrupted by a signal (I'm looking at a 5.19-ish kernel,
> > which may slightly differ from the 5.15 mentioned above).
> > 
> > But as Vitaly points out, it doesn't seem to be a signal delivered
> > here.
> > 
> > Vitaly: could you please share your exact test case (full qemu command
> > line), and instrument your kernel to see if mm_take_all_locks() is the
> > one failing?
> 
> Full command is `qemu-system-aarch64 -M accel=kvm:tcg -m 4096M -smp
>   cores=8 -nodefaults -nographic -no-reboot -fsdev
>   local,id=root,path=/,security_model=none,multidevs=remap -device
>   virtio-9p-pci,fsdev=root,mount_tag=/dev/root -device virtio-rng-pci
>   -serial mon:stdio -kernel /boot/vmlinuz-5.18.16-un-def-alt1 -initrd
>   /usr/src/tmp/initramfs-5.18.16-un-def-alt1.img -sandbox on,spawn=deny -M
>   virt,gic-version=3 -cpu max -append 'console=ttyAMA0 mitigations=off
>   nokaslr quiet panic=-1 SCRIPT=/usr/src/tmp/tmp.458pkF5r8d'`.
> 
> But a minified reproducer is `qemu-system-aarch64 -M virt,accel=kvm -cpu

Re: qemu-system-aarch64: Failed to retrieve host CPU features

2022-08-12 Thread Marc Zyngier

On Fri, 12 Aug 2022 10:25:55 +0100,
Peter Maydell  wrote:
> 
> I've added some more relevant mailing lists to the cc.
> 
> On Fri, 12 Aug 2022 at 09:45, Vitaly Chikunov  wrote:
> > On Fri, Aug 12, 2022 at 05:14:27AM +0300, Vitaly Chikunov wrote:
> > > I noticed that we starting to get many errors like this:
> > >
> > >   qemu-system-aarch64: Failed to retrieve host CPU features
> > >
> > > Where many is 1-2% per run, depends on host, host is Kunpeng-920, and
> > > Linux kernel is v5.15.59, but it started to appear months before that.
> > >
> > > strace shows in erroneous case:
> > >
> > >   1152244 ioctl(9, KVM_CREATE_VM, 0x30)   = -1 EINTR (Interrupted system 
> > > call)
> > >
> > > And I see in target/arm/kvm.c:kvm_arm_create_scratch_host_vcpu:
> > >
> > > vmfd = ioctl(kvmfd, KVM_CREATE_VM, max_vm_pa_size);
> > > if (vmfd < 0) {
> > > goto err;
> > > }
> > >
> > > Maybe it should restart ioctl on EINTR?
> > >
> > > I don't see EINTR documented in ioctl(2) nor in Linux'
> > > Documentation/virt/kvm/api.rst for KVM_CREATE_VM, but for KVM_RUN it
> > > says "an unmasked signal is pending".
> >
> > I am suggested that almost any blocking syscall could return EINTR, so I
> > checked the strace log and it does not show evidence of arriving a signal,
> > the log ends like this:
> >
> >   1152244 openat(AT_FDCWD, "/dev/kvm", O_RDWR|O_CLOEXEC) = 9
> >   1152244 ioctl(9, KVM_CHECK_EXTENSION, KVM_CAP_ARM_VM_IPA_SIZE) = 48
> >   1152244 ioctl(9, KVM_CREATE_VM, 0x30)   = -1 EINTR (Interrupted system 
> > call)
> >   1152244 close(9)= 0
> >   1152244 newfstatat(2, "", {st_dev=makedev(0, 0xd), st_ino=57869925, 
> > st_mode=S_IFIFO|0600, st_nlink=1, st_uid=517, st_gid=517, st_blksize=4096, 
> > st_blocks=0, st_size=0, st_atime=1660268019 /* 
> > 2022-08-12T01:33:39.850436293+ */, st_atime_nsec=850436293, 
> > st_mtime=1660268019 /* 2022-08-12T01:33:39.850436293+ */, 
> > st_mtime_nsec=850436293, st_ctime=1660268019 /* 
> > 2022-08-12T01:33:39.850436293+ */, st_ctime_nsec=850436293}, 
> > AT_EMPTY_PATH) = 0
> >   1152244 write(2, "qemu-system-aarch64: Failed to r"..., 58) = 58
> >   1152244 exit_group(1)   = ?
> >   1152245 <... clock_nanosleep resumed> ) = ?
> >   1152245 +++ exited with 1 +++
> >   1152244 +++ exited with 1 +++
> 
> KVM folks: should we expect that KVM_CREATE_VM might fail EINTR
> and need retrying?
> 
> (I suspect the answer is "yes", given we do this in the generic
> code in kvm-all.c.)

Interestingly, this has cropped up in the (distant) past:

https://lists.gnu.org/archive/html/qemu-devel/2014-01/msg01031.html

and seems to point at the path I was mentioning earlier (the code
hasn't changed too much since, apparently).

I'd still like to understand the underlying reason though.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: qemu-system-aarch64: Failed to retrieve host CPU features

2022-08-12 Thread Marc Zyngier

Hi Peter,

On Fri, 12 Aug 2022 10:25:55 +0100,
Peter Maydell  wrote:
> 
> I've added some more relevant mailing lists to the cc.
> 
> On Fri, 12 Aug 2022 at 09:45, Vitaly Chikunov  wrote:
> > On Fri, Aug 12, 2022 at 05:14:27AM +0300, Vitaly Chikunov wrote:
> > > I noticed that we starting to get many errors like this:
> > >
> > >   qemu-system-aarch64: Failed to retrieve host CPU features
> > >
> > > Where many is 1-2% per run, depends on host, host is Kunpeng-920, and
> > > Linux kernel is v5.15.59, but it started to appear months before that.
> > >
> > > strace shows in erroneous case:
> > >
> > >   1152244 ioctl(9, KVM_CREATE_VM, 0x30)   = -1 EINTR (Interrupted system 
> > > call)
> > >
> > > And I see in target/arm/kvm.c:kvm_arm_create_scratch_host_vcpu:
> > >
> > > vmfd = ioctl(kvmfd, KVM_CREATE_VM, max_vm_pa_size);
> > > if (vmfd < 0) {
> > > goto err;
> > > }
> > >
> > > Maybe it should restart ioctl on EINTR?
> > >
> > > I don't see EINTR documented in ioctl(2) nor in Linux'
> > > Documentation/virt/kvm/api.rst for KVM_CREATE_VM, but for KVM_RUN it
> > > says "an unmasked signal is pending".
> >
> > I am suggested that almost any blocking syscall could return EINTR, so I
> > checked the strace log and it does not show evidence of arriving a signal,
> > the log ends like this:
> >
> >   1152244 openat(AT_FDCWD, "/dev/kvm", O_RDWR|O_CLOEXEC) = 9
> >   1152244 ioctl(9, KVM_CHECK_EXTENSION, KVM_CAP_ARM_VM_IPA_SIZE) = 48
> >   1152244 ioctl(9, KVM_CREATE_VM, 0x30)   = -1 EINTR (Interrupted system 
> > call)
> >   1152244 close(9)= 0
> >   1152244 newfstatat(2, "", {st_dev=makedev(0, 0xd), st_ino=57869925, 
> > st_mode=S_IFIFO|0600, st_nlink=1, st_uid=517, st_gid=517, st_blksize=4096, 
> > st_blocks=0, st_size=0, st_atime=1660268019 /* 
> > 2022-08-12T01:33:39.850436293+ */, st_atime_nsec=850436293, 
> > st_mtime=1660268019 /* 2022-08-12T01:33:39.850436293+ */, 
> > st_mtime_nsec=850436293, st_ctime=1660268019 /* 
> > 2022-08-12T01:33:39.850436293+ */, st_ctime_nsec=850436293}, 
> > AT_EMPTY_PATH) = 0
> >   1152244 write(2, "qemu-system-aarch64: Failed to r"..., 58) = 58
> >   1152244 exit_group(1)   = ?
> >   1152245 <... clock_nanosleep resumed> ) = ?
> >   1152245 +++ exited with 1 +++
> >   1152244 +++ exited with 1 +++
> 
> KVM folks: should we expect that KVM_CREATE_VM might fail EINTR
> and need retrying?

In general, yes. But for this particular one, this is pretty odd.

The only path I can so far see that would match this behaviour is if
mm_take_all_locks() (called from __mmu_notifier_register()) was
getting interrupted by a signal (I'm looking at a 5.19-ish kernel,
which may slightly differ from the 5.15 mentioned above).

But as Vitaly points out, it doesn't seem to be a signal delivered
here.

Vitaly: could you please share your exact test case (full qemu command
line), and instrument your kernel to see if mm_take_all_locks() is the
one failing?

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 1/2] hw/arm/virt: Improve address assignment for highmem IO regions

2022-08-11 Thread Marc Zyngier

Hi Gavin,

On Thu, 11 Aug 2022 06:32:36 +0100,
Gavin Shan  wrote:
> 
> Hi Marc,
> 
> On 8/8/22 7:17 PM, Marc Zyngier wrote:
> > On Wed, 03 Aug 2022 14:02:04 +0100,
> > Gavin Shan  wrote:

[...]

> Sorry for the delay. I think the original changelog is confusing
> enough and sorry about it. I will improve it if v2 is needed :)
> 
> Yes, we shouldn't assign address to VIRT_HIGH_PCIE_ECAM region
> and enable it when vms->highmem_ecam is false in virt_set_memmap().
> 
> In the original implementation of virt_set_memmap(), the memory
> regioin is disabled when when vms->highmem_ecam is false. However,
> the address is still assigned to the memory region, even when
> vms->highmem_ecam is false. This leads to waste in the PA space.

Yes, I got this point now. However, I think the approach you've chosen
leads to subtle issues, see below.

[...]

> >> Eric thought that VIRT_HIGH_GIC_REDIST2 and VIRT_HIGH_PCIE_MMIO regions
> >> aren't user selectable. I tended to explain why it's not true. 'maxmem='
> >> can affect the outcome. When 'maxmem=' value is big enough, there will be
> >> no free area in the PA space to hold those two regions.
> > 
> > Right, that's an interesting point. This is a consequence of these
> > upper regions floating above RAM.
> > 
> 
> Yep, it's fine for those high memory region floating above RAM, and to
> disable them if we run out of PA space. Something may be irrelevant
> to this topic: VIRT_HIGH_PCIE_MMIO region has 512GB, which is huge one.
> It may be nice to fall back smaller sizes when we're having tight PA
> space. For example, we can fall back to try 256GB if 512GB doesn't fit
> into the PA space.
> 
> However, I'm not sure how we had the size (512GB) for the region. Are
> there practical factors why we need a 512GB PCIe 64-bits MMIO space?

I have no idea. But this is something that is now relied upon by
existing VMs, and I'm not sure we can break this.

> >>>>>> Improve address assignment for highmem IO regions to avoid the waste
> >>>>>> in the PA space by putting the logic into virt_memmap_fits().
> >>> 
> >>> I guess that this is what I understand the least. What do you mean by
> >>> "wasted PA space"? Either the regions fit in the PA space, and
> >>> computing their addresses in relevant, or they fall outside of it and
> >>> what we stick in memap[index].base is completely irrelevant.
> >>> 
> >> 
> >> It's possible that we run into the following combination. we should
> >> have enough PA space to enable VIRT_HIGH_PCIE_MMIO region. However,
> >> the region is disabled in the original implementation because
> >> VIRT_HIGH_{GIC_REDIST2, PCIE_ECAM} regions consumed 1GB, which is
> >> unecessary and waste in the PA space.
> >> 
> >>  static MemMapEntry extended_memmap[] = {
> >>  [VIRT_HIGH_GIC_REDIST2] =   { 0x0, 64 * MiB },
> >>  [VIRT_HIGH_PCIE_ECAM] = { 0x0, 256 * MiB },
> >>  [VIRT_HIGH_PCIE_MMIO] = { 0x0, 512 * GiB },
> >>  };
> >> 
> >>  IPA_LIMIT   = (1UL << 40)
> >>  '-maxmem'   = 511GB  /* Memory starts from 1GB */
> >>  '-slots'= 0
> >>  vms->highmem_rdist2 = false
> >>  vms->highmem_ecam   = false
> >>  vms->highmem_mmio   = true
> >> 
> > 
> > Sure. But that's not how QEMU works today, and these regions are
> > enabled in order (though it could well be that my recent changes have
> > made the situation more complicated).
> > 
> > What you're suggesting is a pretty radical change in the way the
> > memory map is set. My hunch is that allowing the highmem IO regions to
> > be selectively enabled and allowed to float in the high IO space
> > should come as a new virt machine revision, with user-visible options.
> > 
> 
> Yeah, These regions are enabled in order. It also means they have ascending
> priorities. In other words, '-maxmem' has higher priority than those 3 high
> memory regions.
> 
> My suggested code changes just improve the address assignment for these 3
> high memory regions, without changing the mechanism fundamentally. The
> intention of the proposed changes are like below.
> 
> - In virt_set_memmap(), don't assign address for one specific high memory
>   region if it has been disabled.
> 
> - Put the logic into standalone helper, to simplify the code.
> 
> I'

Re: [PATCH 1/2] hw/arm/virt: Improve address assignment for highmem IO regions

2022-08-08 Thread Marc Zyngier

On Wed, 03 Aug 2022 14:02:04 +0100,
Gavin Shan  wrote:
> 
> Hi Marc,
> 
> On 8/3/22 5:01 PM, Marc Zyngier wrote:
> > On Wed, 03 Aug 2022 04:01:04 +0100,
> > Gavin Shan  wrote:
> >> On 8/2/22 7:41 PM, Eric Auger wrote:
> >>> On 8/2/22 08:45, Gavin Shan wrote:
> >>>> There are 3 highmem IO regions as below. They can be disabled in
> >>>> two situations: (a) The specific region is disabled by user. (b)
> >>>> The specific region doesn't fit in the PA space. However, the base
> >>>> address and highest_gpa are still updated no matter if the region
> >>>> is enabled or disabled. It's incorrectly incurring waste in the PA
> >>>> space.
> >>> If I am not wrong highmem_redists and highmem_mmio are not user selectable
> >>> 
> >>> Only highmem ecam depends on machine type & ACPI setup. But I would say
> >>> that in server use case it is always set. So is that optimization really
> >>> needed?
> >> 
> >> There are two other cases you missed.
> >> 
> >> - highmem_ecam is enabled after virt-2.12, meaning it stays disabled
> >>before that.
> > 
> > I don't get this. The current behaviour is to disable highmem_ecam if
> > it doesn't fit in the PA space. I can't see anything that enables it
> > if it was disabled the first place.
> > 
> 
> There are several places or conditions where vms->highmem_ecam can be
> disabled:
> 
> - virt_instance_init() where vms->highmem_ecam is inherited from
>   !vmc->no_highmem_ecam. The option is set to true after virt-2.12
>   in virt_machine_2_12_options().
> 
> - machvirt_init() where vms->highmem_ecam can be disable if we have
>   32-bits vCPUs and failure on loading firmware.

Right. But at no point do we *enable* something that was disabled
beforehand, which is how I understood your previous comment.

> 
> - Another place is where we're talking about. It's address assignment
>   to fit the PA space.

Alignment? No, the alignment is cast into stone: it is set to the
smallest power-of-two containing the region (natural alignment).

> 
> >> 
> >> - The high memory region can be disabled if user is asking large
> >>(normal) memory space through 'maxmem=' option. When the requested
> >>memory by 'maxmem=' is large enough, the high memory regions are
> >>disabled. It means the normal memory has higher priority than those
> >>high memory regions. This is the case I provided in (b) of the
> >>commit log.
> > 
> > Why is that a problem? It matches the expected behaviour, as the
> > highmem IO region is floating and is pushed up by the memory region.
> > 
> 
> Eric thought that VIRT_HIGH_GIC_REDIST2 and VIRT_HIGH_PCIE_MMIO regions
> aren't user selectable. I tended to explain why it's not true. 'maxmem='
> can affect the outcome. When 'maxmem=' value is big enough, there will be
> no free area in the PA space to hold those two regions.

Right, that's an interesting point. This is a consequence of these
upper regions floating above RAM.

> 
> >> 
> >> In the commit log, I was supposed to say something like below for
> >> (a):
> >> 
> >> - The specific high memory region can be disabled through changing
> >>the code by user or developer. For example, 'vms->highmem_mmio'
> >>is changed from true to false in virt_instance_init().
> > 
> > Huh. By this principle, the user can change anything. Why is it
> > important?
> > 
> 
> Still like above. I was explaining the possible cases where those
> 3 switches can be turned on/off by users or developers. Our code
> needs to be consistent and comprehensive.
> 
>   vms->highmem_redists
>   vms->highmem_ecam
>   vms->mmio
> 
> >> 
> >>>> 
> >>>> Improve address assignment for highmem IO regions to avoid the waste
> >>>> in the PA space by putting the logic into virt_memmap_fits().
> > 
> > I guess that this is what I understand the least. What do you mean by
> > "wasted PA space"? Either the regions fit in the PA space, and
> > computing their addresses in relevant, or they fall outside of it and
> > what we stick in memap[index].base is completely irrelevant.
> > 
> 
> It's possible that we run into the following combination. we should
> have enough PA space to enable VIRT_HIGH_PCIE_MMIO region. However,
> the region is disabled in the original implementa

Re: [PATCH 1/2] hw/arm/virt: Improve address assignment for highmem IO regions

2022-08-03 Thread Marc Zyngier

On Wed, 03 Aug 2022 04:01:04 +0100,
Gavin Shan  wrote:
> 
> Hi Eric,
> 
> On 8/2/22 7:41 PM, Eric Auger wrote:
> > On 8/2/22 08:45, Gavin Shan wrote:
> >> There are 3 highmem IO regions as below. They can be disabled in
> >> two situations: (a) The specific region is disabled by user. (b)
> >> The specific region doesn't fit in the PA space. However, the base
> >> address and highest_gpa are still updated no matter if the region
> >> is enabled or disabled. It's incorrectly incurring waste in the PA
> >> space.
> > If I am not wrong highmem_redists and highmem_mmio are not user selectable
> > 
> > Only highmem ecam depends on machine type & ACPI setup. But I would say
> > that in server use case it is always set. So is that optimization really
> > needed?
> 
> There are two other cases you missed.
> 
> - highmem_ecam is enabled after virt-2.12, meaning it stays disabled
>   before that.

I don't get this. The current behaviour is to disable highmem_ecam if
it doesn't fit in the PA space. I can't see anything that enables it
if it was disabled the first place.

> 
> - The high memory region can be disabled if user is asking large
>   (normal) memory space through 'maxmem=' option. When the requested
>   memory by 'maxmem=' is large enough, the high memory regions are
>   disabled. It means the normal memory has higher priority than those
>   high memory regions. This is the case I provided in (b) of the
>   commit log.

Why is that a problem? It matches the expected behaviour, as the
highmem IO region is floating and is pushed up by the memory region.

> 
> In the commit log, I was supposed to say something like below for
> (a):
> 
> - The specific high memory region can be disabled through changing
>   the code by user or developer. For example, 'vms->highmem_mmio'
>   is changed from true to false in virt_instance_init().

Huh. By this principle, the user can change anything. Why is it
important?

> 
> >> 
> >> Improve address assignment for highmem IO regions to avoid the waste
> >> in the PA space by putting the logic into virt_memmap_fits().

I guess that this is what I understand the least. What do you mean by
"wasted PA space"? Either the regions fit in the PA space, and
computing their addresses in relevant, or they fall outside of it and
what we stick in memap[index].base is completely irrelevant.

> >> 
> >> Signed-off-by: Gavin Shan 
> >> ---
> >>   hw/arm/virt.c | 54 +--
> >>   1 file changed, 31 insertions(+), 23 deletions(-)
> >> 
> >> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> >> index 9633f822f3..bc0cd218f9 100644
> >> --- a/hw/arm/virt.c
> >> +++ b/hw/arm/virt.c
> >> @@ -1688,6 +1688,34 @@ static uint64_t 
> >> virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
> >>   return arm_cpu_mp_affinity(idx, clustersz);
> >>   }
> >>   +static void virt_memmap_fits(VirtMachineState *vms, int index,
> >> + bool *enabled, hwaddr *base, int pa_bits)
> >> +{
> >> +hwaddr size = extended_memmap[index].size;
> >> +
> >> +/* The region will be disabled if its size isn't given */
> >> +if (!*enabled || !size) {
> > In which case do you have !size?
> 
> Yeah, we don't have !size and the condition should be removed.
> 
> >> +*enabled = false;
> >> +vms->memmap[index].base = 0;
> >> +vms->memmap[index].size = 0;
> > It looks dangerous to me to reset the region's base and size like that.
> > for instance fdt_add_gic_node() will add dummy data in the dt.
> 
> I would guess you missed that the high memory regions won't be exported
> through device-tree if they have been disabled. We have a check there,
> which is "if (nb_redist_regions == 1)"
> 
> >> +return;
> >> +}
> >> +
> >> +/*
> >> + * Check if the memory region fits in the PA space. The memory map
> >> + * and highest_gpa are updated if it fits. Otherwise, it's disabled.
> >> + */
> >> +*enabled = (ROUND_UP(*base, size) + size <= BIT_ULL(pa_bits));
> > using a 'fits' local variable would make the code more obvious I think
> 
> Lets confirm if you're suggesting something like below?
> 
> bool fits;
> 
> fits = (ROUND_UP(*base, size) + size <= BIT_ULL(pa_bits));
> 
> if (fits) {
>/* update *base, memory mapping, highest_gpa */
> } else {
>*enabled = false;
> }
> 
> I guess we can simply do
> 
> if (ROUND_UP(*base, size) + size <= BIT_ULL(pa_bits)) {
>/* update *base, memory mapping, highest_gpa */
> } else {
>*enabled = false;
> }
> 
> Please let me know which one looks best to you.

Why should the 'enabled' flag be updated by this function, instead of
returning the value and keeping it as an assignment in the caller
function? It is purely stylistic though.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [QUESTION] Exception print when enabling GICv4

2022-07-12 Thread Marc Zyngier

On Wed, 13 Jul 2022 07:02:10 +0100,
"chenxiang (M)"  wrote:
> 
> Hi Marc,
> 
> Thank you for your reply.
> 
> 在 2022/7/12 23:25, Marc Zyngier 写道:
> > Hi Xiang,
> > 
> > On Tue, 12 Jul 2022 13:55:16 +0100,
> > "chenxiang (M)"  wrote:
> >> Hi,
> >> I encounter a issue related to GICv4 enable on ARM64 platform (kernel
> >> 5.19-rc4, qemu 6.2.0):
> >> We have a accelaration module whose VF has 3 MSI interrupts, and we
> >> passthrough it to virtual machine with following steps:
> >> 
> >> echo :79:00.1 > /sys/bus/pci/drivers/hisi_hpre/unbind
> >> echo vfio-pci >
> >> /sys/devices/pci\:78/\:78\:00.0/\:79\:00.1/driver_override
> >> echo :79:00.1 > /sys/bus/pci/drivers_probe
> >> 
> >> Then we boot VM with "-device vfio-pci,host=79:00.1,id=net0 \".
> >> When insmod the driver which registers 3 PCI MSI interrupts in VM,
> >> some exception print occur as following:
> >> 
> >> vfio-pci :3a:00.1: irq bypass producer (token 8f08224d)
> >> registration fails: 66311
> >> 
> >> I find that bit[6:4] of register PCI_MSI_FLAGS is 2 (4 MSI interrupts)
> >> though we only register 3 PCI MSI interrupt,
> >> 
> >> and only 3 MSI interrupt is activated at last.
> >> It allocates 4 vectors in function vfio_msi_enable() (qemu)  as it
> >> reads the register PCI_MSI_FLAGS.
> >> Later it will  call system call VFIO_DEVICE_SET_IRQS to set forwarding
> >> for those interrupts
> >> using function kvm_vgic_v4_set_forrwarding() as GICv4 is enabled. For
> >> interrupt 0~2, it success to set forwarding as they are already
> >> activated,
> >> but for the 4th interrupt, it is not activated, so ite is not found in
> >> function vgic_its_resolve_lpi(), so above printk occurs.
> >> 
> >> It seems that we only allocate and activate 3 MSI interrupts in guest
> >> while it tried to set forwarding for 4 MSI interrupts in host.
> >> Do you have any idea about this issue?
> > I have a hunch: QEMU cannot know that the guest is only using 3 MSIs
> > out of the 4 that the device can use, and PCI/Multi-MSI only has a
> > single enable bit for all MSIs. So it probably iterates over all
> > possible MSIs and enable the forwarding. Since the guest has only
> > created 3 mappings in the virtual ITS, the last call fails. I would
> > expect the guest to still work properly though.
> 
> Yes, that's the reason of exception print.
> Is it possible for QEMU to get the exact number of interrupts guest is
> using? It seems not.

Not really. Or rather, this is a pretty involved process: you'd need
to stop the guest, perform a save operation on the ITS (as if you were
doing a migration), and then introspect the ITS tables to find whether
there is a mapping for each of the possible events generated by the
device. Clearly, that's overkill.

A better approach would be to be able to retrieve an individual
mapping, using a new API that would be similar to KVM_SIGNAL_MSI. It
would take the same kvm_msi structure as input, and retrieving the
{LPI, CPU} pair or an error if there is no mapping.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [QUESTION] Exception print when enabling GICv4

2022-07-12 Thread Marc Zyngier

Hi Xiang,

On Tue, 12 Jul 2022 13:55:16 +0100,
"chenxiang (M)"  wrote:
> 
> Hi,
> I encounter a issue related to GICv4 enable on ARM64 platform (kernel
> 5.19-rc4, qemu 6.2.0):
> We have a accelaration module whose VF has 3 MSI interrupts, and we
> passthrough it to virtual machine with following steps:
> 
> echo :79:00.1 > /sys/bus/pci/drivers/hisi_hpre/unbind
> echo vfio-pci >
> /sys/devices/pci\:78/\:78\:00.0/\:79\:00.1/driver_override
> echo :79:00.1 > /sys/bus/pci/drivers_probe
> 
> Then we boot VM with "-device vfio-pci,host=79:00.1,id=net0 \".
> When insmod the driver which registers 3 PCI MSI interrupts in VM,
> some exception print occur as following:
> 
> vfio-pci :3a:00.1: irq bypass producer (token 8f08224d)
> registration fails: 66311
> 
> I find that bit[6:4] of register PCI_MSI_FLAGS is 2 (4 MSI interrupts)
> though we only register 3 PCI MSI interrupt,
>
> and only 3 MSI interrupt is activated at last.
> It allocates 4 vectors in function vfio_msi_enable() (qemu)  as it
> reads the register PCI_MSI_FLAGS.
> Later it will  call system call VFIO_DEVICE_SET_IRQS to set forwarding
> for those interrupts
> using function kvm_vgic_v4_set_forrwarding() as GICv4 is enabled. For
> interrupt 0~2, it success to set forwarding as they are already
> activated,
> but for the 4th interrupt, it is not activated, so ite is not found in
> function vgic_its_resolve_lpi(), so above printk occurs.
> 
> It seems that we only allocate and activate 3 MSI interrupts in guest
> while it tried to set forwarding for 4 MSI interrupts in host.
> Do you have any idea about this issue?

I have a hunch: QEMU cannot know that the guest is only using 3 MSIs
out of the 4 that the device can use, and PCI/Multi-MSI only has a
single enable bit for all MSIs. So it probably iterates over all
possible MSIs and enable the forwarding. Since the guest has only
created 3 mappings in the virtual ITS, the last call fails. I would
expect the guest to still work properly though.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH kvm-unit-tests] MAINTAINERS: Change drew's email address

2022-06-23 Thread Marc Zyngier

On Thu, 23 Jun 2022 14:10:17 +0100,
Andrew Jones  wrote:
> 
> As a side effect of leaving Red Hat I won't be able to use my Red Hat
> email address anymore. I'm also changing the name of my gitlab group.
> 
> Signed-off-by: Andrew Jones 
> Signed-off-by: Andrew Jones 

Acked-by: Marc Zyngier 

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [RFC 1/3] serial: Enable MSI capablity and option

2022-04-12 Thread Marc Zyngier


On 2022-04-12 03:10, Atish Patra wrote:

The seria-pci device doesn't support MSI. Enable the device to provide
MSI so that any platform with MSI support only can also use
this serial device. MSI can be enabled by enabling the newly introduced
device property. This will be disabled by default preserving the 
current

behavior of the seria-pci device.


This seems really odd. Switching to MSI implies that you now have
an edge signalling. This means that the guest will not be interrupted
again if it acks the MSI and doesn't service the device, as you'd
expect for a level interrupt (which is what the device generates today).

From what I understand of the patch, you signal a MSI on each
transition of the device state, which is equally odd (you get
an interrupt even where the device goes idle?).

While this may work for some guests, this completely changes the
semantics of the device. You may want to at least document the new
behaviour.

Thanks,

M.



Signed-off-by: Atish Patra 
---
 hw/char/serial-pci.c | 36 +---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/hw/char/serial-pci.c b/hw/char/serial-pci.c
index 93d6f9924425..ca93c2ce2776 100644
--- a/hw/char/serial-pci.c
+++ b/hw/char/serial-pci.c
@@ -31,6 +31,7 @@
 #include "hw/char/serial.h"
 #include "hw/irq.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/msi.h"
 #include "hw/qdev-properties.h"
 #include "migration/vmstate.h"
 #include "qom/object.h"
@@ -39,26 +40,54 @@ struct PCISerialState {
 PCIDevice dev;
 SerialState state;
 uint8_t prog_if;
+bool msi_enabled;
 };

 #define TYPE_PCI_SERIAL "pci-serial"
 OBJECT_DECLARE_SIMPLE_TYPE(PCISerialState, PCI_SERIAL)

+
+static void msi_irq_handler(void *opaque, int irq_num, int level)
+{
+PCIDevice *pci_dev = opaque;
+
+assert(level == 0 || level == 1);
+
+if (msi_enabled(pci_dev)) {
+msi_notify(pci_dev, 0);
+}
+}
+
 static void serial_pci_realize(PCIDevice *dev, Error **errp)
 {
 PCISerialState *pci = DO_UPCAST(PCISerialState, dev, dev);
 SerialState *s = &pci->state;
+Error *err = NULL;
+int ret;

 if (!qdev_realize(DEVICE(s), NULL, errp)) {
 return;
 }

 pci->dev.config[PCI_CLASS_PROG] = pci->prog_if;
-pci->dev.config[PCI_INTERRUPT_PIN] = 0x01;
-s->irq = pci_allocate_irq(&pci->dev);
-
+if (pci->msi_enabled) {
+pci->dev.config[PCI_INTERRUPT_PIN] = 0x00;
+s->irq = qemu_allocate_irq(msi_irq_handler, &pci->dev, 100);
+} else {
+pci->dev.config[PCI_INTERRUPT_PIN] = 0x01;
+s->irq = pci_allocate_irq(&pci->dev);
+}
 memory_region_init_io(&s->io, OBJECT(pci), &serial_io_ops, s, 
"serial", 8);

 pci_register_bar(&pci->dev, 0, PCI_BASE_ADDRESS_SPACE_IO, &s->io);
+
+if (!pci->msi_enabled) {
+return;
+}
+
+ret = msi_init(&pci->dev, 0, 1, true, false, &err);
+if (ret == -ENOTSUP) {
+fprintf(stdout, "MSIX INIT FAILED\n");
+}
 }

 static void serial_pci_exit(PCIDevice *dev)
@@ -83,6 +112,7 @@ static const VMStateDescription vmstate_pci_serial = 
{


 static Property serial_pci_properties[] = {
 DEFINE_PROP_UINT8("prog_if",  PCISerialState, prog_if, 0x02),
+DEFINE_PROP_BOOL("msi",  PCISerialState, msi_enabled, false),
 DEFINE_PROP_END_OF_LIST(),
 };


--
Jazz is not dead. It just smells funny...

Re: [PATCH] target/arm: Report KVM's actual PSCI version to guest in dtb

2022-02-24 Thread Marc Zyngier via


On 2022-02-24 13:46, Peter Maydell wrote:

When we're using KVM, the PSCI implementation is provided by the
kernel, but QEMU has to tell the guest about it via the device tree.
Currently we look at the KVM_CAP_ARM_PSCI_0_2 capability to determine
if the kernel is providing at least PSCI 0.2, but if the kernel
provides a newer version than that we will still only tell the guest
it has PSCI 0.2.  (This is fairly harmless; it just means the guest
won't use newer parts of the PSCI API.)

The kernel exposes the specific PSCI version it is implementing via
the ONE_REG API; use this to report in the dtb that the PSCI
implementation is 1.0-compatible if appropriate.  (The device tree
binding currently only distinguishes "pre-0.2", "0.2-compatible" and
"1.0-compatible".)

Signed-off-by: Peter Maydell 


Reviewed-by: Marc Zyngier 

M.
--
Who you jivin' with that Cosmik Debris?

Re: [PULL 18/38] hw/arm/virt: Honor highmem setting when computing the memory map

2022-02-13 Thread Marc Zyngier

[+ Alex for HVF]

On Sun, 13 Feb 2022 05:05:33 +,
Akihiko Odaki  wrote:
> 
> On 2022/01/20 21:36, Peter Maydell wrote:
> > From: Marc Zyngier 
> > 
> > Even when the VM is configured with highmem=off, the highest_gpa
> > field includes devices that are above the 4GiB limit.
> > Similarily, nothing seem to check that the memory is within
> > the limit set by the highmem=off option.
> > 
> > This leads to failures in virt_kvm_type() on systems that have
> > a crippled IPA range, as the reported IPA space is larger than
> > what it should be.
> > 
> > Instead, honor the user-specified limit to only use the devices
> > at the lowest end of the spectrum, and fail if we have memory
> > crossing the 4GiB limit.
> > 
> > Reviewed-by: Andrew Jones 
> > Reviewed-by: Eric Auger 
> > Signed-off-by: Marc Zyngier 
> > Message-id: 20220114140741.1358263-4-...@kernel.org
> > Signed-off-by: Peter Maydell 
> > ---
> >   hw/arm/virt.c | 10 +++---
> >   1 file changed, 7 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 62bdce1eb4b..3b839ba78ba 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1670,7 +1670,7 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
> > *vms, int idx)
> >   static void virt_set_memmap(VirtMachineState *vms)
> >   {
> >   MachineState *ms = MACHINE(vms);
> > -hwaddr base, device_memory_base, device_memory_size;
> > +hwaddr base, device_memory_base, device_memory_size, memtop;
> >   int i;
> > vms->memmap = extended_memmap;
> > @@ -1697,7 +1697,11 @@ static void virt_set_memmap(VirtMachineState *vms)
> >   device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * 
> > GiB;
> > /* Base address of the high IO region */
> > -base = device_memory_base + ROUND_UP(device_memory_size, GiB);
> > +memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
> > +if (!vms->highmem && memtop > 4 * GiB) {
> > +error_report("highmem=off, but memory crosses the 4GiB limit\n");
> > +exit(EXIT_FAILURE);
> > +}
> >   if (base < device_memory_base) {
> >   error_report("maxmem/slots too huge");
> >   exit(EXIT_FAILURE);
> > @@ -1714,7 +1718,7 @@ static void virt_set_memmap(VirtMachineState *vms)
> >   vms->memmap[i].size = size;
> >   base += size;
> >   }
> > -vms->highest_gpa = base - 1;
> > +vms->highest_gpa = (vms->highmem ? base : memtop) - 1;
> >   if (device_memory_size > 0) {
> >   ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
> >   ms->device_memory->base = device_memory_base;
> 
> Hi,
> This breaks in a case where highmem is disabled but can have more than
> 4 GiB of RAM. M1 (Apple Silicon) actually can have 36-bit PA with HVF,
> which is not enough for highmem MMIO but is enough to contain 32 GiB
> of RAM.

Funny. The whole point of this series is to make it all work correctly
on M1.

> Where the magic number of 4 GiB / 32-bit came from?

Not exactly a magic number. From QEMU's docs/system/arm/virt.rst:

highmem
  Set ``on``/``off`` to enable/disable placing devices and RAM in physical
  address space above 32 bits. The default is ``on`` for machine types
  later than ``virt-2.12``.

TL;DR: Removing the bogus 'highmem=off' option from your command-line
should get you going with large memory spaces, up to the IPA limit.

The fact that you could run with 32GB of RAM while mandating that the
guest IPA space was limited to 32bit was nothing but a bug, further
"exploited" by HVF to allow disabling the highhmem devices which are
out of reach given the HW limitations (see [1] for details on the
discussion, specially around patch 3).

This is now fixed, and has been extended to work with any IPA size
(including 36bit machines such as M1).

> I also don't quite understand what failures virt_kvm_type() had.

QEMU works by first computing the memory map and passing the required
IPA limit to KVM as part of the VM type. By failing to take into
account the initial limit requirements to the IPA space (either via a
command-line option such as 'highmem', or by using the value provided
by KVM itself), QEMU would try to create a VM that cannot run on the
HW, and KVM would simply return an error.

All of this is documented as part of the KVM/arm64 API [2]. And with
this fixed, QEMU is able to correctly drive KVM on M1.

M.

[1] https://lore.kernel.org/all/2021082211.1290891-1-...@kernel.org
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/virt/kvm/api.rst#n138

-- 
Without deviation from the norm, progress is not possible.

Re: Raspberry Pi?

2022-01-26 Thread Marc Zyngier


On 2022-01-26 02:59, Philippe Mathieu-Daudé via wrote:

Hi,

On 26/1/22 00:59, Kenneth Adam Miller wrote:

Hello all,

I would like to emulate something on a pi so that I don't have to 
pay as high of a translation penalty since the guest and host will 
share the same arch. I'm finding that on some forums that people have 
been having trouble getting QEMU to run on raspberry pi. The posts are 
kind of old, in 2019.


Does anyone know if this has been addressed since then?


What you asks is if you can run an Aarch64 guest (virt machine?) on a
Raspi4 host, is that right? IIRC it should work straight away using
"-machine virt,gic-version=host". Cc'ing qemu-arm@ list to verify.


Note that only a RPi-4 will provide any sort of performance, assuming
the OP wants to use KVM as the acceleration backend.

The original RPi has no support for virtualisation (ARM 1176), and
the two following models are deprived of a GIC, making them a bit
useless (we have *some* support code in KVM, but I'm pretty sure it
has bitrot by now).

M.
--
Jazz is not dead. It just smells funny...

[PATCH v5 6/6] hw/arm/virt: Drop superfluous checks against highmem

2022-01-14 Thread Marc Zyngier

Now that the devices present in the extended memory map are checked
against the available PA space and disabled when they don't fit,
there is no need to keep the same checks against highmem, as
highmem really is a shortcut for the PA space being 32bit.

Reviewed-by: Eric Auger 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 2 --
 hw/arm/virt.c| 5 +
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 0757c28f69..449fab0080 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -947,8 +947,6 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables 
*tables)
 acpi_add_table(table_offsets, tables_blob);
 build_fadt_rev5(tables_blob, tables->linker, vms, dsdt);
 
-vms->highmem_redists &= vms->highmem;
-
 acpi_add_table(table_offsets, tables_blob);
 build_madt(tables_blob, tables->linker, vms);
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 053791cc44..4524f3807d 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2171,9 +2171,6 @@ static void machvirt_init(MachineState *machine)
 
 virt_flash_fdt(vms, sysmem, secure_sysmem ?: sysmem);
 
-vms->highmem_mmio &= vms->highmem;
-vms->highmem_redists &= vms->highmem;
-
 create_gic(vms, sysmem);
 
 virt_cpu_post_init(vms, sysmem);
@@ -2192,7 +2189,7 @@ static void machvirt_init(MachineState *machine)
machine->ram_size, "mach-virt.tag");
 }
 
-vms->highmem_ecam &= vms->highmem && (!firmware_loaded || aarch64);
+vms->highmem_ecam &= (!firmware_loaded || aarch64);
 
 create_rtc(vms);
 
-- 
2.30.2

[PATCH v5 1/6] hw/arm/virt: Add a control for the the highmem PCIe MMIO

2022-01-14 Thread Marc Zyngier

Just like we can control the enablement of the highmem PCIe ECAM
region using highmem_ecam, let's add a control for the highmem
PCIe MMIO  region.

Similarily to highmem_ecam, this region is disabled when highmem
is off.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 10 --
 hw/arm/virt.c|  7 +--
 include/hw/arm/virt.h|  1 +
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index f2514ce77c..449fab0080 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -158,10 +158,9 @@ static void acpi_dsdt_add_virtio(Aml *scope,
 }
 
 static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
-  uint32_t irq, bool use_highmem, bool 
highmem_ecam,
-  VirtMachineState *vms)
+  uint32_t irq, VirtMachineState *vms)
 {
-int ecam_id = VIRT_ECAM_ID(highmem_ecam);
+int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam);
 struct GPEXConfig cfg = {
 .mmio32 = memmap[VIRT_PCIE_MMIO],
 .pio= memmap[VIRT_PCIE_PIO],
@@ -170,7 +169,7 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry 
*memmap,
 .bus= vms->bus,
 };
 
-if (use_highmem) {
+if (vms->highmem_mmio) {
 cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
 }
 
@@ -869,8 +868,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 acpi_dsdt_add_fw_cfg(scope, &memmap[VIRT_FW_CFG]);
 acpi_dsdt_add_virtio(scope, &memmap[VIRT_MMIO],
 (irqmap[VIRT_MMIO] + ARM_SPI_BASE), NUM_VIRTIO_TRANSPORTS);
-acpi_dsdt_add_pci(scope, memmap, (irqmap[VIRT_PCIE] + ARM_SPI_BASE),
-  vms->highmem, vms->highmem_ecam, vms);
+acpi_dsdt_add_pci(scope, memmap, irqmap[VIRT_PCIE] + ARM_SPI_BASE, vms);
 if (vms->acpi_dev) {
 build_ged_aml(scope, "\\_SB."GED_DEVICE,
   HOTPLUG_HANDLER(vms->acpi_dev),
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index b45b52c90e..ed8ea96acc 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1412,7 +1412,7 @@ static void create_pcie(VirtMachineState *vms)
  mmio_reg, base_mmio, size_mmio);
 memory_region_add_subregion(get_system_memory(), base_mmio, mmio_alias);
 
-if (vms->highmem) {
+if (vms->highmem_mmio) {
 /* Map high MMIO space */
 MemoryRegion *high_mmio_alias = g_new0(MemoryRegion, 1);
 
@@ -1466,7 +1466,7 @@ static void create_pcie(VirtMachineState *vms)
 qemu_fdt_setprop_sized_cells(ms->fdt, nodename, "reg",
  2, base_ecam, 2, size_ecam);
 
-if (vms->highmem) {
+if (vms->highmem_mmio) {
 qemu_fdt_setprop_sized_cells(ms->fdt, nodename, "ranges",
  1, FDT_PCI_RANGE_IOPORT, 2, 0,
  2, base_pio, 2, size_pio,
@@ -2105,6 +2105,8 @@ static void machvirt_init(MachineState *machine)
 
 virt_flash_fdt(vms, sysmem, secure_sysmem ?: sysmem);
 
+vms->highmem_mmio &= vms->highmem;
+
 create_gic(vms, sysmem);
 
 virt_cpu_post_init(vms, sysmem);
@@ -2802,6 +2804,7 @@ static void virt_instance_init(Object *obj)
 vms->gic_version = VIRT_GIC_VERSION_NOSEL;
 
 vms->highmem_ecam = !vmc->no_highmem_ecam;
+vms->highmem_mmio = true;
 
 if (vmc->no_its) {
 vms->its = false;
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index dc6b66ffc8..9c54acd10d 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -143,6 +143,7 @@ struct VirtMachineState {
 bool secure;
 bool highmem;
 bool highmem_ecam;
+bool highmem_mmio;
 bool its;
 bool tcg_its;
 bool virt;
-- 
2.30.2

[PATCH v5 2/6] hw/arm/virt: Add a control for the the highmem redistributors

2022-01-14 Thread Marc Zyngier

Just like we can control the enablement of the highmem PCIe region
using highmem_ecam, let's add a control for the highmem GICv3
redistributor region.

Similarily to highmem_ecam, these redistributors are disabled when
highmem is off.

Reviewed-by: Andrew Jones 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 2 ++
 hw/arm/virt.c| 2 ++
 include/hw/arm/virt.h| 4 +++-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 449fab0080..0757c28f69 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -947,6 +947,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables 
*tables)
 acpi_add_table(table_offsets, tables_blob);
 build_fadt_rev5(tables_blob, tables->linker, vms, dsdt);
 
+vms->highmem_redists &= vms->highmem;
+
 acpi_add_table(table_offsets, tables_blob);
 build_madt(tables_blob, tables->linker, vms);
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ed8ea96acc..e734a75850 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2106,6 +2106,7 @@ static void machvirt_init(MachineState *machine)
 virt_flash_fdt(vms, sysmem, secure_sysmem ?: sysmem);
 
 vms->highmem_mmio &= vms->highmem;
+vms->highmem_redists &= vms->highmem;
 
 create_gic(vms, sysmem);
 
@@ -2805,6 +2806,7 @@ static void virt_instance_init(Object *obj)
 
 vms->highmem_ecam = !vmc->no_highmem_ecam;
 vms->highmem_mmio = true;
+vms->highmem_redists = true;
 
 if (vmc->no_its) {
 vms->its = false;
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 9c54acd10d..dc9fa26faa 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -144,6 +144,7 @@ struct VirtMachineState {
 bool highmem;
 bool highmem_ecam;
 bool highmem_mmio;
+bool highmem_redists;
 bool its;
 bool tcg_its;
 bool virt;
@@ -190,7 +191,8 @@ static inline int 
virt_gicv3_redist_region_count(VirtMachineState *vms)
 
 assert(vms->gic_version == VIRT_GIC_VERSION_3);
 
-return MACHINE(vms)->smp.cpus > redist0_capacity ? 2 : 1;
+return (MACHINE(vms)->smp.cpus > redist0_capacity &&
+vms->highmem_redists) ? 2 : 1;
 }
 
 #endif /* QEMU_ARM_VIRT_H */
-- 
2.30.2

[PATCH v5 3/6] hw/arm/virt: Honor highmem setting when computing the memory map

2022-01-14 Thread Marc Zyngier

Even when the VM is configured with highmem=off, the highest_gpa
field includes devices that are above the 4GiB limit.
Similarily, nothing seem to check that the memory is within
the limit set by the highmem=off option.

This leads to failures in virt_kvm_type() on systems that have
a crippled IPA range, as the reported IPA space is larger than
what it should be.

Instead, honor the user-specified limit to only use the devices
at the lowest end of the spectrum, and fail if we have memory
crossing the 4GiB limit.

Reviewed-by: Andrew Jones 
Reviewed-by: Eric Auger 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index e734a75850..ecc3e3e5b0 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1663,7 +1663,7 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
*vms, int idx)
 static void virt_set_memmap(VirtMachineState *vms)
 {
 MachineState *ms = MACHINE(vms);
-hwaddr base, device_memory_base, device_memory_size;
+hwaddr base, device_memory_base, device_memory_size, memtop;
 int i;
 
 vms->memmap = extended_memmap;
@@ -1690,7 +1690,11 @@ static void virt_set_memmap(VirtMachineState *vms)
 device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB;
 
 /* Base address of the high IO region */
-base = device_memory_base + ROUND_UP(device_memory_size, GiB);
+memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
+if (!vms->highmem && memtop > 4 * GiB) {
+error_report("highmem=off, but memory crosses the 4GiB limit\n");
+exit(EXIT_FAILURE);
+}
 if (base < device_memory_base) {
 error_report("maxmem/slots too huge");
 exit(EXIT_FAILURE);
@@ -1707,7 +1711,7 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = base - 1;
+vms->highest_gpa = (vms->highmem ? base : memtop) - 1;
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
-- 
2.30.2

[PATCH v5 4/6] hw/arm/virt: Use the PA range to compute the memory map

2022-01-14 Thread Marc Zyngier

The highmem attribute is nothing but another way to express the
PA range of a VM. To support HW that has a smaller PA range then
what QEMU assumes, pass this PA range to the virt_set_memmap()
function, allowing it to correctly exclude highmem devices
if they are outside of the PA range.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 64 +--
 1 file changed, 52 insertions(+), 12 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ecc3e3e5b0..a427676b50 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1660,7 +1660,7 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
*vms, int idx)
 return arm_cpu_mp_affinity(idx, clustersz);
 }
 
-static void virt_set_memmap(VirtMachineState *vms)
+static void virt_set_memmap(VirtMachineState *vms, int pa_bits)
 {
 MachineState *ms = MACHINE(vms);
 hwaddr base, device_memory_base, device_memory_size, memtop;
@@ -1678,6 +1678,14 @@ static void virt_set_memmap(VirtMachineState *vms)
 exit(EXIT_FAILURE);
 }
 
+/*
+ * !highmem is exactly the same as limiting the PA space to 32bit,
+ * irrespective of the underlying capabilities of the HW.
+ */
+if (!vms->highmem) {
+pa_bits = 32;
+}
+
 /*
  * We compute the base of the high IO region depending on the
  * amount of initial and device memory. The device memory start/size
@@ -1691,8 +1699,9 @@ static void virt_set_memmap(VirtMachineState *vms)
 
 /* Base address of the high IO region */
 memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
-if (!vms->highmem && memtop > 4 * GiB) {
-error_report("highmem=off, but memory crosses the 4GiB limit\n");
+if (memtop > BIT_ULL(pa_bits)) {
+   error_report("Addressing limited to %d bits, but memory exceeds it 
by %llu bytes\n",
+pa_bits, memtop - BIT_ULL(pa_bits));
 exit(EXIT_FAILURE);
 }
 if (base < device_memory_base) {
@@ -1711,7 +1720,13 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = (vms->highmem ? base : memtop) - 1;
+
+/*
+ * If base fits within pa_bits, all good. If it doesn't, limit it
+ * to the end of RAM, which is guaranteed to fit within pa_bits.
+ */
+vms->highest_gpa = (base <= BIT_ULL(pa_bits) ? base : memtop) - 1;
+
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
@@ -1902,12 +1917,43 @@ static void machvirt_init(MachineState *machine)
 unsigned int smp_cpus = machine->smp.cpus;
 unsigned int max_cpus = machine->smp.max_cpus;
 
+if (!cpu_type_valid(machine->cpu_type)) {
+error_report("mach-virt: CPU type %s not supported", 
machine->cpu_type);
+exit(1);
+}
+
+possible_cpus = mc->possible_cpu_arch_ids(machine);
+
 /*
  * In accelerated mode, the memory map is computed earlier in kvm_type()
  * to create a VM with the right number of IPA bits.
  */
 if (!vms->memmap) {
-virt_set_memmap(vms);
+Object *cpuobj;
+ARMCPU *armcpu;
+int pa_bits;
+
+/*
+ * Instanciate a temporary CPU object to find out about what
+ * we are about to deal with. Once this is done, get rid of
+ * the object.
+ */
+cpuobj = object_new(possible_cpus->cpus[0].type);
+armcpu = ARM_CPU(cpuobj);
+
+if (object_property_get_bool(cpuobj, "aarch64", NULL)) {
+pa_bits = arm_pamax(armcpu);
+} else if (arm_feature(&armcpu->env, ARM_FEATURE_LPAE)) {
+/* v7 with LPAE */
+pa_bits = 40;
+} else {
+/* Anything else */
+pa_bits = 32;
+}
+
+object_unref(cpuobj);
+
+virt_set_memmap(vms, pa_bits);
 }
 
 /* We can probe only here because during property set
@@ -1915,11 +1961,6 @@ static void machvirt_init(MachineState *machine)
  */
 finalize_gic_version(vms);
 
-if (!cpu_type_valid(machine->cpu_type)) {
-error_report("mach-virt: CPU type %s not supported", 
machine->cpu_type);
-exit(1);
-}
-
 if (vms->secure) {
 /*
  * The Secure view of the world is the same as the NonSecure,
@@ -1989,7 +2030,6 @@ static void machvirt_init(MachineState *machine)
 
 create_fdt(vms);
 
-possible_cpus = mc->possible_cpu_arch_ids(machine);
 assert(possible_cpus->len == max_cpus);
 for (n = 0; n < possible_cpus->len; n++) {
 Object *cpuobj;
@@ -2646,7 +2686,7 @@ static int virt_kvm_type(MachineState *ms, const char 
*type_str)
 max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, &fixed_ipa);
 
 /* we fr

[PATCH v5 0/6] target/arm: Reduced-IPA space and highmem fixes

2022-01-14 Thread Marc Zyngier

Here's yet another stab at enabling QEMU on systems with
pathologically reduced IPA ranges such as the Apple M1 (previous
version at [1]). Eventually, we're able to run a KVM guest with more
than just 3GB of RAM on a system with a 36bit IPA space, and at most
123 vCPUs.

This also addresses some pathological QEMU behaviours, where the
highmem property is used as a flag allowing exposure of devices that
can't possibly fit in the PA space of the VM, resulting in a guest
failure.

In the end, we generalise the notion of PA space when exposing
individual devices in the expanded memory map, and treat highmem as
another flavour of PA space restriction.

This series does a few things:

- introduce new attributes to control the enabling of the highmem
  GICv3 redistributors and the highmem PCIe MMIO range

- correctly cap the PA range with highmem is off

- generalise the highmem behaviour to any PA range

- disable each highmem device region that doesn't fit in the PA range

- cleanup uses of highmem outside of virt_set_memmap()

This has been tested on an M1-based Mac-mini running Linux v5.16-rc6
with both KVM and TCG.

* From v4: [1]

  - Moved cpu_type_valid() check before we compute the memory map
  - Drop useless MAX() when computing highest_gpa
  - Fixed more deviations from the QEMU coding style
  - Collected Eric's RBs, with thanks

[1]: https://lore.kernel.org/r/20220107163324.2491209-1-...@kernel.org

Marc Zyngier (6):
  hw/arm/virt: Add a control for the the highmem PCIe MMIO
  hw/arm/virt: Add a control for the the highmem redistributors
  hw/arm/virt: Honor highmem setting when computing the memory map
  hw/arm/virt: Use the PA range to compute the memory map
  hw/arm/virt: Disable highmem devices that don't fit in the PA range
  hw/arm/virt: Drop superfluous checks against highmem

 hw/arm/virt-acpi-build.c | 10 ++--
 hw/arm/virt.c| 98 ++--
 include/hw/arm/virt.h|  5 +-
 3 files changed, 91 insertions(+), 22 deletions(-)

-- 
2.30.2

[PATCH v5 5/6] hw/arm/virt: Disable highmem devices that don't fit in the PA range

2022-01-14 Thread Marc Zyngier

In order to only keep the highmem devices that actually fit in
the PA range, check their location against the range and update
highest_gpa if they fit. If they don't, mark them as disabled.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index a427676b50..053791cc44 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1712,21 +1712,43 @@ static void virt_set_memmap(VirtMachineState *vms, int 
pa_bits)
 base = vms->memmap[VIRT_MEM].base + LEGACY_RAMLIMIT_BYTES;
 }
 
+/* We know for sure that at least the memory fits in the PA space */
+vms->highest_gpa = memtop - 1;
+
 for (i = VIRT_LOWMEMMAP_LAST; i < ARRAY_SIZE(extended_memmap); i++) {
 hwaddr size = extended_memmap[i].size;
+bool fits;
 
 base = ROUND_UP(base, size);
 vms->memmap[i].base = base;
 vms->memmap[i].size = size;
+
+/*
+ * Check each device to see if they fit in the PA space,
+ * moving highest_gpa as we go.
+ *
+ * For each device that doesn't fit, disable it.
+ */
+fits = (base + size) <= BIT_ULL(pa_bits);
+if (fits) {
+vms->highest_gpa = base + size - 1;
+}
+
+switch (i) {
+case VIRT_HIGH_GIC_REDIST2:
+vms->highmem_redists &= fits;
+break;
+case VIRT_HIGH_PCIE_ECAM:
+vms->highmem_ecam &= fits;
+break;
+case VIRT_HIGH_PCIE_MMIO:
+vms->highmem_mmio &= fits;
+break;
+}
+
 base += size;
 }
 
-/*
- * If base fits within pa_bits, all good. If it doesn't, limit it
- * to the end of RAM, which is guaranteed to fit within pa_bits.
- */
-vms->highest_gpa = (base <= BIT_ULL(pa_bits) ? base : memtop) - 1;
-
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
-- 
2.30.2

Re: [PATCH v3] hw/arm/virt: KVM: Enable PAuth when supported by the host

2022-01-11 Thread Marc Zyngier

On Tue, 11 Jan 2022 13:58:49 +,
Peter Maydell  wrote:
> 
> On Sat, 8 Jan 2022 at 13:42, Marc Zyngier  wrote:
> >
> > On 2022-01-07 20:23, Richard Henderson wrote:
> > > On 1/7/22 7:01 AM, Marc Zyngier wrote:
> > >> @@ -1380,17 +1380,10 @@ void arm_cpu_finalize_features(ARMCPU *cpu,
> > >> Error **errp)
> > >>   return;
> > >>   }
> > >>   -/*
> > >> - * KVM does not support modifications to this feature.
> > >> - * We have not registered the cpu properties when KVM
> > >> - * is in use, so the user will not be able to set them.
> > >> - */
> > >> -if (!kvm_enabled()) {
> > >> -arm_cpu_pauth_finalize(cpu, &local_err);
> > >> -if (local_err != NULL) {
> > >> +arm_cpu_pauth_finalize(cpu, &local_err);
> > >> +if (local_err != NULL) {
> > >>   error_propagate(errp, local_err);
> > >>   return;
> > >> -}
> > >>   }
> > >
> > > Indentation is still off -- error + return should be out-dented one
> > > level.
> > >
> >
> > Duh. Clearly, my brain can't spot these. Apologies for the extra noise.
> >
> > > Otherwise,
> > > Reviewed-by: Richard Henderson 
> >
> > Thanks. I'll repost a version shortly, unless someone shouts.
> 
> Don't worry about it -- I've applied this to target-arm.next and
> fixed the indent there.

Awesome, thanks Peter.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH] virtio-pci: fix up config interrupt handling

2022-01-10 Thread Marc Zyngier

Hi Michael,

On Sun, 09 Jan 2022 17:49:19 +,
"Michael S. Tsirkin"  wrote:
> 
> Fixes a couple of issues with irqfd use by config interrupt:
> - Rearrange initialization so cleanup happens in the reverse order
> - Don't use irqfd for config when not in use for data path
> I am not sure this is a complete fix though: I think we
> are better off limiting the effect to vdpa devices
> with config interrupt support. Or even bypass irqfd
> for config completely and inject into KVM using ioctl?
> The advantage would be less FDs used.
> This would mean mostly reverting the patchset though.
> 
> Fixes: d5d24d859c ("virtio-pci: add support for configure interrupt")
> Cc: "Cindy Lu" 
> Signed-off-by: Michael S. Tsirkin 

This doesn't seem to fix the problems I'm seeing here with a KVM/arm64
guest (the issue exists with and without this patch)

On initial boot:


Loading Linux 5.7.0-1-arm64 ...
Loading initial ramdisk ...
qemu-system-aarch64: virtio-blk failed to set guest notifier (-16), ensure 
-accel kvm is set.
qemu-system-aarch64: virtio_bus_start_ioeventfd: failed. Fallback to userspace 
(slower).
qemu-system-aarch64: virtio-scsi: Failed to set guest notifiers (-16), ensure 
-accel kvm is set.
qemu-system-aarch64: virtio_bus_start_ioeventfd: failed. Fallback to userspace 
(slower).


The guest is functional though. However, on reboot:


Loading Linux 5.7.0-1-arm64 ...
Loading initial ramdisk ...
qemu-system-aarch64: ../hw/pci/msix.c:622: msix_unset_vector_notifiers: 
Assertion `dev->msix_vector_use_notifier && dev->msix_vector_release_notifier' 
failed.


Reverting d5d24d859c fixes the issue. For the record, my qemu command
line:

/home/maz/qemu/build/qemu-system-aarch64 -m 1G -smp 8 -cpu host,aarch64=on 
-machine virt,accel=kvm,gic-version=host -nographic -drive 
if=pflash,format=raw,readonly=on,file=/usr/share/AAVMF/AAVMF_CODE.fd -drive 
if=pflash,format=raw,file=bullseye/bBwcgtDY2UwXklV6.fd -netdev user,id=hostnet0 
-device virtio-net-pci,netdev=hostnet0 -drive 
if=none,format=raw,cache=none,aio=native,file=bullseye/bBwcgtDY2UwXklV6.img,id=disk0
 -device virtio-blk-pci,drive=disk0 -drive 
file=debian-testing-arm64-netinst-preseed.iso,id=cdrom,if=none,media=cdrom 
-device virtio-scsi-pci -device scsi-cd,drive=cdrom

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v4 5/6] hw/arm/virt: Disable highmem devices that don't fit in the PA range

2022-01-10 Thread Marc Zyngier

On Mon, 10 Jan 2022 17:12:50 +,
Eric Auger  wrote:
> 
> Hi Marc,
> 
> On 1/7/22 5:33 PM, Marc Zyngier wrote:
> > In order to only keep the highmem devices that actually fit in
> > the PA range, check their location against the range and update
> > highest_gpa if they fit. If they don't, mark them them as disabled.
> s/them them/them
> >
> > Signed-off-by: Marc Zyngier 
> > ---
> >  hw/arm/virt.c | 34 --
> >  1 file changed, 28 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index db4b0636e1..70b4773b3e 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1711,21 +1711,43 @@ static void virt_set_memmap(VirtMachineState *vms, 
> > int pa_bits)
> >  base = vms->memmap[VIRT_MEM].base + LEGACY_RAMLIMIT_BYTES;
> >  }
> >  
> > +/* We know for sure that at least the memory fits in the PA space */
> > +vms->highest_gpa = memtop - 1;
> > +
> >  for (i = VIRT_LOWMEMMAP_LAST; i < ARRAY_SIZE(extended_memmap); i++) {
> >  hwaddr size = extended_memmap[i].size;
> > +bool fits;
> >  
> >  base = ROUND_UP(base, size);
> >  vms->memmap[i].base = base;
> >  vms->memmap[i].size = size;
> > +
> > +/*
> > + * Check each device to see if they fit in the PA space,
> > + * moving highest_gpa as we go.
> > + *
> > + * For each device that doesn't fit, disable it.
> > + */
> > +fits = (base + size) <= BIT_ULL(pa_bits);
> > +if (fits) {
> > +vms->highest_gpa = MAX(vms->highest_gpa, base + size - 1);
> why do you need the MAX()?

Well spotted, I don't. Since we build the memmap by moving base
upward, I can directly use 'base + size - 1' as the new highest_gpa
value.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v4 2/6] hw/arm/virt: Add a control for the the highmem redistributors

2022-01-10 Thread Marc Zyngier

On Mon, 10 Jan 2022 15:47:47 +,
Peter Maydell  wrote:
> 
> On Mon, 10 Jan 2022 at 15:45, Marc Zyngier  wrote:
> > $ /home/maz/vminstall/qemu-hack -m 1G -smp 256 -cpu host -machine 
> > virt,accel=kvm,gic-version=3,highmem=on -nographic -drive 
> > if=pflash,format=raw,readonly=on,file=/usr/share/AAVMF/AAVMF_CODE.fd
> > qemu-hack: warning: Number of SMP cpus requested (256) exceeds the 
> > recommended cpus supported by KVM (8)
> > qemu-hack: warning: Number of hotpluggable cpus requested (256) exceeds the 
> > recommended cpus supported by KVM (8)
> > qemu-hack: Capacity of the redist regions(123) is less than number of 
> > vcpus(256)
> 
> Side question: why is KVM_CAP_NR_VCPUS returning 8 for
> "recommended cpus supported by KVM" ? Is something still
> assuming GICv2 CPU limits?

No, it is only that KVM_CAP_NR_VCPUS is defined as returning the
number of physical CPUs (and this test machine has only 8 of them).

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v4 4/6] hw/arm/virt: Use the PA range to compute the memory map

2022-01-10 Thread Marc Zyngier

On Mon, 10 Jan 2022 15:38:56 +,
Eric Auger  wrote:
> 
> Hi Marc,
> 
> On 1/7/22 5:33 PM, Marc Zyngier wrote:
> > The highmem attribute is nothing but another way to express the
> > PA range of a VM. To support HW that has a smaller PA range then
> > what QEMU assumes, pass this PA range to the virt_set_memmap()
> > function, allowing it to correctly exclude highmem devices
> > if they are outside of the PA range.
> >
> > Signed-off-by: Marc Zyngier 
> > ---
> >  hw/arm/virt.c | 53 ---
> >  1 file changed, 46 insertions(+), 7 deletions(-)
> >
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 57c55e8a37..db4b0636e1 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1660,7 +1660,7 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
> > *vms, int idx)
> >  return arm_cpu_mp_affinity(idx, clustersz);
> >  }
> >  
> > -static void virt_set_memmap(VirtMachineState *vms)
> > +static void virt_set_memmap(VirtMachineState *vms, int pa_bits)
> >  {
> >  MachineState *ms = MACHINE(vms);
> >  hwaddr base, device_memory_base, device_memory_size, memtop;
> > @@ -1678,6 +1678,13 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  exit(EXIT_FAILURE);
> >  }
> >  
> > +/*
> > + * !highmem is exactly the same as limiting the PA space to 32bit,
> > + * irrespective of the underlying capabilities of the HW.
> > + */
> > +if (!vms->highmem)
> > +   pa_bits = 32;
> you need {} according to the QEMU coding style. Welcome to a new shiny
> world :-)

Yeah. Between the reduced indentation and the avalanche of braces, my
brain fails to pattern-match blocks of code. Amusing how inflexible
you become after a couple of decades...

> > +
> >  /*
> >   * We compute the base of the high IO region depending on the
> >   * amount of initial and device memory. The device memory start/size
> > @@ -1691,8 +1698,9 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  
> >  /* Base address of the high IO region */
> >  memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
> > -if (!vms->highmem && memtop > 4 * GiB) {
> > -error_report("highmem=off, but memory crosses the 4GiB limit\n");
> > +if (memtop > BIT_ULL(pa_bits)) {
> > +   error_report("Addressing limited to %d bits, but memory exceeds it 
> > by %llu bytes\n",
> > +pa_bits, memtop - BIT_ULL(pa_bits));
> >  exit(EXIT_FAILURE);
> >  }
> >  if (base < device_memory_base) {
> > @@ -1711,7 +1719,13 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  vms->memmap[i].size = size;
> >  base += size;
> >  }
> > -vms->highest_gpa = (vms->highmem ? base : memtop) - 1;
> > +
> > +/*
> > + * If base fits within pa_bits, all good. If it doesn't, limit it
> > + * to the end of RAM, which is guaranteed to fit within pa_bits.
> > + */
> > +vms->highest_gpa = (base <= BIT_ULL(pa_bits) ? base : memtop) - 1;
> > +
> >  if (device_memory_size > 0) {
> >  ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
> >  ms->device_memory->base = device_memory_base;
> > @@ -1902,12 +1916,38 @@ static void machvirt_init(MachineState *machine)
> >  unsigned int smp_cpus = machine->smp.cpus;
> >  unsigned int max_cpus = machine->smp.max_cpus;
> Move the cpu_type check before?
> 
>     if (!cpu_type_valid(machine->cpu_type)) {
>     error_report("mach-virt: CPU type %s not supported",
> machine->cpu_type);
>     exit(1);
>     }
> >

Yes, very good point. I wonder why this was tucked away past
computing the memory map and the GIC configuration... Anyway, I'll
move it up.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v4 2/6] hw/arm/virt: Add a control for the the highmem redistributors

2022-01-10 Thread Marc Zyngier

Hi Eric,

On Mon, 10 Jan 2022 15:35:44 +,
Eric Auger  wrote:
> 
> Hi Marc,
> 
> On 1/7/22 5:33 PM, Marc Zyngier wrote:

[...]

> > @@ -190,7 +191,8 @@ static inline int 
> > virt_gicv3_redist_region_count(VirtMachineState *vms)
> >  
> >  assert(vms->gic_version == VIRT_GIC_VERSION_3);
> >  
> > -return MACHINE(vms)->smp.cpus > redist0_capacity ? 2 : 1;
> > +return (MACHINE(vms)->smp.cpus > redist0_capacity &&
> > +vms->highmem_redists) ? 2 : 1;
> If we fail to use the high redist region, is there any check that the
> number of vcpus does not exceed the first redist region capacity.
> Did you check that config, does it nicely fail?

I did, and it does (example on M1 with KVM):

$ /home/maz/vminstall/qemu-hack -m 1G -smp 256 -cpu host -machine 
virt,accel=kvm,gic-version=3,highmem=on -nographic -drive 
if=pflash,format=raw,readonly=on,file=/usr/share/AAVMF/AAVMF_CODE.fd
qemu-hack: warning: Number of SMP cpus requested (256) exceeds the recommended 
cpus supported by KVM (8)
qemu-hack: warning: Number of hotpluggable cpus requested (256) exceeds the 
recommended cpus supported by KVM (8)
qemu-hack: Capacity of the redist regions(123) is less than number of vcpus(256)

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v3] hw/arm/virt: KVM: Enable PAuth when supported by the host

2022-01-08 Thread Marc Zyngier


On 2022-01-07 20:23, Richard Henderson wrote:

On 1/7/22 7:01 AM, Marc Zyngier wrote:
@@ -1380,17 +1380,10 @@ void arm_cpu_finalize_features(ARMCPU *cpu, 
Error **errp)

  return;
  }
  -/*
- * KVM does not support modifications to this feature.
- * We have not registered the cpu properties when KVM
- * is in use, so the user will not be able to set them.
- */
-if (!kvm_enabled()) {
-arm_cpu_pauth_finalize(cpu, &local_err);
-if (local_err != NULL) {
+arm_cpu_pauth_finalize(cpu, &local_err);
+if (local_err != NULL) {
  error_propagate(errp, local_err);
  return;
-}
  }


Indentation is still off -- error + return should be out-dented one 
level.




Duh. Clearly, my brain can't spot these. Apologies for the extra noise.


Otherwise,
Reviewed-by: Richard Henderson 


Thanks. I'll repost a version shortly, unless someone shouts.

M.
--
Jazz is not dead. It just smells funny...

Re: [PATCH v3 3/5] hw/arm/virt: Honor highmem setting when computing the memory map

2022-01-07 Thread Marc Zyngier

On Fri, 07 Jan 2022 18:48:16 +,
Peter Maydell  wrote:
> 
> On Fri, 7 Jan 2022 at 18:18, Marc Zyngier  wrote:
> > This is a chicken and egg problem: you need the IPA size to compute
> > the memory map, and you need the memory map to compute the IPA
> > size. Fun, isn't it?
> >
> > At the moment, virt_set_memmap() doesn't know about the IPA space,
> > generates a highest_gpa that may not work, and we end-up failing
> > because the resulting VM type is out of bound.
> >
> > My solution to that is to feed the *maximum* IPA size to
> > virt_set_memmap(), compute the memory map there, and then use
> > highest_gpa to compute the actual IPA size that is used to create the
> > VM. By knowing the IPA limit in virt_set_memmap(), I'm able to keep it
> > in check and avoid generating an unusable memory map.
> 
> Is there any reason not to just always create the VM with the
> maximum supported IPA size, rather than trying to create it
> with the smallest IPA size that will work? (ie skip the last
> step of computing the IPA size to create the VM with)

That gives KVM the opportunity to reduce the depth of the S2 page
tables. On HW that supports a large PA space, there is a real
advantage in keeping these shallow if at all possible.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v3 3/5] hw/arm/virt: Honor highmem setting when computing the memory map

2022-01-07 Thread Marc Zyngier

Hi Eric,

On Fri, 07 Jan 2022 17:15:19 +,
Eric Auger  wrote:
> 
> Hi Marc,
> 
> On 1/6/22 10:26 PM, Marc Zyngier wrote:
> > On Wed, 05 Jan 2022 09:22:39 +,
> > Eric Auger  wrote:
> >> Hi Marc,
> >>
> >> On 12/27/21 10:16 PM, Marc Zyngier wrote:
> >>> Even when the VM is configured with highmem=off, the highest_gpa
> >>> field includes devices that are above the 4GiB limit.
> >>> Similarily, nothing seem to check that the memory is within
> >>> the limit set by the highmem=off option.
> >>>
> >>> This leads to failures in virt_kvm_type() on systems that have
> >>> a crippled IPA range, as the reported IPA space is larger than
> >>> what it should be.
> >>>
> >>> Instead, honor the user-specified limit to only use the devices
> >>> at the lowest end of the spectrum, and fail if we have memory
> >>> crossing the 4GiB limit.
> >>>
> >>> Reviewed-by: Andrew Jones 
> >>> Signed-off-by: Marc Zyngier 
> >>> ---
> >>>  hw/arm/virt.c | 9 -
> >>>  1 file changed, 8 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> >>> index 8b600d82c1..84dd3b36fb 100644
> >>> --- a/hw/arm/virt.c
> >>> +++ b/hw/arm/virt.c
> >>> @@ -1678,6 +1678,11 @@ static void virt_set_memmap(VirtMachineState *vms)
> >>>  exit(EXIT_FAILURE);
> >>>  }
> >>>  
> >>> +if (!vms->highmem &&
> >>> +vms->memmap[VIRT_MEM].base + ms->maxram_size > 4 * GiB) {
> >>> +error_report("highmem=off, but memory crosses the 4GiB limit\n");
> >>> +exit(EXIT_FAILURE);
> >> The memory is composed of initial memory and device memory.
> >> device memory is put after the initial memory but has a 1GB alignment
> >> On top of that you have 1G page alignment per device memory slot
> >>
> >> so potentially the highest mem address is larger than
> >> vms->memmap[VIRT_MEM].base + ms->maxram_size.
> >> I would rather do the check on device_memory_base + device_memory_size
> > Yup, that's a good point.
> >
> > There is also a corner case in one of the later patches where I check
> > this limit against the PA using the rounded-up device_memory_size.
> > This could result in returning an error if the last memory slot would
> > still fit in the PA space, but the rounded-up quantity wouldn't. I
> > don't think it matters much, but I'll fix it anyway.
> >
> >>> +}
> >>>  /*
> >>>   * We compute the base of the high IO region depending on the
> >>>   * amount of initial and device memory. The device memory start/size
> >>> @@ -1707,7 +1712,9 @@ static void virt_set_memmap(VirtMachineState *vms)
> >>>  vms->memmap[i].size = size;
> >>>  base += size;
> >>>  }
> >>> -vms->highest_gpa = base - 1;
> >>> +vms->highest_gpa = (vms->highmem ?
> >>> +base :
> >>> +vms->memmap[VIRT_MEM].base + ms->maxram_size) - 
> >>> 1;
> >> As per the previous comment this looks wrong to me if !highmem.
> > Agreed.
> >
> >> If !highmem, if RAM requirements are low we still could get benefit from
> >> REDIST2 and HIGH ECAM which could fit within the 4GB limit. But maybe we
> >> simply don't care?
> > I don't see how. These devices live at a minimum of 256GB, which
> > contradicts the very meaning of !highmem being a 4GB limit.
> Yes I corrected the above statement afterwards, sorry for the noise.
> >
> >> If we don't, why don't we simply skip the extended_memmap overlay as
> >> suggested in v2? I did not get your reply sorry.
> > Because although this makes sense if you only care about a 32bit
> > limit, we eventually want to check against an arbitrary PA limit and
> > enable the individual devices that do fit in that space.
> 
> In my understanding that is what virt_kvm_type() was supposed to do by
> testing the result of kvm_arm_get_max_vm_ipa_size and requested_pa_size
> (which accounted the high regions) and exiting if they were
> incompatible. But I must miss something.

This is a chicken and egg problem: you need the IPA size to compute
the memory map, and you need the memory map to compute the IPA
size. Fun, isn't it?

At the moment, virt_set_memmap() doesn't know about the IPA space,
generates a highest_gpa that may not work, and we end-up failing
because the resulting VM type is out of bound.

My solution to that is to feed the *maximum* IPA size to
virt_set_memmap(), compute the memory map there, and then use
highest_gpa to compute the actual IPA size that is used to create the
VM. By knowing the IPA limit in virt_set_memmap(), I'm able to keep it
in check and avoid generating an unusable memory map.

I've tried to make that clearer in my v4. Hopefully I succeeded.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

[PATCH v4 5/6] hw/arm/virt: Disable highmem devices that don't fit in the PA range

2022-01-07 Thread Marc Zyngier

In order to only keep the highmem devices that actually fit in
the PA range, check their location against the range and update
highest_gpa if they fit. If they don't, mark them them as disabled.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index db4b0636e1..70b4773b3e 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1711,21 +1711,43 @@ static void virt_set_memmap(VirtMachineState *vms, int 
pa_bits)
 base = vms->memmap[VIRT_MEM].base + LEGACY_RAMLIMIT_BYTES;
 }
 
+/* We know for sure that at least the memory fits in the PA space */
+vms->highest_gpa = memtop - 1;
+
 for (i = VIRT_LOWMEMMAP_LAST; i < ARRAY_SIZE(extended_memmap); i++) {
 hwaddr size = extended_memmap[i].size;
+bool fits;
 
 base = ROUND_UP(base, size);
 vms->memmap[i].base = base;
 vms->memmap[i].size = size;
+
+/*
+ * Check each device to see if they fit in the PA space,
+ * moving highest_gpa as we go.
+ *
+ * For each device that doesn't fit, disable it.
+ */
+fits = (base + size) <= BIT_ULL(pa_bits);
+if (fits) {
+vms->highest_gpa = MAX(vms->highest_gpa, base + size - 1);
+}
+
+switch (i) {
+case VIRT_HIGH_GIC_REDIST2:
+vms->highmem_redists &= fits;
+break;
+case VIRT_HIGH_PCIE_ECAM:
+vms->highmem_ecam &= fits;
+break;
+case VIRT_HIGH_PCIE_MMIO:
+vms->highmem_mmio &= fits;
+break;
+}
+
 base += size;
 }
 
-/*
- * If base fits within pa_bits, all good. If it doesn't, limit it
- * to the end of RAM, which is guaranteed to fit within pa_bits.
- */
-vms->highest_gpa = (base <= BIT_ULL(pa_bits) ? base : memtop) - 1;
-
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
-- 
2.30.2

[PATCH v4 6/6] hw/arm/virt: Drop superfluous checks against highmem

2022-01-07 Thread Marc Zyngier

Now that the devices present in the extended memory map are checked
against the available PA space and disabled when they don't fit,
there is no need to keep the same checks against highmem, as
highmem really is a shortcut for the PA space being 32bit.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 2 --
 hw/arm/virt.c| 5 +
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 505c61e88e..cdac009419 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -946,8 +946,6 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables 
*tables)
 acpi_add_table(table_offsets, tables_blob);
 build_fadt_rev5(tables_blob, tables->linker, vms, dsdt);
 
-vms->highmem_redists &= vms->highmem;
-
 acpi_add_table(table_offsets, tables_blob);
 build_madt(tables_blob, tables->linker, vms);
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 70b4773b3e..641c6a9c31 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2170,9 +2170,6 @@ static void machvirt_init(MachineState *machine)
 
 virt_flash_fdt(vms, sysmem, secure_sysmem ?: sysmem);
 
-vms->highmem_mmio &= vms->highmem;
-vms->highmem_redists &= vms->highmem;
-
 create_gic(vms, sysmem);
 
 virt_cpu_post_init(vms, sysmem);
@@ -2191,7 +2188,7 @@ static void machvirt_init(MachineState *machine)
machine->ram_size, "mach-virt.tag");
 }
 
-vms->highmem_ecam &= vms->highmem && (!firmware_loaded || aarch64);
+vms->highmem_ecam &= (!firmware_loaded || aarch64);
 
 create_rtc(vms);
 
-- 
2.30.2

[PATCH v4 1/6] hw/arm/virt: Add a control for the the highmem PCIe MMIO

2022-01-07 Thread Marc Zyngier

Just like we can control the enablement of the highmem PCIe ECAM
region using highmem_ecam, let's add a control for the highmem
PCIe MMIO  region.

Similarily to highmem_ecam, this region is disabled when highmem
is off.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 10 --
 hw/arm/virt.c|  7 +--
 include/hw/arm/virt.h|  1 +
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index d0f4867fdf..cdac009419 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -158,10 +158,9 @@ static void acpi_dsdt_add_virtio(Aml *scope,
 }
 
 static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
-  uint32_t irq, bool use_highmem, bool 
highmem_ecam,
-  VirtMachineState *vms)
+  uint32_t irq, VirtMachineState *vms)
 {
-int ecam_id = VIRT_ECAM_ID(highmem_ecam);
+int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam);
 struct GPEXConfig cfg = {
 .mmio32 = memmap[VIRT_PCIE_MMIO],
 .pio= memmap[VIRT_PCIE_PIO],
@@ -170,7 +169,7 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry 
*memmap,
 .bus= vms->bus,
 };
 
-if (use_highmem) {
+if (vms->highmem_mmio) {
 cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
 }
 
@@ -868,8 +867,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 acpi_dsdt_add_fw_cfg(scope, &memmap[VIRT_FW_CFG]);
 acpi_dsdt_add_virtio(scope, &memmap[VIRT_MMIO],
 (irqmap[VIRT_MMIO] + ARM_SPI_BASE), NUM_VIRTIO_TRANSPORTS);
-acpi_dsdt_add_pci(scope, memmap, (irqmap[VIRT_PCIE] + ARM_SPI_BASE),
-  vms->highmem, vms->highmem_ecam, vms);
+acpi_dsdt_add_pci(scope, memmap, irqmap[VIRT_PCIE] + ARM_SPI_BASE, vms);
 if (vms->acpi_dev) {
 build_ged_aml(scope, "\\_SB."GED_DEVICE,
   HOTPLUG_HANDLER(vms->acpi_dev),
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 4593fea1ce..b9ce81f4a1 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1412,7 +1412,7 @@ static void create_pcie(VirtMachineState *vms)
  mmio_reg, base_mmio, size_mmio);
 memory_region_add_subregion(get_system_memory(), base_mmio, mmio_alias);
 
-if (vms->highmem) {
+if (vms->highmem_mmio) {
 /* Map high MMIO space */
 MemoryRegion *high_mmio_alias = g_new0(MemoryRegion, 1);
 
@@ -1466,7 +1466,7 @@ static void create_pcie(VirtMachineState *vms)
 qemu_fdt_setprop_sized_cells(ms->fdt, nodename, "reg",
  2, base_ecam, 2, size_ecam);
 
-if (vms->highmem) {
+if (vms->highmem_mmio) {
 qemu_fdt_setprop_sized_cells(ms->fdt, nodename, "ranges",
  1, FDT_PCI_RANGE_IOPORT, 2, 0,
  2, base_pio, 2, size_pio,
@@ -2105,6 +2105,8 @@ static void machvirt_init(MachineState *machine)
 
 virt_flash_fdt(vms, sysmem, secure_sysmem ?: sysmem);
 
+vms->highmem_mmio &= vms->highmem;
+
 create_gic(vms, sysmem);
 
 virt_cpu_post_init(vms, sysmem);
@@ -2802,6 +2804,7 @@ static void virt_instance_init(Object *obj)
 vms->gic_version = VIRT_GIC_VERSION_NOSEL;
 
 vms->highmem_ecam = !vmc->no_highmem_ecam;
+vms->highmem_mmio = true;
 
 if (vmc->no_its) {
 vms->its = false;
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index dc6b66ffc8..9c54acd10d 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -143,6 +143,7 @@ struct VirtMachineState {
 bool secure;
 bool highmem;
 bool highmem_ecam;
+bool highmem_mmio;
 bool its;
 bool tcg_its;
 bool virt;
-- 
2.30.2

[PATCH v4 3/6] hw/arm/virt: Honor highmem setting when computing the memory map

2022-01-07 Thread Marc Zyngier

Even when the VM is configured with highmem=off, the highest_gpa
field includes devices that are above the 4GiB limit.
Similarily, nothing seem to check that the memory is within
the limit set by the highmem=off option.

This leads to failures in virt_kvm_type() on systems that have
a crippled IPA range, as the reported IPA space is larger than
what it should be.

Instead, honor the user-specified limit to only use the devices
at the lowest end of the spectrum, and fail if we have memory
crossing the 4GiB limit.

Reviewed-by: Andrew Jones 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 4d1d629432..57c55e8a37 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1663,7 +1663,7 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
*vms, int idx)
 static void virt_set_memmap(VirtMachineState *vms)
 {
 MachineState *ms = MACHINE(vms);
-hwaddr base, device_memory_base, device_memory_size;
+hwaddr base, device_memory_base, device_memory_size, memtop;
 int i;
 
 vms->memmap = extended_memmap;
@@ -1690,7 +1690,11 @@ static void virt_set_memmap(VirtMachineState *vms)
 device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB;
 
 /* Base address of the high IO region */
-base = device_memory_base + ROUND_UP(device_memory_size, GiB);
+memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
+if (!vms->highmem && memtop > 4 * GiB) {
+error_report("highmem=off, but memory crosses the 4GiB limit\n");
+exit(EXIT_FAILURE);
+}
 if (base < device_memory_base) {
 error_report("maxmem/slots too huge");
 exit(EXIT_FAILURE);
@@ -1707,7 +1711,7 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = base - 1;
+vms->highest_gpa = (vms->highmem ? base : memtop) - 1;
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
-- 
2.30.2

[PATCH v4 4/6] hw/arm/virt: Use the PA range to compute the memory map

2022-01-07 Thread Marc Zyngier

The highmem attribute is nothing but another way to express the
PA range of a VM. To support HW that has a smaller PA range then
what QEMU assumes, pass this PA range to the virt_set_memmap()
function, allowing it to correctly exclude highmem devices
if they are outside of the PA range.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 53 ---
 1 file changed, 46 insertions(+), 7 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 57c55e8a37..db4b0636e1 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1660,7 +1660,7 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
*vms, int idx)
 return arm_cpu_mp_affinity(idx, clustersz);
 }
 
-static void virt_set_memmap(VirtMachineState *vms)
+static void virt_set_memmap(VirtMachineState *vms, int pa_bits)
 {
 MachineState *ms = MACHINE(vms);
 hwaddr base, device_memory_base, device_memory_size, memtop;
@@ -1678,6 +1678,13 @@ static void virt_set_memmap(VirtMachineState *vms)
 exit(EXIT_FAILURE);
 }
 
+/*
+ * !highmem is exactly the same as limiting the PA space to 32bit,
+ * irrespective of the underlying capabilities of the HW.
+ */
+if (!vms->highmem)
+   pa_bits = 32;
+
 /*
  * We compute the base of the high IO region depending on the
  * amount of initial and device memory. The device memory start/size
@@ -1691,8 +1698,9 @@ static void virt_set_memmap(VirtMachineState *vms)
 
 /* Base address of the high IO region */
 memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
-if (!vms->highmem && memtop > 4 * GiB) {
-error_report("highmem=off, but memory crosses the 4GiB limit\n");
+if (memtop > BIT_ULL(pa_bits)) {
+   error_report("Addressing limited to %d bits, but memory exceeds it 
by %llu bytes\n",
+pa_bits, memtop - BIT_ULL(pa_bits));
 exit(EXIT_FAILURE);
 }
 if (base < device_memory_base) {
@@ -1711,7 +1719,13 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = (vms->highmem ? base : memtop) - 1;
+
+/*
+ * If base fits within pa_bits, all good. If it doesn't, limit it
+ * to the end of RAM, which is guaranteed to fit within pa_bits.
+ */
+vms->highest_gpa = (base <= BIT_ULL(pa_bits) ? base : memtop) - 1;
+
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
@@ -1902,12 +1916,38 @@ static void machvirt_init(MachineState *machine)
 unsigned int smp_cpus = machine->smp.cpus;
 unsigned int max_cpus = machine->smp.max_cpus;
 
+possible_cpus = mc->possible_cpu_arch_ids(machine);
+
 /*
  * In accelerated mode, the memory map is computed earlier in kvm_type()
  * to create a VM with the right number of IPA bits.
  */
 if (!vms->memmap) {
-virt_set_memmap(vms);
+Object *cpuobj;
+ARMCPU *armcpu;
+int pa_bits;
+
+/*
+ * Instanciate a temporary CPU object to find out about what
+ * we are about to deal with. Once this is done, get rid of
+ * the object.
+ */
+cpuobj = object_new(possible_cpus->cpus[0].type);
+armcpu = ARM_CPU(cpuobj);
+
+if (object_property_get_bool(cpuobj, "aarch64", NULL)) {
+pa_bits = arm_pamax(armcpu);
+} else if (arm_feature(&armcpu->env, ARM_FEATURE_LPAE)) {
+/* v7 with LPAE */
+pa_bits = 40;
+} else {
+/* Anything else */
+pa_bits = 32;
+}
+
+object_unref(cpuobj);
+
+virt_set_memmap(vms, pa_bits);
 }
 
 /* We can probe only here because during property set
@@ -1989,7 +2029,6 @@ static void machvirt_init(MachineState *machine)
 
 create_fdt(vms);
 
-possible_cpus = mc->possible_cpu_arch_ids(machine);
 assert(possible_cpus->len == max_cpus);
 for (n = 0; n < possible_cpus->len; n++) {
 Object *cpuobj;
@@ -2646,7 +2685,7 @@ static int virt_kvm_type(MachineState *ms, const char 
*type_str)
 max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, &fixed_ipa);
 
 /* we freeze the memory map to compute the highest gpa */
-virt_set_memmap(vms);
+virt_set_memmap(vms, max_vm_pa_size);
 
 requested_pa_size = 64 - clz64(vms->highest_gpa);
 
-- 
2.30.2

[PATCH v4 0/6] target/arm: Reduced-IPA space and highmem fixes

2022-01-07 Thread Marc Zyngier

Here's another stab at enabling QEMU on systems with pathologically
reduced IPA ranges such as the Apple M1 (previous version at [1]).
Eventually, we're able to run a KVM guest with more than just 3GB of
RAM on a system with a 36bit IPA space, and at most 123 vCPUs.

This also addresses some pathological QEMU behaviours, where the
highmem property is used as a flag allowing exposure of devices that
can't possibly fit in the PA space of the VM, resulting in a guest
failure.

In the end, we generalise the notion of PA space when exposing
individual devices in the expanded memory map, and treat highmem as
another flavour or PA space restriction.

This series does a few things:

- introduce new attributes to control the enabling of the highmem
  GICv3 redistributors and the highmem PCIe MMIO range

- correctly cap the PA range with highmem is off

- generalise the highmem behaviour to any PA range

- disable each highmem device region that doesn't fit in the PA range

- cleanup uses of highmem outside of virt_set_memmap()

This has been tested on an M1-based Mac-mini running Linux v5.16-rc6
with both KVM and TCG.

* From v3 [1]:

  - Introduced highmem_mmio as the MMIO pendant to highmem_ecam after
Eric made it plain that I was misguided in using highmem_ecam to
gate the MMIO region.

  - Fixed the way the top of RAM is enforced (using the device memory
size, rounded up to the nearest GB). I long debated *not* using
the rounded up version, but finally decided that it would be the
least surprising, given that each slot is supposed to hold a full
GB.

  - Now allowing some of the highmem devices to be individually
enabled if they fit in the PA range. For example, a system with a
39bit PA range and at most 255GB of RAM can use the highmem redist
and PCIe ECAM ranges, but not the high PCIe range.

  - Dropped some of Andrew's RBs, as the code significantly changed.

[1] https://lore.kernel.org/r/20211227211642.994461-1-...@kernel.org

Marc Zyngier (6):
  hw/arm/virt: Add a control for the the highmem PCIe MMIO
  hw/arm/virt: Add a control for the the highmem redistributors
  hw/arm/virt: Honor highmem setting when computing the memory map
  hw/arm/virt: Use the PA range to compute the memory map
  hw/arm/virt: Disable highmem devices that don't fit in the PA range
  hw/arm/virt: Drop superfluous checks against highmem

 hw/arm/virt-acpi-build.c | 10 ++---
 hw/arm/virt.c| 87 +++-
 include/hw/arm/virt.h|  5 ++-
 3 files changed, 85 insertions(+), 17 deletions(-)

-- 
2.30.2

[PATCH v4 2/6] hw/arm/virt: Add a control for the the highmem redistributors

2022-01-07 Thread Marc Zyngier

Just like we can control the enablement of the highmem PCIe region
using highmem_ecam, let's add a control for the highmem GICv3
redistributor region.

Similarily to highmem_ecam, these redistributors are disabled when
highmem is off.

Reviewed-by: Andrew Jones 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 2 ++
 hw/arm/virt.c| 2 ++
 include/hw/arm/virt.h| 4 +++-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index cdac009419..505c61e88e 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -946,6 +946,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables 
*tables)
 acpi_add_table(table_offsets, tables_blob);
 build_fadt_rev5(tables_blob, tables->linker, vms, dsdt);
 
+vms->highmem_redists &= vms->highmem;
+
 acpi_add_table(table_offsets, tables_blob);
 build_madt(tables_blob, tables->linker, vms);
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index b9ce81f4a1..4d1d629432 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2106,6 +2106,7 @@ static void machvirt_init(MachineState *machine)
 virt_flash_fdt(vms, sysmem, secure_sysmem ?: sysmem);
 
 vms->highmem_mmio &= vms->highmem;
+vms->highmem_redists &= vms->highmem;
 
 create_gic(vms, sysmem);
 
@@ -2805,6 +2806,7 @@ static void virt_instance_init(Object *obj)
 
 vms->highmem_ecam = !vmc->no_highmem_ecam;
 vms->highmem_mmio = true;
+vms->highmem_redists = true;
 
 if (vmc->no_its) {
 vms->its = false;
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 9c54acd10d..dc9fa26faa 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -144,6 +144,7 @@ struct VirtMachineState {
 bool highmem;
 bool highmem_ecam;
 bool highmem_mmio;
+bool highmem_redists;
 bool its;
 bool tcg_its;
 bool virt;
@@ -190,7 +191,8 @@ static inline int 
virt_gicv3_redist_region_count(VirtMachineState *vms)
 
 assert(vms->gic_version == VIRT_GIC_VERSION_3);
 
-return MACHINE(vms)->smp.cpus > redist0_capacity ? 2 : 1;
+return (MACHINE(vms)->smp.cpus > redist0_capacity &&
+vms->highmem_redists) ? 2 : 1;
 }
 
 #endif /* QEMU_ARM_VIRT_H */
-- 
2.30.2

[PATCH v3] hw/arm/virt: KVM: Enable PAuth when supported by the host

2022-01-07 Thread Marc Zyngier

Add basic support for Pointer Authentication when running a KVM
guest and that the host supports it, loosely based on the SVE
support.

Although the feature is enabled by default when the host advertises
it, it is possible to disable it by setting the 'pauth=off' CPU
property. The 'pauth' comment is removed from cpu-features.rst,
as it is now common to both TCG and KVM.

Tested on an Apple M1 running 5.16-rc6.

Cc: Eric Auger 
Cc: Richard Henderson 
Cc: Peter Maydell 
Reviewed-by: Andrew Jones 
Signed-off-by: Marc Zyngier 
---
* From v2:
  - Fixed indentation and spelling
  - Slightly reworked the KVM handling in arm_cpu_pauth_finalize
(no functional changes)
  - Picked Andrew's RB tag

 docs/system/arm/cpu-features.rst |  4 
 target/arm/cpu.c | 12 +++-
 target/arm/cpu.h |  1 +
 target/arm/cpu64.c   | 31 +++
 target/arm/kvm64.c   | 21 +
 5 files changed, 52 insertions(+), 17 deletions(-)

diff --git a/docs/system/arm/cpu-features.rst b/docs/system/arm/cpu-features.rst
index 584eb17097..3e626c4b68 100644
--- a/docs/system/arm/cpu-features.rst
+++ b/docs/system/arm/cpu-features.rst
@@ -217,10 +217,6 @@ TCG VCPU Features
 TCG VCPU features are CPU features that are specific to TCG.
 Below is the list of TCG VCPU features and their descriptions.
 
-  pauthEnable or disable ``FEAT_Pauth``, pointer
-   authentication.  By default, the feature is
-   enabled with ``-cpu max``.
-
   pauth-impdef When ``FEAT_Pauth`` is enabled, either the
*impdef* (Implementation Defined) algorithm
is enabled or the *architected* QARMA algorithm
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index a211804fd3..f3c09931e4 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1380,17 +1380,10 @@ void arm_cpu_finalize_features(ARMCPU *cpu, Error 
**errp)
 return;
 }
 
-/*
- * KVM does not support modifications to this feature.
- * We have not registered the cpu properties when KVM
- * is in use, so the user will not be able to set them.
- */
-if (!kvm_enabled()) {
-arm_cpu_pauth_finalize(cpu, &local_err);
-if (local_err != NULL) {
+arm_cpu_pauth_finalize(cpu, &local_err);
+if (local_err != NULL) {
 error_propagate(errp, local_err);
 return;
-}
 }
 }
 
@@ -2091,6 +2084,7 @@ static void arm_host_initfn(Object *obj)
 kvm_arm_set_cpu_features_from_host(cpu);
 if (arm_feature(&cpu->env, ARM_FEATURE_AARCH64)) {
 aarch64_add_sve_properties(obj);
+aarch64_add_pauth_properties(obj);
 }
 #else
 hvf_arm_set_cpu_features_from_host(cpu);
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index e33f37b70a..c6a4d50e82 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1076,6 +1076,7 @@ void aarch64_sve_narrow_vq(CPUARMState *env, unsigned vq);
 void aarch64_sve_change_el(CPUARMState *env, int old_el,
int new_el, bool el0_a64);
 void aarch64_add_sve_properties(Object *obj);
+void aarch64_add_pauth_properties(Object *obj);
 
 /*
  * SVE registers are encoded in KVM's memory in an endianness-invariant format.
diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 15245a60a8..8786be7783 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -630,6 +630,15 @@ void arm_cpu_pauth_finalize(ARMCPU *cpu, Error **errp)
 int arch_val = 0, impdef_val = 0;
 uint64_t t;
 
+/* Exit early if PAuth is enabled, and fall through to disable it */
+if (kvm_enabled() && cpu->prop_pauth) {
+if (!cpu_isar_feature(aa64_pauth, cpu)) {
+error_setg(errp, "'pauth' feature not supported by KVM on this 
host");
+}
+
+return;
+}
+
 /* TODO: Handle HaveEnhancedPAC, HaveEnhancedPAC2, HaveFPAC. */
 if (cpu->prop_pauth) {
 if (cpu->prop_pauth_impdef) {
@@ -655,6 +664,23 @@ static Property arm_cpu_pauth_property =
 static Property arm_cpu_pauth_impdef_property =
 DEFINE_PROP_BOOL("pauth-impdef", ARMCPU, prop_pauth_impdef, false);
 
+void aarch64_add_pauth_properties(Object *obj)
+{
+ARMCPU *cpu = ARM_CPU(obj);
+
+/* Default to PAUTH on, with the architected algorithm on TCG. */
+qdev_property_add_static(DEVICE(obj), &arm_cpu_pauth_property);
+if (kvm_enabled()) {
+/*
+ * Mirror PAuth support from the probed sysregs back into the
+ * property for KVM. Is it just a bit backward? Yes it is!
+ */
+cpu->prop_pauth = cpu_isar_feature(aa64_pauth, cpu);
+} else {
+qdev_property_add_static(DEVICE(obj), &arm_cpu_pauth_impdef_property);
+}
+}
+
 /* -cpu m

Re: [PATCH v3 3/5] hw/arm/virt: Honor highmem setting when computing the memory map

2022-01-06 Thread Marc Zyngier

On Wed, 05 Jan 2022 09:22:39 +,
Eric Auger  wrote:
> 
> Hi Marc,
> 
> On 12/27/21 10:16 PM, Marc Zyngier wrote:
> > Even when the VM is configured with highmem=off, the highest_gpa
> > field includes devices that are above the 4GiB limit.
> > Similarily, nothing seem to check that the memory is within
> > the limit set by the highmem=off option.
> >
> > This leads to failures in virt_kvm_type() on systems that have
> > a crippled IPA range, as the reported IPA space is larger than
> > what it should be.
> >
> > Instead, honor the user-specified limit to only use the devices
> > at the lowest end of the spectrum, and fail if we have memory
> > crossing the 4GiB limit.
> >
> > Reviewed-by: Andrew Jones 
> > Signed-off-by: Marc Zyngier 
> > ---
> >  hw/arm/virt.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 8b600d82c1..84dd3b36fb 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1678,6 +1678,11 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  exit(EXIT_FAILURE);
> >  }
> >  
> > +if (!vms->highmem &&
> > +vms->memmap[VIRT_MEM].base + ms->maxram_size > 4 * GiB) {
> > +error_report("highmem=off, but memory crosses the 4GiB limit\n");
> > +exit(EXIT_FAILURE);
> 
> The memory is composed of initial memory and device memory.
> device memory is put after the initial memory but has a 1GB alignment
> On top of that you have 1G page alignment per device memory slot
> 
> so potentially the highest mem address is larger than
> vms->memmap[VIRT_MEM].base + ms->maxram_size.
> I would rather do the check on device_memory_base + device_memory_size

Yup, that's a good point.

There is also a corner case in one of the later patches where I check
this limit against the PA using the rounded-up device_memory_size.
This could result in returning an error if the last memory slot would
still fit in the PA space, but the rounded-up quantity wouldn't. I
don't think it matters much, but I'll fix it anyway.

> > +}
> >  /*
> >   * We compute the base of the high IO region depending on the
> >   * amount of initial and device memory. The device memory start/size
> > @@ -1707,7 +1712,9 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  vms->memmap[i].size = size;
> >  base += size;
> >  }
> > -vms->highest_gpa = base - 1;
> > +vms->highest_gpa = (vms->highmem ?
> > +base :
> > +vms->memmap[VIRT_MEM].base + ms->maxram_size) - 1;
> As per the previous comment this looks wrong to me if !highmem.

Agreed.

> If !highmem, if RAM requirements are low we still could get benefit from
> REDIST2 and HIGH ECAM which could fit within the 4GB limit. But maybe we
> simply don't care?

I don't see how. These devices live at a minimum of 256GB, which
contradicts the very meaning of !highmem being a 4GB limit.

> If we don't, why don't we simply skip the extended_memmap overlay as
> suggested in v2? I did not get your reply sorry.

Because although this makes sense if you only care about a 32bit
limit, we eventually want to check against an arbitrary PA limit and
enable the individual devices that do fit in that space.

In order to do that, we need to compute the base addresses for these
extra devices. Also, computing 3 base addresses isn't going to be
massively expensive.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2 1/5] hw/arm/virt: Key enablement of highmem PCIe on highmem_ecam

2022-01-06 Thread Marc Zyngier

Hi Eric,

On Wed, 05 Jan 2022 09:41:19 +,
Eric Auger  wrote:
> 
> couldn't you simply introduce highmem_redist which is truly missing. You
> could set it in virt_set_memmap() in case you skip extended_map overlay
> and use it in virt_gicv3_redist_region_count() as you did?
> In addition to the device memory top address check against the 4GB limit
> if !highmem, we should be fine then?

No, highmem really isn't nearly enough.

Imagine you have (like I do) a system with 36 bits of IPA space.
Create a VM with 8GB of RAM (which means the low-end of IPA space is
already 9GB). Obviously, highmem=true here. With the current code, we
will try to expose this PCI MMIO range, which falls way out of the IPA
space (you need at least 40 bits of IPA to even cover it with the
smallest configuration).

highmem really is a control that says 'things may live above 4GB'. It
doesn't say *how far* above 4GB it can be placed. Which is what I am
trying to address.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2] hw/arm/virt: KVM: Enable PAuth when supported by the host

2022-01-06 Thread Marc Zyngier

On Thu, 06 Jan 2022 18:26:29 +,
Richard Henderson  wrote:
> 
> Mm.  It does beg the question of why KVM exposes multiple bits.  If
> they must be tied, then it only serves to make the interface more
> complicated than necessary.  We would be better served to have a
> single bit to control all of PAuth.

In hindsight, there is a lot I would change in the KVM userspace ABI,
and a lot I should have pushed back on. Unfortunately, there is little
we can do now to fix it (userspace expecting this behaviour has been
in the wild for almost 3 years already).

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2] hw/arm/virt: KVM: Enable PAuth when supported by the host

2022-01-06 Thread Marc Zyngier

On Thu, 06 Jan 2022 17:20:33 +,
Richard Henderson  wrote:
> 
> On 1/6/22 1:16 AM, Marc Zyngier wrote:
> >>> +static bool kvm_arm_pauth_supported(void)
> >>> +{
> >>> +return (kvm_check_extension(kvm_state, KVM_CAP_ARM_PTRAUTH_ADDRESS) 
> >>> &&
> >>> +kvm_check_extension(kvm_state, KVM_CAP_ARM_PTRAUTH_GENERIC));
> >>> +}
> >> 
> >> Do we really need to have them both set to play the game?  Given that
> >> the only thing that happens is that we disable whatever host support
> >> exists, can we have "pauth enabled" mean whatever subset the host has?
> > 
> > The host will always expose either both features or none, and that's
> > part of the ABI. From the bit of kernel documentation located in
> > Documentation/virt/kvm/api.rst:
> > 
> > 
> > 4.82 KVM_ARM_VCPU_INIT
> > --
> > [...]
> >  - KVM_ARM_VCPU_PTRAUTH_ADDRESS: Enables Address Pointer 
> > authentication
> >for arm64 only.
> >Depends on KVM_CAP_ARM_PTRAUTH_ADDRESS.
> >If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC 
> > are
> >both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
> >KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
> >requested.
> > 
> >  - KVM_ARM_VCPU_PTRAUTH_GENERIC: Enables Generic Pointer 
> > authentication
> >for arm64 only.
> >Depends on KVM_CAP_ARM_PTRAUTH_GENERIC.
> >If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC 
> > are
> >both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
> >KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
> >requested.
> > 
> > 
> > KVM will reject the initialisation if only one of the features is
> > requested, so checking and enabling both makes sense to me.
> 
> Well, no, that's not what that says.  It says that *if* both host
> flags are set, then both guest flags must be set or both unset.

Indeed. But KVM never returns just one flag. It only exposes both or
none.

> It's probably all academic anyway, because I can't actually imagine a
> vendor implementing ADDR and not GENERIC, but in theory we ought to be
> able to support a host with only ADDR.

We explicitly decided against supporting such a configuration. If
someone comes up with such a contraption, guests won't be able to see
it.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2] hw/arm/virt: KVM: Enable PAuth when supported by the host

2022-01-06 Thread Marc Zyngier

Hi Richard,

On Wed, 05 Jan 2022 21:36:55 +,
Richard Henderson  wrote:
> 
> On 1/3/22 10:05 AM, Marc Zyngier wrote:
> > -/*
> > - * KVM does not support modifications to this feature.
> > - * We have not registered the cpu properties when KVM
> > - * is in use, so the user will not be able to set them.
> > - */
> > -if (!kvm_enabled()) {
> > -arm_cpu_pauth_finalize(cpu, &local_err);
> > -if (local_err != NULL) {
> > +   arm_cpu_pauth_finalize(cpu, &local_err);
> > +   if (local_err != NULL) {
> >   error_propagate(errp, local_err);
> >   return;
> > -}
> > -}
> > +   }
> 
> Looks like the indentation is off?

Most probably. I only just discovered how to use the QEMU style for
Emacs, and was indenting things by hand before that (yes, pretty
painful and likely to lead to issues (there is a TAB instead of a set
of spaces there...).

> 
> > +static bool kvm_arm_pauth_supported(void)
> > +{
> > +return (kvm_check_extension(kvm_state, KVM_CAP_ARM_PTRAUTH_ADDRESS) &&
> > +kvm_check_extension(kvm_state, KVM_CAP_ARM_PTRAUTH_GENERIC));
> > +}
> 
> Do we really need to have them both set to play the game?  Given that
> the only thing that happens is that we disable whatever host support
> exists, can we have "pauth enabled" mean whatever subset the host has?

The host will always expose either both features or none, and that's
part of the ABI. From the bit of kernel documentation located in
Documentation/virt/kvm/api.rst:


4.82 KVM_ARM_VCPU_INIT
--
[...]
- KVM_ARM_VCPU_PTRAUTH_ADDRESS: Enables Address Pointer authentication
  for arm64 only.
  Depends on KVM_CAP_ARM_PTRAUTH_ADDRESS.
  If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC are
  both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
  KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
  requested.

- KVM_ARM_VCPU_PTRAUTH_GENERIC: Enables Generic Pointer authentication
  for arm64 only.
  Depends on KVM_CAP_ARM_PTRAUTH_GENERIC.
  If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC are
  both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
  KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
  requested.


KVM will reject the initialisation if only one of the features is
requested, so checking and enabling both makes sense to me.

> 
> > @@ -521,6 +527,17 @@ bool kvm_arm_get_host_cpu_features(ARMHostCPUFeatures 
> > *ahcf)
> >*/
> >   struct kvm_vcpu_init init = { .target = -1, };
> >   +/*
> > + * Ask for Pointer Authentication if supported. We can't play the
> > + * SVE trick of synthetising the ID reg as KVM won't tell us
> 
> synthesizing

Yup.

> 
> > + * whether we have the architected or IMPDEF version of PAuth, so
> > + * we have to use the actual ID regs.
> > + */
> > +if (kvm_arm_pauth_supported()) {
> > +init.features[0] |= (1 << KVM_ARM_VCPU_PTRAUTH_ADDRESS |
> > +1 << KVM_ARM_VCPU_PTRAUTH_GENERIC);
> 
> Align the two 1's.

Gah, another of these... Will fix.

> 
> Otherwise, it looks good.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2 1/5] hw/arm/virt: Key enablement of highmem PCIe on highmem_ecam

2022-01-04 Thread Marc Zyngier

Hi Eric,

On Tue, 04 Jan 2022 15:31:33 +,
Eric Auger  wrote:
> 
> Hi Marc,
> 
> On 12/27/21 4:53 PM, Marc Zyngier wrote:
> > Hi Eric,
> >
> > Picking this up again after a stupidly long time...
> >
> > On Mon, 04 Oct 2021 13:00:21 +0100,
> > Eric Auger  wrote:
> >> Hi Marc,
> >>
> >> On 10/3/21 6:46 PM, Marc Zyngier wrote:
> >>> Currently, the highmem PCIe region is oddly keyed on the highmem
> >>> attribute instead of highmem_ecam. Move the enablement of this PCIe
> >>> region over to highmem_ecam.
> >>>
> >>> Signed-off-by: Marc Zyngier 
> >>> ---
> >>>  hw/arm/virt-acpi-build.c | 10 --
> >>>  hw/arm/virt.c|  4 ++--
> >>>  2 files changed, 6 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> >>> index 037cc1fd82..d7bef0e627 100644
> >>> --- a/hw/arm/virt-acpi-build.c
> >>> +++ b/hw/arm/virt-acpi-build.c
> >>> @@ -157,10 +157,9 @@ static void acpi_dsdt_add_virtio(Aml *scope,
> >>>  }
> >>>  
> >>>  static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
> >>> -  uint32_t irq, bool use_highmem, bool 
> >>> highmem_ecam,
> >>> -  VirtMachineState *vms)
> >>> +  uint32_t irq, VirtMachineState *vms)
> >>>  {
> >>> -int ecam_id = VIRT_ECAM_ID(highmem_ecam);
> >>> +int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam);
> >>>  struct GPEXConfig cfg = {
> >>>  .mmio32 = memmap[VIRT_PCIE_MMIO],
> >>>  .pio= memmap[VIRT_PCIE_PIO],
> >>> @@ -169,7 +168,7 @@ static void acpi_dsdt_add_pci(Aml *scope, const 
> >>> MemMapEntry *memmap,
> >>>  .bus= vms->bus,
> >>>  };
> >>>  
> >>> -if (use_highmem) {
> >>> +if (vms->highmem_ecam) {
> >> highmem_ecam is more restrictive than use_highmem:
> >> vms->highmem_ecam &= vms->highmem && (!firmware_loaded || aarch64);
> >>
> >> If I remember correctly there was a problem using highmem ECAM with 32b
> >> AAVMF FW.
> >>
> >> However 5125f9cd2532 ("hw/arm/virt: Add high MMIO PCI region, 512G in
> >> size") introduced high MMIO PCI region without this constraint.
> > Then I really don't understand the point of this highmem_ecam. We only
> > register the highmem version if highmem_ecam is set (see the use of
> > VIRT_ECAM_ID() to pick the right ECAM window).
> 
> but aren't we talking about different regions? On one hand the [high]
> MMIO region (512GB wide) and the [high] ECAM region (256MB large).
> To me you can enable either independently. High MMIO region is used by
> some devices likes ivshmem/video cards while high ECAM was introduced to
> extend the number of supported buses: 601d626d148a (hw/arm/virt: Add a
> new 256MB ECAM region).
> 
> with the above change the high MMIO region won't be set with 32b
> FW+kernel and LPAE whereas it is currently.
> 
> high ECAM was not supported by 32b FW, hence the highmem_ecam.
> 
> but maybe I miss your point?

There are two issues.

First, I have been conflating the ECAM and MMIO ranges, and you only
made me realise that they were supposed to be independent.  I still
think the keying on highmem is wrong, but the main issue is that the
highmem* flags don't quite describe the shape of the platform.

All these booleans indicate is whether the feature they describe (the
high MMIO range, the high ECAM range, and in one of my patches the
high RD range) are *allowed* to live above 4GB, but do not express
whether then are actually usable (i.e. fit in the PA range).

Maybe we need to be more thorough in the way we describe the extended
region in the VirtMachineState structure:

- highmem: overall control for anything that *can* live above 4GB
- highmem_ecam: Has a PCIe ECAM region above 256GB
- highmem_mmio: Has a PCIe MMIO region above 256GB
- highmem_redist: Has 512 RDs above 256GB

Crucially, the last 3 items must fit in the PA range or be disabled.

We have highmem_ecam which is keyed on highmem, but not on the PA
range.  highmem_mmio doesn't exist at all (we use highmem instead),
and I'm only introducing highmem_redist.

For these 3 ranges, we should have something like

vms->highmem_xxx &= (vms->highmem &&
 (vms->memmap[XXX].base + vms->vms->memmap[XXX].size) < 
vms->highest_gpa);

and treat them as independent entities.  Unless someone shouts, I'm
going to go ahead and implement this logic.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH] hw/arm/virt: KVM: Enable PAuth when supported by the host

2022-01-03 Thread Marc Zyngier

Hi Andrew,

On Mon, 03 Jan 2022 13:46:01 +,
Andrew Jones  wrote:
> 
> Hi Marc,
> 
> On Tue, Dec 28, 2021 at 06:23:47PM +, Marc Zyngier wrote:
> > Add basic support for Pointer Authentication when running a KVM
> > guest and that the host supports it, loosely based on the SVE
> > support.
> > 
> > Although the feature is enabled by default when the host advertises
> > it, it is possible to disable it by setting the 'pauth=off' CPU
> > property.
> > 
> > Tested on an Apple M1 running 5.16-rc6.
> > 
> > Cc: Eric Auger 
> > Cc: Andrew Jones 
> > Cc: Richard Henderson 
> > Cc: Peter Maydell 
> > Signed-off-by: Marc Zyngier 
> > ---
> >  docs/system/arm/cpu-features.rst |  5 +
> >  target/arm/cpu.c |  1 +
> >  target/arm/cpu.h |  1 +
> >  target/arm/cpu64.c   | 36 
> >  target/arm/kvm.c | 13 
> >  target/arm/kvm64.c   | 10 +
> >  target/arm/kvm_arm.h |  7 +++
> >  7 files changed, 73 insertions(+)
> > 
> > diff --git a/docs/system/arm/cpu-features.rst 
> > b/docs/system/arm/cpu-features.rst
> > index 584eb17097..c9e39546a5 100644
> > --- a/docs/system/arm/cpu-features.rst
> > +++ b/docs/system/arm/cpu-features.rst
> > @@ -211,6 +211,11 @@ the list of KVM VCPU features and their descriptions.
> > influence the guest scheduler behavior and/or be
> > exposed to the guest userspace.
> >  
> > +  pauthEnable or disable ``FEAT_Pauth``, pointer
> > +   authentication.  By default, the feature is 
> > enabled
> > +   with ``-cpu host`` if supported by both the host
> > +   kernel and the hardware.
> > +
> 
> Thanks for considering a documentation update. In this case, though, I
> think we should delete the "TCG VCPU Features" pauth paragraph, rather
> than add a new "KVM VCPU Features" pauth paragraph. We don't need to
> document each CPU feature. We just document complex ones, like sve*,
> KVM specific ones (kvm-*), and TCG specific ones (now only pauth-impdef).

Sure, works for me. Do we need to keep a trace of the available
options? I'm not sure how a user is supposed to find out about those
(I always end-up grepping through the code base, and something tells
me I'm doing it wrong...). The QMP stuff flies way over my head.

> >  TCG VCPU Features
> >  =
> >  
> > diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> > index a211804fd3..68b09cbc6a 100644
> > --- a/target/arm/cpu.c
> > +++ b/target/arm/cpu.c
> > @@ -2091,6 +2091,7 @@ static void arm_host_initfn(Object *obj)
> >  kvm_arm_set_cpu_features_from_host(cpu);
> >  if (arm_feature(&cpu->env, ARM_FEATURE_AARCH64)) {
> >  aarch64_add_sve_properties(obj);
> > +aarch64_add_pauth_properties(obj);
> >  }
> >  #else
> >  hvf_arm_set_cpu_features_from_host(cpu);
> > diff --git a/target/arm/cpu.h b/target/arm/cpu.h
> > index e33f37b70a..c6a4d50e82 100644
> > --- a/target/arm/cpu.h
> > +++ b/target/arm/cpu.h
> > @@ -1076,6 +1076,7 @@ void aarch64_sve_narrow_vq(CPUARMState *env, unsigned 
> > vq);
> >  void aarch64_sve_change_el(CPUARMState *env, int old_el,
> > int new_el, bool el0_a64);
> >  void aarch64_add_sve_properties(Object *obj);
> > +void aarch64_add_pauth_properties(Object *obj);
> >  
> >  /*
> >   * SVE registers are encoded in KVM's memory in an endianness-invariant 
> > format.
> > diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
> > index 15245a60a8..305c0e19fe 100644
> > --- a/target/arm/cpu64.c
> > +++ b/target/arm/cpu64.c
> > @@ -625,6 +625,42 @@ void aarch64_add_sve_properties(Object *obj)
> >  #endif
> >  }
> >  
> > +static bool cpu_arm_get_pauth(Object *obj, Error **errp)
> > +{
> > +ARMCPU *cpu = ARM_CPU(obj);
> > +return cpu_isar_feature(aa64_pauth, cpu);
> > +}
> > +
> > +static void cpu_arm_set_pauth(Object *obj, bool value, Error **errp)
> > +{
> > +ARMCPU *cpu = ARM_CPU(obj);
> > +uint64_t t;
> > +
> > +if (value) {
> > +if (!kvm_arm_pauth_supported()) {
> > +error_setg(errp, "'pauth' feature not supported by KVM on this 
> > host");
> > +}
> > +
> > +

[PATCH v2] hw/arm/virt: KVM: Enable PAuth when supported by the host

2022-01-03 Thread Marc Zyngier

Add basic support for Pointer Authentication when running a KVM
guest and that the host supports it, loosely based on the SVE
support.

Although the feature is enabled by default when the host advertises
it, it is possible to disable it by setting the 'pauth=off' CPU
property. The 'pauth' comment is removed from cpu-features.rst,
as it is now common to both TCG and KVM.

Tested on an Apple M1 running 5.16-rc6.

Cc: Eric Auger 
Cc: Andrew Jones 
Cc: Richard Henderson 
Cc: Peter Maydell 
Signed-off-by: Marc Zyngier 
---
* From v1:
  - Drop 'pauth' documentation
  - Make the TCG path common to both TCG and KVM
  - Some tidying up

 docs/system/arm/cpu-features.rst |  4 
 target/arm/cpu.c | 14 --
 target/arm/cpu.h |  1 +
 target/arm/cpu64.c   | 33 
 target/arm/kvm64.c   | 21 
 5 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/docs/system/arm/cpu-features.rst b/docs/system/arm/cpu-features.rst
index 584eb17097..3e626c4b68 100644
--- a/docs/system/arm/cpu-features.rst
+++ b/docs/system/arm/cpu-features.rst
@@ -217,10 +217,6 @@ TCG VCPU Features
 TCG VCPU features are CPU features that are specific to TCG.
 Below is the list of TCG VCPU features and their descriptions.
 
-  pauthEnable or disable ``FEAT_Pauth``, pointer
-   authentication.  By default, the feature is
-   enabled with ``-cpu max``.
-
   pauth-impdef When ``FEAT_Pauth`` is enabled, either the
*impdef* (Implementation Defined) algorithm
is enabled or the *architected* QARMA algorithm
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index a211804fd3..d96cc4ef18 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1380,18 +1380,11 @@ void arm_cpu_finalize_features(ARMCPU *cpu, Error 
**errp)
 return;
 }
 
-/*
- * KVM does not support modifications to this feature.
- * We have not registered the cpu properties when KVM
- * is in use, so the user will not be able to set them.
- */
-if (!kvm_enabled()) {
-arm_cpu_pauth_finalize(cpu, &local_err);
-if (local_err != NULL) {
+   arm_cpu_pauth_finalize(cpu, &local_err);
+   if (local_err != NULL) {
 error_propagate(errp, local_err);
 return;
-}
-}
+   }
 }
 
 if (kvm_enabled()) {
@@ -2091,6 +2084,7 @@ static void arm_host_initfn(Object *obj)
 kvm_arm_set_cpu_features_from_host(cpu);
 if (arm_feature(&cpu->env, ARM_FEATURE_AARCH64)) {
 aarch64_add_sve_properties(obj);
+aarch64_add_pauth_properties(obj);
 }
 #else
 hvf_arm_set_cpu_features_from_host(cpu);
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index e33f37b70a..c6a4d50e82 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1076,6 +1076,7 @@ void aarch64_sve_narrow_vq(CPUARMState *env, unsigned vq);
 void aarch64_sve_change_el(CPUARMState *env, int old_el,
int new_el, bool el0_a64);
 void aarch64_add_sve_properties(Object *obj);
+void aarch64_add_pauth_properties(Object *obj);
 
 /*
  * SVE registers are encoded in KVM's memory in an endianness-invariant format.
diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 15245a60a8..d5c0bce1c4 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -630,6 +630,17 @@ void arm_cpu_pauth_finalize(ARMCPU *cpu, Error **errp)
 int arch_val = 0, impdef_val = 0;
 uint64_t t;
 
+if (kvm_enabled()) {
+if (cpu->prop_pauth) {
+if (!cpu_isar_feature(aa64_pauth, cpu)) {
+error_setg(errp, "'pauth' feature not supported by KVM on this 
host");
+}
+
+return;
+}
+/* Fall through to disable PAuth */
+}
+
 /* TODO: Handle HaveEnhancedPAC, HaveEnhancedPAC2, HaveFPAC. */
 if (cpu->prop_pauth) {
 if (cpu->prop_pauth_impdef) {
@@ -655,6 +666,23 @@ static Property arm_cpu_pauth_property =
 static Property arm_cpu_pauth_impdef_property =
 DEFINE_PROP_BOOL("pauth-impdef", ARMCPU, prop_pauth_impdef, false);
 
+void aarch64_add_pauth_properties(Object *obj)
+{
+ARMCPU *cpu = ARM_CPU(obj);
+
+/* Default to PAUTH on, with the architected algorithm on TCG. */
+qdev_property_add_static(DEVICE(obj), &arm_cpu_pauth_property);
+if (kvm_enabled()) {
+/*
+ * Mirror PAuth support from the probed sysregs back into the
+ * property for KVM. Is it just a bit backward? Yes it is!
+ */
+cpu->prop_pauth = cpu_isar_feature(aa64_pauth, cpu);
+} else {
+qdev_property_add_static(DEVICE(obj), &arm_cpu_pauth_impdef_property);
+}
+}
+
 /* -cpu max: if KVM is enabled, like

[PATCH] hw/arm/virt: KVM: Enable PAuth when supported by the host

2021-12-28 Thread Marc Zyngier

Add basic support for Pointer Authentication when running a KVM
guest and that the host supports it, loosely based on the SVE
support.

Although the feature is enabled by default when the host advertises
it, it is possible to disable it by setting the 'pauth=off' CPU
property.

Tested on an Apple M1 running 5.16-rc6.

Cc: Eric Auger 
Cc: Andrew Jones 
Cc: Richard Henderson 
Cc: Peter Maydell 
Signed-off-by: Marc Zyngier 
---
 docs/system/arm/cpu-features.rst |  5 +
 target/arm/cpu.c |  1 +
 target/arm/cpu.h |  1 +
 target/arm/cpu64.c   | 36 
 target/arm/kvm.c | 13 
 target/arm/kvm64.c   | 10 +
 target/arm/kvm_arm.h |  7 +++
 7 files changed, 73 insertions(+)

diff --git a/docs/system/arm/cpu-features.rst b/docs/system/arm/cpu-features.rst
index 584eb17097..c9e39546a5 100644
--- a/docs/system/arm/cpu-features.rst
+++ b/docs/system/arm/cpu-features.rst
@@ -211,6 +211,11 @@ the list of KVM VCPU features and their descriptions.
influence the guest scheduler behavior and/or be
exposed to the guest userspace.
 
+  pauthEnable or disable ``FEAT_Pauth``, pointer
+   authentication.  By default, the feature is enabled
+   with ``-cpu host`` if supported by both the host
+   kernel and the hardware.
+
 TCG VCPU Features
 =
 
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index a211804fd3..68b09cbc6a 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -2091,6 +2091,7 @@ static void arm_host_initfn(Object *obj)
 kvm_arm_set_cpu_features_from_host(cpu);
 if (arm_feature(&cpu->env, ARM_FEATURE_AARCH64)) {
 aarch64_add_sve_properties(obj);
+aarch64_add_pauth_properties(obj);
 }
 #else
 hvf_arm_set_cpu_features_from_host(cpu);
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index e33f37b70a..c6a4d50e82 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1076,6 +1076,7 @@ void aarch64_sve_narrow_vq(CPUARMState *env, unsigned vq);
 void aarch64_sve_change_el(CPUARMState *env, int old_el,
int new_el, bool el0_a64);
 void aarch64_add_sve_properties(Object *obj);
+void aarch64_add_pauth_properties(Object *obj);
 
 /*
  * SVE registers are encoded in KVM's memory in an endianness-invariant format.
diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index 15245a60a8..305c0e19fe 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -625,6 +625,42 @@ void aarch64_add_sve_properties(Object *obj)
 #endif
 }
 
+static bool cpu_arm_get_pauth(Object *obj, Error **errp)
+{
+ARMCPU *cpu = ARM_CPU(obj);
+return cpu_isar_feature(aa64_pauth, cpu);
+}
+
+static void cpu_arm_set_pauth(Object *obj, bool value, Error **errp)
+{
+ARMCPU *cpu = ARM_CPU(obj);
+uint64_t t;
+
+if (value) {
+if (!kvm_arm_pauth_supported()) {
+error_setg(errp, "'pauth' feature not supported by KVM on this 
host");
+}
+
+return;
+}
+
+/*
+ * If the host supports PAuth, we only end-up here if the user has
+ * disabled the support, and value is false.
+ */
+t = cpu->isar.id_aa64isar1;
+t = FIELD_DP64(t, ID_AA64ISAR1, APA, value);
+t = FIELD_DP64(t, ID_AA64ISAR1, GPA, value);
+t = FIELD_DP64(t, ID_AA64ISAR1, API, value);
+t = FIELD_DP64(t, ID_AA64ISAR1, GPI, value);
+cpu->isar.id_aa64isar1 = t;
+}
+
+void aarch64_add_pauth_properties(Object *obj)
+{
+object_property_add_bool(obj, "pauth", cpu_arm_get_pauth, 
cpu_arm_set_pauth);
+}
+
 void arm_cpu_pauth_finalize(ARMCPU *cpu, Error **errp)
 {
 int arch_val = 0, impdef_val = 0;
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index bbf1ce7ba3..71e2f46ce8 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -84,6 +84,7 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t 
*cpus_to_try,
 if (vmfd < 0) {
 goto err;
 }
+
 cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
 if (cpufd < 0) {
 goto err;
@@ -94,6 +95,18 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t 
*cpus_to_try,
 goto finish;
 }
 
+/*
+ * Ask for Pointer Authentication if supported. We can't play the
+ * SVE trick of synthetising the ID reg as KVM won't tell us
+ * whether we have the architected or IMPDEF version of PAuth, so
+ * we have to use the actual ID regs.
+ */
+if (ioctl(vmfd, KVM_CHECK_EXTENSION, KVM_CAP_ARM_PTRAUTH_ADDRESS) > 0 &&
+ioctl(vmfd, KVM_CHECK_EXTENSION, KVM_CAP_ARM_PTRAUTH_GENERIC) > 0) {
+init->features[0] |= (1 << KVM_ARM_VCPU_PTRAUTH_ADDRESS |
+  1 << KVM_ARM_VCPU_PTRAUTH_GENERIC);
+}
+
 if (init->target == -1) {

[PATCH v3 4/5] hw/arm/virt: Use the PA range to compute the memory map

2021-12-27 Thread Marc Zyngier

The highmem attribute is nothing but another way to express the
PA range of a VM. To support HW that has a smaller PA range then
what QEMU assumes, pass this PA range to the virt_set_memmap()
function, allowing it to correctly exclude highmem devices
if they are outside of the PA range.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 64 ---
 1 file changed, 50 insertions(+), 14 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 84dd3b36fb..212079e7a6 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1660,10 +1660,10 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
*vms, int idx)
 return arm_cpu_mp_affinity(idx, clustersz);
 }
 
-static void virt_set_memmap(VirtMachineState *vms)
+static void virt_set_memmap(VirtMachineState *vms, int pa_bits)
 {
 MachineState *ms = MACHINE(vms);
-hwaddr base, device_memory_base, device_memory_size;
+hwaddr base, device_memory_base, device_memory_size, memtop;
 int i;
 
 vms->memmap = extended_memmap;
@@ -1678,11 +1678,9 @@ static void virt_set_memmap(VirtMachineState *vms)
 exit(EXIT_FAILURE);
 }
 
-if (!vms->highmem &&
-vms->memmap[VIRT_MEM].base + ms->maxram_size > 4 * GiB) {
-error_report("highmem=off, but memory crosses the 4GiB limit\n");
-exit(EXIT_FAILURE);
-}
+if (!vms->highmem)
+   pa_bits = 32;
+
 /*
  * We compute the base of the high IO region depending on the
  * amount of initial and device memory. The device memory start/size
@@ -1695,7 +1693,12 @@ static void virt_set_memmap(VirtMachineState *vms)
 device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB;
 
 /* Base address of the high IO region */
-base = device_memory_base + ROUND_UP(device_memory_size, GiB);
+memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
+if (memtop > BIT_ULL(pa_bits)) {
+   error_report("Addressing limited to %d bits, but memory exceeds it 
by %llu bytes\n",
+pa_bits, memtop - BIT_ULL(pa_bits));
+exit(EXIT_FAILURE);
+}
 if (base < device_memory_base) {
 error_report("maxmem/slots too huge");
 exit(EXIT_FAILURE);
@@ -1712,9 +1715,17 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = (vms->highmem ?
-base :
-vms->memmap[VIRT_MEM].base + ms->maxram_size) - 1;
+
+/*
+ * If base fits within pa_bits, all good. If it doesn't, limit it
+ * to the end of RAM, which is guaranteed to fit within pa_bits.
+ */
+if (base <= BIT_ULL(pa_bits)) {
+vms->highest_gpa = base - 1;
+} else {
+vms->highest_gpa = memtop - 1;
+}
+
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
@@ -1905,12 +1916,38 @@ static void machvirt_init(MachineState *machine)
 unsigned int smp_cpus = machine->smp.cpus;
 unsigned int max_cpus = machine->smp.max_cpus;
 
+possible_cpus = mc->possible_cpu_arch_ids(machine);
+
 /*
  * In accelerated mode, the memory map is computed earlier in kvm_type()
  * to create a VM with the right number of IPA bits.
  */
 if (!vms->memmap) {
-virt_set_memmap(vms);
+Object *cpuobj;
+ARMCPU *armcpu;
+int pa_bits;
+
+/*
+ * Instanciate a temporary CPU object to find out about what
+ * we are about to deal with. Once this is done, get rid of
+ * the object.
+ */
+cpuobj = object_new(possible_cpus->cpus[0].type);
+armcpu = ARM_CPU(cpuobj);
+
+if (object_property_get_bool(cpuobj, "aarch64", NULL)) {
+pa_bits = arm_pamax(armcpu);
+} else if (arm_feature(&armcpu->env, ARM_FEATURE_LPAE)) {
+/* v7 with LPAE */
+pa_bits = 40;
+} else {
+/* Anything else */
+pa_bits = 32;
+}
+
+object_unref(cpuobj);
+
+virt_set_memmap(vms, pa_bits);
 }
 
 /* We can probe only here because during property set
@@ -1992,7 +2029,6 @@ static void machvirt_init(MachineState *machine)
 
 create_fdt(vms);
 
-possible_cpus = mc->possible_cpu_arch_ids(machine);
 assert(possible_cpus->len == max_cpus);
 for (n = 0; n < possible_cpus->len; n++) {
 Object *cpuobj;
@@ -2648,7 +2684,7 @@ static int virt_kvm_type(MachineState *ms, const char 
*type_str)
 max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, &fixed_ipa);
 
 /* we freeze the memory map to compute the highest gpa */
-virt_set_memmap(vms);
+virt_set_memmap(vms, max_vm_pa_size);
 
 requested_pa_size = 64 - clz64(vms->highest_gpa);
 
-- 
2.30.2

[PATCH v3 0/5] target/arm: Reduced-IPA space and highmem=off fixes

2021-12-27 Thread Marc Zyngier

Here's another stab at enabling QEMU on systems with pathologically
reduced IPA ranges such as the Apple M1 (previous version at [1]).
Eventually, we're able to run a KVM guest with more than just 3GB of
RAM on a system with a 36bit IPA space, and at most 123 vCPUs.

This series does a few things:
- decouple the enabling of the highmem PCIe region from the highmem
  attribute
- introduce a new attribute to control the enabling of the highmem
  GICv3 redistributors
- correctly cap the PA range with highmem is off
- generalise the highmem behaviour to any PA range
- disable both highmem PCIe and GICv3 RDs when they are outside of the
  PA range

This has been tested on an M1-based Mac-mini running Linux v5.16-rc6
with both KVM and TCG.

* From v2:
  - Fixed checking of the maximum memory against the IPA space
  - Fixed TCG memory map creation
  - Rebased on top of QEMU's 89f3bfa326
  - Collected Andrew's RBs, with thanks

[1] https://lore.kernel.org/r/20211003164605.3116450-1-...@kernel.org

Marc Zyngier (5):
  hw/arm/virt: Key enablement of highmem PCIe on highmem_ecam
  hw/arm/virt: Add a control for the the highmem redistributors
  hw/arm/virt: Honor highmem setting when computing the memory map
  hw/arm/virt: Use the PA range to compute the memory map
  hw/arm/virt: Disable highmem devices that don't fit in the PA range

 hw/arm/virt-acpi-build.c | 12 +++
 hw/arm/virt.c| 67 ++--
 include/hw/arm/virt.h|  4 ++-
 3 files changed, 67 insertions(+), 16 deletions(-)

-- 
2.30.2

[PATCH v3 1/5] hw/arm/virt: Key enablement of highmem PCIe on highmem_ecam

2021-12-27 Thread Marc Zyngier

Currently, the highmem PCIe region is oddly keyed on the highmem
attribute instead of highmem_ecam. Move the enablement of this PCIe
region over to highmem_ecam.

Reviewed-by: Andrew Jones 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 10 --
 hw/arm/virt.c|  4 ++--
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index d0f4867fdf..d04c107fd8 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -158,10 +158,9 @@ static void acpi_dsdt_add_virtio(Aml *scope,
 }
 
 static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
-  uint32_t irq, bool use_highmem, bool 
highmem_ecam,
-  VirtMachineState *vms)
+  uint32_t irq, VirtMachineState *vms)
 {
-int ecam_id = VIRT_ECAM_ID(highmem_ecam);
+int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam);
 struct GPEXConfig cfg = {
 .mmio32 = memmap[VIRT_PCIE_MMIO],
 .pio= memmap[VIRT_PCIE_PIO],
@@ -170,7 +169,7 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry 
*memmap,
 .bus= vms->bus,
 };
 
-if (use_highmem) {
+if (vms->highmem_ecam) {
 cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
 }
 
@@ -868,8 +867,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 acpi_dsdt_add_fw_cfg(scope, &memmap[VIRT_FW_CFG]);
 acpi_dsdt_add_virtio(scope, &memmap[VIRT_MMIO],
 (irqmap[VIRT_MMIO] + ARM_SPI_BASE), NUM_VIRTIO_TRANSPORTS);
-acpi_dsdt_add_pci(scope, memmap, (irqmap[VIRT_PCIE] + ARM_SPI_BASE),
-  vms->highmem, vms->highmem_ecam, vms);
+acpi_dsdt_add_pci(scope, memmap, irqmap[VIRT_PCIE] + ARM_SPI_BASE, vms);
 if (vms->acpi_dev) {
 build_ged_aml(scope, "\\_SB."GED_DEVICE,
   HOTPLUG_HANDLER(vms->acpi_dev),
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 6bce595aba..a54dc43175 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1412,7 +1412,7 @@ static void create_pcie(VirtMachineState *vms)
  mmio_reg, base_mmio, size_mmio);
 memory_region_add_subregion(get_system_memory(), base_mmio, mmio_alias);
 
-if (vms->highmem) {
+if (vms->highmem_ecam) {
 /* Map high MMIO space */
 MemoryRegion *high_mmio_alias = g_new0(MemoryRegion, 1);
 
@@ -1466,7 +1466,7 @@ static void create_pcie(VirtMachineState *vms)
 qemu_fdt_setprop_sized_cells(ms->fdt, nodename, "reg",
  2, base_ecam, 2, size_ecam);
 
-if (vms->highmem) {
+if (vms->highmem_ecam) {
 qemu_fdt_setprop_sized_cells(ms->fdt, nodename, "ranges",
  1, FDT_PCI_RANGE_IOPORT, 2, 0,
  2, base_pio, 2, size_pio,
-- 
2.30.2

[PATCH v3 3/5] hw/arm/virt: Honor highmem setting when computing the memory map

2021-12-27 Thread Marc Zyngier

Even when the VM is configured with highmem=off, the highest_gpa
field includes devices that are above the 4GiB limit.
Similarily, nothing seem to check that the memory is within
the limit set by the highmem=off option.

This leads to failures in virt_kvm_type() on systems that have
a crippled IPA range, as the reported IPA space is larger than
what it should be.

Instead, honor the user-specified limit to only use the devices
at the lowest end of the spectrum, and fail if we have memory
crossing the 4GiB limit.

Reviewed-by: Andrew Jones 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 8b600d82c1..84dd3b36fb 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1678,6 +1678,11 @@ static void virt_set_memmap(VirtMachineState *vms)
 exit(EXIT_FAILURE);
 }
 
+if (!vms->highmem &&
+vms->memmap[VIRT_MEM].base + ms->maxram_size > 4 * GiB) {
+error_report("highmem=off, but memory crosses the 4GiB limit\n");
+exit(EXIT_FAILURE);
+}
 /*
  * We compute the base of the high IO region depending on the
  * amount of initial and device memory. The device memory start/size
@@ -1707,7 +1712,9 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = base - 1;
+vms->highest_gpa = (vms->highmem ?
+base :
+vms->memmap[VIRT_MEM].base + ms->maxram_size) - 1;
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
-- 
2.30.2

[PATCH v3 2/5] hw/arm/virt: Add a control for the the highmem redistributors

2021-12-27 Thread Marc Zyngier

Just like we can control the enablement of the highmem PCIe region
using highmem_ecam, let's add a control for the highmem GICv3
redistributor region.

Similarily to highmem_ecam, these redistributors are disabled when
highmem is off.

Reviewed-by: Andrew Jones 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 2 ++
 hw/arm/virt.c| 3 +++
 include/hw/arm/virt.h| 4 +++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index d04c107fd8..fcbff9d835 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -946,6 +946,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables 
*tables)
 acpi_add_table(table_offsets, tables_blob);
 build_fadt_rev5(tables_blob, tables->linker, vms, dsdt);
 
+vms->highmem_redists &= vms->highmem;
+
 acpi_add_table(table_offsets, tables_blob);
 build_madt(tables_blob, tables->linker, vms);
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index a54dc43175..8b600d82c1 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2105,6 +2105,8 @@ static void machvirt_init(MachineState *machine)
 
 virt_flash_fdt(vms, sysmem, secure_sysmem ?: sysmem);
 
+vms->highmem_redists &= vms->highmem;
+
 create_gic(vms, sysmem);
 
 virt_cpu_post_init(vms, sysmem);
@@ -2802,6 +2804,7 @@ static void virt_instance_init(Object *obj)
 vms->gic_version = VIRT_GIC_VERSION_NOSEL;
 
 vms->highmem_ecam = !vmc->no_highmem_ecam;
+vms->highmem_redists = true;
 
 if (vmc->no_its) {
 vms->its = false;
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index dc6b66ffc8..726623a176 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -143,6 +143,7 @@ struct VirtMachineState {
 bool secure;
 bool highmem;
 bool highmem_ecam;
+bool highmem_redists;
 bool its;
 bool tcg_its;
 bool virt;
@@ -189,7 +190,8 @@ static inline int 
virt_gicv3_redist_region_count(VirtMachineState *vms)
 
 assert(vms->gic_version == VIRT_GIC_VERSION_3);
 
-return MACHINE(vms)->smp.cpus > redist0_capacity ? 2 : 1;
+return (MACHINE(vms)->smp.cpus > redist0_capacity &&
+vms->highmem_redists) ? 2 : 1;
 }
 
 #endif /* QEMU_ARM_VIRT_H */
-- 
2.30.2

[PATCH v3 5/5] hw/arm/virt: Disable highmem devices that don't fit in the PA range

2021-12-27 Thread Marc Zyngier

Make sure both the highmem PCIe and GICv3 regions are disabled when
they don't fully fit in the PA range.

Reviewed-by: Andrew Jones 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 212079e7a6..18e615070f 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1723,6 +1723,9 @@ static void virt_set_memmap(VirtMachineState *vms, int 
pa_bits)
 if (base <= BIT_ULL(pa_bits)) {
 vms->highest_gpa = base - 1;
 } else {
+/* Advertise that we have disabled the highmem devices */
+vms->highmem_ecam = false;
+vms->highmem_redists = false;
 vms->highest_gpa = memtop - 1;
 }
 
-- 
2.30.2

Re: [PATCH v2 4/5] hw/arm/virt: Use the PA range to compute the memory map

2021-12-27 Thread Marc Zyngier

On Mon, 04 Oct 2021 11:11:10 +0100,
Andrew Jones  wrote:
> 
> On Sun, Oct 03, 2021 at 05:46:04PM +0100, Marc Zyngier wrote:
> > The highmem attribute is nothing but another way to express the
> > PA range of a VM. To support HW that has a smaller PA range then
> > what QEMU assumes, pass this PA range to the virt_set_memmap()
> > function, allowing it to correctly exclude highmem devices
> > if they are outside of the PA range.
> > 
> > Signed-off-by: Marc Zyngier 
> > ---
> >  hw/arm/virt.c | 46 +++---
> >  1 file changed, 35 insertions(+), 11 deletions(-)
> > 
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 9d2abdbd5f..a572e0c9d9 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1610,10 +1610,10 @@ static uint64_t 
> > virt_cpu_mp_affinity(VirtMachineState *vms, int idx)
> >  return arm_cpu_mp_affinity(idx, clustersz);
> >  }
> >  
> > -static void virt_set_memmap(VirtMachineState *vms)
> > +static void virt_set_memmap(VirtMachineState *vms, int pa_bits)
> >  {
> >  MachineState *ms = MACHINE(vms);
> > -hwaddr base, device_memory_base, device_memory_size;
> > +hwaddr base, device_memory_base, device_memory_size, memtop;
> >  int i;
> >  
> >  vms->memmap = extended_memmap;
> > @@ -1628,9 +1628,12 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  exit(EXIT_FAILURE);
> >  }
> >  
> > -if (!vms->highmem &&
> > -vms->memmap[VIRT_MEM].base + ms->maxram_size > 4 * GiB) {
> > -error_report("highmem=off, but memory crosses the 4GiB limit\n");
> > +if (!vms->highmem)
> > +   pa_bits = 32;
> > +
> > +if (vms->memmap[VIRT_MEM].base + ms->maxram_size > BIT_ULL(pa_bits)) {
> > +   error_report("Addressing limited to %d bits, but memory exceeds it 
> > by %llu bytes\n",
> > +pa_bits, vms->memmap[VIRT_MEM].base + ms->maxram_size 
> > - BIT_ULL(pa_bits));
> >  exit(EXIT_FAILURE);
> >  }
> >  /*
> > @@ -1645,7 +1648,7 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * 
> > GiB;
> >  
> >  /* Base address of the high IO region */
> > -base = device_memory_base + ROUND_UP(device_memory_size, GiB);
> > +memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
> >  if (base < device_memory_base) {
> >  error_report("maxmem/slots too huge");
> >  exit(EXIT_FAILURE);
> > @@ -1662,9 +1665,17 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  vms->memmap[i].size = size;
> >  base += size;
> >  }
> > -vms->highest_gpa = (vms->highmem ?
> > -base :
> > -vms->memmap[VIRT_MEM].base + ms->maxram_size) - 1;
> > +
> > +/*
> > + * If base fits within pa_bits, all good. If it doesn't, limit it
> > + * to the end of RAM, which is guaranteed to fit within pa_bits.
> 
> We tested that
> 
>   vms->memmap[VIRT_MEM].base + ms->maxram_size
> 
> fits within pa_bits, but here we're setting highest_gpa to
> 
>   ROUND_UP(vms->memmap[VIRT_MEM].base + ms->ram_size, GiB) +
>   ROUND_UP(ms->maxram_size - ms->ram_size + ms->ram_slots * GiB, GiB)
> 
> which will be larger. Shouldn't we test memtop instead to make this
> guarantee?

Yes, well spotted.

> 
> 
> > + */
> > +if (base <= BIT_ULL(pa_bits)) {
> > +vms->highest_gpa = base -1;
> > +} else {
> > +vms->highest_gpa = memtop - 1;
> > +}
> > +
> >  if (device_memory_size > 0) {
> >  ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
> >  ms->device_memory->base = device_memory_base;
> > @@ -1860,7 +1871,20 @@ static void machvirt_init(MachineState *machine)
> >   * to create a VM with the right number of IPA bits.
> >   */
> >  if (!vms->memmap) {
> > -virt_set_memmap(vms);
> > +ARMCPU *armcpu = ARM_CPU(first_cpu);
> 
> 
> I think it's too early to use first_cpu here (although, I'll admit I'm
> always confused as to what gets initialized when...) Assuming we need to
> realize the cpus first, then we don't do that until a bit further down
> in this function. I wonder if it's possi

Re: [PATCH v2 3/5] hw/arm/virt: Honor highmem setting when computing the memory map

2021-12-27 Thread Marc Zyngier

On Mon, 04 Oct 2021 13:23:41 +0100,
Eric Auger  wrote:
> 
> Hi Marc,
> 
> On 10/3/21 6:46 PM, Marc Zyngier wrote:
> > Even when the VM is configured with highmem=off, the highest_gpa
> > field includes devices that are above the 4GiB limit.
> > Similarily, nothing seem to check that the memory is within
> > the limit set by the highmem=off option.
> >
> > This leads to failures in virt_kvm_type() on systems that have
> > a crippled IPA range, as the reported IPA space is larger than
> > what it should be.
> >
> > Instead, honor the user-specified limit to only use the devices
> > at the lowest end of the spectrum, and fail if we have memory
> > crossing the 4GiB limit.
> >
> > Signed-off-by: Marc Zyngier 
> > ---
> >  hw/arm/virt.c | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index bcf58f677d..9d2abdbd5f 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -1628,6 +1628,11 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  exit(EXIT_FAILURE);
> >  }
> >  
> > +if (!vms->highmem &&
> > +vms->memmap[VIRT_MEM].base + ms->maxram_size > 4 * GiB) {
> > +error_report("highmem=off, but memory crosses the 4GiB limit\n");
> > +exit(EXIT_FAILURE);
> > +}
> >  /*
> >   * We compute the base of the high IO region depending on the
> >   * amount of initial and device memory. The device memory start/size
> > @@ -1657,7 +1662,9 @@ static void virt_set_memmap(VirtMachineState *vms)
> >  vms->memmap[i].size = size;
> >  base += size;
> >  }
> > -vms->highest_gpa = base - 1;
> > +vms->highest_gpa = (vms->highmem ?
> > +base :
> > +vms->memmap[VIRT_MEM].base + ms->maxram_size) - 1;
> I think I would have preferred to have
> 
> if (vms->highmem) {
>    for (i = VIRT_LOWMEMMAP_LAST; i < ARRAY_SIZE(extended_memmap); i++) {
>     hwaddr size = extended_memmap[i].size;
> 
>     base = ROUND_UP(base, size);
>     vms->memmap[i].base = base;
>     vms->memmap[i].size = size;
>     base += size;
>     }
> }
> as it is useless to execute that code and create new memmap entries in
> case of !highmem.

I agree that it is a bit useless when we only have highmem. But we
really want to deal with arbitrary IPA spaces (see how this changes in
the follow-up patches), and we need to check that everything fits in
the IPA space (and fix things up if they don't).

> 
> But nevertheless, this looks correct

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v2 1/5] hw/arm/virt: Key enablement of highmem PCIe on highmem_ecam

2021-12-27 Thread Marc Zyngier

Hi Eric,

Picking this up again after a stupidly long time...

On Mon, 04 Oct 2021 13:00:21 +0100,
Eric Auger  wrote:
> 
> Hi Marc,
> 
> On 10/3/21 6:46 PM, Marc Zyngier wrote:
> > Currently, the highmem PCIe region is oddly keyed on the highmem
> > attribute instead of highmem_ecam. Move the enablement of this PCIe
> > region over to highmem_ecam.
> >
> > Signed-off-by: Marc Zyngier 
> > ---
> >  hw/arm/virt-acpi-build.c | 10 --
> >  hw/arm/virt.c|  4 ++--
> >  2 files changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> > index 037cc1fd82..d7bef0e627 100644
> > --- a/hw/arm/virt-acpi-build.c
> > +++ b/hw/arm/virt-acpi-build.c
> > @@ -157,10 +157,9 @@ static void acpi_dsdt_add_virtio(Aml *scope,
> >  }
> >  
> >  static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
> > -  uint32_t irq, bool use_highmem, bool 
> > highmem_ecam,
> > -  VirtMachineState *vms)
> > +  uint32_t irq, VirtMachineState *vms)
> >  {
> > -int ecam_id = VIRT_ECAM_ID(highmem_ecam);
> > +int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam);
> >  struct GPEXConfig cfg = {
> >  .mmio32 = memmap[VIRT_PCIE_MMIO],
> >  .pio= memmap[VIRT_PCIE_PIO],
> > @@ -169,7 +168,7 @@ static void acpi_dsdt_add_pci(Aml *scope, const 
> > MemMapEntry *memmap,
> >  .bus= vms->bus,
> >  };
> >  
> > -if (use_highmem) {
> > +if (vms->highmem_ecam) {
> highmem_ecam is more restrictive than use_highmem:
> vms->highmem_ecam &= vms->highmem && (!firmware_loaded || aarch64);
> 
> If I remember correctly there was a problem using highmem ECAM with 32b
> AAVMF FW.
>
> However 5125f9cd2532 ("hw/arm/virt: Add high MMIO PCI region, 512G in
> size") introduced high MMIO PCI region without this constraint.

Then I really don't understand the point of this highmem_ecam. We only
register the highmem version if highmem_ecam is set (see the use of
VIRT_ECAM_ID() to pick the right ECAM window).

So keying this on highmem makes it expose a device that may not be
there the first place since, as you pointed out that highmem_ecam can
be false in cases where highmem is true.

> So to me we should keep vms->highmem here

I really must be missing how this is supposed to work.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH for-6.2? 0/2] arm_gicv3: Fix handling of LPIs in list registers

2021-11-29 Thread Marc Zyngier

Hi Peter,

On Fri, 26 Nov 2021 16:39:13 +,
Peter Maydell  wrote:
> 
> (Marc: cc'd you on this one in case you're still using QEMU
> to test KVM stuff with, in which case you might have run into
> the bug this is fixing.)

Amusingly enough, I have recently fixed [1] a very similar issue with
the ICV_*_EL1 emulation that KVM uses when dealing with sub-par HW
(ThunderX and M1).

When writing this a very long while ago, I modelled it so that LPIs
wouldn't have an Active state, similar to bare metal. As it turns out,
the pseudocode actually treats LPIs almost like any other interrupt,
and is quite happy to carry an active bit that eventually gets exposed
to the hypervisor.

I don't think this ever caused any issue, but I'd be pretty happy to
see the QEMU implementation fixed.

For the whole series:

Reviewed-by: Marc Zyngier 

Thanks,

M.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kvm/hyp/vgic-v3-sr.c?id=9d449c71bd8f74282e84213c8f0b8328293ab0a7

-- 
Without deviation from the norm, progress is not possible.

[PATCH v2 5/5] hw/arm/virt: Disable highmem devices that don't fit in the PA range

2021-10-03 Thread Marc Zyngier

Make sure both the highmem PCIe and GICv3 regions are disabled when
they don't fully fit in the PA range.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index a572e0c9d9..756f67b6c8 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1673,6 +1673,9 @@ static void virt_set_memmap(VirtMachineState *vms, int 
pa_bits)
 if (base <= BIT_ULL(pa_bits)) {
 vms->highest_gpa = base -1;
 } else {
+/* Advertise that we have disabled the highmem devices */
+vms->highmem_ecam = false;
+vms->highmem_redists = false;
 vms->highest_gpa = memtop - 1;
 }
 
-- 
2.30.2

[PATCH v2 3/5] hw/arm/virt: Honor highmem setting when computing the memory map

2021-10-03 Thread Marc Zyngier

Even when the VM is configured with highmem=off, the highest_gpa
field includes devices that are above the 4GiB limit.
Similarily, nothing seem to check that the memory is within
the limit set by the highmem=off option.

This leads to failures in virt_kvm_type() on systems that have
a crippled IPA range, as the reported IPA space is larger than
what it should be.

Instead, honor the user-specified limit to only use the devices
at the lowest end of the spectrum, and fail if we have memory
crossing the 4GiB limit.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index bcf58f677d..9d2abdbd5f 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1628,6 +1628,11 @@ static void virt_set_memmap(VirtMachineState *vms)
 exit(EXIT_FAILURE);
 }
 
+if (!vms->highmem &&
+vms->memmap[VIRT_MEM].base + ms->maxram_size > 4 * GiB) {
+error_report("highmem=off, but memory crosses the 4GiB limit\n");
+exit(EXIT_FAILURE);
+}
 /*
  * We compute the base of the high IO region depending on the
  * amount of initial and device memory. The device memory start/size
@@ -1657,7 +1662,9 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = base - 1;
+vms->highest_gpa = (vms->highmem ?
+base :
+vms->memmap[VIRT_MEM].base + ms->maxram_size) - 1;
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
-- 
2.30.2

[PATCH v2 0/5] target/arm: Reduced-IPA space and highmem=off fixes

2021-10-03 Thread Marc Zyngier

Here's another stab at enabling QEMU on systems with pathologically
reduced IPA ranges such as the Apple M1 (original version at [1]).
Eventually, we're able to run a KVM guest with more than just 3GB of
RAM on a system with a 36bit IPA space, and at most 123 vCPUs.

This series does a few things:
- decouple the enabling of the highmem PCIe region from the highmem
  attribute
- introduce a new attribute to control the enabling of the highmem
  GICv3 redistributors
- correctly cap the PA range with highmem is off
- generalise the highmem behaviour to any PA range
- disable both highmem PCIe and GICv3 RDs when they are outside of the
  PA range

This has been tested on an M1-based Mac-mini running Linux v5.15-rc3.

[1] https://lore.kernel.org/r/2021082211.1290891-1-...@kernel.org

Marc Zyngier (5):
  hw/arm/virt: Key enablement of highmem PCIe on highmem_ecam
  hw/arm/virt: Add a control for the the highmem redistributors
  hw/arm/virt: Honor highmem setting when computing the memory map
  hw/arm/virt: Use the PA range to compute the memory map
  hw/arm/virt: Disable highmem devices that don't fit in the PA range

 hw/arm/virt-acpi-build.c | 12 -
 hw/arm/virt.c| 53 ++--
 include/hw/arm/virt.h|  4 ++-
 3 files changed, 54 insertions(+), 15 deletions(-)

-- 
2.30.2

[PATCH v2 1/5] hw/arm/virt: Key enablement of highmem PCIe on highmem_ecam

2021-10-03 Thread Marc Zyngier

Currently, the highmem PCIe region is oddly keyed on the highmem
attribute instead of highmem_ecam. Move the enablement of this PCIe
region over to highmem_ecam.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 10 --
 hw/arm/virt.c|  4 ++--
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 037cc1fd82..d7bef0e627 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -157,10 +157,9 @@ static void acpi_dsdt_add_virtio(Aml *scope,
 }
 
 static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
-  uint32_t irq, bool use_highmem, bool 
highmem_ecam,
-  VirtMachineState *vms)
+  uint32_t irq, VirtMachineState *vms)
 {
-int ecam_id = VIRT_ECAM_ID(highmem_ecam);
+int ecam_id = VIRT_ECAM_ID(vms->highmem_ecam);
 struct GPEXConfig cfg = {
 .mmio32 = memmap[VIRT_PCIE_MMIO],
 .pio= memmap[VIRT_PCIE_PIO],
@@ -169,7 +168,7 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry 
*memmap,
 .bus= vms->bus,
 };
 
-if (use_highmem) {
+if (vms->highmem_ecam) {
 cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
 }
 
@@ -712,8 +711,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 acpi_dsdt_add_fw_cfg(scope, &memmap[VIRT_FW_CFG]);
 acpi_dsdt_add_virtio(scope, &memmap[VIRT_MMIO],
 (irqmap[VIRT_MMIO] + ARM_SPI_BASE), NUM_VIRTIO_TRANSPORTS);
-acpi_dsdt_add_pci(scope, memmap, (irqmap[VIRT_PCIE] + ARM_SPI_BASE),
-  vms->highmem, vms->highmem_ecam, vms);
+acpi_dsdt_add_pci(scope, memmap, (irqmap[VIRT_PCIE] + ARM_SPI_BASE), vms);
 if (vms->acpi_dev) {
 build_ged_aml(scope, "\\_SB."GED_DEVICE,
   HOTPLUG_HANDLER(vms->acpi_dev),
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 7170aaacd5..8021d545c3 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1362,7 +1362,7 @@ static void create_pcie(VirtMachineState *vms)
  mmio_reg, base_mmio, size_mmio);
 memory_region_add_subregion(get_system_memory(), base_mmio, mmio_alias);
 
-if (vms->highmem) {
+if (vms->highmem_ecam) {
 /* Map high MMIO space */
 MemoryRegion *high_mmio_alias = g_new0(MemoryRegion, 1);
 
@@ -1416,7 +1416,7 @@ static void create_pcie(VirtMachineState *vms)
 qemu_fdt_setprop_sized_cells(ms->fdt, nodename, "reg",
  2, base_ecam, 2, size_ecam);
 
-if (vms->highmem) {
+if (vms->highmem_ecam) {
 qemu_fdt_setprop_sized_cells(ms->fdt, nodename, "ranges",
  1, FDT_PCI_RANGE_IOPORT, 2, 0,
  2, base_pio, 2, size_pio,
-- 
2.30.2

[PATCH v2 4/5] hw/arm/virt: Use the PA range to compute the memory map

2021-10-03 Thread Marc Zyngier

The highmem attribute is nothing but another way to express the
PA range of a VM. To support HW that has a smaller PA range then
what QEMU assumes, pass this PA range to the virt_set_memmap()
function, allowing it to correctly exclude highmem devices
if they are outside of the PA range.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 46 +++---
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 9d2abdbd5f..a572e0c9d9 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1610,10 +1610,10 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
*vms, int idx)
 return arm_cpu_mp_affinity(idx, clustersz);
 }
 
-static void virt_set_memmap(VirtMachineState *vms)
+static void virt_set_memmap(VirtMachineState *vms, int pa_bits)
 {
 MachineState *ms = MACHINE(vms);
-hwaddr base, device_memory_base, device_memory_size;
+hwaddr base, device_memory_base, device_memory_size, memtop;
 int i;
 
 vms->memmap = extended_memmap;
@@ -1628,9 +1628,12 @@ static void virt_set_memmap(VirtMachineState *vms)
 exit(EXIT_FAILURE);
 }
 
-if (!vms->highmem &&
-vms->memmap[VIRT_MEM].base + ms->maxram_size > 4 * GiB) {
-error_report("highmem=off, but memory crosses the 4GiB limit\n");
+if (!vms->highmem)
+   pa_bits = 32;
+
+if (vms->memmap[VIRT_MEM].base + ms->maxram_size > BIT_ULL(pa_bits)) {
+   error_report("Addressing limited to %d bits, but memory exceeds it 
by %llu bytes\n",
+pa_bits, vms->memmap[VIRT_MEM].base + ms->maxram_size 
- BIT_ULL(pa_bits));
 exit(EXIT_FAILURE);
 }
 /*
@@ -1645,7 +1648,7 @@ static void virt_set_memmap(VirtMachineState *vms)
 device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB;
 
 /* Base address of the high IO region */
-base = device_memory_base + ROUND_UP(device_memory_size, GiB);
+memtop = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
 if (base < device_memory_base) {
 error_report("maxmem/slots too huge");
 exit(EXIT_FAILURE);
@@ -1662,9 +1665,17 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = (vms->highmem ?
-base :
-vms->memmap[VIRT_MEM].base + ms->maxram_size) - 1;
+
+/*
+ * If base fits within pa_bits, all good. If it doesn't, limit it
+ * to the end of RAM, which is guaranteed to fit within pa_bits.
+ */
+if (base <= BIT_ULL(pa_bits)) {
+vms->highest_gpa = base -1;
+} else {
+vms->highest_gpa = memtop - 1;
+}
+
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
@@ -1860,7 +1871,20 @@ static void machvirt_init(MachineState *machine)
  * to create a VM with the right number of IPA bits.
  */
 if (!vms->memmap) {
-virt_set_memmap(vms);
+ARMCPU *armcpu = ARM_CPU(first_cpu);
+int pa_bits;
+
+if (object_property_get_bool(OBJECT(first_cpu), "aarch64", NULL)) {
+pa_bits = arm_pamax(armcpu);
+} else if (arm_feature(&armcpu->env, ARM_FEATURE_LPAE)) {
+/* v7 with LPAE */
+pa_bits = 40;
+} else {
+/* Anything else */
+pa_bits = 32;
+}
+
+virt_set_memmap(vms, pa_bits);
 }
 
 /* We can probe only here because during property set
@@ -2596,7 +2620,7 @@ static int virt_kvm_type(MachineState *ms, const char 
*type_str)
 max_vm_pa_size = kvm_arm_get_max_vm_ipa_size(ms, &fixed_ipa);
 
 /* we freeze the memory map to compute the highest gpa */
-virt_set_memmap(vms);
+virt_set_memmap(vms, max_vm_pa_size);
 
 requested_pa_size = 64 - clz64(vms->highest_gpa);
 
-- 
2.30.2

[PATCH v2 2/5] hw/arm/virt: Add a control for the the highmem redistributors

2021-10-03 Thread Marc Zyngier

Just like we can control the enablement of the highmem PCIe region
using highmem_ecam, let's add a control for the highmem GICv3
redistributor region.

Similarily to highmem_ecam, these redistributors are disabled when
highmem is off.

Signed-off-by: Marc Zyngier 
---
 hw/arm/virt-acpi-build.c | 2 ++
 hw/arm/virt.c| 3 +++
 include/hw/arm/virt.h| 4 +++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index d7bef0e627..f0d0b662b7 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -792,6 +792,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables 
*tables)
 acpi_add_table(table_offsets, tables_blob);
 build_fadt_rev5(tables_blob, tables->linker, vms, dsdt);
 
+vms->highmem_redists &= vms->highmem;
+
 acpi_add_table(table_offsets, tables_blob);
 build_madt(tables_blob, tables->linker, vms);
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 8021d545c3..bcf58f677d 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2053,6 +2053,8 @@ static void machvirt_init(MachineState *machine)
 
 virt_flash_fdt(vms, sysmem, secure_sysmem ?: sysmem);
 
+vms->highmem_redists &= vms->highmem;
+
 create_gic(vms, sysmem);
 
 virt_cpu_post_init(vms, sysmem);
@@ -2750,6 +2752,7 @@ static void virt_instance_init(Object *obj)
 vms->gic_version = VIRT_GIC_VERSION_NOSEL;
 
 vms->highmem_ecam = !vmc->no_highmem_ecam;
+vms->highmem_redists = true;
 
 if (vmc->no_its) {
 vms->its = false;
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index b461b8d261..787cc8a27d 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -141,6 +141,7 @@ struct VirtMachineState {
 bool secure;
 bool highmem;
 bool highmem_ecam;
+bool highmem_redists;
 bool its;
 bool tcg_its;
 bool virt;
@@ -187,7 +188,8 @@ static inline int 
virt_gicv3_redist_region_count(VirtMachineState *vms)
 
 assert(vms->gic_version == VIRT_GIC_VERSION_3);
 
-return MACHINE(vms)->smp.cpus > redist0_capacity ? 2 : 1;
+return (MACHINE(vms)->smp.cpus > redist0_capacity &&
+vms->highmem_redists) ? 2 : 1;
 }
 
 #endif /* QEMU_ARM_VIRT_H */
-- 
2.30.2

Re: [PATCH v9 07/11] hvf: arm: Implement PSCI handling

2021-09-15 Thread Marc Zyngier

On Wed, 15 Sep 2021 11:58:29 +0100,
Alexander Graf  wrote:
> 
> 
> On 15.09.21 11:46, Marc Zyngier wrote:
> > On Mon, 13 Sep 2021 13:30:57 +0100,
> > Peter Maydell  wrote:
> >> On Mon, 13 Sept 2021 at 13:02, Alexander Graf  wrote:
> >>>
> >>> On 13.09.21 13:44, Peter Maydell wrote:
> >>>> On Mon, 13 Sept 2021 at 12:07, Alexander Graf  wrote:
> >>>>> To keep your train of thought though, what would you do if we encounter
> >>>>> a conduit that is different from the chosen one? Today, I am aware of 2
> >>>>> different implementations: TCG injects #UD [1] while KVM sets x0 to -1 
> >>>>> [2].
> >>>> If the SMC or HVC insn isn't being used for PSCI then it should
> >>>> have its standard architectural behaviour.
> >>> Why?
> >> QEMU's assumption here is that there are basically two scenarios
> >> for these instructions:
> >>  (1) we're providing an emulation of firmware that uses this
> >>  instruction (and only this insn, not the other one) to
> >>  provide PSCI services
> >>  (2) we're not emulating any firmware at all, we're running it
> >>  in the guest, and that guest firmware is providing PSCI
> >>
> >> In case (1) we provide a PSCI ABI on the end of the insn.
> >> In case (2) we provide the architectural behaviour for the insn
> >> so that the guest firmware can use it.
> >>
> >> We don't currently have
> >>  (3) we're providing an emulation of firmware that does something
> >>  other than providing PSCI services on this instruction
> >>
> >> which is what I think you're asking for. (Alternatively, you might
> >> be after "provide PSCI via SMC, not HVC", ie use a different conduit.
> >> If hvf documents that SMC is guaranteed to trap that would be
> >> possible, I guess.)
> >>
> >>> Also, why does KVM behave differently?
> >> Looks like Marc made KVM set x0 to -1 for SMC calls in kernel commit
> >> c0938c72f8070aa; conveniently he's on the cc list here so we can
> >> ask him :-)
> > If we got a SMC trap into KVM, that's because the HW knows about it,
> > so injecting an UNDEF is rather counter productive (we don't hide the
> > fact that EL3 actually exists).
> 
> 
> This is the part where you and Peter disagree :). What would you suggest
> to do to create consistency between KVM and TCG based EL0/1 only VMs?

I don't think we disagree. We simply have different implementation
choices. The KVM "firmware" can only be used with HVC, and not
SMC. SMC is reserved for cases where the guest talks to the actual
EL3, or an emulation of it in the case of NV.

As for consistency between TGC and KVM, I have no plan for that
whatsoever. Both implementations are valid, and they don't have to be
identical. Even more, diversity is important, as it weeds out silly
assumptions that are baked in non-portable SW.

Windows doesn't boot? I won't loose any sleep over it.

> 
> > However, we don't implement anything on the back of this instruction,
> > so we just return NOT_IMPLEMENTED (-1). With NV, we actually use it as
> > a guest hypervisor can use it for PSCI and SMC is guaranteed to trap
> > even if EL3 doesn't exist in the HW.
> >
> > For the brain-damaged case where there is no EL3, SMC traps and the
> > hypervisor doesn't actually advertises EL3, that's likely a guest
> > bug. Tough luck.
> >
> > Side note: Not sure where HVF does, but on the M1 running Linux, SMC
> > appears to trap to EL2 with EC=0x3f, which is a reserved exception
> > class. This of course results in an UNDEF being injected because as
> > far as KVM is concerned, this should never happen.
>
> Could that be yet another magical implementation specific MSR bit that
> needs to be set? Hvf returns 0x17 (EC_AA64_SMC) for SMC calls.

That's possible, but that's not something KVM will do. Also, from what
I understand of HVF, this value is what you get in userspace, and it
says nothing of what the kernel side does. It could well be
translating the invalid EC into something else, after having read the
instruction from the guest for all I know.

It is pretty obvious that this HW is not a valid implementation of the
architecture and if it decides to screw itself up, I'm happy to
oblige.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v9 07/11] hvf: arm: Implement PSCI handling

2021-09-15 Thread Marc Zyngier

On Mon, 13 Sep 2021 13:30:57 +0100,
Peter Maydell  wrote:
> 
> On Mon, 13 Sept 2021 at 13:02, Alexander Graf  wrote:
> >
> >
> > On 13.09.21 13:44, Peter Maydell wrote:
> > > On Mon, 13 Sept 2021 at 12:07, Alexander Graf  wrote:
> > >> To keep your train of thought though, what would you do if we encounter
> > >> a conduit that is different from the chosen one? Today, I am aware of 2
> > >> different implementations: TCG injects #UD [1] while KVM sets x0 to -1 
> > >> [2].
> > > If the SMC or HVC insn isn't being used for PSCI then it should
> > > have its standard architectural behaviour.
> >
> > Why?
> 
> QEMU's assumption here is that there are basically two scenarios
> for these instructions:
>  (1) we're providing an emulation of firmware that uses this
>  instruction (and only this insn, not the other one) to
>  provide PSCI services
>  (2) we're not emulating any firmware at all, we're running it
>  in the guest, and that guest firmware is providing PSCI
> 
> In case (1) we provide a PSCI ABI on the end of the insn.
> In case (2) we provide the architectural behaviour for the insn
> so that the guest firmware can use it.
> 
> We don't currently have
>  (3) we're providing an emulation of firmware that does something
>  other than providing PSCI services on this instruction
> 
> which is what I think you're asking for. (Alternatively, you might
> be after "provide PSCI via SMC, not HVC", ie use a different conduit.
> If hvf documents that SMC is guaranteed to trap that would be
> possible, I guess.)
> 
> > Also, why does KVM behave differently?
> 
> Looks like Marc made KVM set x0 to -1 for SMC calls in kernel commit
> c0938c72f8070aa; conveniently he's on the cc list here so we can
> ask him :-)

If we got a SMC trap into KVM, that's because the HW knows about it,
so injecting an UNDEF is rather counter productive (we don't hide the
fact that EL3 actually exists).

However, we don't implement anything on the back of this instruction,
so we just return NOT_IMPLEMENTED (-1). With NV, we actually use it as
a guest hypervisor can use it for PSCI and SMC is guaranteed to trap
even if EL3 doesn't exist in the HW.

For the brain-damaged case where there is no EL3, SMC traps and the
hypervisor doesn't actually advertises EL3, that's likely a guest
bug. Tough luck.

Side note: Not sure where HVF does, but on the M1 running Linux, SMC
appears to trap to EL2 with EC=0x3f, which is a reserved exception
class. This of course results in an UNDEF being injected because as
far as KVM is concerned, this should never happen.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 3/3] docs/system/arm/virt: Fix documentation for the 'highmem' option

2021-09-08 Thread Marc Zyngier

On Tue, 07 Sep 2021 19:25:23 +0100,
Peter Maydell  wrote:
> 
> On Tue, 7 Sept 2021 at 18:10, Marc Zyngier  wrote:
> >
> > Hi Peter,
> >
> > On Tue, 07 Sep 2021 13:51:13 +0100,
> > Peter Maydell  wrote:
> > >
> > > On Sun, 22 Aug 2021 at 15:45, Marc Zyngier  wrote:
> > > >
> > > > The documentation for the 'highmem' option indicates that it controls
> > > > the placement of both devices and RAM. The actual behaviour of QEMU
> > > > seems to be that RAM is allowed to go beyond the 4GiB limit, and
> > > > that only devices are constraint by this option.
> > > >
> > > > Align the documentation with the actual behaviour.
> > >
> > > I think it would be better to align the behaviour with the documentation.
> > >
> > > The intent of 'highmem' is to allow a configuration for use with guests
> > > that can't address more than 32 bits (originally, 32-bit guests without
> > > LPAE support compiled in). It seems like a bug that we allow the user
> > > to specify more RAM than will fit into that 32-bit range. We should
> > > instead make QEMU exit with an error if the user tries to specify
> > > both highmem=off and a memory size that's too big to fit.
> >
> > I'm happy to address this if you are OK with the change in user
> > visible behaviour.
> >
> > However, I am still struggling with my original goal, which is to
> > allow QEMU to create a usable KVM_based VM on systems with a small IPA
> > space (36 bits on the system I have). What would an acceptable way to
> > convey this to the code that deals with the virt memory map so that it
> > falls back to something that actually works?
> 
> Hmm, so at the moment we can either do "fits in 32 bits" or
> "assumes at least 40 bits" but not 36 ?

Exactly. I have the gut feeling that we need a 'gpa_bits' option that
would limit the guest physical range and generalise highmem. High IO
ranges would simply not be available if the GPA range isn't big
enough.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 3/3] docs/system/arm/virt: Fix documentation for the 'highmem' option

2021-09-07 Thread Marc Zyngier

Hi Peter,

On Tue, 07 Sep 2021 13:51:13 +0100,
Peter Maydell  wrote:
> 
> On Sun, 22 Aug 2021 at 15:45, Marc Zyngier  wrote:
> >
> > The documentation for the 'highmem' option indicates that it controls
> > the placement of both devices and RAM. The actual behaviour of QEMU
> > seems to be that RAM is allowed to go beyond the 4GiB limit, and
> > that only devices are constraint by this option.
> >
> > Align the documentation with the actual behaviour.
> 
> I think it would be better to align the behaviour with the documentation.
> 
> The intent of 'highmem' is to allow a configuration for use with guests
> that can't address more than 32 bits (originally, 32-bit guests without
> LPAE support compiled in). It seems like a bug that we allow the user
> to specify more RAM than will fit into that 32-bit range. We should
> instead make QEMU exit with an error if the user tries to specify
> both highmem=off and a memory size that's too big to fit.

I'm happy to address this if you are OK with the change in user
visible behaviour.

However, I am still struggling with my original goal, which is to
allow QEMU to create a usable KVM_based VM on systems with a small IPA
space (36 bits on the system I have). What would an acceptable way to
convey this to the code that deals with the virt memory map so that it
falls back to something that actually works?

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 0/3] target/arm: Reduced-IPA space and highmem=off fixes

2021-08-22 Thread Marc Zyngier


On 2021-08-22 15:44, Marc Zyngier wrote:

With the availability of a fruity range of arm64 systems, it becomes
obvious that QEMU doesn't deal very well with limited IPA ranges when
used as a front-end for KVM.

This short series aims at making usable on such systems:
- the first patch makes the creation of a scratch VM IPA-limit aware
- the second one actually removes the highmem devices from the
computed IPA range when highmem=off
- the last one addresses an imprecision in the documentation for the
highmem option

This has been tested on an M1-based Mac-mini running Linux v5.14-rc6.


I realise I haven't been very clear in my description of the above.
With this series, using 'highmem=off' results in a usable VM, while
sticking to the default 'highmem=on' still generates an error.

M.
--
Jazz is not dead. It just smells funny...

[PATCH 0/3] target/arm: Reduced-IPA space and highmem=off fixes

2021-08-22 Thread Marc Zyngier

With the availability of a fruity range of arm64 systems, it becomes
obvious that QEMU doesn't deal very well with limited IPA ranges when
used as a front-end for KVM.

This short series aims at making usable on such systems:
- the first patch makes the creation of a scratch VM IPA-limit aware
- the second one actually removes the highmem devices from the
computed IPA range when highmem=off
- the last one addresses an imprecision in the documentation for the
highmem option

This has been tested on an M1-based Mac-mini running Linux v5.14-rc6.

Marc Zyngier (3):
  hw/arm/virt: KVM: Probe for KVM_CAP_ARM_VM_IPA_SIZE when creating
scratch VM
  hw/arm/virt: Honor highmem setting when computing highest_gpa
  docs/system/arm/virt: Fix documentation for the 'highmem' option

 docs/system/arm/virt.rst |  6 +++---
 hw/arm/virt.c| 10 +++---
 target/arm/kvm.c |  7 ++-
 3 files changed, 16 insertions(+), 7 deletions(-)

-- 
2.30.2

[PATCH 1/3] hw/arm/virt: KVM: Probe for KVM_CAP_ARM_VM_IPA_SIZE when creating scratch VM

2021-08-22 Thread Marc Zyngier

Although we probe for the IPA limits imposed by KVM (and the hardware)
when computing the memory map, we still use the old style '0' when
creating a scratch VM in kvm_arm_create_scratch_host_vcpu().

On systems that are severely IPA challenged (such as the Apple M1),
this results in a failure as KVM cannot use the default 40bit that
'0' represents.

Instead, probe for the extension and use the reported IPA limit
if available.

Cc: Andrew Jones 
Cc: Eric Auger 
Cc: Peter Maydell 
Signed-off-by: Marc Zyngier 
---
 target/arm/kvm.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index d8381ba224..cc3371a99b 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -70,12 +70,17 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t 
*cpus_to_try,
   struct kvm_vcpu_init *init)
 {
 int ret = 0, kvmfd = -1, vmfd = -1, cpufd = -1;
+int max_vm_pa_size;
 
 kvmfd = qemu_open_old("/dev/kvm", O_RDWR);
 if (kvmfd < 0) {
 goto err;
 }
-vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
+max_vm_pa_size = ioctl(kvmfd, KVM_CHECK_EXTENSION, 
KVM_CAP_ARM_VM_IPA_SIZE);
+if (max_vm_pa_size < 0) {
+max_vm_pa_size = 0;
+}
+vmfd = ioctl(kvmfd, KVM_CREATE_VM, max_vm_pa_size);
 if (vmfd < 0) {
 goto err;
 }
-- 
2.30.2

[PATCH 3/3] docs/system/arm/virt: Fix documentation for the 'highmem' option

2021-08-22 Thread Marc Zyngier

The documentation for the 'highmem' option indicates that it controls
the placement of both devices and RAM. The actual behaviour of QEMU
seems to be that RAM is allowed to go beyond the 4GiB limit, and
that only devices are constraint by this option.

Align the documentation with the actual behaviour.

Cc: Andrew Jones 
Cc: Eric Auger 
Cc: Peter Maydell 
Signed-off-by: Marc Zyngier 
---
 docs/system/arm/virt.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/system/arm/virt.rst b/docs/system/arm/virt.rst
index 59acf0eeaf..e206e7565d 100644
--- a/docs/system/arm/virt.rst
+++ b/docs/system/arm/virt.rst
@@ -86,9 +86,9 @@ mte
   Arm Memory Tagging Extensions. The default is ``off``.
 
 highmem
-  Set ``on``/``off`` to enable/disable placing devices and RAM in physical
-  address space above 32 bits. The default is ``on`` for machine types
-  later than ``virt-2.12``.
+  Set ``on``/``off`` to enable/disable placing devices in physical address
+  space above 32 bits. RAM in excess of 3GiB will always be placed above
+  32 bits. The default is ``on`` for machine types later than ``virt-2.12``.
 
 gic-version
   Specify the version of the Generic Interrupt Controller (GIC) to provide.
-- 
2.30.2

[PATCH 2/3] hw/arm/virt: Honor highmem setting when computing highest_gpa

2021-08-22 Thread Marc Zyngier

Even when the VM is configured with highmem=off, the highest_gpa
field includes devices that are above the 4GiB limit, which is
what highmem=off is supposed to enforce. This leads to failures
in virt_kvm_type() on systems that have a crippled IPA range,
as the reported IPA space is larger than what it should be.

Instead, honor the user-specified limit to only use the devices
at the lowest end of the spectrum.

Note that this doesn't affect memory, which is still allowed to
go beyond 4GiB with highmem=on configurations.

Cc: Andrew Jones 
Cc: Eric Auger 
Cc: Peter Maydell 
Signed-off-by: Marc Zyngier 
---
 hw/arm/virt.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 81eda46b0b..bc189e30b8 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1598,7 +1598,7 @@ static uint64_t virt_cpu_mp_affinity(VirtMachineState 
*vms, int idx)
 static void virt_set_memmap(VirtMachineState *vms)
 {
 MachineState *ms = MACHINE(vms);
-hwaddr base, device_memory_base, device_memory_size;
+hwaddr base, device_memory_base, device_memory_size, ceiling;
 int i;
 
 vms->memmap = extended_memmap;
@@ -1625,7 +1625,7 @@ static void virt_set_memmap(VirtMachineState *vms)
 device_memory_size = ms->maxram_size - ms->ram_size + ms->ram_slots * GiB;
 
 /* Base address of the high IO region */
-base = device_memory_base + ROUND_UP(device_memory_size, GiB);
+ceiling = base = device_memory_base + ROUND_UP(device_memory_size, GiB);
 if (base < device_memory_base) {
 error_report("maxmem/slots too huge");
 exit(EXIT_FAILURE);
@@ -1642,7 +1642,11 @@ static void virt_set_memmap(VirtMachineState *vms)
 vms->memmap[i].size = size;
 base += size;
 }
-vms->highest_gpa = base - 1;
+if (vms->highmem) {
+   /* If we have highmem, move the IPA limit to the top */
+   ceiling = base;
+}
+vms->highest_gpa = ceiling - 1;
 if (device_memory_size > 0) {
 ms->device_memory = g_malloc0(sizeof(*ms->device_memory));
 ms->device_memory->base = device_memory_base;
-- 
2.30.2

Re: [PATCH v17 5/6] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-24 Thread Marc Zyngier

)
> + return -EINVAL;
> +
> + gfn = gpa_to_gfn(guest_ipa);
> +
> + mutex_lock(&kvm->slots_lock);
> +
> + while (length > 0) {
> + kvm_pfn_t pfn = gfn_to_pfn_prot(kvm, gfn, write, NULL);
> + void *maddr;
> + unsigned long num_tags;
> + struct page *page;
> +
> + if (is_error_noslot_pfn(pfn)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + page = pfn_to_online_page(pfn);
> + if (!page) {
> + /* Reject ZONE_DEVICE memory */
> + ret = -EFAULT;
> + goto out;
> + }
> + maddr = page_address(page);
> +
> + if (!write) {
> + if (test_bit(PG_mte_tagged, &page->flags))
> + num_tags = mte_copy_tags_to_user(tags, maddr,
> + MTE_GRANULES_PER_PAGE);
> + else
> + /* No tags in memory, so write zeros */
> + num_tags = MTE_GRANULES_PER_PAGE -
> + clear_user(tags, MTE_GRANULES_PER_PAGE);
> + kvm_release_pfn_clean(pfn);
> + } else {
> + num_tags = mte_copy_tags_from_user(maddr, tags,
> + MTE_GRANULES_PER_PAGE);
> + kvm_release_pfn_dirty(pfn);
> + }
> +
> + if (num_tags != MTE_GRANULES_PER_PAGE) {
> + ret = -EFAULT;
> +     goto out;
> + }
> +
> + /* Set the flag after checking the write completed fully */
> + if (write)
> + set_bit(PG_mte_tagged, &page->flags);

This ended up catching my eye as I was merging some other patches.

This set_bit() occurs *after* the page has been released, meaning it
could have been evicted and reused in the interval. I plan to fix it
as below. Please let me know if that works for you.

Thanks,

M.

From a78d3206378a7101659fbc2a4bf01cb9376c4793 Mon Sep 17 00:00:00 2001
From: Marc Zyngier 
Date: Thu, 24 Jun 2021 14:21:05 +0100
Subject: [PATCH] KVM: arm64: Set the MTE tag bit before releasing the page

Setting a page flag without holding a reference to the page
is living dangerously. In the tag-writing path, we drop the
reference to the page by calling kvm_release_pfn_dirty(),
and only then set the PG_mte_tagged bit.

It would be safer to do it the other way round.

Fixes: f0376edb1ddca ("KVM: arm64: Add ioctl to fetch/store tags in a guest")
Cc: Steven Price 
Cc: Catalin Marinas 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/guest.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 4ddb20017b2f..60815ae477cf 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1053,6 +1053,14 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
} else {
num_tags = mte_copy_tags_from_user(maddr, tags,
MTE_GRANULES_PER_PAGE);
+
+   /*
+* Set the flag after checking the write
+* completed fully
+*/
+   if (num_tags == MTE_GRANULES_PER_PAGE)
+   set_bit(PG_mte_tagged, &page->flags);
+
kvm_release_pfn_dirty(pfn);
}
 
@@ -1061,10 +1069,6 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
goto out;
}
 
-   /* Set the flag after checking the write completed fully */
-   if (write)
-   set_bit(PG_mte_tagged, &page->flags);
-
gfn++;
tags += num_tags;
length -= PAGE_SIZE;
-- 
2.30.2


-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v17 0/6] MTE support for KVM guest

2021-06-22 Thread Marc Zyngier

On Mon, 21 Jun 2021 12:17:10 +0100, Steven Price wrote:
> This series adds support for using the Arm Memory Tagging Extensions
> (MTE) in a KVM guest.
> 
> Changes since v16[1]:
> 
>  - Dropped the first patch ("Handle race when synchronising tags") as
>it's not KVM specific and by restricting MAP_SHARED in KVM there is
>no longer a dependency.
> 
> [...]

Applied to next, thanks!

[1/6] arm64: mte: Sync tags for pages where PTE is untagged
  commit: 69e3b846d8a753f9f279f29531ca56b0f7563ad0
[2/6] KVM: arm64: Introduce MTE VM feature
  commit: ea7fc1bb1cd1b92b42b1d9273ce7e231d3dc9321
[3/6] KVM: arm64: Save/restore MTE registers
  commit: e1f358b5046479d2897f23b1d5b092687c6e7a67
[4/6] KVM: arm64: Expose KVM_ARM_CAP_MTE
  commit: 673638f434ee4a00319e254ade338c57618d6f7e
[5/6] KVM: arm64: ioctl to fetch/store tags in a guest
  commit: f0376edb1ddcab19a473b4bf1fbd5b6bbed3705b
[6/6] KVM: arm64: Document MTE capability and ioctl
  commit: 04c02c201d7e8149ae336ead69fb64e4e6f94bc9

I performed a number of changes in user_mem_abort(), so please
have a look at the result. It is also pretty late in the merge
cycle, so if anything looks amiss, I'll just drop it.

Cheers,

M.
-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v17 2/6] KVM: arm64: Introduce MTE VM feature

2021-06-22 Thread Marc Zyngier

On Mon, 21 Jun 2021 18:00:20 +0100,
Fuad Tabba  wrote:
> 
> Hi,
> 
> On Mon, Jun 21, 2021 at 12:18 PM Steven Price  wrote:
> >
> > Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
> > for a VM. This will expose the feature to the guest and automatically
> > tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
> > storage) to ensure that the guest cannot see stale tags, and so that
> > the tags are correctly saved/restored across swap.
> >
> > Actually exposing the new capability to user space happens in a later
> > patch.
> >
> > Reviewed-by: Catalin Marinas 
> > Signed-off-by: Steven Price 
> > ---
> >  arch/arm64/include/asm/kvm_emulate.h |  3 ++
> >  arch/arm64/include/asm/kvm_host.h|  3 ++
> >  arch/arm64/kvm/hyp/exception.c   |  3 +-
> >  arch/arm64/kvm/mmu.c | 64 +++-
> >  arch/arm64/kvm/sys_regs.c|  7 +++
> >  include/uapi/linux/kvm.h |  1 +
> >  6 files changed, 79 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_emulate.h 
> > b/arch/arm64/include/asm/kvm_emulate.h
> > index 01b9857757f2..fd418955e31e 100644
> > --- a/arch/arm64/include/asm/kvm_emulate.h
> > +++ b/arch/arm64/include/asm/kvm_emulate.h
> > @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
> > if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
> > vcpu_el1_is_32bit(vcpu))
> > vcpu->arch.hcr_el2 |= HCR_TID2;
> > +
> > +   if (kvm_has_mte(vcpu->kvm))
> > +   vcpu->arch.hcr_el2 |= HCR_ATA;
> >  }
> >
> >  static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
> > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > b/arch/arm64/include/asm/kvm_host.h
> > index 7cd7d5c8c4bc..afaa5333f0e4 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -132,6 +132,8 @@ struct kvm_arch {
> >
> > u8 pfr0_csv2;
> > u8 pfr0_csv3;
> > +   /* Memory Tagging Extension enabled for the guest */
> > +   bool mte_enabled;
> >  };
> 
> nit: newline before the comment/new member
> 
> >
> >  struct kvm_vcpu_fault_info {
> > @@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
> >  #define kvm_arm_vcpu_sve_finalized(vcpu) \
> > ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
> >
> > +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
> >  #define kvm_vcpu_has_pmu(vcpu) \
> > (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
> >
> > diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
> > index 11541b94b328..0418399e0a20 100644
> > --- a/arch/arm64/kvm/hyp/exception.c
> > +++ b/arch/arm64/kvm/hyp/exception.c
> > @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
> > unsigned long target_mode,
> > new |= (old & PSR_C_BIT);
> > new |= (old & PSR_V_BIT);
> >
> > -   // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
> > +   if (kvm_has_mte(vcpu->kvm))
> > +   new |= PSR_TCO_BIT;
> >
> > new |= (old & PSR_DIT_BIT);
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index c10207fed2f3..52326b739357 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -822,6 +822,45 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
> > *memslot,
> > return PAGE_SIZE;
> >  }
> >
> > +/*
> > + * The page will be mapped in stage 2 as Normal Cacheable, so the VM will 
> > be
> > + * able to see the page's tags and therefore they must be initialised 
> > first. If
> > + * PG_mte_tagged is set, tags have already been initialised.
> > + *
> > + * The race in the test/set of the PG_mte_tagged flag is handled by:
> > + * - preventing VM_SHARED mappings in a memslot with MTE preventing two VMs
> > + *   racing to santise the same page
> > + * - mmap_lock protects between a VM faulting a page in and the VMM 
> > performing
> > + *   an mprotect() to add VM_MTE
> > + */
> > +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
> > +unsigned long size)
> > +{
> > +   unsigned long i, nr_pages = size >> PAGE_SHIFT;
> > +   struct page *page;
> > +
> > +   if (!kvm_has_mte(kvm))
> > +   return 0;
> > +
> > +   /*
> > +* pfn_to_online_page() is used to reject ZONE_DEVICE pages
> > +* that may not support tags.
> > +*/
> > +   page = pfn_to_online_page(pfn);
> > +
> > +   if (!page)
> > +   return -EFAULT;
> > +
> > +   for (i = 0; i < nr_pages; i++, page++) {
> > +   if (!test_bit(PG_mte_tagged, &page->flags)) {
> > +   mte_clear_page_tags(page_address(page));
> > +   set_bit(PG_mte_tagged, &page->flags);
> > +   }
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> >  static int us

Re: [PATCH v17 5/6] KVM: arm64: ioctl to fetch/store tags in a guest

2021-06-22 Thread Marc Zyngier

Hi Fuad,

On Tue, 22 Jun 2021 09:56:22 +0100,
Fuad Tabba  wrote:
> 
> Hi,
> 
> 
> On Mon, Jun 21, 2021 at 12:18 PM Steven Price  wrote:
> >
> > The VMM may not wish to have it's own mapping of guest memory mapped
> > with PROT_MTE because this causes problems if the VMM has tag checking
> > enabled (the guest controls the tags in physical RAM and it's unlikely
> > the tags are correct for the VMM).
> >
> > Instead add a new ioctl which allows the VMM to easily read/write the
> > tags from guest memory, allowing the VMM's mapping to be non-PROT_MTE
> > while the VMM can still read/write the tags for the purpose of
> > migration.
> >
> > Reviewed-by: Catalin Marinas 
> > Signed-off-by: Steven Price 
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  3 ++
> >  arch/arm64/include/asm/mte-def.h  |  1 +
> >  arch/arm64/include/uapi/asm/kvm.h | 11 +
> >  arch/arm64/kvm/arm.c  |  7 +++
> >  arch/arm64/kvm/guest.c| 82 +++
> >  include/uapi/linux/kvm.h  |  1 +
> >  6 files changed, 105 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > b/arch/arm64/include/asm/kvm_host.h
> > index 309e36cc1b42..6a2ac4636d42 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -729,6 +729,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
> >  int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
> >struct kvm_device_attr *attr);
> >
> > +long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
> > +   struct kvm_arm_copy_mte_tags *copy_tags);
> > +
> >  /* Guest/host FPSIMD coordination helpers */
> >  int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
> >  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu);
> > diff --git a/arch/arm64/include/asm/mte-def.h 
> > b/arch/arm64/include/asm/mte-def.h
> > index cf241b0f0a42..626d359b396e 100644
> > --- a/arch/arm64/include/asm/mte-def.h
> > +++ b/arch/arm64/include/asm/mte-def.h
> > @@ -7,6 +7,7 @@
> >
> >  #define MTE_GRANULE_SIZE   UL(16)
> >  #define MTE_GRANULE_MASK   (~(MTE_GRANULE_SIZE - 1))
> > +#define MTE_GRANULES_PER_PAGE  (PAGE_SIZE / MTE_GRANULE_SIZE)
> >  #define MTE_TAG_SHIFT  56
> >  #define MTE_TAG_SIZE   4
> >  #define MTE_TAG_MASK   GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 
> > 1)), MTE_TAG_SHIFT)
> > diff --git a/arch/arm64/include/uapi/asm/kvm.h 
> > b/arch/arm64/include/uapi/asm/kvm.h
> > index 24223adae150..b3edde68bc3e 100644
> > --- a/arch/arm64/include/uapi/asm/kvm.h
> > +++ b/arch/arm64/include/uapi/asm/kvm.h
> > @@ -184,6 +184,17 @@ struct kvm_vcpu_events {
> > __u32 reserved[12];
> >  };
> >
> > +struct kvm_arm_copy_mte_tags {
> > +   __u64 guest_ipa;
> > +   __u64 length;
> > +   void __user *addr;
> > +   __u64 flags;
> > +   __u64 reserved[2];
> > +};
> > +
> > +#define KVM_ARM_TAGS_TO_GUEST  0
> > +#define KVM_ARM_TAGS_FROM_GUEST1
> > +
> >  /* If you need to interpret the index values, here is the key: */
> >  #define KVM_REG_ARM_COPROC_MASK0x0FFF
> >  #define KVM_REG_ARM_COPROC_SHIFT   16
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 28ce26a68f09..511f3716fe33 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -1359,6 +1359,13 @@ long kvm_arch_vm_ioctl(struct file *filp,
> >
> > return 0;
> > }
> > +   case KVM_ARM_MTE_COPY_TAGS: {
> > +   struct kvm_arm_copy_mte_tags copy_tags;
> > +
> > +   if (copy_from_user(©_tags, argp, sizeof(copy_tags)))
> > +   return -EFAULT;
> > +   return kvm_vm_ioctl_mte_copy_tags(kvm, ©_tags);
> > +   }
> > default:
> > return -EINVAL;
> > }
> > diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> > index 5cb4a1cd5603..4ddb20017b2f 100644
> > --- a/arch/arm64/kvm/guest.c
> > +++ b/arch/arm64/kvm/guest.c
> > @@ -995,3 +995,85 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
> >
> > return ret;
> >  }
> > +
> > +long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
> > +   struct kvm_arm_copy_mte_tags *copy_tags)
> > +{
> > +   gpa_t guest_ipa = copy_tags->guest_ipa;
> > +   size_t length = copy_tags->length;
> > +   void __user *tags = copy_tags->addr;
> > +   gpa_t gfn;
> > +   bool write = !(copy_tags->flags & KVM_ARM_TAGS_FROM_GUEST);
> > +   int ret = 0;
> > +
> > +   if (!kvm_has_mte(kvm))
> > +   return -EINVAL;
> > +
> > +   if (copy_tags->reserved[0] || copy_tags->reserved[1])
> > +   return -EINVAL;
> > +
> > +   if (copy_tags->flags & ~KVM_ARM_TAGS_FROM_GUEST)
> > +   return -EINVAL;
> > +
> > +   if (length & ~PAGE_MASK || guest_ipa & ~PAGE_MASK)
> > +   return -EINVAL;
> > +
> > +   gfn = gpa_to_gfn(gue

Re: [PATCH v17 6/6] KVM: arm64: Document MTE capability and ioctl

2021-06-22 Thread Marc Zyngier

On Tue, 22 Jun 2021 10:42:42 +0100,
Fuad Tabba  wrote:
> 
> Hi,
> 
> 
> On Mon, Jun 21, 2021 at 12:18 PM Steven Price  wrote:
> >
> > A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
> > granting a guest access to the tags, and provides a mechanism for the
> > VMM to enable it.
> >
> > A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
> > access the tags of a guest without having to maintain a PROT_MTE mapping
> > in userspace. The above capability gates access to the ioctl.
> >
> > Reviewed-by: Catalin Marinas 
> > Signed-off-by: Steven Price 
> > ---
> >  Documentation/virt/kvm/api.rst | 61 ++
> >  1 file changed, 61 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 7fcb2fd38f42..97661a97943f 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5034,6 +5034,43 @@ see KVM_XEN_VCPU_SET_ATTR above.
> >  The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
> >  with the KVM_XEN_VCPU_GET_ATTR ioctl.
> >
> > +4.130 KVM_ARM_MTE_COPY_TAGS
> > +---
> > +
> > +:Capability: KVM_CAP_ARM_MTE
> > +:Architectures: arm64
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_arm_copy_mte_tags
> > +:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
> > +  arguments, -EFAULT if memory cannot be accessed).
> > +
> > +::
> > +
> > +  struct kvm_arm_copy_mte_tags {
> > +   __u64 guest_ipa;
> > +   __u64 length;
> > +   void __user *addr;
> > +   __u64 flags;
> > +   __u64 reserved[2];
> > +  };
> > +
> > +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
> > +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The 
> > ``addr``
> > +field must point to a buffer which the tags will be copied to or from.
> > +
> > +``flags`` specifies the direction of copy, either 
> > ``KVM_ARM_TAGS_TO_GUEST`` or
> > +``KVM_ARM_TAGS_FROM_GUEST``.
> > +
> > +The size of the buffer to store the tags is ``(length / 16)`` bytes
> > +(granules in MTE are 16 bytes long). Each byte contains a single tag
> > +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
> > +``PTRACE_POKEMTETAGS``.
> > +
> > +If an error occurs before any data is copied then a negative error code is
> > +returned. If some tags have been copied before an error occurs then the 
> > number
> > +of bytes successfully copied is returned. If the call completes 
> > successfully
> > +then ``length`` is returned.
> > +
> >  5. The kvm_run structure
> >  
> >
> > @@ -6362,6 +6399,30 @@ default.
> >
> >  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
> >
> > +7.26 KVM_CAP_ARM_MTE
> > +
> > +
> > +:Architectures: arm64
> > +:Parameters: none
> > +
> > +This capability indicates that KVM (and the hardware) supports exposing the
> > +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by 
> > the
> > +VMM before creating any VCPUs to allow the guest access. Note that MTE is 
> > only
> > +available to a guest running in AArch64 mode and enabling this capability 
> > will
> > +cause attempts to create AArch32 VCPUs to fail.
> 
> I was wondering if there might be an issue with AArch32 at EL0 and
> MTE, because I think that even if AArch64 at EL1 is disallowed, the

Did you mean AArch32 here?

> guest can still run AArch32 at EL0.

I don't get your question:

- If the guest is AArch32 at EL1, there is not MTE whatsoever (where
  would you place the tag?)

- If the guest is AArch64, it can have MTE enabled or not,
  irrespective of the EL. If this guest decides to run an AArch32 EL0,
  the architecture rules still apply, and it cannot expose MTE to its
  own 32bit userspace. Nothing that KVM needs to do about this.

What KVM enforces is that at the point where the guest is in charge,
we have a consistent architectural behaviour.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v17 4/6] KVM: arm64: Expose KVM_ARM_CAP_MTE

2021-06-22 Thread Marc Zyngier

On Tue, 22 Jun 2021 09:07:51 +0100,
Fuad Tabba  wrote:
> 
> Hi,
> 
> On Mon, Jun 21, 2021 at 12:18 PM Steven Price  wrote:
> >
> > It's now safe for the VMM to enable MTE in a guest, so expose the
> > capability to user space.
> >
> > Reviewed-by: Catalin Marinas 
> > Signed-off-by: Steven Price 
> > ---
> >  arch/arm64/kvm/arm.c  | 9 +
> >  arch/arm64/kvm/reset.c| 4 
> >  arch/arm64/kvm/sys_regs.c | 3 +++
> >  3 files changed, 16 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index e720148232a0..28ce26a68f09 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -93,6 +93,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > r = 0;
> > kvm->arch.return_nisv_io_abort_to_user = true;
> > break;
> > +   case KVM_CAP_ARM_MTE:
> > +   if (!system_supports_mte() || kvm->created_vcpus)
> > +   return -EINVAL;
> > +   r = 0;
> > +   kvm->arch.mte_enabled = true;
> > +   break;
> > default:
> > r = -EINVAL;
> > break;
> > @@ -237,6 +243,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> > ext)
> >  */
> > r = 1;
> > break;
> > +   case KVM_CAP_ARM_MTE:
> > +   r = system_supports_mte();
> > +   break;
> > case KVM_CAP_STEAL_TIME:
> > r = kvm_arm_pvtime_supported();
> > break;
> > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> > index d37ebee085cf..9e6922b9503a 100644
> > --- a/arch/arm64/kvm/reset.c
> > +++ b/arch/arm64/kvm/reset.c
> > @@ -244,6 +244,10 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
> > switch (vcpu->arch.target) {
> > default:
> > if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
> > +   if (vcpu->kvm->arch.mte_enabled) {
> > +   ret = -EINVAL;
> > +   goto out;
> > +   }
> > pstate = VCPU_RESET_PSTATE_SVC;
> > } else {
> > pstate = VCPU_RESET_PSTATE_EL1;
> 
> nit: I was wondering whether this check would be better suited in
> kvm_vcpu_set_target, rather than here (kvm_reset_vcpu). kvm_reset_vcpu
> is called by kvm_vcpu_set_target, but kvm_vcpu_set_target is where
> checking for supported features happens. It might be better to group
> all such checks together. I don't think that there is any risk of this
> feature being toggled by the other call path to kvm_reset_vcpu (via
> check_vcpu_requests).

We already group the 32bit related compatibility checks in
vcpu_allowed_register_width(), and this is where I think this should
move to. I've provisionally added the change below.

M.

diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 9e6922b9503a..cba7872d69a8 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -176,6 +176,10 @@ static bool vcpu_allowed_register_width(struct kvm_vcpu 
*vcpu)
if (!cpus_have_const_cap(ARM64_HAS_32BIT_EL1) && is32bit)
return false;
 
+   /* MTE is incompatible with AArch32 */
+   if (kvm_has_mte(vcpu->kvm) && is32bit)
+   return false;
+
/* Check that the vcpus are either all 32bit or all 64bit */
kvm_for_each_vcpu(i, tmp, vcpu->kvm) {
if (vcpu_has_feature(tmp, KVM_ARM_VCPU_EL1_32BIT) != is32bit)
@@ -244,10 +248,6 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
switch (vcpu->arch.target) {
default:
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
-   if (vcpu->kvm->arch.mte_enabled) {
-   ret = -EINVAL;
-   goto out;
-   }
pstate = VCPU_RESET_PSTATE_SVC;
} else {
pstate = VCPU_RESET_PSTATE_EL1;


-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v16 3/7] KVM: arm64: Introduce MTE VM feature

2021-06-21 Thread Marc Zyngier

On Fri, 18 Jun 2021 14:28:22 +0100,
Steven Price  wrote:
> 
> Add a new VM feature 'KVM_ARM_CAP_MTE' which enables memory tagging
> for a VM. This will expose the feature to the guest and automatically
> tag memory pages touched by the VM as PG_mte_tagged (and clear the tag
> storage) to ensure that the guest cannot see stale tags, and so that
> the tags are correctly saved/restored across swap.
> 
> Actually exposing the new capability to user space happens in a later
> patch.
> 
> Signed-off-by: Steven Price 
> ---
>  arch/arm64/include/asm/kvm_emulate.h |  3 ++
>  arch/arm64/include/asm/kvm_host.h|  3 ++
>  arch/arm64/kvm/hyp/exception.c   |  3 +-
>  arch/arm64/kvm/mmu.c | 62 +++-
>  arch/arm64/kvm/sys_regs.c|  7 
>  include/uapi/linux/kvm.h |  1 +
>  6 files changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
> b/arch/arm64/include/asm/kvm_emulate.h
> index f612c090f2e4..6bf776c2399c 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -84,6 +84,9 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
>   if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) ||
>   vcpu_el1_is_32bit(vcpu))
>   vcpu->arch.hcr_el2 |= HCR_TID2;
> +
> + if (kvm_has_mte(vcpu->kvm))
> + vcpu->arch.hcr_el2 |= HCR_ATA;
>  }
>  
>  static inline unsigned long *vcpu_hcr(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 7cd7d5c8c4bc..afaa5333f0e4 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -132,6 +132,8 @@ struct kvm_arch {
>  
>   u8 pfr0_csv2;
>   u8 pfr0_csv3;
> + /* Memory Tagging Extension enabled for the guest */
> + bool mte_enabled;
>  };
>  
>  struct kvm_vcpu_fault_info {
> @@ -769,6 +771,7 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
>  #define kvm_arm_vcpu_sve_finalized(vcpu) \
>   ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
>  
> +#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
>  #define kvm_vcpu_has_pmu(vcpu)   \
>   (test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))
>  
> diff --git a/arch/arm64/kvm/hyp/exception.c b/arch/arm64/kvm/hyp/exception.c
> index 73629094f903..56426565600c 100644
> --- a/arch/arm64/kvm/hyp/exception.c
> +++ b/arch/arm64/kvm/hyp/exception.c
> @@ -112,7 +112,8 @@ static void enter_exception64(struct kvm_vcpu *vcpu, 
> unsigned long target_mode,
>   new |= (old & PSR_C_BIT);
>   new |= (old & PSR_V_BIT);
>  
> - // TODO: TCO (if/when ARMv8.5-MemTag is exposed to guests)
> + if (kvm_has_mte(vcpu->kvm))
> + new |= PSR_TCO_BIT;
>  
>   new |= (old & PSR_DIT_BIT);
>  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c5d1f3c87dbd..f5305b7561ad 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -822,6 +822,45 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
> *memslot,
>   return PAGE_SIZE;
>  }
>  
> +/*
> + * The page will be mapped in stage 2 as Normal Cacheable, so the VM will be
> + * able to see the page's tags and therefore they must be initialised first. 
> If
> + * PG_mte_tagged is set, tags have already been initialised.
> + *
> + * The race in the test/set of the PG_mte_tagged flag is handled by:
> + * - preventing VM_SHARED mappings in a memslot with MTE preventing two VMs
> + *   racing to santise the same page
> + * - mmap_lock protects between a VM faulting a page in and the VMM 
> performing
> + *   an mprotect() to add VM_MTE
> + */
> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
> +  unsigned long size)
> +{
> + unsigned long i, nr_pages = size >> PAGE_SHIFT;
> + struct page *page;
> +
> + if (!kvm_has_mte(kvm))
> + return 0;
> +
> + /*
> +  * pfn_to_online_page() is used to reject ZONE_DEVICE pages
> +  * that may not support tags.
> +  */
> + page = pfn_to_online_page(pfn);
> +
> + if (!page)
> + return -EFAULT;
> +
> + for (i = 0; i < nr_pages; i++, page++) {
> + if (!test_bit(PG_mte_tagged, &page->flags)) {
> + mte_clear_page_tags(page_address(page));
> + set_bit(PG_mte_tagged, &page->flags);
> + }
> + }
> +
> + return 0;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_memory_slot *memslot, unsigned long hva,
> unsigned long fault_status)
> @@ -971,8 +1010,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   if (writable)
>   prot |= KVM_PGTABLE_PROT_W;
>  
> - if (fault_status != FSC_PERM && !device)
> + if (fault_status != FSC_PERM &

Re: [PATCH v16 1/7] arm64: mte: Handle race when synchronising tags

2021-06-18 Thread Marc Zyngier


On 2021-06-18 15:40, Catalin Marinas wrote:

On Fri, Jun 18, 2021 at 02:28:20PM +0100, Steven Price wrote:

mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
before restoring/zeroing the MTE tags. However if another thread were 
to

race and attempt to sync the tags on the same page before the first
thread had completed restoring/zeroing then it would see the flag is
already set and continue without waiting. This would potentially 
expose

the previous contents of the tags to user space, and cause any updates
that user space makes before the restoring/zeroing has completed to
potentially be lost.

Since this code is run from atomic contexts we can't just lock the 
page
during the process. Instead implement a new (global) spinlock to 
protect

the mte_sync_page_tags() function.

Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped 
in user-space with PROT_MTE")

Reviewed-by: Catalin Marinas 
Signed-off-by: Steven Price 


Although I reviewed this patch, I think we should drop it from this
series and restart the discussion with the Chromium guys on what/if 
they

need PROT_MTE with MAP_SHARED. It currently breaks if you have two
PROT_MTE mappings but if they are ok with only one of the mappings 
being

PROT_MTE, I'm happy to just document it.

Not sure whether subsequent patches depend on it though.


I'd certainly like it to be independent of the KVM series, specially
as this series is pretty explicit that this MTE lock is not required
for KVM.

This will require some rework of patch #2, I believe. And while we're
at it, a rebase on 5.13-rc4 wouldn't hurt, as both patches #3 and #5
conflict with it...

Thanks,

M.
--
Jazz is not dead. It just smells funny...

Re: [PATCH v15 0/7] MTE support for KVM guest

2021-06-17 Thread Marc Zyngier

On Thu, 17 Jun 2021 14:24:25 +0100,
Steven Price  wrote:
> 
> On 17/06/2021 14:15, Marc Zyngier wrote:
> > On Thu, 17 Jun 2021 13:13:22 +0100,
> > Catalin Marinas  wrote:
> >>
> >> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
> >>> I realise there are still open questions[1] around the performance of
> >>> this series (the 'big lock', tag_sync_lock, introduced in the first
> >>> patch). But there should be no impact on non-MTE workloads and until we
> >>> get real MTE-enabled hardware it's hard to know whether there is a need
> >>> for something more sophisticated or not. Peter Collingbourne's patch[3]
> >>> to clear the tags at page allocation time should hide more of the impact
> >>> for non-VM cases. So the remaining concern is around VM startup which
> >>> could be effectively serialised through the lock.
> >> [...]
> >>> [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
> >>
> >> Start-up, VM resume, migration could be affected by this lock, basically
> >> any time you fault a page into the guest. As you said, for now it should
> >> be fine as long as the hardware doesn't support MTE or qemu doesn't
> >> enable MTE in guests. But the problem won't go away.
> > 
> > Indeed. And I find it odd to say "it's not a problem, we don't have
> > any HW available". By this token, why should we merge this work the
> > first place, or any of the MTE work that has gone into the kernel over
> > the past years?
> > 
> >> We have a partial solution with an array of locks to mitigate against
> >> this but there's still the question of whether we should actually bother
> >> for something that's unlikely to happen in practice: MAP_SHARED memory
> >> in guests (ignoring the stage 1 case for now).
> >>
> >> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
> >> user_mem_abort() and if the VM_SHARED flag is set together with MTE
> >> enabled for guests, we can reject the mapping.
> > 
> > That's a reasonable approach. I wonder whether we could do that right
> > at the point where the memslot is associated with the VM, like this:
> > 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index a36a2e3082d8..ebd3b3224386 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> > if (!vma)
> > break;
> >  
> > +   if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
> > +   return -EINVAL;
> > +
> > /*
> >  * Take the intersection of this VMA with the memory region
> >  */
> > 
> > which takes the problem out of the fault path altogether? We document
> > the restriction and move on. With that, we can use a non-locking
> > version of mte_sync_page_tags().
> 
> Does this deal with the case where the VMAs are changed after the
> memslot is created? While we can do the check here to give the VMM a
> heads-up if it gets it wrong, I think we also need it in
> user_mem_abort() to deal with a VMM which mmap()s over the VA of the
> memslot. Or am I missing something?

No, you're right. I wish the memslot API wasn't so lax... Anyway, even
a VMA flag check in user_mem_abort() will be cheaper than this new BKL.

> But if everyone is happy with the restriction (just for KVM) of not
> allowing MTE+VM_SHARED then that sounds like a good way forward.

Definitely works for me.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v15 0/7] MTE support for KVM guest

2021-06-17 Thread Marc Zyngier

On Thu, 17 Jun 2021 13:13:22 +0100,
Catalin Marinas  wrote:
> 
> On Mon, Jun 14, 2021 at 10:05:18AM +0100, Steven Price wrote:
> > I realise there are still open questions[1] around the performance of
> > this series (the 'big lock', tag_sync_lock, introduced in the first
> > patch). But there should be no impact on non-MTE workloads and until we
> > get real MTE-enabled hardware it's hard to know whether there is a need
> > for something more sophisticated or not. Peter Collingbourne's patch[3]
> > to clear the tags at page allocation time should hide more of the impact
> > for non-VM cases. So the remaining concern is around VM startup which
> > could be effectively serialised through the lock.
> [...]
> > [1]: https://lore.kernel.org/r/874ke7z3ng.wl-maz%40kernel.org
> 
> Start-up, VM resume, migration could be affected by this lock, basically
> any time you fault a page into the guest. As you said, for now it should
> be fine as long as the hardware doesn't support MTE or qemu doesn't
> enable MTE in guests. But the problem won't go away.

Indeed. And I find it odd to say "it's not a problem, we don't have
any HW available". By this token, why should we merge this work the
first place, or any of the MTE work that has gone into the kernel over
the past years?

> We have a partial solution with an array of locks to mitigate against
> this but there's still the question of whether we should actually bother
> for something that's unlikely to happen in practice: MAP_SHARED memory
> in guests (ignoring the stage 1 case for now).
> 
> If MAP_SHARED in guests is not a realistic use-case, we have the vma in
> user_mem_abort() and if the VM_SHARED flag is set together with MTE
> enabled for guests, we can reject the mapping.

That's a reasonable approach. I wonder whether we could do that right
at the point where the memslot is associated with the VM, like this:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a36a2e3082d8..ebd3b3224386 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1376,6 +1376,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
if (!vma)
break;
 
+   if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED)
+   return -EINVAL;
+
/*
 * Take the intersection of this VMA with the memory region
 */

which takes the problem out of the fault path altogether? We document
the restriction and move on. With that, we can use a non-locking
version of mte_sync_page_tags().

> We can discuss the stage 1 case separately from this series.

Works for me.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v14 1/8] arm64: mte: Handle race when synchronising tags

2021-06-09 Thread Marc Zyngier

On Wed, 09 Jun 2021 11:51:34 +0100,
Steven Price  wrote:
> 
> On 09/06/2021 11:30, Marc Zyngier wrote:
> > On Mon, 07 Jun 2021 12:08:09 +0100,
> > Steven Price  wrote:
> >>
> >> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
> >> before restoring/zeroing the MTE tags. However if another thread were to
> >> race and attempt to sync the tags on the same page before the first
> >> thread had completed restoring/zeroing then it would see the flag is
> >> already set and continue without waiting. This would potentially expose
> >> the previous contents of the tags to user space, and cause any updates
> >> that user space makes before the restoring/zeroing has completed to
> >> potentially be lost.
> >>
> >> Since this code is run from atomic contexts we can't just lock the page
> >> during the process. Instead implement a new (global) spinlock to protect
> >> the mte_sync_page_tags() function.
> >>
> >> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
> >> user-space with PROT_MTE")
> >> Reviewed-by: Catalin Marinas 
> >> Signed-off-by: Steven Price 
> >> ---
> >>  arch/arm64/kernel/mte.c | 20 +---
> >>  1 file changed, 17 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> >> index 125a10e413e9..a3583a7fd400 100644
> >> --- a/arch/arm64/kernel/mte.c
> >> +++ b/arch/arm64/kernel/mte.c
> >> @@ -25,6 +25,7 @@
> >>  u64 gcr_kernel_excl __ro_after_init;
> >>  
> >>  static bool report_fault_once = true;
> >> +static DEFINE_SPINLOCK(tag_sync_lock);
> >>  
> >>  #ifdef CONFIG_KASAN_HW_TAGS
> >>  /* Whether the MTE asynchronous mode is enabled. */
> >> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
> >>  
> >>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool 
> >> check_swap)
> >>  {
> >> +  unsigned long flags;
> >>pte_t old_pte = READ_ONCE(*ptep);
> >>  
> >> +  spin_lock_irqsave(&tag_sync_lock, flags);
> > 
> > having though a bit more about this after an offline discussion with
> > Catalin: why can't this lock be made per mm? We can't really share
> > tags across processes anyway, so this is limited to threads from the
> > same process.
> 
> Currently there's nothing stopping processes sharing tags (mmap(...,
> PROT_MTE, MAP_SHARED)) - I agree making use of this is tricky and it
> would have been nice if this had just been prevented from the
> beginning.

I don't think it should be prevented. I think it should be made clear
that it is unreliable and that it will result in tag corruption.

> Given the above, clearly the lock can't be per mm and robust.

I don't think we need to make it robust. The architecture actively
prevents sharing if the tags are also shared, just like we can't
really expect the VMM to share tags with the guest.

> > I'd also like it to be documented that page sharing can only reliably
> > work with tagging if only one of the mappings is using tags.
> 
> I'm not entirely clear whether you mean "can only reliably work" to be
> "is practically impossible to coordinate tag values", or whether you are
> proposing to (purposefully) introduce the race with a per-mm lock? (and
> document it).

The latter. You can obviously communicate your tags to another task,
but this should come with attached restrictions (mlock?).

> I guess we could have a per-mm lock and handle the race if user space
> screws up with the outcome being lost tags (double clear).
> 
> But it feels to me like it could come back to bite in the future since
> VM_SHARED|VM_MTE will almost always work and I fear someone will start
> using it since it's permitted by the kernel.

I'm really worried that performance is going to suck even on a small
system, and this global lock will be heavily contended, even without
considering KVM.

Catalin, any though?

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v14 1/8] arm64: mte: Handle race when synchronising tags

2021-06-09 Thread Marc Zyngier

On Mon, 07 Jun 2021 12:08:09 +0100,
Steven Price  wrote:
> 
> mte_sync_tags() used test_and_set_bit() to set the PG_mte_tagged flag
> before restoring/zeroing the MTE tags. However if another thread were to
> race and attempt to sync the tags on the same page before the first
> thread had completed restoring/zeroing then it would see the flag is
> already set and continue without waiting. This would potentially expose
> the previous contents of the tags to user space, and cause any updates
> that user space makes before the restoring/zeroing has completed to
> potentially be lost.
> 
> Since this code is run from atomic contexts we can't just lock the page
> during the process. Instead implement a new (global) spinlock to protect
> the mte_sync_page_tags() function.
> 
> Fixes: 34bfeea4a9e9 ("arm64: mte: Clear the tags when a page is mapped in 
> user-space with PROT_MTE")
> Reviewed-by: Catalin Marinas 
> Signed-off-by: Steven Price 
> ---
>  arch/arm64/kernel/mte.c | 20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index 125a10e413e9..a3583a7fd400 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -25,6 +25,7 @@
>  u64 gcr_kernel_excl __ro_after_init;
>  
>  static bool report_fault_once = true;
> +static DEFINE_SPINLOCK(tag_sync_lock);
>  
>  #ifdef CONFIG_KASAN_HW_TAGS
>  /* Whether the MTE asynchronous mode is enabled. */
> @@ -34,13 +35,22 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
>  
>  static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool 
> check_swap)
>  {
> + unsigned long flags;
>   pte_t old_pte = READ_ONCE(*ptep);
>  
> + spin_lock_irqsave(&tag_sync_lock, flags);

having though a bit more about this after an offline discussion with
Catalin: why can't this lock be made per mm? We can't really share
tags across processes anyway, so this is limited to threads from the
same process.

I'd also like it to be documented that page sharing can only reliably
work with tagging if only one of the mappings is using tags.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH v12 8/8] KVM: arm64: Document MTE capability and ioctl

2021-05-20 Thread Marc Zyngier

On Wed, 19 May 2021 15:09:23 +0100,
Steven Price  wrote:
> 
> On 17/05/2021 19:09, Marc Zyngier wrote:
> > On Mon, 17 May 2021 13:32:39 +0100,
> > Steven Price  wrote:
> >>
> >> A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
> >> granting a guest access to the tags, and provides a mechanism for the
> >> VMM to enable it.
> >>
> >> A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
> >> access the tags of a guest without having to maintain a PROT_MTE mapping
> >> in userspace. The above capability gates access to the ioctl.
> >>
> >> Signed-off-by: Steven Price 
> >> ---
> >>  Documentation/virt/kvm/api.rst | 53 ++
> >>  1 file changed, 53 insertions(+)
> >>
> >> diff --git a/Documentation/virt/kvm/api.rst 
> >> b/Documentation/virt/kvm/api.rst
> >> index 22d077562149..a31661b870ba 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -5034,6 +5034,40 @@ see KVM_XEN_VCPU_SET_ATTR above.
> >>  The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
> >>  with the KVM_XEN_VCPU_GET_ATTR ioctl.
> >>  
> >> +4.130 KVM_ARM_MTE_COPY_TAGS
> >> +---
> >> +
> >> +:Capability: KVM_CAP_ARM_MTE
> >> +:Architectures: arm64
> >> +:Type: vm ioctl
> >> +:Parameters: struct kvm_arm_copy_mte_tags
> >> +:Returns: 0 on success, < 0 on error
> >> +
> >> +::
> >> +
> >> +  struct kvm_arm_copy_mte_tags {
> >> +  __u64 guest_ipa;
> >> +  __u64 length;
> >> +  union {
> >> +  void __user *addr;
> >> +  __u64 padding;
> >> +  };
> >> +  __u64 flags;
> >> +  __u64 reserved[2];
> >> +  };
> > 
> > This doesn't exactly match the structure in the previous patch :-(.
> 
> :( I knew there was a reason I didn't include it in the documentation
> for the first 9 versions... I'll fix this up, thanks for spotting it.
> 
> >> +
> >> +Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
> >> +``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The 
> >> ``addr``
> >> +fieldmust point to a buffer which the tags will be copied to or from.
> >> +
> >> +``flags`` specifies the direction of copy, either 
> >> ``KVM_ARM_TAGS_TO_GUEST`` or
> >> +``KVM_ARM_TAGS_FROM_GUEST``.
> >> +
> >> +The size of the buffer to store the tags is ``(length / 
> >> MTE_GRANULE_SIZE)``
> > 
> > Should we add a UAPI definition for MTE_GRANULE_SIZE?
> 
> I wasn't sure whether to export this or not. The ioctl is based around
> the existing ptrace interface (PTRACE_{PEEK,POKE}MTETAGS) which doesn't
> expose a UAPI definition. Admittedly the documentation there also just
> says "16-byte granule" rather than MTE_GRANULE_SIZE.
> 
> So I'll just remove the reference to MTE_GRANULE_SIZE in the
> documentation unless you feel that we should have a UAPI definition.

Dropping the mention of this symbol and replacing it by the value 16
matches the architecture and doesn't require any extra UAPI
definition, so let's just do that.

> 
> >> +bytes (i.e. 1/16th of the corresponding size). Each byte contains a 
> >> single tag
> >> +value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
> >> +``PTRACE_POKEMTETAGS``.
> >> +
> >>  5. The kvm_run structure
> >>  
> >>  
> >> @@ -6362,6 +6396,25 @@ default.
> >>  
> >>  See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
> >>  
> >> +7.26 KVM_CAP_ARM_MTE
> >> +
> >> +
> >> +:Architectures: arm64
> >> +:Parameters: none
> >> +
> >> +This capability indicates that KVM (and the hardware) supports exposing 
> >> the
> >> +Memory Tagging Extensions (MTE) to the guest. It must also be enabled by 
> >> the
> >> +VMM before the guest will be granted access.
> >> +
> >> +When enabled the guest is able to access tags associated with any memory 
> >> given
> >> +to the guest. KVM will ensure that the pages are flagged 
> >> ``PG_mte_tagged`` so
> >> +that the tags are maintained during swap or hibernation of the host; 
> >> however
> >> +the VMM needs to manually save/restore the tags as appropriate if the VM 
> >> is
> >> +migrated.
> >> +
> >> +When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl 
> >> to
> >> +perform a bulk copy of tags to/from the guest.
> >> +
> > 
> > Missing limitation to AArch64 guests.
> 
> As mentioned previously it's not technically limited to AArch64, but
> I'll expand this to make it clear that MTE isn't usable from a AArch32 VCPU.

I believe the architecture is quite clear that it *is* limited to
AArch64. The clarification is welcome though.

M.

-- 
Without deviation from the norm, progress is not possible.

1 2 3 >

1 - 100 of 227 matches

Mail list logo