Hi Eric,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on v5.10-rc4]
[also build test ERROR on next-20201116]
[cannot apply to vfio/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--bas
Small refactor so that nVHE's handle_trap uses a switch on the Exception
Class value of ESR_EL2 in preparation for adding a handler of SMC32/64.
Signed-off-by: David Brazdil
---
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 15 ---
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a
Add a handler of PSCI SMCs in nVHE hyp code. The handler is initialized
with the version used by the host's PSCI driver and the function IDs it
was configured with. If the SMC function ID matches one of the
configured PSCI calls (for v0.1) or falls into the PSCI function ID
range (for v0.2+), the S
Add a handler of the CPU_ON PSCI call from host. When invoked, it looks
up the logical CPU ID corresponding to the provided MPIDR and populates
the state struct of the target CPU with the provided x0, pc. It then
calls CPU_ON itself, with an entry point in hyp that initializes EL2
state before retu
KVM by default keeps the stub vector installed and installs the nVHE
vector only briefly for init and later on demand. Change this policy
to install the vector at init and then never uninstall it if the kernel
was given the protected KVM command line parameter.
Signed-off-by: David Brazdil
---
a
When compiling with __KVM_NVHE_HYPERVISOR__ redefine per_cpu_offset() to
__hyp_per_cpu_offset() which looks up the base of the nVHE per-CPU
region of the given cpu and computes its offset from the
.hyp.data..percpu section.
This enables use of per_cpu_ptr() helpers in nVHE hyp code. Until now
only
Add rules for renaming the .data..ro_after_init ELF section in KVM nVHE
object files to .hyp.data..ro_after_init, linking it into the kernel
and mapping it in hyp at runtime.
The section is RW to the host, then mapped RO in hyp. The expectation is
that the host populates the variables in the secti
Add a handler of CPU_SUSPEND host PSCI SMCs. The SMC can either enter
a sleep state indistinguishable from a WFI or a deeper sleep state that
behaves like a CPU_OFF+CPU_ON.
The handler saves r0,pc of the host and makes the same call to EL3 with
the hyp CPU entry point. It either returns back to th
Add handler of host SMCs in KVM nVHE trap handler. Forward all SMCs to
EL3 and propagate the result back to EL1. This is done in preparation
for validating host SMCs in KVM nVHE protected mode.
The implementation assumes that firmware uses SMCCC v1.2 or older. That
means x0-x17 can be used both fo
With protected nVHE hyp code interception host's PSCI CPU_ON/SUSPEND
SMCs, the host starts seeing new CPUs boot in EL1 instead of EL2. The
kernel logic that keeps track of the boot mode needs to be adjusted.
Add a static key enabled if KVM protected nVHE initialization is
successful.
When the key
While protected nVHE KVM is installed, start trapping all host SMCs.
By default, these are simply forwarded to EL3, but PSCI SMCs are
validated first.
Create new constant HCR_HOST_NVHE_PROTECTED_FLAGS with the new set of HCR
flags to use while the nVHE vector is installed when the kernel was
boote
When nVHE hyp starts interception host's PSCI CPU_ON SMCs, it will need
to install KVM on the newly booted CPU before returning to the host. Add
an entry point which expects the same kvm_nvhe_init_params struct as the
__kvm_hyp_init HVC in the CPU_ON context argument (x0).
The entry point initiali
Add an early parameter that allows users to opt into protected KVM mode
when using the nVHE hypervisor. In this mode, guest state will be kept
private from the host. This will primarily involve enabling stage-2
address translation for the host, restricting DMA to host memory, and
filtering host SMC
Once we start initializing KVM on newly booted cores before the rest of
the kernel, parameters to __do_hyp_init will need to be provided by EL2
rather than EL1. At that point it will not be possible to pass its four
arguments directly because PSCI_CPU_ON only supports one context
argument.
Refacto
MAIR_EL2 is currently initialized to the value of MAIR_EL1, which itself
is set to a constant MAIR_ELx_SET.
Initialize MAIR_EL2 to MAIR_ELx_SET directly in preparation of allowing
KVM to start CPU cores itself and not initializing itself before ERETing
to EL1. In that case, MAIR_EL2 will be initia
Add a host-initialized constant to KVM nVHE hyp code for converting
between EL2 linear map virtual addresses and physical addresses.
Also add `__hyp_pa` macro that performs the conversion.
Signed-off-by: David Brazdil
---
arch/arm64/kvm/hyp/nvhe/psci-relay.c | 3 +++
arch/arm64/kvm/va_layout.c
When the a CPU is booted in EL2, the kernel checks for VHE support and
initializes the CPU core accordingly. For nVHE it also installs the stub
vectors and drops down to EL1.
Once KVM gains the ability to boot cores without going through the
kernel entry point, it will need to initialize the CPU t
In preparation for adding a CPU entry point in nVHE hyp code, extract
most of __do_hyp_init hypervisor initialization code into a common
helper function. This will be invoked by the entry point to install KVM
on the newly booted CPU.
Signed-off-by: David Brazdil
---
arch/arm64/kvm/hyp/nvhe/hyp-i
All nVHE hyp code is currently executed as handlers of host's HVCs. This
will change as nVHE starts intercepting host's PSCI CPU_ON SMCs. The
newly booted CPU will need to initialize EL2 state and then enter the
host. Add __host_enter function that branches into the existing
host state-restoring co
When KVM starts validating host's PSCI requests, it will need to map
MPIDR back to the CPU ID. To this end, copy cpu_logical_map into nVHE
hyp memory when KVM is initialized.
Only copy the information for CPUs that are online at the point of KVM
initialization so that KVM rejects CPUs whose featur
Forward the following PSCI SMCs issued by host to EL3 as they do not
require the hypervisor's intervention. This assumes that EL3 correctly
implements the PSCI specification.
Only function IDs implemented in Linux are included.
Where both 32-bit and 64-bit variants exist, it is assumed that the h
CPU index should never be negative. Change the signature of
(set_)cpu_logical_map to take an unsigned int.
Signed-off-by: David Brazdil
---
arch/arm64/include/asm/smp.h | 4 ++--
arch/arm64/kernel/setup.c| 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/includ
As we progress towards being able to keep guest state private to the
host running nVHE hypervisor, this series allows the hypervisor to
install itself on newly booted CPUs before the host is allowed to run
on them.
All functionality described below is opt-in, guarded by an early param
'kvm-arm.pro
KVM currently initializes MAIR_EL2 to the value of MAIR_EL1. In
preparation for initializing MAIR_EL2 before MAIR_EL1, move the constant
into a shared header file. Since it is used for EL1 and EL2, rename to
MAIR_ELx_SET.
Signed-off-by: David Brazdil
---
arch/arm64/include/asm/memory.h | 29
KVM's host PSCI SMC filter needs to be aware of the PSCI version of the
system but currently it is impossible to distinguish between v0.1 and
PSCI disabled because both have get_version == NULL.
Populate get_version for v0.1 with a function that returns a constant.
psci_opt.get_version is current
Function IDs used by PSCI are configurable for v0.1 via DT/APCI. If the
host is using PSCI v0.1, KVM's host PSCI proxy needs to use the same IDs.
Expose the array holding the information with a read-only accessor.
Signed-off-by: David Brazdil
---
drivers/firmware/psci/psci.c | 14 ++
Hi Marc,
> > +* Check for VHE being present. x2 being non-zero indicates that we
> > +* do have VHE, and that the kernel is intended to run at EL2.
> > */
> > mrs x2, id_aa64mmfr1_el1
> > ubfxx2, x2, #ID_AA64MMFR1_VHE_SHIFT, #4
> > -#else
> > - mov x2, xzr
> > -#
Hi Eric,
I love your patch! Perhaps something to improve:
[auto build test WARNING on v5.10-rc4]
[also build test WARNING on next-20201116]
[cannot apply to vfio/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--bas
On 16/11/20 14:22, Zenghui Yu wrote:
It makes no sense to track dirty pages for those un-migratable memory
regions (e.g., Memory BAR region of the VFIO PCI device) and doing so
will potentially lead to some unpleasant issues during migration [1].
Skip dirty tracking for those regions by evaluati
On Mon, 16 Nov 2020 21:22:10 +0800
Zenghui Yu wrote:
> It makes no sense to track dirty pages for those un-migratable memory
> regions (e.g., Memory BAR region of the VFIO PCI device) and doing so
> will potentially lead to some unpleasant issues during migration [1].
>
> Skip dirty tracking for
Hi Eric,
I love your patch! Perhaps something to improve:
[auto build test WARNING on iommu/next]
[also build test WARNING on linus/master v5.10-rc4 next-20201116]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--bas
> > #ifdef CONFIG_CPU_PM
> >DEFINE(CPU_CTX_SP, offsetof(struct cpu_suspend_ctx, sp));
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > index 1697d25756e9..f999a35b2c8c 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > +++ b/ar
Hi Eric,
I love your patch! Perhaps something to improve:
[auto build test WARNING on iommu/next]
[also build test WARNING on linus/master v5.10-rc4 next-20201116]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--bas
Hi Marc,
On 2020/11/16 22:10, Marc Zyngier wrote:
My take is that only if the "[Re]Distributor base address" is specified
in the system memory map, will the user-provided kvm_device_attr.offset
make sense. And we can then handle the access to the register which is
defined by "base address + offs
On 2020-11-16 13:09, Zenghui Yu wrote:
Hi Marc,
On 2020/11/16 1:04, Marc Zyngier wrote:
Hi Zenghui,
On 2020-11-13 14:28, Zenghui Yu wrote:
It's expected that users will access registers in the redistributor
*if*
the RD has been initialized properly. Unfortunately userspace can be
bogus
enoug
Hi Eric,
I love your patch! Perhaps something to improve:
[auto build test WARNING on v5.10-rc4]
[also build test WARNING on next-20201116]
[cannot apply to vfio/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--bas
It makes no sense to track dirty pages for those un-migratable memory
regions (e.g., Memory BAR region of the VFIO PCI device) and doing so
will potentially lead to some unpleasant issues during migration [1].
Skip dirty tracking for those regions by evaluating if the region is
migratable before s
Hi Marc,
On 2020/11/16 1:04, Marc Zyngier wrote:
Hi Zenghui,
On 2020-11-13 14:28, Zenghui Yu wrote:
It's expected that users will access registers in the redistributor *if*
the RD has been initialized properly. Unfortunately userspace can be
bogus
enough to access registers before setting th
On 2020-11-16 11:49, David Brazdil wrote:
> #ifdef CONFIG_CPU_PM
>DEFINE(CPU_CTX_SP, offsetof(struct cpu_suspend_ctx, sp));
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> index 1697d25756e9..f999a35b2c8c 100644
> --- a/arch/arm64/kvm/h
In preparation for vSVA, let's register a DMA fault response region,
where the userspace will push the page responses and increment the
head of the buffer. The kernel will pop those responses and inject them
on iommu side.
Signed-off-by: Eric Auger
---
drivers/vfio/pci/vfio_pci.c | 114 +
When the userspace increments the head of the page response
buffer ring, let's push the response into the iommu layer.
This is done through a workqueue that pops the responses from
the ring buffer and increment the tail.
Signed-off-by: Eric Auger
---
drivers/vfio/pci/vfio_pci.c | 40
From: Tina Zhang
Caps the number of irqs with fixed indexes and uses capability chains
to chain device specific irqs.
Signed-off-by: Tina Zhang
Signed-off-by: Eric Auger
[Eric: Put cap_offset at the end of the vfio_irq_info struct,
remove GFX IRQ at the moment and remove any reference to this
Register an IOMMU fault handler which records faults in
the DMA FAULT region ring buffer. In a subsequent patch, we
will add the signaling of a specific eventfd to allow the
userspace to be notified whenever a new fault as shown up.
Signed-off-by: Eric Auger
---
v11 -> v12:
- take the fault_queu
Implement IRQ capability chain infrastructure. All interrupt
indexes beyond VFIO_PCI_NUM_IRQS are handled as extended
interrupts. They are registered with a specific type/subtype
and supported flags.
Signed-off-by: Eric Auger
---
drivers/vfio/pci/vfio_pci.c | 99 +++--
The DMA FAULT region contains the fault ring buffer.
There is benefit to let the userspace mmap this area.
Expose this mmappable area through a sparse mmap entry
and implement the mmap operation.
Signed-off-by: Eric Auger
---
v8 -> v9:
- remove unused index local variable in vfio_pci_fault_mmap
Add a new specific DMA_FAULT region aiming to exposed nested mode
translation faults. This region only is exposed if the device
is attached to a nested domain.
The region has a ring buffer that contains the actual fault
records plus a header allowing to handle it (tail/head indices,
max capacity,
Register the VFIO_IRQ_TYPE_NESTED/VFIO_IRQ_SUBTYPE_DMA_FAULT
IRQ that allows to signal a nested mode DMA fault.
Signed-off-by: Eric Auger
---
v10 -> v11:
- the irq now is registered in vfio_pci_dma_fault_init()
in case the domain is nested
---
drivers/vfio/pci/vfio_pci.c | 21 +++
Add a new IRQ type/subtype to get notification on nested
stage DMA faults.
Signed-off-by: Eric Auger
---
include/uapi/linux/vfio.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0e2bfbeccd08..1e5c82f9d14d 100644
--- a/include/ua
The VFIO API was enhanced to support nested stage control: a bunch of
new iotcls, one DMA FAULT region and an associated specific IRQ.
Let's document the process to follow to set up nested mode.
Signed-off-by: Eric Auger
---
v11 -> v12:
s/VFIO_REGION_INFO_CAP_PRODUCER_FAULT/VFIO_REGION_INFO_CA
This series brings the VFIO part of HW nested paging support
in the SMMUv3.
This is a rebase on top of v5.10-rc4
The series depends on:
[PATCH v12 00/15] SMMUv3 Nested Stage Setup (IOMMU part)
3 new IOCTLs are introduced that allow the userspace to
1) pass the guest stage 1 configuration
2) pass
This patch adds the VFIO_IOMMU_SET_MSI_BINDING ioctl which aim
to (un)register the guest MSI binding to the host. This latter
then can use those stage 1 bindings to build a nested stage
binding targeting the physical MSIs.
Signed-off-by: Eric Auger
---
v10 -> v11:
- renamed ustruct into msi_bin
From: "Liu, Yi L"
This patch adds an VFIO_IOMMU_SET_PASID_TABLE ioctl
which aims to pass the virtual iommu guest configuration
to the host. This latter takes the form of the so-called
PASID table.
Signed-off-by: Jacob Pan
Signed-off-by: Liu, Yi L
Signed-off-by: Eric Auger
---
v11 -> v12:
- u
From: "Liu, Yi L"
When the guest "owns" the stage 1 translation structures, the host
IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation requests are trapped and passed down to the
host.
This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims
at prop
In order to cascade guest CFGI_CD, let's add PASID cache invalidation
per PASID.
Signed-off-by: Eric Auger
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 +---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/
Up to now, when the type was UNMANAGED, we used to
allocate IOVA pages within a reserved IOVA MSI range.
If both the host and the guest are exposed with SMMUs, each
would allocate an IOVA. The guest allocates an IOVA (gIOVA)
to map onto the guest MSI doorbell (gDB). The Host allocates
another IOVA
Nested mode currently is not compatible with HW MSI reserved regions.
Indeed MSI transactions targeting this MSI doorbells bypass the SMMU.
Let's check nested mode is not attempted in such configuration.
Signed-off-by: Eric Auger
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 23
Implement domain-selective and page-selective IOTLB invalidations.
Signed-off-by: Eric Auger
---
v7 -> v8:
- ASID based invalidation using iommu_inv_pasid_info
- check ARCHID/PASID flags in addr based invalidation
- use __arm_smmu_tlb_inv_context and __arm_smmu_tlb_inv_range_nosync
v6 -> v7
- c
In preparation for vSVA, let's accept userspace provided configs
with more than one CD. We check the max CD against the host iommu
capability and also the format (linear versus 2 level).
Signed-off-by: Eric Auger
Signed-off-by: Shameer Kolothum
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |
When a stage 1 related fault event is read from the event queue,
let's propagate it to potential external fault listeners, ie. users
who registered a fault handler.
Signed-off-by: Eric Auger
---
v8 -> v9:
- adapt to the removal of IOMMU_FAULT_UNRECOV_PERM_VALID:
only look at IOMMU_FAULT_UNRECO
On attach_pasid_table() we program STE S1 related info set
by the guest into the actual physical STEs. At minimum
we need to program the context descriptor GPA and compute
whether the stage1 is translated/bypassed or aborted.
Signed-off-by: Eric Auger
---
v7 -> v8:
- remove smmu->features check,
With nested stage support, soon we will need to invalidate
S1 contexts and ranges tagged with an unmanaged asid, this
latter being managed by the guest. So let's introduce 2 helpers
that allow to invalidate with externally managed ASIDs
Signed-off-by: Eric Auger
---
drivers/iommu/arm/arm-smmu-v3
In nested mode we enforce the rule that all devices belonging
to the same iommu_domain share the same msi_domain.
Indeed if there were several physical MSI doorbells being used
within a single iommu_domain, it becomes really difficult to
resolve the nested stage mapping translating into the correc
The bind/unbind_guest_msi() callbacks check the domain
is NESTED and redirect to the dma-iommu implementation.
Signed-off-by: Eric Auger
---
v6 -> v7:
- remove device handle argument
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 43 +
1 file changed, 43 insertions(+)
d
This series brings the IOMMU part of HW nested paging support
in the SMMUv3. The VFIO part is submitted separately.
The IOMMU API is extended to support 2 new API functionalities:
1) pass the guest stage 1 configuration
2) pass stage 1 MSI bindings
Then those capabilities gets implemented in the
In virtualization use case, when a guest is assigned
a PCI host device, protected by a virtual IOMMU on the guest,
the physical IOMMU must be programmed to be consistent with
the guest mappings. If the physical IOMMU supports two
translation stages it makes sense to program guest mappings
onto the
When handling faults from the event or PRI queue, we need to find the
struct device associated to a SID. Add a rb_tree to keep track of SIDs.
Signed-off-by: Eric Auger
Signed-off-by: Jean-Philippe Brucker
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 99 +
drivers/iommu/
On ARM, MSI are translated by the SMMU. An IOVA is allocated
for each MSI doorbell. If both the host and the guest are exposed
with SMMUs, we end up with 2 different IOVAs allocated by each.
guest allocates an IOVA (gIOVA) to map onto the guest MSI
doorbell (gDB). The Host allocates another IOVA (h
When nested stage translation is setup, both s1_cfg and
s2_cfg are allocated.
We introduce a new smmu domain abort field that will be set
upon guest stage1 configuration passing.
arm_smmu_write_strtab_ent() is modified to write both stage
fields in the STE and deal with the abort field.
In neste
In preparation for the introduction of nested stages
let's turn s1_cfg and s2_cfg fields into pointers which are
dynamically allocated depending on the smmu_domain stage.
In nested mode, both stages will coexist and s1_cfg will
be allocated when the guest configuration gets passed.
Signed-off-by:
Hi Marc,
Friendly ping, it is some time since I sent this patch according to your last
advice...
Besides, recently we found that the mmio delay on GICv4.1 system is about 10
times higher
than that on GICv4.0 system in kvm-unit-tests (the specific data is as
follows). By the
way, HiSilicon GICv
On 2020/11/15 23:03, Paolo Bonzini wrote:
On 15/11/20 15:31, Zenghui Yu wrote:
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 71951fe4dc..0958db1a08 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1806,7 +1806,10 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
uint8_t mem
71 matches
Mail list logo