On Tue, Nov 11, 2025 at 1:53 AM Oliver Upton <[email protected]> wrote: > > Hi Jiaqi, > > On Mon, Nov 03, 2025 at 12:45:50PM -0800, Jiaqi Yan wrote: > > On Mon, Nov 3, 2025 at 10:17 AM Jose Marinho <[email protected]> wrote: > > > > > > Thank you for these patches. > > > > Thanks for your comments, Jose! > > > > > > > > On 10/13/2025 7:59 PM, Jiaqi Yan wrote: > > > > When APEI fails to handle a stage-2 synchronous external abort (SEA), > > > > today KVM injects an asynchronous SError to the VCPU then resumes it, > > > > which usually results in unpleasant guest kernel panic. > > > > > > > > One major situation of guest SEA is when vCPU consumes recoverable > > > > uncorrected memory error (UER). Although SError and guest kernel panic > > > > effectively stops the propagation of corrupted memory, guest may > > > > re-use the corrupted memory if auto-rebooted; in worse case, guest > > > > boot may run into poisoned memory. So there is room to recover from > > > > an UER in a more graceful manner. > > > > > > > > Alternatively KVM can redirect the synchronous SEA event to VMM to > > > > - Reduce blast radius if possible. VMM can inject a SEA to VCPU via > > > > KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison > > > > consumption or fault is not from guest kernel, blast radius can be > > > > limited to the triggering thread in guest userspace, so VM can > > > > keep running. > > > > - Allow VMM to protect from future memory poison consumption by > > > > unmapping the page from stage-2, or to interrupt guest of the > > > > poisoned page so guest kernel can unmap it from stage-1 page table. > > > > - Allow VMM to track SEA events that VM customers care about, to restart > > > > VM when certain number of distinct poison events have happened, > > > > to provide observability to customers in log management UI. > > > > > > > > Introduce an userspace-visible feature to enable VMM handle SEA: > > > > - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior > > > > when host APEI fails to claim a SEA, userspace can opt in this new > > > > capability to let KVM exit to userspace during SEA if it is not > > > > owned by host. > > > > - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. > > > > KVM fills kvm_run.arm_sea with as much as possible information about > > > > the SEA, enabling VMM to emulate SEA to guest by itself. > > > > - Sanitized ESR_EL2. The general rule is to keep only the bits > > > > useful for userspace and relevant to guest memory. > > > > - Flags indicating if faulting guest physical address is valid. > > > > - Faulting guest physical and virtual addresses if valid. > > > > > > > > Signed-off-by: Jiaqi Yan <[email protected]> > > > > Co-developed-by: Oliver Upton <[email protected]> > > > > Signed-off-by: Oliver Upton <[email protected]> > > > > --- > > > > arch/arm64/include/asm/kvm_host.h | 2 + > > > > arch/arm64/kvm/arm.c | 5 +++ > > > > arch/arm64/kvm/mmu.c | 68 ++++++++++++++++++++++++++++++- > > > > include/uapi/linux/kvm.h | 10 +++++ > > > > 4 files changed, 84 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h > > > > b/arch/arm64/include/asm/kvm_host.h > > > > index b763293281c88..e2c65b14e60c4 100644 > > > > --- a/arch/arm64/include/asm/kvm_host.h > > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > > @@ -350,6 +350,8 @@ struct kvm_arch { > > > > #define KVM_ARCH_FLAG_GUEST_HAS_SVE 9 > > > > /* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace > > > > */ > > > > #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS 10 > > > > + /* Unhandled SEAs are taken to userspace */ > > > > +#define KVM_ARCH_FLAG_EXIT_SEA 11 > > > > unsigned long flags; > > > > > > > > /* VM-wide vCPU feature set */ > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > > index f21d1b7f20f8e..888600df79c40 100644 > > > > --- a/arch/arm64/kvm/arm.c > > > > +++ b/arch/arm64/kvm/arm.c > > > > @@ -132,6 +132,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > > > } > > > > mutex_unlock(&kvm->lock); > > > > break; > > > > + case KVM_CAP_ARM_SEA_TO_USER: > > > > + r = 0; > > > > + set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags); > > > > + break; > > > > default: > > > > break; > > > > } > > > > @@ -327,6 +331,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, > > > > long ext) > > > > case KVM_CAP_IRQFD_RESAMPLE: > > > > case KVM_CAP_COUNTER_OFFSET: > > > > case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS: > > > > + case KVM_CAP_ARM_SEA_TO_USER: > > > > r = 1; > > > > break; > > > > case KVM_CAP_SET_GUEST_DEBUG2: > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > > index 7cc964af8d305..09210b6ab3907 100644 > > > > --- a/arch/arm64/kvm/mmu.c > > > > +++ b/arch/arm64/kvm/mmu.c > > > > @@ -1899,8 +1899,48 @@ static void handle_access_fault(struct kvm_vcpu > > > > *vcpu, phys_addr_t fault_ipa) > > > > read_unlock(&vcpu->kvm->mmu_lock); > > > > } > > > > > > > > +/* > > > > + * Returns true if the SEA should be handled locally within KVM if the > > > > abort > > > > + * is caused by a kernel memory allocation (e.g. stage-2 table memory). > > > > + */ > > > > +static bool host_owns_sea(struct kvm_vcpu *vcpu, u64 esr) > > > > +{ > > > > + /* > > > > + * Without FEAT_RAS HCR_EL2.TEA is RES0, meaning any external > > > > abort > > > > + * taken from a guest EL to EL2 is due to a host-imposed access > > > > (e.g. > > > > + * stage-2 PTW). > > > > + */ > > > > + if (!cpus_have_final_cap(ARM64_HAS_RAS_EXTN)) > > > > + return true; > > > > + > > > > + /* KVM owns the VNCR when the vCPU isn't in a nested context. */ > > > > + if (is_hyp_ctxt(vcpu) && (esr & ESR_ELx_VNCR)) > > > Is this check valid only for a "Data Abort"? > > > > Yes, the VNCR bit is specific to a Data Abort (provided we can only > > reach host_owns_sea if kvm_vcpu_abt_issea). > > I don't think we need to explicitly exclude the check here for > > Instruction Abort. > > You can take an external abort on an instruction fetch, in which case > bit 13 of the ISS (VNCR bit for data abort) is RES0. So this does need > to check for a data abort.
Agreed and thanks for correcting me, Oliver! I will fix this in v5. > > Thanks, > Oliver
