This was identified as a potential fix for an issue we analyzed in our Enterprise support, where guests would hang before the boot-loader after being rebooted from within the guest (after applying updates for RHEL 8).
https://lore.kernel.org/lkml/20230608090348.414990-1-gs...@redhat.com/ Suggested-by: Stefan Hanreich <s.hanre...@proxmox.com> Signed-off-by: Stoiko Ivanov <s.iva...@proxmox.com> --- ...l-stage2-mapping-on-invalid-memory-s.patch | 122 ++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch diff --git a/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch b/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch new file mode 100644 index 000000000000..d50aab8e4d7c --- /dev/null +++ b/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch @@ -0,0 +1,122 @@ +From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 +From: Gavin Shan <gs...@redhat.com> +Date: Thu, 15 Jun 2023 15:42:59 +1000 +Subject: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot + +commit 2230f9e1171a2e9731422a14d1bbc313c0b719d1 upstream. + +We run into guest hang in edk2 firmware when KSM is kept as running on +the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash +device (TYPE_PFLASH_CFI01) during the operation of sector erasing or +buffered write. The status is returned by reading the memory region of +the pflash device and the read request should have been forwarded to QEMU +and emulated by it. Unfortunately, the read request is covered by an +illegal stage2 mapping when the guest hang issue occurs. The read request +is completed with QEMU bypassed and wrong status is fetched. The edk2 +firmware runs into an infinite loop with the wrong status. + +The illegal stage2 mapping is populated due to same page sharing by KSM +at (C) even the associated memory slot has been marked as invalid at (B) +when the memory slot is requested to be deleted. It's notable that the +active and inactive memory slots can't be swapped when we're in the middle +of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count +is elevated, and kvm_swap_active_memslots() will busy loop until it reaches +to zero again. Besides, the swapping from the active to the inactive memory +slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(), +corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots(). + + CPU-A CPU-B + ----- ----- + ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) + kvm_vm_ioctl_set_memory_region + kvm_set_memory_region + __kvm_set_memory_region + kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) + kvm_invalidate_memslot + kvm_copy_memslot + kvm_replace_memslot + kvm_swap_active_memslots (A) + kvm_arch_flush_shadow_memslot (B) + same page sharing by KSM + kvm_mmu_notifier_invalidate_range_start + : + kvm_mmu_notifier_change_pte + kvm_handle_hva_range + __kvm_handle_hva_range + kvm_set_spte_gfn (C) + : + kvm_mmu_notifier_invalidate_range_end + +Fix the issue by skipping the invalid memory slot at (C) to avoid the +illegal stage2 mapping so that the read request for the pflash's status +is forwarded to QEMU and emulated by it. In this way, the correct pflash's +status can be returned from QEMU to break the infinite loop in the edk2 +firmware. + +We tried a git-bisect and the first problematic commit is cd4c71835228 (" +KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, +clean_dcache_guest_page() is called after the memory slots are iterated +in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called +before the iteration on the memory slots before this commit. This change +literally enlarges the racy window between kvm_mmu_notifier_change_pte() +and memory slot removal so that we're able to reproduce the issue in a +practical test case. However, the issue exists since commit d5d8184d35c9 +("KVM: ARM: Memory virtualization setup"). + +Cc: sta...@vger.kernel.org # v3.9+ +Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") +Reported-by: Shuai Hu <hsh...@redhat.com> +Reported-by: Zhenyu Zhang <zheny...@redhat.com> +Signed-off-by: Gavin Shan <gs...@redhat.com> +Reviewed-by: David Hildenbrand <da...@redhat.com> +Reviewed-by: Oliver Upton <oliver.up...@linux.dev> +Reviewed-by: Peter Xu <pet...@redhat.com> +Reviewed-by: Sean Christopherson <sea...@google.com> +Reviewed-by: Shaoqin Huang <shahu...@redhat.com> +Message-Id: <20230615054259.14911-1-gs...@redhat.com> +Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> +Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> +(cherry picked from commit 953dd7e2df8181d5ce4117fca347992d616f0621) +Signed-off-by: Stoiko Ivanov <s.iva...@proxmox.com> +--- + virt/kvm/kvm_main.c | 20 +++++++++++++++++++- + 1 file changed, 19 insertions(+), 1 deletion(-) + +diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c +index db159be9d5b8..6deb43c2d091 100644 +--- a/virt/kvm/kvm_main.c ++++ b/virt/kvm/kvm_main.c +@@ -636,6 +636,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn + + return __kvm_handle_hva_range(kvm, &range); + } ++ ++static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) ++{ ++ /* ++ * Skipping invalid memslots is correct if and only change_pte() is ++ * surrounded by invalidate_range_{start,end}(), which is currently ++ * guaranteed by the primary MMU. If that ever changes, KVM needs to ++ * unmap the memslot instead of skipping the memslot to ensure that KVM ++ * doesn't hold references to the old PFN. ++ */ ++ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); ++ ++ if (range->slot->flags & KVM_MEMSLOT_INVALID) ++ return false; ++ ++ return kvm_set_spte_gfn(kvm, range); ++} ++ + static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address, +@@ -656,7 +674,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, + if (!READ_ONCE(kvm->mmu_notifier_count)) + return; + +- kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); ++ kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn); + } + + void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, -- 2.39.2 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel