This was identified as a potential fix for an issue we analyzed in our
Enterprise support, where guests would hang before the boot-loader
after being rebooted from within the guest (after applying updates for
RHEL 8).

https://lore.kernel.org/lkml/20230608090348.414990-1-gs...@redhat.com/

Suggested-by: Stefan Hanreich <s.hanre...@proxmox.com>
Signed-off-by: Stoiko Ivanov <s.iva...@proxmox.com>
---
 ...l-stage2-mapping-on-invalid-memory-s.patch | 122 ++++++++++++++++++
 1 file changed, 122 insertions(+)
 create mode 100644 
patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch

diff --git 
a/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch
 
b/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch
new file mode 100644
index 000000000000..d50aab8e4d7c
--- /dev/null
+++ 
b/patches/kernel/0025-KVM-Avoid-illegal-stage2-mapping-on-invalid-memory-s.patch
@@ -0,0 +1,122 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Gavin Shan <gs...@redhat.com>
+Date: Thu, 15 Jun 2023 15:42:59 +1000
+Subject: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot
+
+commit 2230f9e1171a2e9731422a14d1bbc313c0b719d1 upstream.
+
+We run into guest hang in edk2 firmware when KSM is kept as running on
+the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
+device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
+buffered write. The status is returned by reading the memory region of
+the pflash device and the read request should have been forwarded to QEMU
+and emulated by it. Unfortunately, the read request is covered by an
+illegal stage2 mapping when the guest hang issue occurs. The read request
+is completed with QEMU bypassed and wrong status is fetched. The edk2
+firmware runs into an infinite loop with the wrong status.
+
+The illegal stage2 mapping is populated due to same page sharing by KSM
+at (C) even the associated memory slot has been marked as invalid at (B)
+when the memory slot is requested to be deleted. It's notable that the
+active and inactive memory slots can't be swapped when we're in the middle
+of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
+is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
+to zero again. Besides, the swapping from the active to the inactive memory
+slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
+corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
+
+  CPU-A                    CPU-B
+  -----                    -----
+                           ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
+                           kvm_vm_ioctl_set_memory_region
+                           kvm_set_memory_region
+                           __kvm_set_memory_region
+                           kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
+                             kvm_invalidate_memslot
+                               kvm_copy_memslot
+                               kvm_replace_memslot
+                               kvm_swap_active_memslots        (A)
+                               kvm_arch_flush_shadow_memslot   (B)
+  same page sharing by KSM
+  kvm_mmu_notifier_invalidate_range_start
+        :
+  kvm_mmu_notifier_change_pte
+    kvm_handle_hva_range
+    __kvm_handle_hva_range
+    kvm_set_spte_gfn            (C)
+        :
+  kvm_mmu_notifier_invalidate_range_end
+
+Fix the issue by skipping the invalid memory slot at (C) to avoid the
+illegal stage2 mapping so that the read request for the pflash's status
+is forwarded to QEMU and emulated by it. In this way, the correct pflash's
+status can be returned from QEMU to break the infinite loop in the edk2
+firmware.
+
+We tried a git-bisect and the first problematic commit is cd4c71835228 ("
+KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
+clean_dcache_guest_page() is called after the memory slots are iterated
+in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
+before the iteration on the memory slots before this commit. This change
+literally enlarges the racy window between kvm_mmu_notifier_change_pte()
+and memory slot removal so that we're able to reproduce the issue in a
+practical test case. However, the issue exists since commit d5d8184d35c9
+("KVM: ARM: Memory virtualization setup").
+
+Cc: sta...@vger.kernel.org # v3.9+
+Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
+Reported-by: Shuai Hu <hsh...@redhat.com>
+Reported-by: Zhenyu Zhang <zheny...@redhat.com>
+Signed-off-by: Gavin Shan <gs...@redhat.com>
+Reviewed-by: David Hildenbrand <da...@redhat.com>
+Reviewed-by: Oliver Upton <oliver.up...@linux.dev>
+Reviewed-by: Peter Xu <pet...@redhat.com>
+Reviewed-by: Sean Christopherson <sea...@google.com>
+Reviewed-by: Shaoqin Huang <shahu...@redhat.com>
+Message-Id: <20230615054259.14911-1-gs...@redhat.com>
+Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
+Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
+(cherry picked from commit 953dd7e2df8181d5ce4117fca347992d616f0621)
+Signed-off-by: Stoiko Ivanov <s.iva...@proxmox.com>
+---
+ virt/kvm/kvm_main.c | 20 +++++++++++++++++++-
+ 1 file changed, 19 insertions(+), 1 deletion(-)
+
+diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
+index db159be9d5b8..6deb43c2d091 100644
+--- a/virt/kvm/kvm_main.c
++++ b/virt/kvm/kvm_main.c
+@@ -636,6 +636,24 @@ static __always_inline int 
kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
+ 
+       return __kvm_handle_hva_range(kvm, &range);
+ }
++
++static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
++{
++      /*
++       * Skipping invalid memslots is correct if and only change_pte() is
++       * surrounded by invalidate_range_{start,end}(), which is currently
++       * guaranteed by the primary MMU.  If that ever changes, KVM needs to
++       * unmap the memslot instead of skipping the memslot to ensure that KVM
++       * doesn't hold references to the old PFN.
++       */
++      WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
++
++      if (range->slot->flags & KVM_MEMSLOT_INVALID)
++              return false;
++
++      return kvm_set_spte_gfn(kvm, range);
++}
++
+ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
+                                       struct mm_struct *mm,
+                                       unsigned long address,
+@@ -656,7 +674,7 @@ static void kvm_mmu_notifier_change_pte(struct 
mmu_notifier *mn,
+       if (!READ_ONCE(kvm->mmu_notifier_count))
+               return;
+ 
+-      kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
++      kvm_handle_hva_range(mn, address, address + 1, pte, 
kvm_change_spte_gfn);
+ }
+ 
+ void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
-- 
2.39.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to