Hi Jonathan,

On 11/6/25 12:37 AM, Jonathan Cameron wrote:
On Wed,  5 Nov 2025 21:44:53 +1000
Gavin Shan <[email protected]> wrote:

In the combination of 64KiB host and 4KiB guest, a problematic host
page affects 16x guest pages that can be owned by different threads.
It means 16x memory errors can be raised at once due to the parallel
accesses to those 16x guest pages on the guest. Unfortunately, QEMU
can't deliver them one by one because we just one GHES error block,

we have just one


Thanks, fixed locally.

corresponding one read acknowledgement register. It can eventually
cause QEMU crash dump due to the contention on that register, meaning
the current memory error can't be delivered before the previous error
isn't acknowledged.

Imporve push_ghes_memory_errors() to push 16x consecutive memory errors
Improve


Thanks, fixed locally.

under this situation to avoid the contention on the read acknowledgement
register.

Signed-off-by: Gavin Shan <[email protected]>
Hi Gavin

Silly question that never occurred to me before:
What happens if we just report a single larger error?

The CPER record has a Physical Address Mask that I think lets us say we
are only reporting at a 64KiB granularity.

In linux drivers/edac/ghes_edac.c seems to handle this via e->grain.
https://elixir.bootlin.com/linux/v6.18-rc4/source/drivers/edac/ghes_edac.c#L346

I haven't chased the whole path through to whether this does appropriate 
poisoning
on the guest though.


We have the following call trace to handle CPER error record. The e->grain
originated from the Physical Address Mask is used to come up with the size
for memory scrubbing at (a). The page isolation happens at (b) bases on the
reported Physical Address. So a larger Physical Address Mask won't help to
isolate more pages per my understanding.

do_sea
  apei_claim_sea
    ghes_notify_sea
      ghes_in_nmi_spool_from_list
        ghes_in_nmi_queue_one_entry
        irq_work_queue                          // ghes_proc_irq_work
          ghes_proc_in_irq
            ghes_do_proc
              atomic_notifier_call_chain        // (a) ghes_report_chain
                ghes_edac_report_mem_error
                  edac_raw_mc_handle_error
               ghes_handle_memory_failure
                 ghes_do_memory_failure
                   memory_failure_cb
                     memory_failure             // (b) Isolate the page

Thanks,
Gavin

---
  target/arm/kvm.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++--
  1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 5b151eda3c..d7de8262da 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -11,6 +11,7 @@
   */
#include "qemu/osdep.h"
+#include "qemu/units.h"
  #include <sys/ioctl.h>
#include <linux/kvm.h>
@@ -2432,12 +2433,59 @@ int kvm_arch_get_registers(CPUState *cs, Error **errp)
  static void push_ghes_memory_errors(CPUState *c, AcpiGhesState *ags,
                                      uint64_t paddr, Error **errp)
  {
+    uint64_t val, start, end, guest_pgsz, host_pgsz;
      uint64_t addresses[16];
+    uint32_t num_of_addresses;
+    int ret;
+
+    /*
+     * Sort out the guest page size from TCR_EL1, which can be modified
+     * by the guest from time to time. So we have to sort it out dynamically.
+     */
+    ret = read_sys_reg64(c->kvm_fd, &val, ARM64_SYS_REG(3, 0, 2, 0, 2));
+    if (ret) {
+        error_setg(errp, "Error %" PRId32 " to read TCR_EL1 register", ret);
+        return;
+    }
+
+    switch (extract64(val, 14, 2)) {
+    case 0:
+        guest_pgsz = 4 * KiB;
+        break;
+    case 1:
+        guest_pgsz = 64 * KiB;
+        break;
+    case 2:
+        guest_pgsz = 16 * KiB;
+        break;
+    default:
+        error_setg(errp, "Unknown page size from TCR_EL1 (0x%" PRIx64 ")", 
val);
+        return;
+    }
+
+    host_pgsz = qemu_real_host_page_size();
+    start = paddr & ~(host_pgsz - 1);
+    end = start + host_pgsz;
+    num_of_addresses = 0;
- addresses[0] = paddr;
+    while (start < end) {
+        /*
+         * The precise physical address is provided for the affected
+         * guest page that contains @paddr. Otherwise, the starting
+         * address of the guest page is provided.
+         */
+        if (paddr >= start && paddr < (start + guest_pgsz)) {
+            addresses[num_of_addresses++] = paddr;
+        } else {
+            addresses[num_of_addresses++] = start;
+        }
+
+        start += guest_pgsz;
+    }
kvm_cpu_synchronize_state(c);
-    acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC, addresses, 1, errp);
+    acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC,
+                            addresses, num_of_addresses, errp);
      kvm_inject_arm_sea(c);
  }



Reply via email to