Hi John, I'd like to put the emphasis on the fact that ignoring the SRAO error for a VM is a real problem at least for a specific (rare) case I'm currently working on: The VM migration.
Context: - In the case of a poisoned page in the VM address space, the migration can't read it and will skip this page, considering it as a zero-filled page. The VM kernel (that handled the vMCE) would have marked it's associated page as poisoned, and if the VM touches the page, the VM kernel generates the associated MCE because it already knows about the poisoned page. - When we ignore the vMCE in the case of a SIGBUS/BUS_MCEERR_AO error (what this patch does), we entirely rely on the Hypervisor to send an SRAR error to qemu when the page is touched: The AMD VM kernel will receive the SIGBUS/BUS_MCEERR_AR and deal with it, thanks to your changes here. So it looks like the mechanism works fine... unless the VM has migrated between the SRAO error and the first time it really touches the poisoned page to get an SRAR error ! In this case, its new address space (created on the migration destination) will have a zero-page where we had a poisoned page, and the AMD VM Kernel (that never dealt with the SRAO) doesn't know about the poisoned page and will access the page finding only zeros... We have a memory corruption ! It is a very rare window, but in order to fix it the most reasonable course of action would be to make the AMD emulation deal with SRAO errors, instead of ignoring them. Do you agree with my analysis ? Would an AMD platform generate SRAO signal to a process (SIGBUS/BUS_MCEERR_AO) in case of a real hardware error ? Thanks, William.