Hi John,

I'd like to put the emphasis on the fact that ignoring the SRAO error
for a VM is a real problem at least for a specific (rare) case I'm
currently working on: The VM migration.

Context:

- In the case of a poisoned page in the VM address space, the migration
can't read it and will skip this page, considering it as a zero-filled
page. The VM kernel (that handled the vMCE) would have marked it's
associated page as poisoned, and if the VM touches the page, the VM
kernel generates the associated MCE because it already knows about the
poisoned page.

- When we ignore the vMCE in the case of a SIGBUS/BUS_MCEERR_AO error
(what this patch does), we entirely rely on the Hypervisor to send an
SRAR error to qemu when the page is touched: The AMD VM kernel will
receive the SIGBUS/BUS_MCEERR_AR and deal with it, thanks to your
changes here.

So it looks like the mechanism works fine... unless the VM has migrated
between the SRAO error and the first time it really touches the poisoned
page to get an SRAR error !  In this case, its new address space
(created on the migration destination) will have a zero-page where we
had a poisoned page, and the AMD VM Kernel (that never dealt with the
SRAO) doesn't know about the poisoned page and will access the page
finding only zeros...  We have a memory corruption !

It is a very rare window, but in order to fix it the most reasonable
course of action would be to make the AMD emulation deal with SRAO
errors, instead of ignoring them.

Do you agree with my analysis ?
Would an AMD platform generate SRAO signal to a process
(SIGBUS/BUS_MCEERR_AO) in case of a real hardware error ?

Thanks,
William.

Reply via email to