Just a note to inform you that I've submitted a new patch on a
separate thread -- dealing with VM live migration after receiving
memory errors:
https://lore.kernel.org/qemu-devel/20231013150839.867164-3-william.ro...@oracle.com/

This patch belongs to a 2 patches set that should fix the migration in
case of memory errors received and handled by the VM before the
migration request.

For the moment this other patch only fixes the ARM case ignoring
SIGBUS/BUS_MCEERR_AO errors, but the same mechanism should be used with
AMD ignoring SIGBUS/BUS_MCEERR_AO too. Using the same new parameter
to the kvm_hwpoison_page_add function in kvm_arch_on_sigbus_vcpu with:

    kvm_hwpoison_page_add(ram_addr, (code == BUS_MCEERR_AR));

Of course we'll have to wait for this above patch to be integrated first.

HTH,
William.


On 9/19/23 00:00, William Roche wrote:
> Hi John,
>
> I'd like to put the emphasis on the fact that ignoring the SRAO error
> for a VM is a real problem at least for a specific (rare) case I'm
> currently working on: The VM migration.
>
> Context:
>
> - In the case of a poisoned page in the VM address space, the migration
> can't read it and will skip this page, considering it as a zero-filled
> page. The VM kernel (that handled the vMCE) would have marked it's
> associated page as poisoned, and if the VM touches the page, the VM
> kernel generates the associated MCE because it already knows about the
> poisoned page.
>
> - When we ignore the vMCE in the case of a SIGBUS/BUS_MCEERR_AO error
> (what this patch does), we entirely rely on the Hypervisor to send an
> SRAR error to qemu when the page is touched: The AMD VM kernel will
> receive the SIGBUS/BUS_MCEERR_AR and deal with it, thanks to your
> changes here.
>
> So it looks like the mechanism works fine... unless the VM has migrated
> between the SRAO error and the first time it really touches the poisoned
> page to get an SRAR error !  In this case, its new address space
> (created on the migration destination) will have a zero-page where we
> had a poisoned page, and the AMD VM Kernel (that never dealt with the
> SRAO) doesn't know about the poisoned page and will access the page
> finding only zeros...  We have a memory corruption !
>
> It is a very rare window, but in order to fix it the most reasonable
> course of action would be to make the AMD emulation deal with SRAO
> errors, instead of ignoring them.
>
> Do you agree with my analysis ?
> Would an AMD platform generate SRAO signal to a process
> (SIGBUS/BUS_MCEERR_AO) in case of a real hardware error ?
>
> Thanks,
> William.

Reply via email to