Hi Geng,

On 11/04/2020 13:17, Dongjiu Geng wrote:
> When the RAS Extension is implemented, b0b011000, 0b011100,
> 0b011101, 0b011110, and 0b011111, are not used and reserved
> to the DFSC[5:0] of ESR_ELx, but the code still checks these
> unused bits, so remove them.

They aren't unused: CPUs without the RAS extensions may still generate these.

kvm_handle_guest_abort() wants to know if this is an external abort.
KVM doesn't really care if the CPU has the RAS extensions or not, its the arch 
code's job
to sort all that out.


> If the handling of guest ras data error fails, it should
> inject data instead of SError to let the guest recover as
> much as possible.

(I don't quite follow your point here).

If KVM injected a synchronous external abort due to a RAS error here, then you 
wouldn't be
able to support firmware-first RAS with Qemu. I don't think this is what you 
want.


The handling is (and should be) decoupled.

KVM guests aren't special. Whatever happens for a normal user-space process is 
what should
happen here. KVM is just doing the plumbing:

When the hypervisor takes an external abort due to the guest, it should plumb 
the error
into the arch code to be handled. This is what would happen for a normal EL0 
process.
This is what do_sea() and kvm_handle_guest_sea() do with apei_claim_sea().

If the RAS code says it handled this error, then we can continue. For 
user-space, we
return to user-space. For a guest, we return to the guest. (for user-space this 
piece is
not quite complete in mainline, see:
https://lore.kernel.org/linux-acpi/20200228174817.74278-4-james.mo...@arm.com/ )

This first part happens even if the errors are notified by IRQs, or found in a 
polled buffer.

The RAS code may have 'handled' the memory by unmapping it, and marking the 
corresponding
page as HWPOISONed. If user-space tries to access this, it will be give an
SIGBUS:MCEERR_AR. If a guest tries to do this, the same things happens. (The 
signal goes
to Qemu).
(See do_page_fault()s use of the MCEERR si_code's, and kvm_send_hwpoison_signal)

This second part is the same regardless of how the kernel discovered the RAS 
error in the
first place.


If the RAS code says it did not handle this error, it means it wasn't a RAS 
error, or your
platform doesn't support RAS. For an external-abort there is very little the 
hypervisor
can do in this situation. It does what KVM has always done: inject an 
asynchronous
external abort.
This should only happen if the host has failed to handle the error. KVM's use of
asynchronous abort is the simplest one size fits all.

Are you seeing this happen? If so, what are the circumstances. Did the host 
handle the
error? (if not: why not!)


Thanks,

James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Reply via email to