I've been mentioning this on a regular basis, but the state of MCE
handling with Xen seems poor.

I find the present handling of MCE in Xen an odd choice.  Having Xen do
most of the handling of MCE events is a behavior matching a traditional
stand-alone hypervisor.  Yet Xen was originally pushing any task not
requiring hypervisor action onto Domain 0.

MCE seems a perfect match for sharing responsibility with Domain 0.
Domain 0 needs to know about any MCE event, this is where system
administrators will expect to find logs.  In fact, if the event is a
Correctable Error, then *only* Domain 0 needs to know.  For a CE, Xen
may need no action at all (an implementation could need help) and
the effected domain would need no action.  It is strictly for
Uncorrectable Errors that action beside logging is needed.

For a UE memory error, the best approach might be for Domain 0 to decode
the error.  Once Domain 0 determines it is UE, invoke a hypercall to pass
the GPFN to Xen.  Xen would then forcibly unmap the page (similar to what
Linux does to userspace for corrupted pages).  Xen would then identify
what the page was used for, alert the domain and return that to Domain 0.


The key advantage of this approach is it makes MCE handling act very
similar to MCE handling without Xen.  Documentation about how MCEs are
reported/decoded would apply equally to Xen.  Another rather important
issue is it means less maintenance work to keep MCE handling working with
cutting-edge hardware.  I've noticed one vendor being sluggish about
getting patches into Linux and I fear similar issues may apply more
severely to Xen.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sig...@m5p.com  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



Reply via email to