>> That means there were no VALID=1, EN=1, S=1 errors anywhere.  But there
>> might be some other things logged that would help us understand.
>
> By "other things" you mean other MCEs?

Logs with EN=0 and/or S=0.  They may have interesting information, and have
a good chance of being useful (especially if they are from some functional
unit that isn't part of the buggy behavior. Bad data flowing through multiple
functional units can leave a trail of logged entries (perhaps as many as four
units may see and log a single error). Only one of them should signal the 
machine
check (to avoid shutdown because of nested machine check). 

> Oh, cpu errata. So this would mean that we can't even rely on the
> contents of the MCA banks, can we?
>
> In any case, is any of the information in the MCA banks in such cases
> even usable then? Because if not, we're definitely barking up the wrong
> tree...

See above - I think even if there is a bug in the core that isn't setting the
right bits in the MCi_STATUS register - we could get good data from
devices out in the uncore.

-Tony

Reply via email to