> I'm curious if anyone has any experience with ECC uncorrectable errors > (specifically not the identification of), but which specific dimm in > the chassis it's pointing to.
we've had good luck using EDAC to pin down bad dimms - at least those that that cause _correctable_ errors. our uncorrectable errors trigger panics. I suppose that's selectable, though I guess you could turn that off (/sys/module/edac_mc/panic_on_ue) > The mcelog in linux doesn't seem to report the dimm slot correctly on > my supermicro boards. I prefer the hardware-topology-based naming that edac uses (controller, channel, chipselect). I guess recent versions of edac have a user-space tool that will translate that for you (but of course, you have to verify the topo-to-label mapping yourself anyway.) regards, mark hahn. _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
