-----Original Message-----

>Date: Thu, 22 Oct 2015 11:09:50 -0700
>From: John Baldwin <j...@freebsd.org>
>To: freebsd-hardware@freebsd.org
>Cc: Dieter BSD <dieter...@gmail.com>, freebsd-hack...@freebsd.org
>Subject: Re: ECC support
>Message-ID: <1492434.22kxskh...@ralph.baldwin.cx>
>Content-Type: text/plain; charset="us-ascii"
>
>The problem is that there are other fields to decode and you can only fit so 
>much in one line.

At Panasas, we did in-kernel parsing and got it down to a one-liner like this:

    Detected HW Err (CMC) - Correctable ECC error Channel:0; Dimm:0; 
Syndrome:2151686160


But that was only for main-memory corrected ECCs; for all other MCAs, it was a 
multi-line format (which I think we got from backporting MCA support from 
(8-STABLE?)):

    MCA: Bank 8, Status 0xb20000000004008f
    MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
    MCA: Vendor "GenuineIntel", ID 0x106e4, APIC ID 0
    MCA: CPU 0 UNCOR PCC GEN channel ?? memory error


>Also, there is not a CPU-independent way to know the address of an ECC error.  
>On Intel Core i3/5/7 (anything with QPI) you can identify the individual DIMM 
>at least, but the label that the motherboard manufacturer uses varies by 
>manufacturer.  (You can maybe scrape that text from the SMBIOS tables,

That's exactly what we did when using off-the-shelf motherboards. We were able 
to extract the name of the DIMM slot, as defined in SMBIOS, as well as the part 
and serial numbers of the DIMM, and the physical address range of the DIMM. For 
example:

    hw.mem.dimm.s: locator   serial#  part#              bank                   
    size     addr0         addrN
    hw.mem.dimm.0: DIMM_A1   DC917AEF 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 0 DIMM 
0]  16384MB  0x00000000000 0x003FFFFFFFF
    hw.mem.dimm.1: DIMM_B1   DDA0C793 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 1 DIMM 
0]  16384MB  0x00400000000 0x007FFFFFFFF
    hw.mem.dimm.2: DIMM_C1   DDA0C7B6 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 2 DIMM 
0]  16384MB  0x00800000000 0x00BFFFFFFFF
    hw.mem.dimm.3: DIMM_D1   DDA0C7DE 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 3 DIMM 
0]  16384MB  0x00C00000000 0x00FFFFFFFFF


Re-whacking that code for -CURRENT and getting it upstream has been on my to-do 
list for a depressingly long time; it keeps getting pre-empted. :-S


>but only if they aren't wrong which they sometimes are, and good luck knowing 
>if they are wrong or right.)

Making sure the SMBIOS identifier matches the label on the motherboard is part 
of the process of validating the motherboard as usable by us. :-)

>Digital UNIX had the luxury of running on hardware built by the same company, 
>not on a random assortment of boards built by various vendors.  FreeBSD does 
>not.

Yeah. Like I said, we scrapped SMBIOS *for off-the-shelf motherboards*. For our 
in-house designs, we hardcoded the Channel/DIMM mapping into an unambiguous 
form inside the driver itself.

>sysutils/mcelog does some more verbose decoding of MCA records, but I find it 
>to be equally gibberish for anyone not intimately familiar with a specific CPU.
>
>I wrote a tool for a previous employer that was able to do some simple parsing 
>of MCA errors for Supermicro X7-X10 boards (Intel CPUs) and give a short 
>summary that was used in a nagios check.  However, it only handles a narrow 
>set of systems.
>
>https://github.com/freebsd/freebsd/compare/master...bsdjhb:ecc

Oooo, that looks nice! Is this something that can be committed to the main 
tree? If nothing else, I'll need to make a note of the way you're getting the 
MCA records into userland.

Thanks,

Ravi

>-- 
>John Baldwin
_______________________________________________
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Reply via email to