--- Orion Poplawski <[EMAIL PROTECTED]> wrote: > Robert Hancock wrote: > > Orion Poplawski wrote: > >> Can someone please explain to me what these mean? > >> > >> EDAC k8 MC1: general bus error: participating processor(local node > > >> origin), time-out(no timeout) memory transaction type(generic > read), > >> mem or i/o(mem access), cache level(generic) > >> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, > row > >> 1, channel 0, label "": k8_edac > >> EDAC MC1: CE - no information available: k8_edac Error Overflow > set > >> EDAC k8 MC1: extended error code: ECC chipkill x4 error > >> > >> Thanks! > >> > > > > Sounds like you're having some memory ECC errors.. some Memtest86, > etc. > > runs may be in order. You may be able to figure out from this info > what > > DIMM is having the problem. > > > > That was my assumption as well, but was hoping someone could decode > the > above information and point me to the problem chip. I ran Memtest86 > overnight but found no problems, but don't know if it needs to run in > a > particular ECC mode. > > This is a dual proc 275 system with 4 1GB DIMMs. Guessing that MC1 > is > the controller on the second CPU. Would row 1 be the second DIMM?
No that would be the FIRST DIMM, on Channel 0 Each DIMM has 2 ChipSelect Rows (CSROW) Each csrow covers two channels across, therefore on a 4 socket memory array, there are CSROWS 0 and 1 on the first DIMM row and CSROWS 2 and 3 on the second DIMM row. WWWWWWWWW XXXXXXXXXXX YYYYYYYYY ZZZZZZZZZZZ The W and the Y DIMMs are channel 0 The X and the Z DIMMs are channel 1 csrows 0 and 1 would cross over Y and Z DIMMs csrows 2 and 3 would cross over W and X DIMMs The mapping problem occurs in then identifying each of the above goes to which silk screen labeled sockets on the mobo. Usually they are labeled: H0_DIMM2A H0_DIMM2B H0_DIMM1A H0_DIMM1B where A is channel 0 and B is channel 1 and the "DIMM1" would indicate the CSROWs 0 and 1 and "DIMM2" would indicate the CSROWs 2 and 3 The string 'label ""' can be filled in by a userspace script to properly identify the DIMM silk screen according to the motherboard used. The lines with "EDAC MC1:" are EDAC CORE output messages, while the "EDAC K8:" lines are EDAC Memory Controller driver messages. "CE" is correctable error MC1 is memory controller 1 (0 based) ECC ChipKill x4 was what found the error and corrected it. The FRU, (field replaceable unit) is the DIMM located at socket H1_DIMM1A, according to the labeling I mentioned above. caveat: the detector is not 100% perfect but gives a general area to look at, the DIMM specification. Sometimes other errors can cause what looks like a memory error, but usually a bad memory DIMM is the root cause of the vast majority of such errors. In addition, memtest86+ doesn't find all the bad memory in all cases, but it is still a VERY useful tool doug thompson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/