> On Wed, Jan 30, 2013 at 11:29:47AM -0500, Michael Madore wrote: >> Supermicro H8QGi-F server board (AMD SR5690/SR5670/SP5100 Chipset) >> 4 X AMD Opteron 6276 processors >> 32 X 8GB (256GB) DDR3-1600 ECC Registered memory >> Debian with kernel 3.2.35-2 >> >> We have received the following two hardware errors: >> >> 9/10/12 >> >> [591006.120039] [Hardware Error]: CPU:58 >> MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x9842c000000c0176 >> [591006.120048] [Hardware Error]: Combined Unit Error: VB Data/ECC error. >> [591006.120052] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV >> >> 1/21/12 >> >> [549004.336097] [Hardware Error]: CPU:40 >> MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c3444e0001f010b >> [549004.336111] [Hardware Error]: MC4_ADDR: 0x000000000000e480 >> [549004.336117] [Hardware Error]: Northbridge Error (node 5): ECC >> Error in the Probe Filter directory. >> [549004.336125] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN >> >> If I understand correctly, both of these errors represent single bit >> corrected errors in the CPU cache. > > Internal CPU structures, victim buffer the first and the second in the > probe filter which is part of L3. > >> On both occasions the system continued to function normally after the >> error was reported. > > As expected; both are single-bit ECC errors which were corrected and > system state wasn't influenced. > >> Is receiving two such errors (on different CPUs) over such a time span >> cause for concern? > > Not really. I'd say, only if the error rate starts increasing over time > and the error types keep repeating. > >> The end user is concerned there is a serious hardware problem. I'm >> reluctant to start replacing CPUs, however, without seeing a repeated >> pattern of errors. > > Yes, no need to replace, simply watch the error rates. Maybe check the > temperature of the CPUs, possibly improve cooling are some of the things > that come to mind.
Hi Boris, Thank you for the information. The system has just received a third error: [573603.432036] [Hardware Error]: CPU:32 MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9c43ccb0011c017b [573603.432045] [Hardware Error]: MC4_ADDR: 0x0000002782598940 [573603.432048] [Hardware Error]: Northbridge Error (node 4): L3 ECC data cache error. [573603.432054] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: EV This is on a different node than the previous two errors. And each node has it's own L3, correct? Would you still advocate watching and waiting? Thanks, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

