[EMAIL PROTECTED] ~]# mcelog --k8 --ascii <mce3.txt CPU 0 4 northbridge TSC 23f6fd4262e9 RIP 00:413bd600000000 Northbridge Chipkill ECC error Chipkill ECC syndrome = 3d3c bit40 = error found by scrub bit45 = uncorrected ecc error bit61 = error uncorrected bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS f41e21003d080a13 MCGSTATUS 7 Kernel panic - not syncing: Uncorrected machine check
and again :) this is with chipkill and everything that should help debug/correct RAM enabled. it took a while to crash but eventually it did. I'm going to attack both RAM and the cooling. On 2/1/06, mike <[EMAIL PROTECTED]> wrote: > Yeah, I looked at the memory. It's got a PQI sticker (at least one set) > > It's SUPPOSED to be "(Samsung, Micron, Elpida, Infineon, Hynix OEM)" - > which would align it basically with what Supermicro suggests: > > http://supermicro.com/Aplus/support/resources/memory/?sz=1.0&mspd=0.4&mtyp=9&id=51EF70624CA791283EC434A52DA0D4E2 > > Anyway, I called Supermicro. I'm going to order their > recommended/proper heatsink, air shroud, and then also call up the > vendor I got the RAM from and tell them they did not deliver the > proper stuff. They'll put up a fight, because they don't do good > business - so tomorrow looks to be fun. > > Hopefully between those two any cooling and any RAM issues will be out > of the equation. > > On 2/1/06, Paul Brook <[EMAIL PROTECTED]> wrote: > > On Wednesday 01 February 2006 16:47, mike wrote: > > > After running memtest86 (V3.3) for at least 24 hours, I came back and > > > saw that each machine completed 61-63 cycles of tests, with 0 > > > errors... > > > > > > However, I did look through the BIOS for cache disabling - and it > > > doesn't appear I can disable the CPU cache. > > > > > > I did turn on chipkill and some other supposed ECC memory "helpers" > > > and instantly had the machine crash twice. > > > > > > [EMAIL PROTECTED] ~]# mcelog --k8 --ascii <mce2.txt > > > CPU 0 4 northbridge TSC 2 > > > Northbridge Chipkill ECC error > > > Chipkill ECC syndrome = 6ca0 > > > bit32 = err cpu0 > > > bit45 = uncorrected ecc error > > > bit57 = processor context corrupt > > > bit61 = error uncorrected > > > bus error 'local node origin, request didn't time out > > > generic read mem transaction > > > memory access, level generic' > > > STATUS b65020016c080813 MCGSTATUS 4 > > > 332ff8453 ADDR 7ff5faf0 > > > Kernel panic - not syncing: Machine check > > > > I had something similar, and it turned out the motherboard just didn't like > > the brand/model of memory I was using. Replacing it with a different make > > (this time one that was on the motherboard's recommended list) fixed the > > problem. > > > > Paul > > >