It's somewhat of a stretch since you say it *suddenly* lost the use of the bank of memory, but it could be that the processor for that particular bank of memory isn't properly seated.
We've had two such systems in the past couple years, with the first memtest86 kept reporting errors at consecutive addresses after it crossed the memory boundary to where the affected processor's memory controller took over. I swapped out memory modules, fiddled with memory settings, and re-arranged the cards all to no avail. The final thing I did that ended up fixing the problem was taking the processors out and seating each in the others' slot. At that point I had figured the processor itself was damaged, and that I'd surely get errors in the other side of the memory region but to my surprise I did not. The second system flat out refused to recognize one whole bank of memory like yours. After swapping out the memory didn't work, I tried the processor swapping trick and it worked perfectly afterward. Even swapping them back so they were in their original arrangement worked the second time. So, if all else fails you may want to try swapping the processors around or reseating them. It just might save you some headaches dealing with SuperMicro's RMA & tech support departments. On 1/15/09, Chris Samuel <[email protected]> wrote: > > ----- "Francesco Pietra" <[email protected]> wrote: > >> Therefore, is it any software way to check if the CPUs are fully in >> order, including the memory controller? lshw and other software >> provided only partial help in my hands. > > Make sure that you have ECC turned to MAX in your BIOS, > on our SuperMicro mainboards that enables scrubs of RAM > and CPU caches as well as spotting ECC memory errors. > > For some reason the SuperMicro BIOS's we've had recently > have defaulted to turning ECC off which isn't particularly > useful, especially on motherboards that can only take ECC > memory! We found that the hard way recently, and you > can work that out from the output of dmidecode like this: > > dmidecode | grep -A7 "Physical Memory Array" | grep "Error Correction"| > grep ECC > > Make sure you're also running mcelog to pull any MCE > or ECC hardware reports that the kernel has recorded > from the CPUs out to a logfile. > > We find that running it with the --k8 and --dmi options > is important to decode more information about these events. > > cheers! > Chris > -- > Christopher Samuel - (03) 9925 4751 - Systems Manager > The Victorian Partnership for Advanced Computing > P.O. Box 201, Carlton South, VIC 3053, Australia > VPAC is a not-for-profit Registered Research Agency > _______________________________________________ > Beowulf mailing list, [email protected] > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
