messages

Alfred Bartsch Tue, 20 Nov 2018 01:11:56 -0800


Am 19.11.18 um 14:10 schrieb Patrick M. Hausen:
> Hi all,
> 
> one of our production servers, 11.2p3 is logging this every couple of minutes:
> 
> Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory 
> error
> Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0
> Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c
> Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3
> Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 
> 0x0000000000000000
> Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID > 0
> 
> Address and core varies but it is always bank 12.
> 
> It seems like applications are unaffected, we use, of course ECC memory.
> 
> Is the OS able to work around these errors and just notifies us or is 
> in-memory
> data already getting corrupted?
> 
> I’m at a bit of a loss identifying which DIMM might be the cause so I 
> contacted Supermicro
> support. They answered:
> 
>> We can't really answer this, we do not know how various OS's map the memory 
>> slots.
>> Our advise is always to look at IPMI, but if that doesn't log any issues 
>> then we're not sure you're looking at a hardware issue.
>>
>> But assuming the OS looks at the ranks of a module as a bank and you use 
>> dual rank memory then it should logically point at DIMMC2.
> 
> They are right on the IPMI (I told them when opening the case) - there’s 
> nothing at all
> in the event log.
> 
> Can they be correct that it might not even be a hardware issue?
> If not how can I be sure which DIMM is to blame? Spare parts are ready but 
> I’d like to
> have a rather short maintenance break outside regular business hours.
> 
> I’ll attach a dmesg.boot. HW is a X10DRW-NT mainboard, SYS-1028R-WTNRT server 
> platform.
> 
> Thanks for any hints,
> Patrick
> 
> 
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>


Hi Patrick,
we had a similar experience with one of our servers (HP DL380 G7): Tons
of MCA errors concerning a single memory bank. This bank number did not
correspond to a special memory slot (HP numbers them from A to I for
each cpu). iLO and mcelog output was not of any help for me.
We did not notice any data loss, but to get rid of these annoying
messages, I did the following:
After taking the server out of production, I removed pairs of memory
modules until the MCA messages stopped. Then the last removed pair
contained the problematic module. Re-adding one of these last modules
left a 50-percent chance to identify the defective module. After
replacing this module, the server did no longer complain about memory
problems.

There should definitely be a more sophisticated method to identify
problematic memory modules. Perhaps there is someone on the list who is
able to shed some light on this kind of errors.

-- 
Sincerely
Alfred Bartsch
Data-Service GmbH
Beethovenstr. 2A
23617 Stockelsdorf
fon: +49 451 490010 fax: +49 451 4900123
Amtsgericht Lübeck, HRB 318 BS
Geschäftsführer: Wilfried Paepcke, Dr. Andreas Longwitz, Dr. Hans-Martin
Rasch, Dr. Uwe Szyszka
_______________________________________________
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Memory error logged in /var/log/messages

Reply via email to