Am 19.11.18 um 14:10 schrieb Patrick M. Hausen: > Hi all, > > one of our production servers, 11.2p3 is logging this every couple of minutes: > > Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory > error > Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0 > Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c > Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3 > Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status > 0x0000000000000000 > Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID > 0 > > Address and core varies but it is always bank 12. > > It seems like applications are unaffected, we use, of course ECC memory. > > Is the OS able to work around these errors and just notifies us or is > in-memory > data already getting corrupted? > > I’m at a bit of a loss identifying which DIMM might be the cause so I > contacted Supermicro > support. They answered: > >> We can't really answer this, we do not know how various OS's map the memory >> slots. >> Our advise is always to look at IPMI, but if that doesn't log any issues >> then we're not sure you're looking at a hardware issue. >> >> But assuming the OS looks at the ranks of a module as a bank and you use >> dual rank memory then it should logically point at DIMMC2. > > They are right on the IPMI (I told them when opening the case) - there’s > nothing at all > in the event log. > > Can they be correct that it might not even be a hardware issue? > If not how can I be sure which DIMM is to blame? Spare parts are ready but > I’d like to > have a rather short maintenance break outside regular business hours. > > I’ll attach a dmesg.boot. HW is a X10DRW-NT mainboard, SYS-1028R-WTNRT server > platform. > > Thanks for any hints, > Patrick > > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" >
Hi Patrick, we had a similar experience with one of our servers (HP DL380 G7): Tons of MCA errors concerning a single memory bank. This bank number did not correspond to a special memory slot (HP numbers them from A to I for each cpu). iLO and mcelog output was not of any help for me. We did not notice any data loss, but to get rid of these annoying messages, I did the following: After taking the server out of production, I removed pairs of memory modules until the MCA messages stopped. Then the last removed pair contained the problematic module. Re-adding one of these last modules left a 50-percent chance to identify the defective module. After replacing this module, the server did no longer complain about memory problems. There should definitely be a more sophisticated method to identify problematic memory modules. Perhaps there is someone on the list who is able to shed some light on this kind of errors. -- Sincerely Alfred Bartsch Data-Service GmbH Beethovenstr. 2A 23617 Stockelsdorf fon: +49 451 490010 fax: +49 451 4900123 Amtsgericht Lübeck, HRB 318 BS Geschäftsführer: Wilfried Paepcke, Dr. Andreas Longwitz, Dr. Hans-Martin Rasch, Dr. Uwe Szyszka _______________________________________________ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"