On 11/28/2015 11:01 AM, Dan Johansson wrote:
> I have started noticing the following messages in the dmesg output (and
> in the log-files) on my Gentoo rig:
> 
> [46545.779803] [Hardware Error]: Corrected error, no action required.
> [46545.779984] [Hardware Error]: CPU:3 (15:2:0)
> MC2_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc2540f000040136
> [46545.780434] [Hardware Error]: MC2 Error Address: 0x00000002cc215138
> [46545.780605] [Hardware Error]: MC2 Error: Fill ECC error on data fills.
> [46545.783764] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
> [46545.784088] mce: [Hardware Error]: Machine check events logged

Are you using ECC memory? I saw the same errors when I just finished
building a machine that had some faulty ECC DIMMs installed.

> I have been running memtest for some time (~100h) and have not gotten
> any error message - so I am suspecting that this is a CPU problem. Am I
> correct?

In my case memtest didn't find any errors after a night of running
either, but when I'd boot Gentoo the errors would occur more frequently
the longer I was running or the more packages I had compiled.
I think the version of memtest I was running didn't take into account
error corrections, so for memtest every test succeeded even though the
memory had to use error corrections to make sure everything was
read/written properly.

> If it was just these error-messages I would not be that worried, but I
> have started to get a lot of "hangers" on this rig when compiling larger
> packages. Could there be a relation to the error-messages?

What I'd try to do is find the DIMM that's causing these errors and see
how your machine runs without it installed. I used EDAC [0] and
edac-utils [1] to find my faulty DIMMs.

- Boy

[0] https://www.kernel.org/doc/Documentation/edac.txt
[1] https://packages.gentoo.org/package/sys-apps/edac-utils

Attachment: 0x729527E4.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to