Lemme add some more folks to CC.
On Wed, Jan 04, 2017 at 04:42:18PM +0100, Paul Menzel wrote:
> Dear Linux folks,
>
>
> The logs contain the following messages.
>
> From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):
>
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check:
> > 0 Bank 6: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40
> > MISC 47880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME
> > 1483543069 SOCKET 0 APIC 0 microcode 0
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check:
> > 0 Bank 7: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40
> > MISC 7880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME
> > 1483543069 SOCKET 0 APIC 0 microcode 0
>
> I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.
>
> Installing *mcelog* 144+dfsg-1, the file below is created.
>
> ```
> $ more /var/log/mcelog
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543069 Wed Jan 4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543069 Wed Jan 4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543581 Wed Jan 4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543581 Wed Jan 4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> ```
>
> It looks like it’s a common problem on this machine [1].
>
> > First, I fear that I cannot really give good answers to your questions. I
> > also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact
> > with Dell Support because of these. They replaced the mainboard but it did
> > not help. Same messages in the logs. At some point they concluded that it
> > is probably a false positive. They had no idea what is causing it, though
> > (mcelog/kernel/Intel problem?). The correspondence with Support is still
> > ongoing.
> >
> > <rant> Btw, talking to Dell Support is a very unpleasant experience. They
> > seem to only suggest the "standard" solutions like resetting the Firmware,
> > run self-health tests and so on. I didn't had the impression to talk to
> > someone with some technical insight. </rant>
> >
> > To add more details, I see the same issue on Fedora 24 so it seems not to
> > be related to Ubuntu.
> >
> > Regarding your questions:
> >
> > What do these errors mean and should I worry about them?
> >
> > I don't know. Dell Support thinks those are false positives.
> >
> > Could these hardware errors be the cause of the freezes of the entire
> > system?
> >
> > Besides the messages my system works fine. I'd guess the freeze is a
> > different issue.
> >
> > Should I have the laptop (or parts) replaced by the manufacturer?
> >
> > Replacing the mainboard did not fix the MCE issue. It might solve the
> > freezing issue, although it seems that this was fixed by a kernel update.
> >
> > Are there any other actions I should take?
> >
> > If you are not already in contact with Support, contact them. Maybe they
> > will come up with a real solution once they see that it affects more
> > customers.
>
> Could you please tell me, if and where I should open an issue in the Linux
> bug tracker [2]?
>
> Any ideas are welcome.
>
>
> Kind regards,
>
> Paul
>
>
> [1]
> https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
> [2] https://bugzilla.kernel.org/
>
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.