Re: Dell XPS13: MCE (Hardware Error) reported

Borislav Petkov Wed, 04 Jan 2017 15:07:37 -0800

Lemme add some more folks to CC.

On Wed, Jan 04, 2017 at 04:42:18PM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> The logs contain the following messages.
> 
> From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):
> 
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 
> > 0 Bank 6: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40 
> > MISC 47880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 
> > 1483543069 SOCKET 0 APIC 0 microcode 0
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 
> > 0 Bank 7: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40 
> > MISC 7880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 
> > 1483543069 SOCKET 0 APIC 0 microcode 0
> 
> I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.
> 
> Installing *mcelog* 144+dfsg-1, the file below is created.
> 
> ```
> $ more /var/log/mcelog
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543069 Wed Jan  4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543069 Wed Jan  4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543581 Wed Jan  4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543581 Wed Jan  4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> ```
> 
> It looks like it’s a common problem on this machine [1].
> 
> > First, I fear that I cannot really give good answers to your questions. I 
> > also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact 
> > with Dell Support because of these. They replaced the mainboard but it did 
> > not help. Same messages in the logs. At some point they concluded that it 
> > is probably a false positive. They had no idea what is causing it, though 
> > (mcelog/kernel/Intel problem?). The correspondence with Support is still 
> > ongoing.
> > 
> > <rant> Btw, talking to Dell Support is a very unpleasant experience. They 
> > seem to only suggest the "standard" solutions like resetting the Firmware, 
> > run self-health tests and so on. I didn't had the impression to talk to 
> > someone with some technical insight. </rant>
> > 
> > To add more details, I see the same issue on Fedora 24 so it seems not to 
> > be related to Ubuntu.
> > 
> > Regarding your questions:
> > 
> >     What do these errors mean and should I worry about them?
> > 
> > I don't know. Dell Support thinks those are false positives.
> > 
> >     Could these hardware errors be the cause of the freezes of the entire 
> > system?
> > 
> > Besides the messages my system works fine. I'd guess the freeze is a 
> > different issue.
> > 
> >     Should I have the laptop (or parts) replaced by the manufacturer?
> > 
> > Replacing the mainboard did not fix the MCE issue. It might solve the 
> > freezing issue, although it seems that this was fixed by a kernel update.
> > 
> >     Are there any other actions I should take?
> > 
> > If you are not already in contact with Support, contact them. Maybe they 
> > will come up with a real solution once they see that it affects more 
> > customers.
> 
> Could you please tell me, if and where I should open an issue in the Linux
> bug tracker [2]?
> 
> Any ideas are welcome.
> 
> 
> Kind regards,
> 
> Paul
> 
> 
> [1] 
> https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
> [2] https://bugzilla.kernel.org/
>


-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Re: Dell XPS13: MCE (Hardware Error) reported

Reply via email to