Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-23 Thread Borislav Petkov
On Fri, Nov 21, 2014 at 09:59:49PM +, Luck, Tony wrote: > > Oh, cpu errata. So this would mean that we can't even rely on the > > contents of the MCA banks, can we? > > > > In any case, is any of the information in the MCA banks in such cases > > even usable then? Because if not, we're definite

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-22 Thread Borislav Petkov
On Sat, Nov 22, 2014 at 11:32:00PM +0800, rui wang wrote: > But that means mcelog buffer will have to become circular, and we can > only dump the last 32 errors. There must be a reason why it wasn't > designed as circular. Is there? Please do tell because I don't know the reason why. -- Regards/

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-22 Thread rui wang
On 11/22/14, Borislav Petkov wrote: > On Sat, Nov 22, 2014 at 10:16:49AM +0800, rui wang wrote: >> I think both possibilities are valid. But experiments show that the >> error logs are not in the dmesg preserved by kdump in /var/crash/ >> after panic and reboot, and not in the mcelog.entry[] array

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-22 Thread Borislav Petkov
On Sat, Nov 22, 2014 at 10:16:49AM +0800, rui wang wrote: > I think both possibilities are valid. But experiments show that the > error logs are not in the dmesg preserved by kdump in /var/crash/ > after panic and reboot, and not in the mcelog.entry[] array in the > kernel. So they must be somewher

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-21 Thread rui wang
On 11/22/14, Borislav Petkov wrote: >... there are two possibilities: > > * error got logged into mcelog and is long out to dmesg. > > So we go look at dmesg. Not very easy to do when we panic, I know, so we > better make sure we have serial connected. > > > [ Btw., we can know when userspace is

RE: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-21 Thread Luck, Tony
>> That means there were no VALID=1, EN=1, S=1 errors anywhere. But there >> might be some other things logged that would help us understand. > > By "other things" you mean other MCEs? Logs with EN=0 and/or S=0. They may have interesting information, and have a good chance of being useful (espec

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-21 Thread Borislav Petkov
On Fri, Nov 21, 2014 at 09:31:56PM +, Luck, Tony wrote: > > > >/* > > * No machine check event found. Must be some external > > * source or one CPU is hung. Panic. > > */ > >if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3) > >

RE: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-21 Thread Luck, Tony
> >/* > * No machine check event found. Must be some external > * source or one CPU is hung. Panic. > */ >if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3) >mce_panic("Machine check from unknown source", NULL, NULL); > > Provided

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-21 Thread Borislav Petkov
On Fri, Nov 21, 2014 at 05:20:53PM +, Luck, Tony wrote: > > leave them in. Then you can read them out again on panic time. The mce > > log buffer will have to become a circular buffer or something like that. > > This is a mixed bag. If there are a bunch of errors so that we overflow the > bu

RE: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-21 Thread Luck, Tony
> leave them in. Then you can read them out again on panic time. The mce > log buffer will have to become a circular buffer or something like that. This is a mixed bag. If there are a bunch of errors so that we overflow the buffer, then general wisdom says that people want to see the first error

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-21 Thread Borislav Petkov
On Fri, Nov 21, 2014 at 09:20:59AM +0800, rui wang wrote: > We've found there are cases after mce_log() has been called, we then > decide to panic, but print_mce() can't find anything in the mcelog > buffer. I think the mcelog buffer can be consumed by the user space > daemon (possibly on a differe

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-20 Thread rui wang
On 11/20/14, Borislav Petkov wrote: > On Wed, Nov 19, 2014 at 11:34:10PM +, Luck, Tony wrote: >> The SDM has this to say about EN=0 (in section 15.10.4.1 of volume 3B): >> >>When the EN flag is zero but the VAL and UC flags are one in >>the IA32_MCi_STATUS register, the reported uncorr

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-20 Thread Borislav Petkov
On Wed, Nov 19, 2014 at 11:34:10PM +, Luck, Tony wrote: > The SDM has this to say about EN=0 (in section 15.10.4.1 of volume 3B): > >When the EN flag is zero but the VAL and UC flags are one in >the IA32_MCi_STATUS register, the reported uncorrected error >in this bank is not enabl

RE: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-19 Thread Luck, Tony
>> No information besides that it is a machine check. This happens in two cases: >> 1) The CPU logs the error with the MCi_STATUS.EN bit set to zero, and Linux >>ignores EN=0 entries (as it should). > Well, I guess we shouldn't anymore. Apparently hw forgets to set the > bit when raising an MC

[PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-19 Thread ruiv . wang
From: Rui Wang There are cases when an machine check panics without giving any information about the error: [ 177.806166] Kernel panic - not syncing: Machine check from unknown source No information besides that it is a machine check. This happens in two cases: 1) The CPU logs the error with t

Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

2014-11-19 Thread Borislav Petkov
On Wed, Nov 19, 2014 at 05:22:41PM +0800, ruiv.w...@gmail.com wrote: > From: Rui Wang > > There are cases when an machine check panics without giving any information > about the error: > > [ 177.806166] Kernel panic - not syncing: Machine check from unknown source > > No information besides th