On Fri, Nov 21, 2014 at 09:59:49PM +, Luck, Tony wrote:
> > Oh, cpu errata. So this would mean that we can't even rely on the
> > contents of the MCA banks, can we?
> >
> > In any case, is any of the information in the MCA banks in such cases
> > even usable then? Because if not, we're definite
On Sat, Nov 22, 2014 at 11:32:00PM +0800, rui wang wrote:
> But that means mcelog buffer will have to become circular, and we can
> only dump the last 32 errors. There must be a reason why it wasn't
> designed as circular.
Is there? Please do tell because I don't know the reason why.
--
Regards/
On 11/22/14, Borislav Petkov wrote:
> On Sat, Nov 22, 2014 at 10:16:49AM +0800, rui wang wrote:
>> I think both possibilities are valid. But experiments show that the
>> error logs are not in the dmesg preserved by kdump in /var/crash/
>> after panic and reboot, and not in the mcelog.entry[] array
On Sat, Nov 22, 2014 at 10:16:49AM +0800, rui wang wrote:
> I think both possibilities are valid. But experiments show that the
> error logs are not in the dmesg preserved by kdump in /var/crash/
> after panic and reboot, and not in the mcelog.entry[] array in the
> kernel. So they must be somewher
On 11/22/14, Borislav Petkov wrote:
>... there are two possibilities:
>
> * error got logged into mcelog and is long out to dmesg.
>
> So we go look at dmesg. Not very easy to do when we panic, I know, so we
> better make sure we have serial connected.
>
>
> [ Btw., we can know when userspace is
>> That means there were no VALID=1, EN=1, S=1 errors anywhere. But there
>> might be some other things logged that would help us understand.
>
> By "other things" you mean other MCEs?
Logs with EN=0 and/or S=0. They may have interesting information, and have
a good chance of being useful (espec
On Fri, Nov 21, 2014 at 09:31:56PM +, Luck, Tony wrote:
> >
> >/*
> > * No machine check event found. Must be some external
> > * source or one CPU is hung. Panic.
> > */
> >if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3)
> >
>
>/*
> * No machine check event found. Must be some external
> * source or one CPU is hung. Panic.
> */
>if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3)
>mce_panic("Machine check from unknown source", NULL, NULL);
>
> Provided
On Fri, Nov 21, 2014 at 05:20:53PM +, Luck, Tony wrote:
> > leave them in. Then you can read them out again on panic time. The mce
> > log buffer will have to become a circular buffer or something like that.
>
> This is a mixed bag. If there are a bunch of errors so that we overflow the
> bu
> leave them in. Then you can read them out again on panic time. The mce
> log buffer will have to become a circular buffer or something like that.
This is a mixed bag. If there are a bunch of errors so that we overflow the
buffer,
then general wisdom says that people want to see the first error
On Fri, Nov 21, 2014 at 09:20:59AM +0800, rui wang wrote:
> We've found there are cases after mce_log() has been called, we then
> decide to panic, but print_mce() can't find anything in the mcelog
> buffer. I think the mcelog buffer can be consumed by the user space
> daemon (possibly on a differe
On 11/20/14, Borislav Petkov wrote:
> On Wed, Nov 19, 2014 at 11:34:10PM +, Luck, Tony wrote:
>> The SDM has this to say about EN=0 (in section 15.10.4.1 of volume 3B):
>>
>>When the EN flag is zero but the VAL and UC flags are one in
>>the IA32_MCi_STATUS register, the reported uncorr
On Wed, Nov 19, 2014 at 11:34:10PM +, Luck, Tony wrote:
> The SDM has this to say about EN=0 (in section 15.10.4.1 of volume 3B):
>
>When the EN flag is zero but the VAL and UC flags are one in
>the IA32_MCi_STATUS register, the reported uncorrected error
>in this bank is not enabl
>> No information besides that it is a machine check. This happens in two cases:
>> 1) The CPU logs the error with the MCi_STATUS.EN bit set to zero, and Linux
>>ignores EN=0 entries (as it should).
> Well, I guess we shouldn't anymore. Apparently hw forgets to set the
> bit when raising an MC
From: Rui Wang
There are cases when an machine check panics without giving any information
about the error:
[ 177.806166] Kernel panic - not syncing: Machine check from unknown source
No information besides that it is a machine check. This happens in two cases:
1) The CPU logs the error with t
On Wed, Nov 19, 2014 at 05:22:41PM +0800, ruiv.w...@gmail.com wrote:
> From: Rui Wang
>
> There are cases when an machine check panics without giving any information
> about the error:
>
> [ 177.806166] Kernel panic - not syncing: Machine check from unknown source
>
> No information besides th
16 matches
Mail list logo