Re: Extended H/W error log driver

2013-10-17 Thread Borislav Petkov
On Thu, Oct 17, 2013 at 05:37:22PM +0530, Naveen N. Rao wrote: > That's me raising both my hands :) :-) > If you feel so strongly about it. "Corrected Error" is an oxymoron. > It's really just the hardware notifying us. Yeah, but we can't write "We just corrected a single-bit flip in DIMM array

Re: Extended H/W error log driver

2013-10-17 Thread Naveen N. Rao
On 10/16/2013 12:53 AM, Borislav Petkov wrote: On Wed, Oct 16, 2013 at 12:40:40AM +0530, Naveen N. Rao wrote: +2 ;) You're counting for 2 people, huh? That's me raising both my hands :) :-) While at it, I wonder if we're better off calling these "Hardware events" rather than "Hardware e

Re: Extended H/W error log driver

2013-10-15 Thread Borislav Petkov
On Wed, Oct 16, 2013 at 12:40:40AM +0530, Naveen N. Rao wrote: > +2 ;) You're counting for 2 people, huh? :-) > While at it, I wonder if we're better off calling these "Hardware > events" rather than "Hardware errors". Oh, please no. That's that euphemistic lying which serves no one. And here's

Re: Extended H/W error log driver

2013-10-15 Thread Naveen N. Rao
On 2013/10/15 09:15AM, Tony Luck wrote: > On Tue, Oct 15, 2013 at 2:28 AM, Borislav Petkov wrote: > > We can even add a hint for the user like: > > > > "Above errors have been corrected by the hardware and require no > > further action." > > > > Btw, this is valid for both dmesg and trace

Re: Extended H/W error log driver

2013-10-15 Thread Tony Luck
On Tue, Oct 15, 2013 at 2:28 AM, Borislav Petkov wrote: > We can even add a hint for the user like: > > "Above errors have been corrected by the hardware and require no > further action." > > Btw, this is valid for both dmesg and trace event output. > > Because from my experience so far p

Re: Extended H/W error log driver

2013-10-15 Thread Borislav Petkov
On Tue, Oct 15, 2013 at 12:07:31AM -0400, Chen Gong wrote: > Some errors have multiple sub sections like below: > > [ 1442.070522] {2}[Hardware Error]: Hardware error from APEI Generic Hardware > Error Source: 0 > [ 1442.070528] {2}[Hardware Error]: event severity: corrected > [ 1442.070531] {2}[

Re: Extended H/W error log driver

2013-10-14 Thread Chen Gong
On Mon, Oct 14, 2013 at 12:55:33PM +0200, Borislav Petkov wrote: > Date: Mon, 14 Oct 2013 12:55:33 +0200 > From: Borislav Petkov > To: Chen Gong > Cc: tony.l...@intel.com, linux-kernel@vger.kernel.org, > linux-a...@vger.kernel.org > Subject: Re: Extended H/W error log driver

Re: Extended H/W error log driver

2013-10-14 Thread Borislav Petkov
On Mon, Oct 14, 2013 at 02:49:40AM -0400, Chen Gong wrote: > On Fri, Oct 11, 2013 at 10:04:27AM +0200, Borislav Petkov wrote: > > > [56005.786154] {4}Hardware error detected on CPU0 > > > [56005.786159] {4}event severity: corrected > > > [56005.786162] {4}sub_event[0], severity: corrected > > > >

Re: Extended H/W error log driver

2013-10-14 Thread Chen Gong
On Fri, Oct 11, 2013 at 10:04:27AM +0200, Borislav Petkov wrote: > Date: Fri, 11 Oct 2013 10:04:27 +0200 > From: Borislav Petkov > To: "Chen, Gong" > Cc: tony.l...@intel.com, linux-kernel@vger.kernel.org, > linux-a...@vger.kernel.org > Subject: Re: Extended H/W e

Re: Extended H/W error log driver

2013-10-11 Thread Borislav Petkov
On Fri, Oct 11, 2013 at 02:54:13PM +, Luck, Tony wrote: > It's such a simple goal - I can't believe it took this long to get > here :-) Right, I'd guess some standard's body needed to be persuaded :-) > > Btw, what's "Memriser1"? > > Each memory controller on this machine routes to a plug-in

RE: Extended H/W error log driver

2013-10-11 Thread Luck, Tony
>> [56005.785981] {3}physical_address: 0x000851fe >> [56005.786027] {3}DIMM location: Memriser1 CHANNEL A DIMM 0 > > Very good guys, I've been waiting for years for this to be possible, > good job! :-) It's such a simple goal - I can't believe it took this long to get here :-) > Btw, what

Re: Extended H/W error log driver

2013-10-11 Thread Borislav Petkov
On Fri, Oct 11, 2013 at 02:32:38AM -0400, Chen, Gong wrote: > [56005.785917] {3}Hardware error detected on CPU0 > [56005.785959] {3}event severity: corrected > [56005.785975] {3}sub_event[0], severity: corrected > [56005.785977] {3}section_type: memory error > [56005.785981] {3}physical_address: 0x

Re: Extended H/W error log driver

2013-10-11 Thread Joe Perches
On Fri, 2013-10-11 at 02:32 -0400, Chen, Gong wrote: > This patch series adds an enhanced MCA event logging driver provided by Intel. [] > dmesg output: > > [56005.785917] {3}Hardware error detected on CPU0 > [56005.785959] {3}event severity: corrected > [56005.785975] {3}sub_event[0], severity: c