Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
On Fri, Dec 06, 2013 at 11:11:07PM +0800, Ethan Zhao wrote: > > @@ -63,10 +63,10 @@ TRACE_EVENT(aer_event, > > > > TP_printk("%s PCIe Bus Error: severity=%s, %s\n", > > __get_str(dev_name), > > - __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : > > - __entry->severity == HW_EVENT_ERR_FATAL ? > > + __entry->severity == AER_CORRECTABLE ? "Corrected" : > > + __entry->severity == AER_FATAL ? > > "Fatal" : "Uncorrected", > > Why not "Fatal" : "Non-fatal", ? per the PCIe spec, > 'Fatal' and 'Non-fatal' are sub-category of " > Uncorrected". But here "Uncorrected" means "Non-fatal". ... and just to denote that, it'll probably be best to say: __entry->severity == AER_CORRECTABLE ? "Corrected" : __entry->severity == AER_FATAL ? "Fatal" : "Uncorrected, non-fatal" right? Btw, Rui, you patch is whitespace-damaged so next time please try sending from a real mail client which doesn't mangle whitespace and not from the gmail web interface. Sending the patch to yourself and trying to apply it is always a good test for that. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
On Fri, Dec 6, 2013 at 5:06 PM, rui wang wrote: > On 12/5/13, Borislav Petkov wrote: > >> Yes, the AER tracepoint above should use the AER_* defines and not the >> HW_EVENT_ERR_* ones which are for memory errors. >> >> Wanna send a fix? >> > > Yes. Does it translate into something like this? > > From: Rui Wang > Date: Fri, 6 Dec 2013 16:47:46 +0800 > Subject: [PATCH] Fix severity usage in aer trace event > > Signed-off-by: Rui Wang > --- > include/trace/events/ras.h |8 > 1 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h > index 88b8783..e2a17d8 100644 > --- a/include/trace/events/ras.h > +++ b/include/trace/events/ras.h > @@ -5,7 +5,7 @@ > #define _TRACE_AER_H > > #include > -#include > +#include > > > /* > @@ -63,10 +63,10 @@ TRACE_EVENT(aer_event, > > TP_printk("%s PCIe Bus Error: severity=%s, %s\n", > __get_str(dev_name), > - __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : > - __entry->severity == HW_EVENT_ERR_FATAL ? > + __entry->severity == AER_CORRECTABLE ? "Corrected" : > + __entry->severity == AER_FATAL ? > "Fatal" : "Uncorrected", Why not "Fatal" : "Non-fatal", ? per the PCIe spec, 'Fatal' and 'Non-fatal' are sub-category of " Uncorrected". But here "Uncorrected" means "Non-fatal". Thanks, Ethan > - __entry->severity == HW_EVENT_ERR_CORRECTED ? > + __entry->severity == AER_CORRECTABLE ? > __print_flags(__entry->status, "|", aer_correctable_errors) : > __print_flags(__entry->status, "|", aer_uncorrectable_errors)) > ); > -- > 1.7.5.4 > > Regards, > Rui > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
On 12/5/13, Borislav Petkov wrote: > Yes, the AER tracepoint above should use the AER_* defines and not the > HW_EVENT_ERR_* ones which are for memory errors. > > Wanna send a fix? > Yes. Does it translate into something like this? From: Rui Wang Date: Fri, 6 Dec 2013 16:47:46 +0800 Subject: [PATCH] Fix severity usage in aer trace event Signed-off-by: Rui Wang --- include/trace/events/ras.h |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h index 88b8783..e2a17d8 100644 --- a/include/trace/events/ras.h +++ b/include/trace/events/ras.h @@ -5,7 +5,7 @@ #define _TRACE_AER_H #include -#include +#include /* @@ -63,10 +63,10 @@ TRACE_EVENT(aer_event, TP_printk("%s PCIe Bus Error: severity=%s, %s\n", __get_str(dev_name), - __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : - __entry->severity == HW_EVENT_ERR_FATAL ? + __entry->severity == AER_CORRECTABLE ? "Corrected" : + __entry->severity == AER_FATAL ? "Fatal" : "Uncorrected", - __entry->severity == HW_EVENT_ERR_CORRECTED ? + __entry->severity == AER_CORRECTABLE ? __print_flags(__entry->status, "|", aer_correctable_errors) : __print_flags(__entry->status, "|", aer_uncorrectable_errors)) ); -- 1.7.5.4 Regards, Rui -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
On Thu, Dec 05, 2013 at 11:21:10AM -0700, Betty Dall wrote: > The definition of the GHES_SEV* matches up with the error severity > definition of the CPER records as defined in the UEFI spec section > N.2.1: > "Indicates the severity of the error condition. The severity of > the error record corresponds to the most severe error > section. > 0 - Recoverable (also called non-fatal uncorrected) > 1 - Fatal > 2 - Corrected > 3 - Informational > All other values are reserved. > Note that severity of "Informational" indicates that the record > could be safely ignored by error handling software." Actually, we can go even one radical step further and drop ghes_severity() completely because GHES severity in the ACPI spec 5.0 is defined almost exactly the same: "18.3.2.6.1 Generic Error Data ... Identifies the error severity of the reported error: 0 – Recoverable 1 – Fatal 2 – Corrected 3 – None Note: This is the error severity of the entire event. Each Generic Error Data Entry also includes its own Error Severity field." I don't know which version of the spec dictated enum { GHES_SEV_NO = 0x0, GHES_SEV_CORRECTED = 0x1, GHES_SEV_RECOVERABLE = 0x2, GHES_SEV_PANIC = 0x3, }; though and whether we're going to have to differentiate between the old and GHES numerical severity levels. Which, if we have to, would be very nasty... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
On Wed, 2013-12-04 at 23:28 +0800, Ethan Zhao wrote: > Rui, >Agree with that, there are really many such confusing error type definition > need to be standardized or unified, some of them are > ambiguous、inconsistent, some of them violates ACPI/PCI spec. > > According to the ACPI spec, the 'FATAL' in fact, is a sub-category > of 'UNCORRECTABLE' , the another one is 'NON-FATAL', then how to > understand the bebow 'UNCORRECTABLE's ? mean uncorrectable-non-fatal > ? just guess, confusing. > > acpi\actbl1.h > > #define ACPI_EINJ_PROCESSOR_CORRECTABLE (1) > #define ACPI_EINJ_PROCESSOR_UNCORRECTABLE (1<<1) > #define ACPI_EINJ_PROCESSOR_FATAL (1<<2) > #define ACPI_EINJ_MEMORY_CORRECTABLE(1<<3) > #define ACPI_EINJ_MEMORY_UNCORRECTABLE (1<<4) > #define ACPI_EINJ_MEMORY_FATAL (1<<5) > #define ACPI_EINJ_PCIX_CORRECTABLE (1<<6) > #define ACPI_EINJ_PCIX_UNCORRECTABLE(1<<7) > #define ACPI_EINJ_PCIX_FATAL(1<<8) > #define ACPI_EINJ_PLATFORM_CORRECTABLE (1<<9) > #define ACPI_EINJ_PLATFORM_UNCORRECTABLE(1<<10) > #define ACPI_EINJ_PLATFORM_FATAL(1<<11) > > edac.h > > enum hw_event_mc_err_type { > HW_EVENT_ERR_CORRECTED, > HW_EVENT_ERR_UNCORRECTED, > HW_EVENT_ERR_FATAL, > HW_EVENT_ERR_INFO, > }; > > ghes.h > enum { > GHES_SEV_NO = 0x0, > GHES_SEV_CORRECTED = 0x1, > GHES_SEV_RECOVERABLE = 0x2, > GHES_SEV_PANIC = 0x3, > }; > > What's the meaning of GHES_SEV_PANIC ? Why not 'FATAL' , just as > described in ACPI spec section 18.3.2.6.1, > " > Error Severity 4 16 Identifies the error severity of the reported error: > 0 – Recoverable > 1 – Fatal > 2 – Corrected > 3 – None > " > If there is other intension, but could be seen translated into 'FATAL' later: > > case GHES_SEV_PANIC: > type = HW_EVENT_ERR_FATAL; > > And these looks reasonable, > aer.h > > #define AER_NONFATAL 0 > #define AER_FATAL 1 > #define AER_CORRECTABLE 2 The definition of the GHES_SEV* matches up with the error severity definition of the CPER records as defined in the UEFI spec section N.2.1: "Indicates the severity of the error condition. The severity of the error record corresponds to the most severe error section. 0 - Recoverable (also called non-fatal uncorrected) 1 - Fatal 2 - Corrected 3 - Informational All other values are reserved. Note that severity of "Informational" indicates that the record could be safely ignored by error handling software." The ghes code uses the CPER record's severity and always calls the function ghes_severity() to convert to the GHES_SEV value. Since the ACPI spec defines the GHES severity, it makes sense to maintain an enum for it and use the ghes_severity() to convert where necessary. This is what I am thinking: Author: Betty Dall Date: Thu Dec 5 11:05:43 2013 -0700 diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index a30bc31..c59144e 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -301,16 +301,18 @@ static inline int ghes_severity(int severity) { switch (severity) { case CPER_SEV_INFORMATIONAL: - return GHES_SEV_NO; + return GHES_SEV_NONE; case CPER_SEV_CORRECTED: return GHES_SEV_CORRECTED; case CPER_SEV_RECOVERABLE: return GHES_SEV_RECOVERABLE; case CPER_SEV_FATAL: - return GHES_SEV_PANIC; + return GHES_SEV_FATAL; default: /* Unknown, go panic */ - return GHES_SEV_PANIC; + pr_warn(FW_WARN GHES_PFX + "Invalid CPER severity: %d\n", severity); + return GHES_SEV_FATAL; } } @@ -828,7 +830,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) if (ret == NMI_DONE) goto out; - if (sev_global >= GHES_SEV_PANIC) { + if (sev_global >= GHES_SEV_FATAL) { oops_begin(); ghes_print_queued_estatus(); __ghes_print_estatus(KERN_EMERG, ghes_global->generic, diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h index dfd60d0..7cefa89 100644 --- a/include/acpi/ghes.h +++ b/include/acpi/ghes.h @@ -39,10 +39,10 @@ struct ghes_estatus_cache { }; enum { - GHES_SEV_NO = 0x0, - GHES_SEV_CORRECTED = 0x1, - GHES_SEV_RECOVERABLE = 0x2, - GHES_SEV_PANIC = 0x3, + GHES_SEV_RECOVERABLE = 0x0, + GHES_SEV_FATAL = 0x1, + GHES_SEV_CORRECTED = 0x2, + GHES_SEV_NONE = 0x3, }; /* From drivers/edac/ghes_edac.c */ -Betty -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
On Mon, Dec 02, 2013 at 01:05:16PM +0800, rui wang wrote: > > + TP_printk("%s PCIe Bus Error: severity=%s, %s\n", > > + __get_str(dev_name), > > + __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : > > + __entry->severity == HW_EVENT_ERR_FATAL ? > > + "Fatal" : "Uncorrected", > > + __entry->severity == HW_EVENT_ERR_CORRECTED ? > > + __print_flags(__entry->status, "|", aer_correctable_errors) : > > + __print_flags(__entry->status, "|", aer_uncorrectable_errors)) > > +); > > This causes inconsistency between dmesg and the trace event output. > When dmesg says "severity=Corrected", the trace event says > "severity=Fatal". What happens is that HW_EVENT_ERR_CORRECTED is > defined in edac.h: > > enum hw_event_mc_err_type { > HW_EVENT_ERR_CORRECTED, > HW_EVENT_ERR_UNCORRECTED, > HW_EVENT_ERR_FATAL, > HW_EVENT_ERR_INFO, > }; > > while aer_print_error() uses aer_error_severity_string[] defined as: > > static const char *aer_error_severity_string[] = { > "Uncorrected (Non-Fatal)", > "Uncorrected (Fatal)", > "Corrected" > }; > > In this case dmesg is correct because info->severity is assigned in > aer_isr_one_error() using the definitions in include/linux/ras.h: > #define AER_NONFATAL0 > #define AER_FATAL 1 > #define AER_CORRECTABLE 2 > > So which one is the standard? Is there a plan to unify all these names? Yes, the AER tracepoint above should use the AER_* defines and not the HW_EVENT_ERR_* ones which are for memory errors. Wanna send a fix? Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
Rui, Agree with that, there are really many such confusing error type definition need to be standardized or unified, some of them are ambiguous、inconsistent, some of them violates ACPI/PCI spec. According to the ACPI spec, the 'FATAL' in fact, is a sub-category of 'UNCORRECTABLE' , the another one is 'NON-FATAL', then how to understand the bebow 'UNCORRECTABLE's ? mean uncorrectable-non-fatal ? just guess, confusing. acpi\actbl1.h #define ACPI_EINJ_PROCESSOR_CORRECTABLE (1) #define ACPI_EINJ_PROCESSOR_UNCORRECTABLE (1<<1) #define ACPI_EINJ_PROCESSOR_FATAL (1<<2) #define ACPI_EINJ_MEMORY_CORRECTABLE(1<<3) #define ACPI_EINJ_MEMORY_UNCORRECTABLE (1<<4) #define ACPI_EINJ_MEMORY_FATAL (1<<5) #define ACPI_EINJ_PCIX_CORRECTABLE (1<<6) #define ACPI_EINJ_PCIX_UNCORRECTABLE(1<<7) #define ACPI_EINJ_PCIX_FATAL(1<<8) #define ACPI_EINJ_PLATFORM_CORRECTABLE (1<<9) #define ACPI_EINJ_PLATFORM_UNCORRECTABLE(1<<10) #define ACPI_EINJ_PLATFORM_FATAL(1<<11) edac.h enum hw_event_mc_err_type { HW_EVENT_ERR_CORRECTED, HW_EVENT_ERR_UNCORRECTED, HW_EVENT_ERR_FATAL, HW_EVENT_ERR_INFO, }; ghes.h enum { GHES_SEV_NO = 0x0, GHES_SEV_CORRECTED = 0x1, GHES_SEV_RECOVERABLE = 0x2, GHES_SEV_PANIC = 0x3, }; What's the meaning of GHES_SEV_PANIC ? Why not 'FATAL' , just as described in ACPI spec section 18.3.2.6.1, " Error Severity 4 16 Identifies the error severity of the reported error: 0 – Recoverable 1 – Fatal 2 – Corrected 3 – None " If there is other intension, but could be seen translated into 'FATAL' later: case GHES_SEV_PANIC: type = HW_EVENT_ERR_FATAL; And these looks reasonable, aer.h #define AER_NONFATAL 0 #define AER_FATAL 1 #define AER_CORRECTABLE 2 Thanks, Ethan On Wed, Dec 4, 2013 at 11:10 AM, rui wang wrote: > Resending adding Mauro's new Email address... > > > On 1/17/13, Lance Ortiz wrote: >> This header file will define a new trace event that will be triggered when >> a AER event occurs. The following data will be provided to the trace >> event. >> >> char * dev_name - The name of the slot where the device resides >> ([domain:]bus:device.function). >> >> u32 status - Either the correctable or uncorrectable register >> indicating what error or errors have been see. >> >> u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED >> >> The trace event will also provide a trace string that may look like: >> >> ":05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned >> TLP" >> >> v1-v2 Move header from include/ras/aer_event.h to >> include/trace/events/ras.h >> v3-v4 Cleaned up comments and commit header >> v4-v5 More cleanup remove () from if statement in print. >> Renamed string define to be more specific. >> v5-v6 change TRACE_SYSTEM define to be ras and not aer. >> >> Signed-off-by: Lance Ortiz >> Acked-by: Mauro Carvalho Chehab >> Acked-by: Tony Luck >> --- >> >> include/trace/events/ras.h | 77 >> >> 1 files changed, 77 insertions(+), 0 deletions(-) >> create mode 100644 include/trace/events/ras.h >> >> diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h >> new file mode 100644 >> index 000..88b8783 >> --- /dev/null >> +++ b/include/trace/events/ras.h >> @@ -0,0 +1,77 @@ >> +#undef TRACE_SYSTEM >> +#define TRACE_SYSTEM ras >> + >> +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ) >> +#define _TRACE_AER_H >> + >> +#include >> +#include >> + >> + >> +/* >> + * PCIe AER Trace event >> + * >> + * These events are generated when hardware detects a corrected or >> + * uncorrected event on a PCIe device. The event report has >> + * the following structure: >> + * >> + * char * dev_name - The name of the slot where the device resides >> + * ([domain:]bus:device.function). >> + * u32 status - Either the correctable or uncorrectable >> register >> + * indicating what error or errors have been seen >> + * u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED >> + */ >> + >> +#define aer_correctable_errors \ >> + {BIT(0),"Receiver Error"}, \ >> + {BIT(6),"Bad TLP"}, \ >> + {BIT(7),"Bad DLLP"},\ >> + {BIT(8),"RELAY_NUM Rollover"}, \ >> + {BIT(12), "Replay Timer Timeout"},\ >> + {BIT(13), "Advisory Non-Fatal"} >> + >> +#define aer_uncorrectable_errors \ >> + {BIT(4),"Data Link Protocol"}, \ >> + {BIT(12), "Poisoned TLP"},\ >> + {BIT(13), "Flow Control Protocol"}, \ >> + {BIT(14), "Completion Timeout"}, \ >> + {BIT(15), "Completer Abort"}, \ >> + {BIT(16), "Unexpected Completion"}, \ >> + {BIT(17), "Receiver Overflow"},
[BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
Resending adding Mauro's new Email address... On 1/17/13, Lance Ortiz wrote: > This header file will define a new trace event that will be triggered when > a AER event occurs. The following data will be provided to the trace > event. > > char * dev_name - The name of the slot where the device resides > ([domain:]bus:device.function). > > u32 status - Either the correctable or uncorrectable register > indicating what error or errors have been see. > > u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED > > The trace event will also provide a trace string that may look like: > > ":05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned > TLP" > > v1-v2 Move header from include/ras/aer_event.h to > include/trace/events/ras.h > v3-v4 Cleaned up comments and commit header > v4-v5 More cleanup remove () from if statement in print. > Renamed string define to be more specific. > v5-v6 change TRACE_SYSTEM define to be ras and not aer. > > Signed-off-by: Lance Ortiz > Acked-by: Mauro Carvalho Chehab > Acked-by: Tony Luck > --- > > include/trace/events/ras.h | 77 > > 1 files changed, 77 insertions(+), 0 deletions(-) > create mode 100644 include/trace/events/ras.h > > diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h > new file mode 100644 > index 000..88b8783 > --- /dev/null > +++ b/include/trace/events/ras.h > @@ -0,0 +1,77 @@ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM ras > + > +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_AER_H > + > +#include > +#include > + > + > +/* > + * PCIe AER Trace event > + * > + * These events are generated when hardware detects a corrected or > + * uncorrected event on a PCIe device. The event report has > + * the following structure: > + * > + * char * dev_name - The name of the slot where the device resides > + * ([domain:]bus:device.function). > + * u32 status - Either the correctable or uncorrectable register > + * indicating what error or errors have been seen > + * u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED > + */ > + > +#define aer_correctable_errors \ > + {BIT(0),"Receiver Error"}, \ > + {BIT(6),"Bad TLP"}, \ > + {BIT(7),"Bad DLLP"},\ > + {BIT(8),"RELAY_NUM Rollover"}, \ > + {BIT(12), "Replay Timer Timeout"},\ > + {BIT(13), "Advisory Non-Fatal"} > + > +#define aer_uncorrectable_errors \ > + {BIT(4),"Data Link Protocol"}, \ > + {BIT(12), "Poisoned TLP"},\ > + {BIT(13), "Flow Control Protocol"}, \ > + {BIT(14), "Completion Timeout"}, \ > + {BIT(15), "Completer Abort"}, \ > + {BIT(16), "Unexpected Completion"}, \ > + {BIT(17), "Receiver Overflow"}, \ > + {BIT(18), "Malformed TLP"}, \ > + {BIT(19), "ECRC"},\ > + {BIT(20), "Unsupported Request"} > + > +TRACE_EVENT(aer_event, > + TP_PROTO(const char *dev_name, > + const u32 status, > + const u8 severity), > + > + TP_ARGS(dev_name, status, severity), > + > + TP_STRUCT__entry( > + __string( dev_name, dev_name) > + __field(u32,status ) > + __field(u8, severity) > + ), > + > + TP_fast_assign( > + __assign_str(dev_name, dev_name); > + __entry->status = status; > + __entry->severity = severity; > + ), > + > + TP_printk("%s PCIe Bus Error: severity=%s, %s\n", > + __get_str(dev_name), > + __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : > + __entry->severity == HW_EVENT_ERR_FATAL ? > + "Fatal" : "Uncorrected", > + __entry->severity == HW_EVENT_ERR_CORRECTED ? > + __print_flags(__entry->status, "|", aer_correctable_errors) : > + __print_flags(__entry->status, "|", aer_uncorrectable_errors)) > +); Here's a bug causing inconsistency between dmesg and the trace event output. When dmesg says "severity=Corrected", the trace event says "severity=Fatal". What happens is that HW_EVENT_ERR_CORRECTED is defined in edac.h: enum hw_event_mc_err_type { HW_EVENT_ERR_CORRECTED, HW_EVENT_ERR_UNCORRECTED, HW_EVENT_ERR_FATAL, HW_EVENT_ERR_INFO, }; while aer_print_error() uses aer_error_severity_string[] defined as: static const char *aer_error_severity_string[] = { "Uncorrected (Non-Fatal)", "Uncorrected (Fatal)", "Corrected" }; In this case dmesg is correct because in
Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
On 1/17/13, Lance Ortiz wrote: > This header file will define a new trace event that will be triggered when > a AER event occurs. The following data will be provided to the trace > event. > > char * dev_name - The name of the slot where the device resides > ([domain:]bus:device.function). > > u32 status - Either the correctable or uncorrectable register > indicating what error or errors have been see. > > u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED > > The trace event will also provide a trace string that may look like: > > ":05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned > TLP" > > v1-v2 Move header from include/ras/aer_event.h to > include/trace/events/ras.h > v3-v4 Cleaned up comments and commit header > v4-v5 More cleanup remove () from if statement in print. > Renamed string define to be more specific. > v5-v6 change TRACE_SYSTEM define to be ras and not aer. > > Signed-off-by: Lance Ortiz > Acked-by: Mauro Carvalho Chehab > Acked-by: Tony Luck > --- > > include/trace/events/ras.h | 77 > > 1 files changed, 77 insertions(+), 0 deletions(-) > create mode 100644 include/trace/events/ras.h > > diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h > new file mode 100644 > index 000..88b8783 > --- /dev/null > +++ b/include/trace/events/ras.h > @@ -0,0 +1,77 @@ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM ras > + > +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_AER_H > + > +#include > +#include > + > + > +/* > + * PCIe AER Trace event > + * > + * These events are generated when hardware detects a corrected or > + * uncorrected event on a PCIe device. The event report has > + * the following structure: > + * > + * char * dev_name - The name of the slot where the device resides > + * ([domain:]bus:device.function). > + * u32 status - Either the correctable or uncorrectable register > + * indicating what error or errors have been seen > + * u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED > + */ > + > +#define aer_correctable_errors \ > + {BIT(0),"Receiver Error"}, \ > + {BIT(6),"Bad TLP"}, \ > + {BIT(7),"Bad DLLP"},\ > + {BIT(8),"RELAY_NUM Rollover"}, \ > + {BIT(12), "Replay Timer Timeout"},\ > + {BIT(13), "Advisory Non-Fatal"} > + > +#define aer_uncorrectable_errors \ > + {BIT(4),"Data Link Protocol"}, \ > + {BIT(12), "Poisoned TLP"},\ > + {BIT(13), "Flow Control Protocol"}, \ > + {BIT(14), "Completion Timeout"}, \ > + {BIT(15), "Completer Abort"}, \ > + {BIT(16), "Unexpected Completion"}, \ > + {BIT(17), "Receiver Overflow"}, \ > + {BIT(18), "Malformed TLP"}, \ > + {BIT(19), "ECRC"},\ > + {BIT(20), "Unsupported Request"} > + > +TRACE_EVENT(aer_event, > + TP_PROTO(const char *dev_name, > + const u32 status, > + const u8 severity), > + > + TP_ARGS(dev_name, status, severity), > + > + TP_STRUCT__entry( > + __string( dev_name, dev_name) > + __field(u32,status ) > + __field(u8, severity) > + ), > + > + TP_fast_assign( > + __assign_str(dev_name, dev_name); > + __entry->status = status; > + __entry->severity = severity; > + ), > + > + TP_printk("%s PCIe Bus Error: severity=%s, %s\n", > + __get_str(dev_name), > + __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : > + __entry->severity == HW_EVENT_ERR_FATAL ? > + "Fatal" : "Uncorrected", > + __entry->severity == HW_EVENT_ERR_CORRECTED ? > + __print_flags(__entry->status, "|", aer_correctable_errors) : > + __print_flags(__entry->status, "|", aer_uncorrectable_errors)) > +); This causes inconsistency between dmesg and the trace event output. When dmesg says "severity=Corrected", the trace event says "severity=Fatal". What happens is that HW_EVENT_ERR_CORRECTED is defined in edac.h: enum hw_event_mc_err_type { HW_EVENT_ERR_CORRECTED, HW_EVENT_ERR_UNCORRECTED, HW_EVENT_ERR_FATAL, HW_EVENT_ERR_INFO, }; while aer_print_error() uses aer_error_severity_string[] defined as: static const char *aer_error_severity_string[] = { "Uncorrected (Non-Fatal)", "Uncorrected (Fatal)", "Corrected" }; In this case dmesg is correct because info->severity is assigned in aer_isr_one_error() using the
[PATCH v10 1/3] aerdrv: Trace Event for AER
This header file will define a new trace event that will be triggered when a AER event occurs. The following data will be provided to the trace event. char * dev_name - The name of the slot where the device resides ([domain:]bus:device.function). u32 status - Either the correctable or uncorrectable register indicating what error or errors have been see. u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED The trace event will also provide a trace string that may look like: ":05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned TLP" v1-v2 Move header from include/ras/aer_event.h to include/trace/events/ras.h v3-v4 Cleaned up comments and commit header v4-v5 More cleanup remove () from if statement in print. Renamed string define to be more specific. v5-v6 change TRACE_SYSTEM define to be ras and not aer. Signed-off-by: Lance Ortiz Acked-by: Mauro Carvalho Chehab Acked-by: Tony Luck --- include/trace/events/ras.h | 77 1 files changed, 77 insertions(+), 0 deletions(-) create mode 100644 include/trace/events/ras.h diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h new file mode 100644 index 000..88b8783 --- /dev/null +++ b/include/trace/events/ras.h @@ -0,0 +1,77 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM ras + +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_AER_H + +#include +#include + + +/* + * PCIe AER Trace event + * + * These events are generated when hardware detects a corrected or + * uncorrected event on a PCIe device. The event report has + * the following structure: + * + * char * dev_name - The name of the slot where the device resides + * ([domain:]bus:device.function). + * u32 status -Either the correctable or uncorrectable register + * indicating what error or errors have been seen + * u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED + */ + +#define aer_correctable_errors \ + {BIT(0),"Receiver Error"}, \ + {BIT(6),"Bad TLP"}, \ + {BIT(7),"Bad DLLP"},\ + {BIT(8),"RELAY_NUM Rollover"}, \ + {BIT(12), "Replay Timer Timeout"},\ + {BIT(13), "Advisory Non-Fatal"} + +#define aer_uncorrectable_errors \ + {BIT(4),"Data Link Protocol"}, \ + {BIT(12), "Poisoned TLP"},\ + {BIT(13), "Flow Control Protocol"}, \ + {BIT(14), "Completion Timeout"}, \ + {BIT(15), "Completer Abort"}, \ + {BIT(16), "Unexpected Completion"}, \ + {BIT(17), "Receiver Overflow"}, \ + {BIT(18), "Malformed TLP"}, \ + {BIT(19), "ECRC"},\ + {BIT(20), "Unsupported Request"} + +TRACE_EVENT(aer_event, + TP_PROTO(const char *dev_name, +const u32 status, +const u8 severity), + + TP_ARGS(dev_name, status, severity), + + TP_STRUCT__entry( + __string( dev_name, dev_name) + __field(u32,status ) + __field(u8, severity) + ), + + TP_fast_assign( + __assign_str(dev_name, dev_name); + __entry->status = status; + __entry->severity = severity; + ), + + TP_printk("%s PCIe Bus Error: severity=%s, %s\n", + __get_str(dev_name), + __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" : + __entry->severity == HW_EVENT_ERR_FATAL ? + "Fatal" : "Uncorrected", + __entry->severity == HW_EVENT_ERR_CORRECTED ? + __print_flags(__entry->status, "|", aer_correctable_errors) : + __print_flags(__entry->status, "|", aer_uncorrectable_errors)) +); + +#endif /* _TRACE_AER_H */ + +/* This part must be outside protection */ +#include -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/