Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 2014/6/28 6:10, Luck, Tony wrote: >>> Not all machine checks are fatal - it would be bad for us to go into >>> an infinite spin instead of executing the recovery code. >> >> Then for the time being extlog shouldn't hook into the decoder chain >> but into mce_process_work, i.e. the last should call it. Or maybe add >> another notifier which is not atomic... > > I spoke too quickly. The only MCE for which we have recovery code are > those that hit in application code. So the processor that is trying to do > the printk() can't possibly be holding the locks. Other processors might > have held the lock at the time of the MCE - but they have all returned > from the handler at the time we try the printk - so they will make progess > and release the lock so that we can acquire it. Thank you for your reply. When we got a MCE which hit in application code, it will be broadcast to other processors immediately. Other processors who might have held the lock at the time of MCE, have no chance to release the lock and return from the printk. Isn't it? I know this rarely happens in production environments, but I think it's still a risk here. So it's very good if we have a printk safe in atomic context in the future. -- Thanks, XiuQi > > -Tony > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 2014/6/28 6:10, Luck, Tony wrote: Not all machine checks are fatal - it would be bad for us to go into an infinite spin instead of executing the recovery code. Then for the time being extlog shouldn't hook into the decoder chain but into mce_process_work, i.e. the last should call it. Or maybe add another notifier which is not atomic... I spoke too quickly. The only MCE for which we have recovery code are those that hit in application code. So the processor that is trying to do the printk() can't possibly be holding the locks. Other processors might have held the lock at the time of the MCE - but they have all returned from the handler at the time we try the printk - so they will make progess and release the lock so that we can acquire it. Thank you for your reply. When we got a MCE which hit in application code, it will be broadcast to other processors immediately. Other processors who might have held the lock at the time of MCE, have no chance to release the lock and return from the printk. Isn't it? I know this rarely happens in production environments, but I think it's still a risk here. So it's very good if we have a printk safe in atomic context in the future. -- Thanks, XiuQi -Tony -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Jun 27, 2014 at 10:10:48PM +, Luck, Tony wrote: > I spoke too quickly. The only MCE for which we have recovery code are > those that hit in application code. So the processor that is trying to > do the printk() can't possibly be holding the locks. Other processors > might have held the lock at the time of the MCE - but they have all > returned from the handler at the time we try the printk - so they will > make progess and release the lock so that we can acquire it. That could explain why we're not seeing hangs left and right. :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
>> Not all machine checks are fatal - it would be bad for us to go into >> an infinite spin instead of executing the recovery code. > > Then for the time being extlog shouldn't hook into the decoder chain > but into mce_process_work, i.e. the last should call it. Or maybe add > another notifier which is not atomic... I spoke too quickly. The only MCE for which we have recovery code are those that hit in application code. So the processor that is trying to do the printk() can't possibly be holding the locks. Other processors might have held the lock at the time of the MCE - but they have all returned from the handler at the time we try the printk - so they will make progess and release the lock so that we can acquire it. -Tony
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Jun 27, 2014 at 08:43:14PM +, Luck, Tony wrote: > Not all machine checks are fatal - it would be bad for us to go into > an infinite spin instead of executing the recovery code. Then for the time being extlog shouldn't hook into the decoder chain but into mce_process_work, i.e. the last should call it. Or maybe add another notifier which is not atomic... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
>> There's a logbuf_lock in printk. If logbuf_lock is held by other cpu, >> it'll lead to an infinity spin here. Isn't it? > > Yes, but we want to take the risk and print something out before the > machine dies instead of waiting to get into printk-safe context first > and maybe corrupt state. Not all machine checks are fatal - it would be bad for us to go into an infinite spin instead of executing the recovery code. > Besides, there's work currently going on to make printk safe in atomic > context so... Good - we need this. -Tony
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Jun 27, 2014 at 01:34:45PM +0800, Xie XiuQi wrote: > The call graph is like this, > do_machine_check > -> mce_log > -> atomic_notifier_call_chain(_mce_decoder_chain ...) >-> ... > -> extlog_print > -> print_extlog_rcd > -> __print_extlog_rcd >-> printk > > There's a logbuf_lock in printk. If logbuf_lock is held by other cpu, > it'll lead to an infinity spin here. Isn't it? Yes, but we want to take the risk and print something out before the machine dies instead of waiting to get into printk-safe context first and maybe corrupt state. Besides, there's work currently going on to make printk safe in atomic context so... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Jun 27, 2014 at 01:34:45PM +0800, Xie XiuQi wrote: The call graph is like this, do_machine_check - mce_log - atomic_notifier_call_chain(x86_mce_decoder_chain ...) - ... - extlog_print - print_extlog_rcd - __print_extlog_rcd - printk There's a logbuf_lock in printk. If logbuf_lock is held by other cpu, it'll lead to an infinity spin here. Isn't it? Yes, but we want to take the risk and print something out before the machine dies instead of waiting to get into printk-safe context first and maybe corrupt state. Besides, there's work currently going on to make printk safe in atomic context so... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
There's a logbuf_lock in printk. If logbuf_lock is held by other cpu, it'll lead to an infinity spin here. Isn't it? Yes, but we want to take the risk and print something out before the machine dies instead of waiting to get into printk-safe context first and maybe corrupt state. Not all machine checks are fatal - it would be bad for us to go into an infinite spin instead of executing the recovery code. Besides, there's work currently going on to make printk safe in atomic context so... Good - we need this. -Tony
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Jun 27, 2014 at 08:43:14PM +, Luck, Tony wrote: Not all machine checks are fatal - it would be bad for us to go into an infinite spin instead of executing the recovery code. Then for the time being extlog shouldn't hook into the decoder chain but into mce_process_work, i.e. the last should call it. Or maybe add another notifier which is not atomic... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
Not all machine checks are fatal - it would be bad for us to go into an infinite spin instead of executing the recovery code. Then for the time being extlog shouldn't hook into the decoder chain but into mce_process_work, i.e. the last should call it. Or maybe add another notifier which is not atomic... I spoke too quickly. The only MCE for which we have recovery code are those that hit in application code. So the processor that is trying to do the printk() can't possibly be holding the locks. Other processors might have held the lock at the time of the MCE - but they have all returned from the handler at the time we try the printk - so they will make progess and release the lock so that we can acquire it. -Tony
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Jun 27, 2014 at 10:10:48PM +, Luck, Tony wrote: I spoke too quickly. The only MCE for which we have recovery code are those that hit in application code. So the processor that is trying to do the printk() can't possibly be holding the locks. Other processors might have held the lock at the time of the MCE - but they have all returned from the handler at the time we try the printk - so they will make progess and release the lock so that we can acquire it. That could explain why we're not seeing hangs left and right. :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 2013/10/18 20:37, Naveen N. Rao wrote: > On 10/18/2013 01:53 PM, Chen, Gong wrote: >> This H/W error log driver (a.k.a eMCA driver) is implemented based on >> http://www.intel.com/content/www/us/en/architecture-and-technology/enhanced-mca-logging-xeon-paper.html >> >> After errors are captured, more valuable information can be >> got via this new enhanced H/W error log driver. >> >> v3 -> v2: fix a MACRO definition error and some cleanup >> v2 -> v1: eliminate spin_lock & minor fixes suggested by Boris >> >> Signed-off-by: Chen, Gong >> --- >> arch/x86/include/asm/mce.h | 5 + >> arch/x86/kernel/cpu/mcheck/mce.c | 20 +++ >> drivers/acpi/Kconfig | 20 +++ >> drivers/acpi/Makefile| 2 + [...] >> +} >> +EXPORT_SYMBOL_GPL(unregister_elog_handler); >> + >> /* >>* Poll for corrected events or events that happened before reset. >>* Those are just logged through /dev/mcelog. >> @@ -624,6 +641,9 @@ void machine_check_poll(enum mcp_flags flags, >> mce_banks_t *b) >> (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) >> continue; >> >> +if (mce_ext_err_print) >> +mce_ext_err_print(NULL, m.extcpu, i); >> + > > Can we use the notifier chain we already have: mce_register_decode_chain()? > EDAC uses this and I'm wondering if it is a good fit here. As an added bonus, > it seems to honor dont_log_ce option as well. Hi everyone, I have a question here, is it safe when we use printk in MCE context? The call graph is like this, do_machine_check -> mce_log -> atomic_notifier_call_chain(_mce_decoder_chain ...) -> ... -> extlog_print -> print_extlog_rcd -> __print_extlog_rcd -> printk There's a logbuf_lock in printk. If logbuf_lock is held by other cpu, it'll lead to an infinity spin here. Isn't it? -- Thanks, XiuQi > >> mce_read_aux(, i); >> >> if (!(flags & MCP_TIMESTAMP)) >> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig >> index 22327e6..c67ec61 100644 >> --- a/drivers/acpi/Kconfig >> +++ b/drivers/acpi/Kconfig >> @@ -372,4 +372,24 @@ config ACPI_BGRT >> >> source "drivers/acpi/apei/Kconfig" >> >> +config ACPI_EXTLOG >> +tristate "Extended Error Log support" >> +depends on X86_MCE ... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 2013/10/18 20:37, Naveen N. Rao wrote: On 10/18/2013 01:53 PM, Chen, Gong wrote: This H/W error log driver (a.k.a eMCA driver) is implemented based on http://www.intel.com/content/www/us/en/architecture-and-technology/enhanced-mca-logging-xeon-paper.html After errors are captured, more valuable information can be got via this new enhanced H/W error log driver. v3 - v2: fix a MACRO definition error and some cleanup v2 - v1: eliminate spin_lock minor fixes suggested by Boris Signed-off-by: Chen, Gong gong.c...@linux.intel.com --- arch/x86/include/asm/mce.h | 5 + arch/x86/kernel/cpu/mcheck/mce.c | 20 +++ drivers/acpi/Kconfig | 20 +++ drivers/acpi/Makefile| 2 + [...] +} +EXPORT_SYMBOL_GPL(unregister_elog_handler); + /* * Poll for corrected events or events that happened before reset. * Those are just logged through /dev/mcelog. @@ -624,6 +641,9 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) (m.status (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; +if (mce_ext_err_print) +mce_ext_err_print(NULL, m.extcpu, i); + Can we use the notifier chain we already have: mce_register_decode_chain()? EDAC uses this and I'm wondering if it is a good fit here. As an added bonus, it seems to honor dont_log_ce option as well. Hi everyone, I have a question here, is it safe when we use printk in MCE context? The call graph is like this, do_machine_check - mce_log - atomic_notifier_call_chain(x86_mce_decoder_chain ...) - ... - extlog_print - print_extlog_rcd - __print_extlog_rcd - printk There's a logbuf_lock in printk. If logbuf_lock is held by other cpu, it'll lead to an infinity spin here. Isn't it? -- Thanks, XiuQi mce_read_aux(m, i); if (!(flags MCP_TIMESTAMP)) diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 22327e6..c67ec61 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -372,4 +372,24 @@ config ACPI_BGRT source drivers/acpi/apei/Kconfig +config ACPI_EXTLOG +tristate Extended Error Log support +depends on X86_MCE ... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 10/22/2013 12:33 AM, Luck, Tony wrote: But yes, this is possible and it would make it all even cleaner and simpler by simply not needing the reg/dereg interfaces for mce_ext_err_print but adding it to the chain. So this is on top of the 9 patch series (using the V4 that Chen Gong posted for part 4/9 and V3 for all the others). Obviously it should be folded back into the series if we go this way. It's a bit simplistic right now - the registered function just returns NOTIFY_DONE in all cases so it will not disturb processing by any other registered functions - we can make it smarter later. Looks good. We obviously need to ensure this gets called before EDAC, if at all. The other question is w.r.t conflicts with EDAC, which we can re-visit as part of the discussions around a new trace event. Thanks, Naveen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Mon, Oct 21, 2013 at 03:39:20PM -0700, Tony Luck wrote: > I folded that back into the series. Also switched out the test on > whether to print the "No further action is required" message to only > do so for corrected errors. Cleaned up some of the commit messages, > > The result is sitting at: > git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git eMCA > > Anything we missed? Doesn't look so, at a first glance. But I agree with you - this stuff will be subject to change as we go along and we make up our mind about what exactly is sufficient and necessary to do proper decoding. I like the idea of keeping an open mind about it. :-) Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Mon, Oct 21, 2013 at 03:39:20PM -0700, Tony Luck wrote: I folded that back into the series. Also switched out the test on whether to print the No further action is required message to only do so for corrected errors. Cleaned up some of the commit messages, The result is sitting at: git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git eMCA Anything we missed? Doesn't look so, at a first glance. But I agree with you - this stuff will be subject to change as we go along and we make up our mind about what exactly is sufficient and necessary to do proper decoding. I like the idea of keeping an open mind about it. :-) Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 10/22/2013 12:33 AM, Luck, Tony wrote: But yes, this is possible and it would make it all even cleaner and simpler by simply not needing the reg/dereg interfaces for mce_ext_err_print but adding it to the chain. So this is on top of the 9 patch series (using the V4 that Chen Gong posted for part 4/9 and V3 for all the others). Obviously it should be folded back into the series if we go this way. It's a bit simplistic right now - the registered function just returns NOTIFY_DONE in all cases so it will not disturb processing by any other registered functions - we can make it smarter later. Looks good. We obviously need to ensure this gets called before EDAC, if at all. The other question is w.r.t conflicts with EDAC, which we can re-visit as part of the discussions around a new trace event. Thanks, Naveen -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Mon, Oct 21, 2013 at 12:03 PM, Luck, Tony wrote: > So this is on top of the 9 patch series (using the V4 that Chen Gong > posted for part 4/9 and V3 for all the others). Obviously it should > be folded back into the series if we go this way. > > It's a bit simplistic right now - the registered function just returns > NOTIFY_DONE in all cases so it will not disturb processing by any other > registered functions - we can make it smarter later. I folded that back into the series. Also switched out the test on whether to print the "No further action is required" message to only do so for corrected errors. Cleaned up some of the commit messages, The result is sitting at: git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git eMCA Anything we missed? -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
> But yes, this is possible and it would make it all even cleaner > and simpler by simply not needing the reg/dereg interfaces for > mce_ext_err_print but adding it to the chain. So this is on top of the 9 patch series (using the V4 that Chen Gong posted for part 4/9 and V3 for all the others). Obviously it should be folded back into the series if we go this way. It's a bit simplistic right now - the registered function just returns NOTIFY_DONE in all cases so it will not disturb processing by any other registered functions - we can make it smarter later. -Tony --- diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 072b2f80a345..8b8e72522737 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -188,9 +188,6 @@ enum mcp_flags { MCP_DONTLOG = (1 << 2), /* only clear, don't log */ }; -void register_elog_handler(int (*f)(const char *, int, int)); -void unregister_elog_handler(int (*f)(const char *, int, int)); - void machine_check_poll(enum mcp_flags flags, mce_banks_t *b); int mce_notify_irq(void); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 981e0d3ed49d..b3218cdee95f 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -48,8 +48,6 @@ #include "mce-internal.h" -static int (*mce_ext_err_print)(const char *, int, int); - static DEFINE_MUTEX(mce_chrdev_read_mutex); #define rcu_dereference_check_mce(p) \ @@ -578,21 +576,6 @@ static void mce_read_aux(struct mce *m, int i) DEFINE_PER_CPU(unsigned, mce_poll_count); -void register_elog_handler(int (*f)(const char *, int, int)) -{ - mce_ext_err_print = f; -} -EXPORT_SYMBOL_GPL(register_elog_handler); - -void unregister_elog_handler(int (*f)(const char *, int, int)) -{ - if (f) { - WARN_ON(mce_ext_err_print != f); - mce_ext_err_print = NULL; - } -} -EXPORT_SYMBOL_GPL(unregister_elog_handler); - /* * Poll for corrected events or events that happened before reset. * Those are just logged through /dev/mcelog. @@ -641,9 +624,6 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; - if (mce_ext_err_print) - mce_ext_err_print(NULL, m.extcpu, i); - mce_read_aux(, i); if (!(flags & MCP_TIMESTAMP)) diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c index 1bc657d3d053..eb0d7792ecc1 100644 --- a/drivers/acpi/acpi_extlog.c +++ b/drivers/acpi/acpi_extlog.c @@ -130,22 +130,26 @@ static int print_extlog_rcd(const char *pfx, return 1; } -static int extlog_print(const char *pfx, int cpu, int bank) +static int extlog_print(struct notifier_block *nb, unsigned long val, + void *data) { + struct mce *mce = (struct mce *)data; + int bank = mce->bank; + int cpu = mce->extcpu; struct acpi_generic_status *estatus; int rc; estatus = extlog_elog_entry_check(cpu, bank); if (estatus == NULL) - return -EINVAL; + return NOTIFY_DONE; memcpy(elog_buf, (void *)estatus, ELOG_ENTRY_LEN); /* clear record status to enable BIOS to update it again */ estatus->block_status = 0; - rc = print_extlog_rcd(pfx, (struct acpi_generic_status *)elog_buf, cpu); + rc = print_extlog_rcd(NULL, (struct acpi_generic_status *)elog_buf, cpu); - return rc; + return NOTIFY_DONE; } static int extlog_get_dsm(acpi_handle handle, int rev, int func, u64 *ret) @@ -213,6 +217,9 @@ static bool extlog_get_l1addr(void) return true; } +static struct notifier_block extlog_mce_dec = { + .notifier_call = extlog_print, +}; static int __init extlog_init(void) { @@ -279,7 +286,7 @@ static int __init extlog_init(void) if (elog_buf == NULL) goto err_release_elog; - register_elog_handler(extlog_print); + mce_register_decode_chain(_mce_dec); /* enable OS to be involved to take over management from BIOS */ ((struct extlog_l1_head *)extlog_l1_addr)->flags |= FLAG_OS_OPTIN; @@ -300,7 +307,7 @@ err: static void __exit extlog_exit(void) { - unregister_elog_handler(extlog_print); + mce_unregister_decode_chain(_mce_dec); ((struct extlog_l1_head *)extlog_l1_addr)->flags &= ~FLAG_OS_OPTIN; if (extlog_l1_addr) acpi_os_unmap_memory(extlog_l1_addr, l1_size); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 10/20/2013 01:51 PM, Borislav Petkov wrote: On Sun, Oct 20, 2013 at 03:06:15AM -0400, Chen Gong wrote: Oh, yes it is. Furthermore, it reminds me where is the best place to put cper.c from I write this patch series. CPER really doesn't dpend on APEI even ACPI. Maybe lib/ ia an option. I can update this patch and if it is OK, I can add another separate patch to change this dependency. Make sense? Yeah, for some reason it is part of the UEFI spec but APEI uses it too. Well, I guess you can add it there as "default n" and have the rest of the code select it in Kconfig. Yup, I think that would be a good idea to just separate out the CPER stuff from the APEI code, though I think your enhanced MCA logging code will need to depend on both CPER and ACPI since you use the ACPI structures as well. Thanks, Naveen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 10/20/2013 01:51 PM, Borislav Petkov wrote: On Sun, Oct 20, 2013 at 03:06:15AM -0400, Chen Gong wrote: Oh, yes it is. Furthermore, it reminds me where is the best place to put cper.c from I write this patch series. CPER really doesn't dpend on APEI even ACPI. Maybe lib/ ia an option. I can update this patch and if it is OK, I can add another separate patch to change this dependency. Make sense? Yeah, for some reason it is part of the UEFI spec but APEI uses it too. Well, I guess you can add it there as default n and have the rest of the code select it in Kconfig. Yup, I think that would be a good idea to just separate out the CPER stuff from the APEI code, though I think your enhanced MCA logging code will need to depend on both CPER and ACPI since you use the ACPI structures as well. Thanks, Naveen -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
But yes, this is possible and it would make it all even cleaner and simpler by simply not needing the reg/dereg interfaces for mce_ext_err_print but adding it to the chain. So this is on top of the 9 patch series (using the V4 that Chen Gong posted for part 4/9 and V3 for all the others). Obviously it should be folded back into the series if we go this way. It's a bit simplistic right now - the registered function just returns NOTIFY_DONE in all cases so it will not disturb processing by any other registered functions - we can make it smarter later. -Tony --- diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index 072b2f80a345..8b8e72522737 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -188,9 +188,6 @@ enum mcp_flags { MCP_DONTLOG = (1 2), /* only clear, don't log */ }; -void register_elog_handler(int (*f)(const char *, int, int)); -void unregister_elog_handler(int (*f)(const char *, int, int)); - void machine_check_poll(enum mcp_flags flags, mce_banks_t *b); int mce_notify_irq(void); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 981e0d3ed49d..b3218cdee95f 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -48,8 +48,6 @@ #include mce-internal.h -static int (*mce_ext_err_print)(const char *, int, int); - static DEFINE_MUTEX(mce_chrdev_read_mutex); #define rcu_dereference_check_mce(p) \ @@ -578,21 +576,6 @@ static void mce_read_aux(struct mce *m, int i) DEFINE_PER_CPU(unsigned, mce_poll_count); -void register_elog_handler(int (*f)(const char *, int, int)) -{ - mce_ext_err_print = f; -} -EXPORT_SYMBOL_GPL(register_elog_handler); - -void unregister_elog_handler(int (*f)(const char *, int, int)) -{ - if (f) { - WARN_ON(mce_ext_err_print != f); - mce_ext_err_print = NULL; - } -} -EXPORT_SYMBOL_GPL(unregister_elog_handler); - /* * Poll for corrected events or events that happened before reset. * Those are just logged through /dev/mcelog. @@ -641,9 +624,6 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) (m.status (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; - if (mce_ext_err_print) - mce_ext_err_print(NULL, m.extcpu, i); - mce_read_aux(m, i); if (!(flags MCP_TIMESTAMP)) diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c index 1bc657d3d053..eb0d7792ecc1 100644 --- a/drivers/acpi/acpi_extlog.c +++ b/drivers/acpi/acpi_extlog.c @@ -130,22 +130,26 @@ static int print_extlog_rcd(const char *pfx, return 1; } -static int extlog_print(const char *pfx, int cpu, int bank) +static int extlog_print(struct notifier_block *nb, unsigned long val, + void *data) { + struct mce *mce = (struct mce *)data; + int bank = mce-bank; + int cpu = mce-extcpu; struct acpi_generic_status *estatus; int rc; estatus = extlog_elog_entry_check(cpu, bank); if (estatus == NULL) - return -EINVAL; + return NOTIFY_DONE; memcpy(elog_buf, (void *)estatus, ELOG_ENTRY_LEN); /* clear record status to enable BIOS to update it again */ estatus-block_status = 0; - rc = print_extlog_rcd(pfx, (struct acpi_generic_status *)elog_buf, cpu); + rc = print_extlog_rcd(NULL, (struct acpi_generic_status *)elog_buf, cpu); - return rc; + return NOTIFY_DONE; } static int extlog_get_dsm(acpi_handle handle, int rev, int func, u64 *ret) @@ -213,6 +217,9 @@ static bool extlog_get_l1addr(void) return true; } +static struct notifier_block extlog_mce_dec = { + .notifier_call = extlog_print, +}; static int __init extlog_init(void) { @@ -279,7 +286,7 @@ static int __init extlog_init(void) if (elog_buf == NULL) goto err_release_elog; - register_elog_handler(extlog_print); + mce_register_decode_chain(extlog_mce_dec); /* enable OS to be involved to take over management from BIOS */ ((struct extlog_l1_head *)extlog_l1_addr)-flags |= FLAG_OS_OPTIN; @@ -300,7 +307,7 @@ err: static void __exit extlog_exit(void) { - unregister_elog_handler(extlog_print); + mce_unregister_decode_chain(extlog_mce_dec); ((struct extlog_l1_head *)extlog_l1_addr)-flags = ~FLAG_OS_OPTIN; if (extlog_l1_addr) acpi_os_unmap_memory(extlog_l1_addr, l1_size); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Mon, Oct 21, 2013 at 12:03 PM, Luck, Tony tony.l...@intel.com wrote: So this is on top of the 9 patch series (using the V4 that Chen Gong posted for part 4/9 and V3 for all the others). Obviously it should be folded back into the series if we go this way. It's a bit simplistic right now - the registered function just returns NOTIFY_DONE in all cases so it will not disturb processing by any other registered functions - we can make it smarter later. I folded that back into the series. Also switched out the test on whether to print the No further action is required message to only do so for corrected errors. Cleaned up some of the commit messages, The result is sitting at: git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git eMCA Anything we missed? -Tony -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
Btw, your mailer is generating that Mail-Followup-To header which removes you from the To: list and puts everyone else on To: instead. And of course, the patches you've sent with git-send-email don't have that header and replying to all there is fine. And Tony's replies don't have it so replying to him is fine. >From reading this here: http://cr.yp.to/proto/replyto.html your mail client seems to think you're subscribed to some list and thus drops your mail address from Mail-Followup-To. On Sun, Oct 20, 2013 at 03:06:15AM -0400, Chen Gong wrote: > Oh, yes it is. Furthermore, it reminds me where is the best place > to put cper.c from I write this patch series. CPER really doesn't > dpend on APEI even ACPI. Maybe lib/ ia an option. I can update this > patch and if it is OK, I can add another separate patch to change this > dependency. Make sense? Yeah, for some reason it is part of the UEFI spec but APEI uses it too. Well, I guess you can add it there as "default n" and have the rest of the code select it in Kconfig. > Sigh, it looks like I have m a little bit hurry. Yeah, why is that? :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
[...] > >diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig > >index 22327e6..c67ec61 100644 > >--- a/drivers/acpi/Kconfig > >+++ b/drivers/acpi/Kconfig > >@@ -372,4 +372,24 @@ config ACPI_BGRT > > > > source "drivers/acpi/apei/Kconfig" > > > >+config ACPI_EXTLOG > >+tristate "Extended Error Log support" > >+depends on X86_MCE > > I think you also have a dependancy on ACPI_APEI for apei_estatus_print() > Oh, yes it is. Furthermore, it reminds me where is the best place to put cper.c from I write this patch series. CPER really doesn't dpend on APEI even ACPI. Maybe lib/ ia an option. I can update this patch and if it is OK, I can add another separate patch to change this dependency. Make sense? > >+default n > >+help > >+ Certain usages such as Predictive Failure Analysis (PFA) require > >+ more information about the error than what can be described in > >+ processor machine check banks. Most server processors log > >+ additional information about the error in processor uncore > >+ registers. Since the addresses and layout of these registers vary > >+ widely from one processor to another, system software cannot > >+ readily make use of them. To complicate matters further, some of > >+ the additional error information cannot be constructed space > >+ between "additional" and "error" without detailed knowledge > > Oops... looks like copy+paste went wrong ;) > Sigh, it looks like I have m a little bit hurry. signature.asc Description: Digital signature
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
[...] diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 22327e6..c67ec61 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -372,4 +372,24 @@ config ACPI_BGRT source drivers/acpi/apei/Kconfig +config ACPI_EXTLOG +tristate Extended Error Log support +depends on X86_MCE I think you also have a dependancy on ACPI_APEI for apei_estatus_print() Oh, yes it is. Furthermore, it reminds me where is the best place to put cper.c from I write this patch series. CPER really doesn't dpend on APEI even ACPI. Maybe lib/ ia an option. I can update this patch and if it is OK, I can add another separate patch to change this dependency. Make sense? +default n +help + Certain usages such as Predictive Failure Analysis (PFA) require + more information about the error than what can be described in + processor machine check banks. Most server processors log + additional information about the error in processor uncore + registers. Since the addresses and layout of these registers vary + widely from one processor to another, system software cannot + readily make use of them. To complicate matters further, some of + the additional error information cannot be constructed space + between additional and error without detailed knowledge Oops... looks like copy+paste went wrong ;) Sigh, it looks like I have m a little bit hurry. signature.asc Description: Digital signature
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
Btw, your mailer is generating that Mail-Followup-To header which removes you from the To: list and puts everyone else on To: instead. And of course, the patches you've sent with git-send-email don't have that header and replying to all there is fine. And Tony's replies don't have it so replying to him is fine. From reading this here: http://cr.yp.to/proto/replyto.html your mail client seems to think you're subscribed to some list and thus drops your mail address from Mail-Followup-To. On Sun, Oct 20, 2013 at 03:06:15AM -0400, Chen Gong wrote: Oh, yes it is. Furthermore, it reminds me where is the best place to put cper.c from I write this patch series. CPER really doesn't dpend on APEI even ACPI. Maybe lib/ ia an option. I can update this patch and if it is OK, I can add another separate patch to change this dependency. Make sense? Yeah, for some reason it is part of the UEFI spec but APEI uses it too. Well, I guess you can add it there as default n and have the rest of the code select it in Kconfig. Sigh, it looks like I have m a little bit hurry. Yeah, why is that? :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Oct 18, 2013 at 06:07:56PM +0530, Naveen N. Rao wrote: > Date: Fri, 18 Oct 2013 18:07:56 +0530 > From: "Naveen N. Rao" > To: "Chen, Gong" , tony.l...@intel.com, > b...@alien8.de, j...@perches.com, m.che...@samsung.com > CC: aroza...@redhat.com, linux-a...@vger.kernel.org, > linux-kernel@vger.kernel.org > Subject: Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 > platform > User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 > Thunderbird/24.0 > [...] > >+ > >+MODULE_AUTHOR("Chen, Gong "); > >+MODULE_DESCRIPTION("Extended Error Log Driver"); > > "Extended MCA Error Log Driver"? > Looks fine to me. Tony, would you please help to fix it when you pick up the patch? Thanks in advance! signature.asc Description: Digital signature
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Oct 18, 2013 at 10:22:26PM +, Luck, Tony wrote: > @@ -154,6 +154,10 @@ void mce_log(struct mce *mce) > /* Emit the trace record: */ > trace_mce_record(mce); > > + if (mce_ext_err_print) > + if (mce_ext_err_print(NULL, m.extcpu, i)) > + return; > + > ret = atomic_notifier_call_chain(_mce_decoder_chain, 0, mce); > if (ret == NOTIFY_STOP) > return; > > If we move mce_ext_err_print() this far ... then it's only one line further > down > to have it be part of the x86_mce_decoder_chain as suggested by Naveen. Right, if you want mce_ext_err_print() to be the first and the only one called on the chain, then you'd have to play with the priority. But yes, this is possible and it would make it all even cleaner and simpler by simply not needing the reg/dereg interfaces for mce_ext_err_print but adding it to the chain. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Oct 18, 2013 at 10:22:26PM +, Luck, Tony wrote: @@ -154,6 +154,10 @@ void mce_log(struct mce *mce) /* Emit the trace record: */ trace_mce_record(mce); + if (mce_ext_err_print) + if (mce_ext_err_print(NULL, m.extcpu, i)) + return; + ret = atomic_notifier_call_chain(x86_mce_decoder_chain, 0, mce); if (ret == NOTIFY_STOP) return; If we move mce_ext_err_print() this far ... then it's only one line further down to have it be part of the x86_mce_decoder_chain as suggested by Naveen. Right, if you want mce_ext_err_print() to be the first and the only one called on the chain, then you'd have to play with the priority. But yes, this is possible and it would make it all even cleaner and simpler by simply not needing the reg/dereg interfaces for mce_ext_err_print but adding it to the chain. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Oct 18, 2013 at 06:07:56PM +0530, Naveen N. Rao wrote: Date: Fri, 18 Oct 2013 18:07:56 +0530 From: Naveen N. Rao naveen.n@linux.vnet.ibm.com To: Chen, Gong gong.c...@linux.intel.com, tony.l...@intel.com, b...@alien8.de, j...@perches.com, m.che...@samsung.com CC: aroza...@redhat.com, linux-a...@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.0 [...] + +MODULE_AUTHOR(Chen, Gong gong.c...@intel.com); +MODULE_DESCRIPTION(Extended Error Log Driver); Extended MCA Error Log Driver? Looks fine to me. Tony, would you please help to fix it when you pick up the patch? Thanks in advance! signature.asc Description: Digital signature
RE: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
@@ -154,6 +154,10 @@ void mce_log(struct mce *mce) /* Emit the trace record: */ trace_mce_record(mce); + if (mce_ext_err_print) + if (mce_ext_err_print(NULL, m.extcpu, i)) + return; + ret = atomic_notifier_call_chain(_mce_decoder_chain, 0, mce); if (ret == NOTIFY_STOP) return; If we move mce_ext_err_print() this far ... then it's only one line further down to have it be part of the x86_mce_decoder_chain as suggested by Naveen. -Tony
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Oct 18, 2013 at 08:57:22PM +, Luck, Tony wrote: > Long term ... I'd be happy to see mce_log() go away. But we need to > have a robust, well tested replacement in place for some time before > such a move is up for discussion. Basically a userspace daemon consuming the tracepoint or plural, tracepoints. > Yes - double error reporting should be avoided. Right. > Our first platforms to implement this only do so for memory errors. > This could change in the future (the UEFI appendix N error record has > defined sub-sections for lots of types of errors). Ok. > Currently EDAC hooked into the mce even notification chain provides a > return code to indicate whether it completely processed the error, or > whether to fall through to the rest of mce_log(): > > if (ret == NOTIFY_STOP) > return; > > Having both EDAC and this new extended error log both registered on this > chain would probably not be helpful in most cases. Not only that - you don't need EDAC because all the information is in the MCA registers and the eMCA supplement, if there is one. EDAC would be used on older systems which don't sport eMCA. Now, concerning the current situation, we probably want to do something like this: diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index b1b04123f3d9..382c78eaf474 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -154,6 +154,10 @@ void mce_log(struct mce *mce) /* Emit the trace record: */ trace_mce_record(mce); + if (mce_ext_err_print) + if (mce_ext_err_print(NULL, m.extcpu, i)) + return; + ret = atomic_notifier_call_chain(_mce_decoder_chain, 0, mce); if (ret == NOTIFY_STOP) return; -- Right, we've moved the eMCA print thingie to mce_log so that we get a chance to run the first TP issuing the raw MCA registers and then run the eMCA TP as a follow-up. We've taught mce_ext_err_print() to return a true/false retval to denote: * true: it has collected data successfully, no need to go down the reporting chain * false: eMCA failed somehow, log the error down and trigger mcelog in userspace. How does that sound? > Not sure if we should handle that with user education to not load both > an EDAC and ext_log driver or if there should be some enforcement. Definitely enforcement. The flags thing I was telling you about recently could be one way to do it. > trace_mce_record() dumps the raw data from the machine check banks. I > think there may still be a case for having this. Analysis tools that > look at this trace as well should be smart enough to connect the dots. Yes, sure. The more non-overlaping data we get, the better. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
> Hmm, that's a good question you raise: but the more important question > is, do you guys - Gong and Tony - want to replace the logging we're > already doing, i.e. mce_log() with extlog or not. Long term ... I'd be happy to see mce_log() go away. But we need to have a robust, well tested replacement in place for some time before such a move is up for discussion. > Because if you want to replace the current logging you actually have to > exit machine_check_poll() after having done mce_ext_err_print() so that > the rest of the chain doesn't see the error. Yes - double error reporting should be avoided. > And, does mce_ext_err_print only report DRAM ECC errors or other error > types too? Our first platforms to implement this only do so for memory errors. This could change in the future (the UEFI appendix N error record has defined sub-sections for lots of types of errors). Currently EDAC hooked into the mce even notification chain provides a return code to indicate whether it completely processed the error, or whether to fall through to the rest of mce_log(): if (ret == NOTIFY_STOP) return; Having both EDAC and this new extended error log both registered on this chain would probably not be helpful in most cases. Not sure if we should handle that with user education to not load both an EDAC and ext_log driver or if there should be some enforcement. > Btw, if we keep both, then we're going to have two tracepoints - > trace_mce_record() in mce_log() and this one - issuing each a record for > the same event. Which is not really what we want I'd say... trace_mce_record() dumps the raw data from the machine check banks. I think there may still be a case for having this. Analysis tools that look at this trace as well should be smart enough to connect the dots. -Tony N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Oct 18, 2013 at 06:07:56PM +0530, Naveen N. Rao wrote: > >@@ -624,6 +641,9 @@ void machine_check_poll(enum mcp_flags flags, > >mce_banks_t *b) > > (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) > > continue; > > > >+if (mce_ext_err_print) > >+mce_ext_err_print(NULL, m.extcpu, i); > >+ > > Can we use the notifier chain we already have: > mce_register_decode_chain()? EDAC uses this and I'm wondering if it > is a good fit here. As an added bonus, it seems to honor dont_log_ce > option as well. Hmm, that's a good question you raise: but the more important question is, do you guys - Gong and Tony - want to replace the logging we're already doing, i.e. mce_log() with extlog or not. Because if you want to replace the current logging you actually have to exit machine_check_poll() after having done mce_ext_err_print() so that the rest of the chain doesn't see the error. And, does mce_ext_err_print only report DRAM ECC errors or other error types too? Btw, if we keep both, then we're going to have two tracepoints - trace_mce_record() in mce_log() and this one - issuing each a record for the same event. Which is not really what we want I'd say... Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 10/18/2013 01:53 PM, Chen, Gong wrote: This H/W error log driver (a.k.a eMCA driver) is implemented based on http://www.intel.com/content/www/us/en/architecture-and-technology/enhanced-mca-logging-xeon-paper.html After errors are captured, more valuable information can be got via this new enhanced H/W error log driver. v3 -> v2: fix a MACRO definition error and some cleanup v2 -> v1: eliminate spin_lock & minor fixes suggested by Boris Signed-off-by: Chen, Gong --- arch/x86/include/asm/mce.h | 5 + arch/x86/kernel/cpu/mcheck/mce.c | 20 +++ drivers/acpi/Kconfig | 20 +++ drivers/acpi/Makefile| 2 + drivers/acpi/acpi_extlog.c | 319 +++ drivers/acpi/bus.c | 3 +- include/linux/acpi.h | 1 + 7 files changed, 369 insertions(+), 1 deletion(-) create mode 100644 drivers/acpi/acpi_extlog.c diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index cbe6b9e..072b2f8 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -16,6 +16,7 @@ #define MCG_EXT_CNT_SHIFT 16 #define MCG_EXT_CNT(c)(((c) & MCG_EXT_CNT_MASK) >> MCG_EXT_CNT_SHIFT) #define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ +#define MCG_ELOG_P (1ULL<<26) /* Extended error log supported */ /* MCG_STATUS register defines */ #define MCG_STATUS_RIPV (1ULL<<0) /* restart ip valid */ @@ -186,6 +187,10 @@ enum mcp_flags { MCP_UC = (1 << 1),/* log uncorrected errors */ MCP_DONTLOG = (1 << 2), /* only clear, don't log */ }; + +void register_elog_handler(int (*f)(const char *, int, int)); +void unregister_elog_handler(int (*f)(const char *, int, int)); + void machine_check_poll(enum mcp_flags flags, mce_banks_t *b); int mce_notify_irq(void); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index b3218cd..981e0d3 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -48,6 +48,8 @@ #include "mce-internal.h" +static int (*mce_ext_err_print)(const char *, int, int); + static DEFINE_MUTEX(mce_chrdev_read_mutex); #define rcu_dereference_check_mce(p) \ @@ -576,6 +578,21 @@ static void mce_read_aux(struct mce *m, int i) DEFINE_PER_CPU(unsigned, mce_poll_count); +void register_elog_handler(int (*f)(const char *, int, int)) +{ + mce_ext_err_print = f; +} +EXPORT_SYMBOL_GPL(register_elog_handler); + +void unregister_elog_handler(int (*f)(const char *, int, int)) +{ + if (f) { + WARN_ON(mce_ext_err_print != f); + mce_ext_err_print = NULL; + } +} +EXPORT_SYMBOL_GPL(unregister_elog_handler); + /* * Poll for corrected events or events that happened before reset. * Those are just logged through /dev/mcelog. @@ -624,6 +641,9 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; + if (mce_ext_err_print) + mce_ext_err_print(NULL, m.extcpu, i); + Can we use the notifier chain we already have: mce_register_decode_chain()? EDAC uses this and I'm wondering if it is a good fit here. As an added bonus, it seems to honor dont_log_ce option as well. mce_read_aux(, i); if (!(flags & MCP_TIMESTAMP)) diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 22327e6..c67ec61 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -372,4 +372,24 @@ config ACPI_BGRT source "drivers/acpi/apei/Kconfig" +config ACPI_EXTLOG + tristate "Extended Error Log support" + depends on X86_MCE I think you also have a dependancy on ACPI_APEI for apei_estatus_print() + default n + help + Certain usages such as Predictive Failure Analysis (PFA) require + more information about the error than what can be described in + processor machine check banks. Most server processors log + additional information about the error in processor uncore + registers. Since the addresses and layout of these registers vary + widely from one processor to another, system software cannot + readily make use of them. To complicate matters further, some of + the additional error information cannot be constructed space + between "additional" and "error" without detailed knowledge Oops... looks like copy+paste went wrong ;) + about platform topology. + + Enhanced MCA Logging allows firmware to provide additional error + information to system software, synchronous with MCE or CMCI. This + driver adds support for that functionality. + endif # ACPI diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index cdaf68b..bce34af 100644 --- a/drivers/acpi/Makefile +++
[PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
This H/W error log driver (a.k.a eMCA driver) is implemented based on http://www.intel.com/content/www/us/en/architecture-and-technology/enhanced-mca-logging-xeon-paper.html After errors are captured, more valuable information can be got via this new enhanced H/W error log driver. v3 -> v2: fix a MACRO definition error and some cleanup v2 -> v1: eliminate spin_lock & minor fixes suggested by Boris Signed-off-by: Chen, Gong --- arch/x86/include/asm/mce.h | 5 + arch/x86/kernel/cpu/mcheck/mce.c | 20 +++ drivers/acpi/Kconfig | 20 +++ drivers/acpi/Makefile| 2 + drivers/acpi/acpi_extlog.c | 319 +++ drivers/acpi/bus.c | 3 +- include/linux/acpi.h | 1 + 7 files changed, 369 insertions(+), 1 deletion(-) create mode 100644 drivers/acpi/acpi_extlog.c diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index cbe6b9e..072b2f8 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -16,6 +16,7 @@ #define MCG_EXT_CNT_SHIFT 16 #define MCG_EXT_CNT(c) (((c) & MCG_EXT_CNT_MASK) >> MCG_EXT_CNT_SHIFT) #define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ +#define MCG_ELOG_P (1ULL<<26) /* Extended error log supported */ /* MCG_STATUS register defines */ #define MCG_STATUS_RIPV (1ULL<<0) /* restart ip valid */ @@ -186,6 +187,10 @@ enum mcp_flags { MCP_UC = (1 << 1), /* log uncorrected errors */ MCP_DONTLOG = (1 << 2), /* only clear, don't log */ }; + +void register_elog_handler(int (*f)(const char *, int, int)); +void unregister_elog_handler(int (*f)(const char *, int, int)); + void machine_check_poll(enum mcp_flags flags, mce_banks_t *b); int mce_notify_irq(void); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index b3218cd..981e0d3 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -48,6 +48,8 @@ #include "mce-internal.h" +static int (*mce_ext_err_print)(const char *, int, int); + static DEFINE_MUTEX(mce_chrdev_read_mutex); #define rcu_dereference_check_mce(p) \ @@ -576,6 +578,21 @@ static void mce_read_aux(struct mce *m, int i) DEFINE_PER_CPU(unsigned, mce_poll_count); +void register_elog_handler(int (*f)(const char *, int, int)) +{ + mce_ext_err_print = f; +} +EXPORT_SYMBOL_GPL(register_elog_handler); + +void unregister_elog_handler(int (*f)(const char *, int, int)) +{ + if (f) { + WARN_ON(mce_ext_err_print != f); + mce_ext_err_print = NULL; + } +} +EXPORT_SYMBOL_GPL(unregister_elog_handler); + /* * Poll for corrected events or events that happened before reset. * Those are just logged through /dev/mcelog. @@ -624,6 +641,9 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; + if (mce_ext_err_print) + mce_ext_err_print(NULL, m.extcpu, i); + mce_read_aux(, i); if (!(flags & MCP_TIMESTAMP)) diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 22327e6..c67ec61 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -372,4 +372,24 @@ config ACPI_BGRT source "drivers/acpi/apei/Kconfig" +config ACPI_EXTLOG + tristate "Extended Error Log support" + depends on X86_MCE + default n + help + Certain usages such as Predictive Failure Analysis (PFA) require + more information about the error than what can be described in + processor machine check banks. Most server processors log + additional information about the error in processor uncore + registers. Since the addresses and layout of these registers vary + widely from one processor to another, system software cannot + readily make use of them. To complicate matters further, some of + the additional error information cannot be constructed space + between "additional" and "error" without detailed knowledge + about platform topology. + + Enhanced MCA Logging allows firmware to provide additional error + information to system software, synchronous with MCE or CMCI. This + driver adds support for that functionality. + endif # ACPI diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index cdaf68b..bce34af 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -82,3 +82,5 @@ processor-$(CONFIG_CPU_FREQ) += processor_perflib.o obj-$(CONFIG_ACPI_PROCESSOR_AGGREGATOR) += acpi_pad.o obj-$(CONFIG_ACPI_APEI)+= apei/ + +obj-$(CONFIG_ACPI_EXTLOG) += acpi_extlog.o diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c new file mode 100644 index 000..afeab59 --- /dev/null +++ b/drivers/acpi/acpi_extlog.c
[PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
This H/W error log driver (a.k.a eMCA driver) is implemented based on http://www.intel.com/content/www/us/en/architecture-and-technology/enhanced-mca-logging-xeon-paper.html After errors are captured, more valuable information can be got via this new enhanced H/W error log driver. v3 - v2: fix a MACRO definition error and some cleanup v2 - v1: eliminate spin_lock minor fixes suggested by Boris Signed-off-by: Chen, Gong gong.c...@linux.intel.com --- arch/x86/include/asm/mce.h | 5 + arch/x86/kernel/cpu/mcheck/mce.c | 20 +++ drivers/acpi/Kconfig | 20 +++ drivers/acpi/Makefile| 2 + drivers/acpi/acpi_extlog.c | 319 +++ drivers/acpi/bus.c | 3 +- include/linux/acpi.h | 1 + 7 files changed, 369 insertions(+), 1 deletion(-) create mode 100644 drivers/acpi/acpi_extlog.c diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index cbe6b9e..072b2f8 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -16,6 +16,7 @@ #define MCG_EXT_CNT_SHIFT 16 #define MCG_EXT_CNT(c) (((c) MCG_EXT_CNT_MASK) MCG_EXT_CNT_SHIFT) #define MCG_SER_P (1ULL24) /* MCA recovery/new status bits */ +#define MCG_ELOG_P (1ULL26) /* Extended error log supported */ /* MCG_STATUS register defines */ #define MCG_STATUS_RIPV (1ULL0) /* restart ip valid */ @@ -186,6 +187,10 @@ enum mcp_flags { MCP_UC = (1 1), /* log uncorrected errors */ MCP_DONTLOG = (1 2), /* only clear, don't log */ }; + +void register_elog_handler(int (*f)(const char *, int, int)); +void unregister_elog_handler(int (*f)(const char *, int, int)); + void machine_check_poll(enum mcp_flags flags, mce_banks_t *b); int mce_notify_irq(void); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index b3218cd..981e0d3 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -48,6 +48,8 @@ #include mce-internal.h +static int (*mce_ext_err_print)(const char *, int, int); + static DEFINE_MUTEX(mce_chrdev_read_mutex); #define rcu_dereference_check_mce(p) \ @@ -576,6 +578,21 @@ static void mce_read_aux(struct mce *m, int i) DEFINE_PER_CPU(unsigned, mce_poll_count); +void register_elog_handler(int (*f)(const char *, int, int)) +{ + mce_ext_err_print = f; +} +EXPORT_SYMBOL_GPL(register_elog_handler); + +void unregister_elog_handler(int (*f)(const char *, int, int)) +{ + if (f) { + WARN_ON(mce_ext_err_print != f); + mce_ext_err_print = NULL; + } +} +EXPORT_SYMBOL_GPL(unregister_elog_handler); + /* * Poll for corrected events or events that happened before reset. * Those are just logged through /dev/mcelog. @@ -624,6 +641,9 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) (m.status (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; + if (mce_ext_err_print) + mce_ext_err_print(NULL, m.extcpu, i); + mce_read_aux(m, i); if (!(flags MCP_TIMESTAMP)) diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 22327e6..c67ec61 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -372,4 +372,24 @@ config ACPI_BGRT source drivers/acpi/apei/Kconfig +config ACPI_EXTLOG + tristate Extended Error Log support + depends on X86_MCE + default n + help + Certain usages such as Predictive Failure Analysis (PFA) require + more information about the error than what can be described in + processor machine check banks. Most server processors log + additional information about the error in processor uncore + registers. Since the addresses and layout of these registers vary + widely from one processor to another, system software cannot + readily make use of them. To complicate matters further, some of + the additional error information cannot be constructed space + between additional and error without detailed knowledge + about platform topology. + + Enhanced MCA Logging allows firmware to provide additional error + information to system software, synchronous with MCE or CMCI. This + driver adds support for that functionality. + endif # ACPI diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index cdaf68b..bce34af 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -82,3 +82,5 @@ processor-$(CONFIG_CPU_FREQ) += processor_perflib.o obj-$(CONFIG_ACPI_PROCESSOR_AGGREGATOR) += acpi_pad.o obj-$(CONFIG_ACPI_APEI)+= apei/ + +obj-$(CONFIG_ACPI_EXTLOG) += acpi_extlog.o diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c new file mode 100644 index 000..afeab59 --- /dev/null +++ b/drivers/acpi/acpi_extlog.c @@
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On 10/18/2013 01:53 PM, Chen, Gong wrote: This H/W error log driver (a.k.a eMCA driver) is implemented based on http://www.intel.com/content/www/us/en/architecture-and-technology/enhanced-mca-logging-xeon-paper.html After errors are captured, more valuable information can be got via this new enhanced H/W error log driver. v3 - v2: fix a MACRO definition error and some cleanup v2 - v1: eliminate spin_lock minor fixes suggested by Boris Signed-off-by: Chen, Gong gong.c...@linux.intel.com --- arch/x86/include/asm/mce.h | 5 + arch/x86/kernel/cpu/mcheck/mce.c | 20 +++ drivers/acpi/Kconfig | 20 +++ drivers/acpi/Makefile| 2 + drivers/acpi/acpi_extlog.c | 319 +++ drivers/acpi/bus.c | 3 +- include/linux/acpi.h | 1 + 7 files changed, 369 insertions(+), 1 deletion(-) create mode 100644 drivers/acpi/acpi_extlog.c diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index cbe6b9e..072b2f8 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -16,6 +16,7 @@ #define MCG_EXT_CNT_SHIFT 16 #define MCG_EXT_CNT(c)(((c) MCG_EXT_CNT_MASK) MCG_EXT_CNT_SHIFT) #define MCG_SER_P (1ULL24) /* MCA recovery/new status bits */ +#define MCG_ELOG_P (1ULL26) /* Extended error log supported */ /* MCG_STATUS register defines */ #define MCG_STATUS_RIPV (1ULL0) /* restart ip valid */ @@ -186,6 +187,10 @@ enum mcp_flags { MCP_UC = (1 1),/* log uncorrected errors */ MCP_DONTLOG = (1 2), /* only clear, don't log */ }; + +void register_elog_handler(int (*f)(const char *, int, int)); +void unregister_elog_handler(int (*f)(const char *, int, int)); + void machine_check_poll(enum mcp_flags flags, mce_banks_t *b); int mce_notify_irq(void); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index b3218cd..981e0d3 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -48,6 +48,8 @@ #include mce-internal.h +static int (*mce_ext_err_print)(const char *, int, int); + static DEFINE_MUTEX(mce_chrdev_read_mutex); #define rcu_dereference_check_mce(p) \ @@ -576,6 +578,21 @@ static void mce_read_aux(struct mce *m, int i) DEFINE_PER_CPU(unsigned, mce_poll_count); +void register_elog_handler(int (*f)(const char *, int, int)) +{ + mce_ext_err_print = f; +} +EXPORT_SYMBOL_GPL(register_elog_handler); + +void unregister_elog_handler(int (*f)(const char *, int, int)) +{ + if (f) { + WARN_ON(mce_ext_err_print != f); + mce_ext_err_print = NULL; + } +} +EXPORT_SYMBOL_GPL(unregister_elog_handler); + /* * Poll for corrected events or events that happened before reset. * Those are just logged through /dev/mcelog. @@ -624,6 +641,9 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) (m.status (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; + if (mce_ext_err_print) + mce_ext_err_print(NULL, m.extcpu, i); + Can we use the notifier chain we already have: mce_register_decode_chain()? EDAC uses this and I'm wondering if it is a good fit here. As an added bonus, it seems to honor dont_log_ce option as well. mce_read_aux(m, i); if (!(flags MCP_TIMESTAMP)) diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 22327e6..c67ec61 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -372,4 +372,24 @@ config ACPI_BGRT source drivers/acpi/apei/Kconfig +config ACPI_EXTLOG + tristate Extended Error Log support + depends on X86_MCE I think you also have a dependancy on ACPI_APEI for apei_estatus_print() + default n + help + Certain usages such as Predictive Failure Analysis (PFA) require + more information about the error than what can be described in + processor machine check banks. Most server processors log + additional information about the error in processor uncore + registers. Since the addresses and layout of these registers vary + widely from one processor to another, system software cannot + readily make use of them. To complicate matters further, some of + the additional error information cannot be constructed space + between additional and error without detailed knowledge Oops... looks like copy+paste went wrong ;) + about platform topology. + + Enhanced MCA Logging allows firmware to provide additional error + information to system software, synchronous with MCE or CMCI. This + driver adds support for that functionality. + endif # ACPI diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index cdaf68b..bce34af 100644 --- a/drivers/acpi/Makefile +++
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Oct 18, 2013 at 06:07:56PM +0530, Naveen N. Rao wrote: @@ -624,6 +641,9 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) (m.status (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) continue; +if (mce_ext_err_print) +mce_ext_err_print(NULL, m.extcpu, i); + Can we use the notifier chain we already have: mce_register_decode_chain()? EDAC uses this and I'm wondering if it is a good fit here. As an added bonus, it seems to honor dont_log_ce option as well. Hmm, that's a good question you raise: but the more important question is, do you guys - Gong and Tony - want to replace the logging we're already doing, i.e. mce_log() with extlog or not. Because if you want to replace the current logging you actually have to exit machine_check_poll() after having done mce_ext_err_print() so that the rest of the chain doesn't see the error. And, does mce_ext_err_print only report DRAM ECC errors or other error types too? Btw, if we keep both, then we're going to have two tracepoints - trace_mce_record() in mce_log() and this one - issuing each a record for the same event. Which is not really what we want I'd say... Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
Hmm, that's a good question you raise: but the more important question is, do you guys - Gong and Tony - want to replace the logging we're already doing, i.e. mce_log() with extlog or not. Long term ... I'd be happy to see mce_log() go away. But we need to have a robust, well tested replacement in place for some time before such a move is up for discussion. Because if you want to replace the current logging you actually have to exit machine_check_poll() after having done mce_ext_err_print() so that the rest of the chain doesn't see the error. Yes - double error reporting should be avoided. And, does mce_ext_err_print only report DRAM ECC errors or other error types too? Our first platforms to implement this only do so for memory errors. This could change in the future (the UEFI appendix N error record has defined sub-sections for lots of types of errors). Currently EDAC hooked into the mce even notification chain provides a return code to indicate whether it completely processed the error, or whether to fall through to the rest of mce_log(): if (ret == NOTIFY_STOP) return; Having both EDAC and this new extended error log both registered on this chain would probably not be helpful in most cases. Not sure if we should handle that with user education to not load both an EDAC and ext_log driver or if there should be some enforcement. Btw, if we keep both, then we're going to have two tracepoints - trace_mce_record() in mce_log() and this one - issuing each a record for the same event. Which is not really what we want I'd say... trace_mce_record() dumps the raw data from the machine check banks. I think there may still be a case for having this. Analysis tools that look at this trace as well should be smart enough to connect the dots. -Tony N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
Re: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
On Fri, Oct 18, 2013 at 08:57:22PM +, Luck, Tony wrote: Long term ... I'd be happy to see mce_log() go away. But we need to have a robust, well tested replacement in place for some time before such a move is up for discussion. Basically a userspace daemon consuming the tracepoint or plural, tracepoints. Yes - double error reporting should be avoided. Right. Our first platforms to implement this only do so for memory errors. This could change in the future (the UEFI appendix N error record has defined sub-sections for lots of types of errors). Ok. Currently EDAC hooked into the mce even notification chain provides a return code to indicate whether it completely processed the error, or whether to fall through to the rest of mce_log(): if (ret == NOTIFY_STOP) return; Having both EDAC and this new extended error log both registered on this chain would probably not be helpful in most cases. Not only that - you don't need EDAC because all the information is in the MCA registers and the eMCA supplement, if there is one. EDAC would be used on older systems which don't sport eMCA. Now, concerning the current situation, we probably want to do something like this: diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index b1b04123f3d9..382c78eaf474 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -154,6 +154,10 @@ void mce_log(struct mce *mce) /* Emit the trace record: */ trace_mce_record(mce); + if (mce_ext_err_print) + if (mce_ext_err_print(NULL, m.extcpu, i)) + return; + ret = atomic_notifier_call_chain(x86_mce_decoder_chain, 0, mce); if (ret == NOTIFY_STOP) return; -- Right, we've moved the eMCA print thingie to mce_log so that we get a chance to run the first TP issuing the raw MCA registers and then run the eMCA TP as a follow-up. We've taught mce_ext_err_print() to return a true/false retval to denote: * true: it has collected data successfully, no need to go down the reporting chain * false: eMCA failed somehow, log the error down and trigger mcelog in userspace. How does that sound? Not sure if we should handle that with user education to not load both an EDAC and ext_log driver or if there should be some enforcement. Definitely enforcement. The flags thing I was telling you about recently could be one way to do it. trace_mce_record() dumps the raw data from the machine check banks. I think there may still be a case for having this. Analysis tools that look at this trace as well should be smart enough to connect the dots. Yes, sure. The more non-overlaping data we get, the better. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 4/9] ACPI, x86: Extended error log driver for x86 platform
@@ -154,6 +154,10 @@ void mce_log(struct mce *mce) /* Emit the trace record: */ trace_mce_record(mce); + if (mce_ext_err_print) + if (mce_ext_err_print(NULL, m.extcpu, i)) + return; + ret = atomic_notifier_call_chain(x86_mce_decoder_chain, 0, mce); if (ret == NOTIFY_STOP) return; If we move mce_ext_err_print() this far ... then it's only one line further down to have it be part of the x86_mce_decoder_chain as suggested by Naveen. -Tony