[x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)

2007-07-12 Thread Joshua Wise
 to be stable, although it's sill ugly.

Thoughts?

joshua

[1] 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4f84e4be53a04a65d97bf0faa0c8f99e29bc0170

[2] http://investor.google.com/conduct.html

--

From: Joshua Wise <[EMAIL PROTECTED]>

Background:
 In some situations, mce_log would race against mce_read and deadlock. This
 race condition is described in more detail in the body of the associated
 e-mail.

Description:
 This patch allows the machine check exception handler to time out after
 spinning for too long waiting for the deadlock to finish. This may lose a
 machine check exception from time to time, but it's certainly better than
 bringing down the system.

Remaining issues:
 * Should the whole log structure just be rewritten as a ring buffer?

Testing:
 I injected single bit errors on CPU0 in a while loop overnight while
 running mced on CPU1. Previously, this would crash the system after some
 minutes, but the system survived the entire night this way.

Credits:
 Thanks to Tim Hockin <[EMAIL PROTECTED]> and Mike Waychison
 <[EMAIL PROTECTED]> for sitting down with me and combing through this to
 help me find the race.

Patch:
 This patch is against git 0471448f4d017470995d8a2272dc8c06dbed3b77.

Signed-off-by: Joshua Wise <[EMAIL PROTECTED]>

--

diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index aa1d159..87ff9dd 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -30,6 +30,8 @@ #include 
 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
 
+#define MCE_LOG_RETRIES 1 /* if we retry 100,000,000 times, we are 
probably in the synchronize_sched race */
+
 atomic_t mce_entry;
 
 static int mce_dont_init;
@@ -61,7 +63,7 @@ struct mce_log mcelog = { 
 
 void mce_log(struct mce *mce)
 {
-   unsigned next, entry;
+   unsigned next, entry, attempts = 0;
atomic_inc(_events);
mce->finished = 0;
wmb();
@@ -70,6 +72,13 @@ void mce_log(struct mce *mce)
/* The rmb forces the compiler to reload next in each
iteration */
rmb();
+   
+   /* Deal with the synchronize_sched race */
+   if (attempts++ > MCE_LOG_RETRIES) {
+   set_bit(MCE_OVERFLOW, );
+   return;
+   }
+   
for (;;) {
/* When the buffer fills up discard new entries. Assume
   that the earlier errors are the more interesting. */From: Joshua Wise <[EMAIL PROTECTED]>

Background:
 In some situations, mce_log would race against mce_read and deadlock. This
 race condition is described in more detail in the body of the associated
 e-mail.

Description:
 This patch allows the machine check exception handler to time out after
 spinning for too long waiting for the deadlock to finish. This may lose a
 machine check exception from time to time, but it's certainly better than
 bringing down the system.

Remaining issues:
 * Should the whole log structure just be rewritten as a ring buffer?

Testing:
 I injected single bit errors on CPU0 in a while loop overnight while
 running mced on CPU1. Previously, this would crash the system after some
 minutes, but the system survived the entire night this way.

Credits:
 Thanks to Tim Hockin <[EMAIL PROTECTED]> and Mike Waychison
 <[EMAIL PROTECTED]> for sitting down with me and combing through this to
 help me find the race.

Patch:
 This patch is against git 0471448f4d017470995d8a2272dc8c06dbed3b77.

Signed-off-by: Joshua Wise <[EMAIL PROTECTED]>

--

diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index aa1d159..87ff9dd 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -30,6 +30,8 @@ #include 
 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
 
+#define MCE_LOG_RETRIES 1 /* if we retry 100,000,000 times, we are 
probably in the synchronize_sched race */
+
 atomic_t mce_entry;
 
 static int mce_dont_init;
@@ -61,7 +63,7 @@ struct mce_log mcelog = { 
 
 void mce_log(struct mce *mce)
 {
-   unsigned next, entry;
+   unsigned next, entry, attempts = 0;
atomic_inc(_events);
mce->finished = 0;
wmb();
@@ -70,6 +72,13 @@ void mce_log(struct mce *mce)
/* The rmb forces the compiler to reload next in each
iteration */
rmb();
+   
+   /* Deal with the synchronize_sched race */
+   if (attempts++ > MCE_LOG_RETRIES) {
+   set_bit(MCE_OVERFLOW, );
+   return;
+   }
+   
for (;;) {
/* When the buffer fills up discard new entries. Assume
   that the earlier errors are the more interesting. */


[x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)

2007-07-12 Thread Joshua Wise
 to be stable, although it's sill ugly.

Thoughts?

joshua

[1] 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4f84e4be53a04a65d97bf0faa0c8f99e29bc0170

[2] http://investor.google.com/conduct.html

--

From: Joshua Wise [EMAIL PROTECTED]

Background:
 In some situations, mce_log would race against mce_read and deadlock. This
 race condition is described in more detail in the body of the associated
 e-mail.

Description:
 This patch allows the machine check exception handler to time out after
 spinning for too long waiting for the deadlock to finish. This may lose a
 machine check exception from time to time, but it's certainly better than
 bringing down the system.

Remaining issues:
 * Should the whole log structure just be rewritten as a ring buffer?

Testing:
 I injected single bit errors on CPU0 in a while loop overnight while
 running mced on CPU1. Previously, this would crash the system after some
 minutes, but the system survived the entire night this way.

Credits:
 Thanks to Tim Hockin [EMAIL PROTECTED] and Mike Waychison
 [EMAIL PROTECTED] for sitting down with me and combing through this to
 help me find the race.

Patch:
 This patch is against git 0471448f4d017470995d8a2272dc8c06dbed3b77.

Signed-off-by: Joshua Wise [EMAIL PROTECTED]

--

diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index aa1d159..87ff9dd 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -30,6 +30,8 @@ #include asm/smp.h
 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
 
+#define MCE_LOG_RETRIES 1 /* if we retry 100,000,000 times, we are 
probably in the synchronize_sched race */
+
 atomic_t mce_entry;
 
 static int mce_dont_init;
@@ -61,7 +63,7 @@ struct mce_log mcelog = { 
 
 void mce_log(struct mce *mce)
 {
-   unsigned next, entry;
+   unsigned next, entry, attempts = 0;
atomic_inc(mce_events);
mce-finished = 0;
wmb();
@@ -70,6 +72,13 @@ void mce_log(struct mce *mce)
/* The rmb forces the compiler to reload next in each
iteration */
rmb();
+   
+   /* Deal with the synchronize_sched race */
+   if (attempts++  MCE_LOG_RETRIES) {
+   set_bit(MCE_OVERFLOW, mcelog.flags);
+   return;
+   }
+   
for (;;) {
/* When the buffer fills up discard new entries. Assume
   that the earlier errors are the more interesting. */From: Joshua Wise [EMAIL PROTECTED]

Background:
 In some situations, mce_log would race against mce_read and deadlock. This
 race condition is described in more detail in the body of the associated
 e-mail.

Description:
 This patch allows the machine check exception handler to time out after
 spinning for too long waiting for the deadlock to finish. This may lose a
 machine check exception from time to time, but it's certainly better than
 bringing down the system.

Remaining issues:
 * Should the whole log structure just be rewritten as a ring buffer?

Testing:
 I injected single bit errors on CPU0 in a while loop overnight while
 running mced on CPU1. Previously, this would crash the system after some
 minutes, but the system survived the entire night this way.

Credits:
 Thanks to Tim Hockin [EMAIL PROTECTED] and Mike Waychison
 [EMAIL PROTECTED] for sitting down with me and combing through this to
 help me find the race.

Patch:
 This patch is against git 0471448f4d017470995d8a2272dc8c06dbed3b77.

Signed-off-by: Joshua Wise [EMAIL PROTECTED]

--

diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index aa1d159..87ff9dd 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -30,6 +30,8 @@ #include asm/smp.h
 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
 
+#define MCE_LOG_RETRIES 1 /* if we retry 100,000,000 times, we are 
probably in the synchronize_sched race */
+
 atomic_t mce_entry;
 
 static int mce_dont_init;
@@ -61,7 +63,7 @@ struct mce_log mcelog = { 
 
 void mce_log(struct mce *mce)
 {
-   unsigned next, entry;
+   unsigned next, entry, attempts = 0;
atomic_inc(mce_events);
mce-finished = 0;
wmb();
@@ -70,6 +72,13 @@ void mce_log(struct mce *mce)
/* The rmb forces the compiler to reload next in each
iteration */
rmb();
+   
+   /* Deal with the synchronize_sched race */
+   if (attempts++  MCE_LOG_RETRIES) {
+   set_bit(MCE_OVERFLOW, mcelog.flags);
+   return;
+   }
+   
for (;;) {
/* When the buffer fills up discard new entries. Assume
   that the earlier errors are the more interesting. */


[PATCH] Print utsname on Oops on all architectures

2007-07-05 Thread Joshua Wise
Hopefully this patch has not been munged by pine; I have, minimally,
unchecked the mung-patches-sent-to-lkml option in pine's config. In the case
that it has been munged, I have also attached it.

--

From: Joshua Wise <[EMAIL PROTECTED]>

Background:
 This patch is a follow-on to "Info dump on Oops or panic()" [1].
 
 On some architectures, the kernel printed some information on the running
 kernel, but not on all architectures. The information printed was generally
 the version and build number, but it was not located in a consistant place,
 and some architectures did not print it at all.

Description:
 This patch uses the already-existing die_chain to print utsname information
 on Oops. This patch also removes the architecture-specific utsname
 printers. To avoid crashing the system further (and hence not printing the
 Oops) in the case where the system is so hopelessly smashed that utsname
 might be destroyed, we vsprintf the utsname data into a static buffer
 first, and then just print that on crash.

Testing:
 I wrote a module that does a *(int*)0 = 0; and observed that I got my
 utsname data printed.

Potential impact:
 This adds another line to the Oops output, causing the first few lines to
 potentially scroll off the screen. This also adds a few more pointer
 dereferences in the Oops path, because it adds to the die_chain notifier
 chain, reducing the likelihood that the Oops will be printed if there is
 very bad memory corruption.

Patch:
 This patch is against git 0471448f4d017470995d8a2272dc8c06dbed3b77.

Signed-Off-By: Joshua Wise <[EMAIL PROTECTED]>

[1] http://lkml.org/lkml/2007/6/28/316

--

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 92b6162..13f342e 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -20,7 +20,6 @@ #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 8423617..e4448f0 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -28,7 +28,6 @@ #include 
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -203,10 +202,8 @@ void __show_regs(struct pt_regs *regs)
unsigned long flags;
char buf[64];
 
-   printk("CPU: %d%s  (%s %.*s)\n",
-   smp_processor_id(), print_tainted(), init_utsname()->release,
-   (int)strcspn(init_utsname()->version, " "),
-   init_utsname()->version);
+   printk("CPU: %d%s\n",
+   smp_processor_id(), print_tainted());
print_symbol("PC is at %s\n", instruction_pointer(regs));
print_symbol("LR is at %s\n", regs->ARM_lr);
printk("pc : [<%08lx>]lr : [<%08lx>]psr: %08lx\n"
diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index 06dfa65..1703243 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -27,7 +27,6 @@ #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -308,10 +307,8 @@ void show_regs(struct pt_regs * regs)
 
if (user_mode_vm(regs))
printk(" ESP: %04x:%08lx",0x & regs->xss,regs->esp);
-   printk(" EFLAGS: %08lx%s  (%s %.*s)\n",
-  regs->eflags, print_tainted(), init_utsname()->release,
-  (int)strcspn(init_utsname()->version, " "),
-  init_utsname()->version);
+   printk(" EFLAGS: %08lx%s\n",
+  regs->eflags, print_tainted());
printk("EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n",
regs->eax,regs->ebx,regs->ecx,regs->edx);
printk("ESI: %08lx EDI: %08lx EBP: %08lx",
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 6e2f035..04c4abd 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -32,7 +32,6 @@ #include 
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -414,8 +413,8 @@ void show_regs(struct pt_regs * regs)
 
printk("NIP: "REG" LR: "REG" CTR: "REG"\n",
   regs->nip, regs->link, regs->ctr);
-   printk("REGS: %p TRAP: %04lx   %s  (%s)\n",
-  regs, regs->trap, print_tainted(), init_utsname()->release);
+   printk("REGS: %p TRAP: %04lx   %s\n",
+  regs, regs->trap, print_tainted());
printk("MSR: "REG" ", regs->msr);
printbits(regs->msr, msr_bits);
printk("  CR: %08lx  XER: %08lx\n", regs->ccr, regs->xer);
diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c
index 5909039..94fb7e3 100644
--- a/arch/x86_64/kernel/process.c
+++ b/arch/x86_64/kern

[PATCH] Print utsname on Oops on all architectures

2007-07-05 Thread Joshua Wise
Hopefully this patch has not been munged by pine; I have, minimally,
unchecked the mung-patches-sent-to-lkml option in pine's config. In the case
that it has been munged, I have also attached it.

--

From: Joshua Wise [EMAIL PROTECTED]

Background:
 This patch is a follow-on to Info dump on Oops or panic() [1].
 
 On some architectures, the kernel printed some information on the running
 kernel, but not on all architectures. The information printed was generally
 the version and build number, but it was not located in a consistant place,
 and some architectures did not print it at all.

Description:
 This patch uses the already-existing die_chain to print utsname information
 on Oops. This patch also removes the architecture-specific utsname
 printers. To avoid crashing the system further (and hence not printing the
 Oops) in the case where the system is so hopelessly smashed that utsname
 might be destroyed, we vsprintf the utsname data into a static buffer
 first, and then just print that on crash.

Testing:
 I wrote a module that does a *(int*)0 = 0; and observed that I got my
 utsname data printed.

Potential impact:
 This adds another line to the Oops output, causing the first few lines to
 potentially scroll off the screen. This also adds a few more pointer
 dereferences in the Oops path, because it adds to the die_chain notifier
 chain, reducing the likelihood that the Oops will be printed if there is
 very bad memory corruption.

Patch:
 This patch is against git 0471448f4d017470995d8a2272dc8c06dbed3b77.

Signed-Off-By: Joshua Wise [EMAIL PROTECTED]

[1] http://lkml.org/lkml/2007/6/28/316

--

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 92b6162..13f342e 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -20,7 +20,6 @@ #include linux/ptrace.h
 #include linux/slab.h
 #include linux/user.h
 #include linux/a.out.h
-#include linux/utsname.h
 #include linux/time.h
 #include linux/major.h
 #include linux/stat.h
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 8423617..e4448f0 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -28,7 +28,6 @@ #include linux/cpu.h
 #include linux/elfcore.h
 #include linux/pm.h
 #include linux/tick.h
-#include linux/utsname.h
 
 #include asm/leds.h
 #include asm/processor.h
@@ -203,10 +202,8 @@ void __show_regs(struct pt_regs *regs)
unsigned long flags;
char buf[64];
 
-   printk(CPU: %d%s  (%s %.*s)\n,
-   smp_processor_id(), print_tainted(), init_utsname()-release,
-   (int)strcspn(init_utsname()-version,  ),
-   init_utsname()-version);
+   printk(CPU: %d%s\n,
+   smp_processor_id(), print_tainted());
print_symbol(PC is at %s\n, instruction_pointer(regs));
print_symbol(LR is at %s\n, regs-ARM_lr);
printk(pc : [%08lx]lr : [%08lx]psr: %08lx\n
diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index 06dfa65..1703243 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -27,7 +27,6 @@ #include linux/vmalloc.h
 #include linux/user.h
 #include linux/a.out.h
 #include linux/interrupt.h
-#include linux/utsname.h
 #include linux/delay.h
 #include linux/reboot.h
 #include linux/init.h
@@ -308,10 +307,8 @@ void show_regs(struct pt_regs * regs)
 
if (user_mode_vm(regs))
printk( ESP: %04x:%08lx,0x  regs-xss,regs-esp);
-   printk( EFLAGS: %08lx%s  (%s %.*s)\n,
-  regs-eflags, print_tainted(), init_utsname()-release,
-  (int)strcspn(init_utsname()-version,  ),
-  init_utsname()-version);
+   printk( EFLAGS: %08lx%s\n,
+  regs-eflags, print_tainted());
printk(EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n,
regs-eax,regs-ebx,regs-ecx,regs-edx);
printk(ESI: %08lx EDI: %08lx EBP: %08lx,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 6e2f035..04c4abd 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -32,7 +32,6 @@ #include linux/module.h
 #include linux/kallsyms.h
 #include linux/mqueue.h
 #include linux/hardirq.h
-#include linux/utsname.h
 
 #include asm/pgtable.h
 #include asm/uaccess.h
@@ -414,8 +413,8 @@ void show_regs(struct pt_regs * regs)
 
printk(NIP: REG LR: REG CTR: REG\n,
   regs-nip, regs-link, regs-ctr);
-   printk(REGS: %p TRAP: %04lx   %s  (%s)\n,
-  regs, regs-trap, print_tainted(), init_utsname()-release);
+   printk(REGS: %p TRAP: %04lx   %s\n,
+  regs, regs-trap, print_tainted());
printk(MSR: REG , regs-msr);
printbits(regs-msr, msr_bits);
printk(  CR: %08lx  XER: %08lx\n, regs-ccr, regs-xer);
diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c
index 5909039..94fb7e3 100644
--- a/arch/x86_64/kernel/process.c
+++ b/arch/x86_64/kernel

Re: [PATCH] [revised -- version 2] Info dump on Oops or panic()

2007-06-28 Thread Joshua Wise

On Thu, 28 Jun 2007, Andrew Morton wrote:

Your email client is doing space-stuffing.  It's easy enough to fix at this
end, but even easier if you fix it ;)


Aw darn :( Stupid PINE. I'll fix it for the next patch.


+   atomic_notifier_call_chain(_dumper_list, 0, NULL);

[...]

So...  Please consider abandoning the notifier-chain and just go for a
simple function call.



As we discovered some minutes ago, there appears to be infrastructure for this
already -- the die_chain. For thsoe of you who don't know, the die_chain
gets called from notify_die. It exists on almost all architectures. On i386,
it gets called from die() on arch/i386/kernel/traps.c:417 -- before
registers are scrolled by. So, if we want our own specific output there,
like utsname or uptime, then we can get it.

There exists a DIE_PANIC type on some architectures, but it's never actually
... used. So, I will probably write a patch to add it on all architectures,
and use it in the panic routines.


+ATOMIC_NOTIFIER_HEAD(info_dumper_list);
+EXPORT_SYMBOL(info_dumper_list);


That export isn't needed.


If I want someone else to access it, like a module, it is... But, I guess
if I wanted to act as per canon, I should just do register functions, and
export those. On the die_chain, those are EXPORT_SYMBOL_GPL, FWIW


Again, a deref of current->nsproxy->uts_ns->name at oops-time has risks.

This string could be precalculated, no?


Yes, and will be.


I don't know what to do here.  It will be hard to find a read-the-time
function which is a) lockless and b) available on all architectures and
configs.

If you can find a way to use plain old jiffies, that'd be good.


jiffies sounds good enough to me.

There seems to be some opposition to the utsname and uptime patches. I'll
take a look at those here and see what I can do to make those a little more
pleasing to non-Google users. Expect a patch for DIE_PANIC tomorrow...

joshua

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [revised -- version 2] Info dump on Oops or panic()

2007-06-28 Thread Joshua Wise

On Thu, 28 Jun 2007, Kyle McMartin wrote:

woah woah woah. This could push critical bits of the register dump or
stacktrace off the screen... Barring any other problems with the patch,
this should probably be dumped in the oops header, not trailing it where
it could hide critical debugging info.


Okay, fair enough. Fixed version follows. I also fixed some checkpatch
issues that I missed before.

However, please note that in general, the info dumped by this will consist
of only a few lines. In our implementations, I believe that we only dump one
line per notifier.

joshua

--
From: Joshua Wise <[EMAIL PROTECTED]>

Version: 2

Background:
 When managing a large number of servers, as Google does, it's sometimes
 useful to get an "at-a-glance" view of a machine when it crashes. When no
 other post-mortem is possible, it's often useful to know how long the
 machine has been powered on, which kernel it was running, and possibly
 other information that a driver may have logged. Up until now, there was no
 real way to do this other than to keep statistics outside to the machine.

Description:
 This patch adds a call chain to be invoked when the system oopses or
 panics. Traps have been added to i386 and x86_64 kernels, but it should be
 trivial to add support for more architectures. Two sample notifiers have
 been added -- an uptime printer, and a utsname printer for showing various
 statistics about the running system.

Remaining issues:
 * Is do_posix_clock_monotonic_gettime really the right API?
 * Should this be on by default, or a disablable Kconfig entry?

Testing:
 I accidentally built my testing kernel without the IDE drivers. It panic()ed
 immediately, and informed me that my uptime was 0.38 seconds.

Credits:
 This code was mainly written by Mike Waychison <[EMAIL PROTECTED]> and
 Masoud Sharbiani <[EMAIL PROTECTED]>.

Changelog:
 v2 -- fixed some checkpatch issues. Moved dumping to before Oopses, as per
 the suggestion of Kyle McMartin <[EMAIL PROTECTED]>.

Patch:
 This patch is against git 48d8d7ee5dd17c64833e0343ab4ae8ef01cc2648.

Signed-off-by: Joshua Wise <[EMAIL PROTECTED]>
Acked-by: Mike Waychison <[EMAIL PROTECTED]>

--

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index 06dfa65..f969c04 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -301,6 +301,8 @@ void show_regs(struct pt_regs * regs)
 {
unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;

+   atomic_notifier_call_chain(_dumper_list, 0, NULL);
+
printk("\n");
printk("Pid: %d, comm: %20s\n", current->pid, current->comm);
printk("EIP: %04x:[<%08lx>] CPU: %d\n",0x & regs->xcs,regs->eip, 
smp_processor_id());
diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c
index 90da057..2ad7f8c 100644
--- a/arch/i386/kernel/traps.c
+++ b/arch/i386/kernel/traps.c
@@ -289,6 +289,9 @@ void show_registers(struct pt_regs *regs
ss = regs->xss & 0x;
}
print_modules();
+
+   atomic_notifier_call_chain(_dumper_list, 0, NULL);
+
printk(KERN_EMERG "CPU:%d\n"
KERN_EMERG "EIP:%04x:[<%08lx>]%s VLI\n"
KERN_EMERG "EFLAGS: %08lx   (%s %.*s)\n",
diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c
index 5909039..e7f0298 100644
--- a/arch/x86_64/kernel/process.c
+++ b/arch/x86_64/kernel/process.c
@@ -353,6 +353,7 @@ void __show_regs(struct pt_regs * regs)

 void show_regs(struct pt_regs *regs)
 {
+   atomic_notifier_call_chain(_dumper_list, 0, NULL);
printk("CPU %d:", smp_processor_id());
__show_regs(regs);
show_trace(NULL, regs, (void *)(regs + 1));
diff --git a/arch/x86_64/kernel/traps.c b/arch/x86_64/kernel/traps.c
index aac1c0b..3d228d0 100644
--- a/arch/x86_64/kernel/traps.c
+++ b/arch/x86_64/kernel/traps.c
@@ -410,6 +410,8 @@ void show_registers(struct pt_regs *regs
const int cpu = smp_processor_id();
struct task_struct *cur = cpu_pda(cpu)->pcurrent;

+   atomic_notifier_call_chain(_dumper_list, 0, NULL);
+
rsp = regs->rsp;
printk("CPU %d ", cpu);
__show_regs(regs);
diff --git a/include/asm-blackfin/macros.h b/include/asm-blackfin/macros.h
deleted file mode 100644
index e69de29..000
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 7a48525..2058e58 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -104,6 +104,7 @@ #define abs(x) ({   \
})

 extern struct atomic_notifier_head panic_notifier_list;
+extern struct atomic_notifier_head info_dumper_list;
 extern long (*panic_blink)(long time);
 NORET_TYPE void panic(const char * fmt, ...)
__attribute__ ((NORET_AND format (printf, 1, 2)));
diff --git a/kernel/panic.c b/kernel/panic.c
index 6

[PATCH] Info dump on Oops or panic()

2007-06-28 Thread Joshua Wise

From: Joshua Wise <[EMAIL PROTECTED]>

Background:
 When managing a large number of servers, as Google does, it's sometimes
 useful to get an "at-a-glance" view of a machine when it crashes. When no
 other post-mortem is possible, it's often useful to know how long the
 machine has been powered on, which kernel it was running, and possibly
 other information that a driver may have logged. Up until now, there was no
 real way to do this other than to keep statistics outside to the machine.

Description:
 This patch adds a call chain to be invoked when the system oopses or
 panics. Traps have been added to i386 and x86_64 kernels, but it should be
 trivial to add support for more architectures. Two sample notifiers have
 been added -- an uptime printer, and a utsname printer for showing various
 statistics about the running system.

Remaining issues:
 * Is do_posix_clock_monotonic_gettime really the right API?
 * Should this be on by default, or a disablable Kconfig entry?

Testing:
 I accidentally built my testing kernel without the IDE drivers. It panic()ed
 immediately, and informed me that my uptime was 0.38 seconds.

Credits:
 This code was mainly written by Mike Waychison <[EMAIL PROTECTED]> and
 Masoud Sharbiani <[EMAIL PROTECTED]>.

Patch:
 This patch is against git 48d8d7ee5dd17c64833e0343ab4ae8ef01cc2648.

Signed-off-by: Joshua Wise <[EMAIL PROTECTED]>
Acked-by: Mike Waychison <[EMAIL PROTECTED]>

--

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index 06dfa65..42d4397 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -325,6 +325,8 @@ void show_regs(struct pt_regs * regs)
cr4 = read_cr4_safe();
printk("CR0: %08lx CR2: %08lx CR3: %08lx CR4: %08lx\n", cr0, cr2, cr3, 
cr4);
show_trace(NULL, regs, >esp);
+ 
+	atomic_notifier_call_chain(_dumper_list, 0, NULL);

 }

 /*
diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c
index 90da057..16668c2 100644
--- a/arch/i386/kernel/traps.c
+++ b/arch/i386/kernel/traps.c
@@ -341,6 +341,8 @@ void show_registers(struct pt_regs *regs
}
}
printk("\n");
+ 
+	atomic_notifier_call_chain(_dumper_list, 0, NULL);

 }

 int is_valid_bugaddr(unsigned long eip)
diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c
index 5909039..7b0619a 100644
--- a/arch/x86_64/kernel/process.c
+++ b/arch/x86_64/kernel/process.c
@@ -356,6 +356,7 @@ void show_regs(struct pt_regs *regs)
printk("CPU %d:", smp_processor_id());
__show_regs(regs);
show_trace(NULL, regs, (void *)(regs + 1));
+   atomic_notifier_call_chain(_dumper_list, 0, NULL);
 }

 /*
diff --git a/arch/x86_64/kernel/traps.c b/arch/x86_64/kernel/traps.c
index aac1c0b..e443613 100644
--- a/arch/x86_64/kernel/traps.c
+++ b/arch/x86_64/kernel/traps.c
@@ -439,6 +439,8 @@ bad:
}
}
printk("\n");
+ 
+	atomic_notifier_call_chain(_dumper_list, 0, NULL);

 }

 int is_valid_bugaddr(unsigned long rip)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 7a48525..2058e58 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -104,6 +104,7 @@ #define abs(x) ({   \
})

 extern struct atomic_notifier_head panic_notifier_list;
+extern struct atomic_notifier_head info_dumper_list;
 extern long (*panic_blink)(long time);
 NORET_TYPE void panic(const char * fmt, ...)
__attribute__ ((NORET_AND format (printf, 1, 2)));
diff --git a/kernel/panic.c b/kernel/panic.c
index 623d182..8bae618 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -97,6 +97,7 @@ #ifdef CONFIG_SMP
 #endif

atomic_notifier_call_chain(_notifier_list, 0, buf);
+   atomic_notifier_call_chain(_dumper_list, 0, NULL);

if (!panic_blink)
panic_blink = no_blink;
diff --git a/kernel/sys.c b/kernel/sys.c
index 872271c..5bbe926 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -98,6 +98,39 @@ struct pid *cad_pid;
 EXPORT_SYMBOL(cad_pid);

 /*
+ * Notifier list for kernel code which wants to printk hardware specific
+ * information at Oops or panic time.
+ */
+
+ATOMIC_NOTIFIER_HEAD(info_dumper_list);
+EXPORT_SYMBOL(info_dumper_list);
+
+/*
+ * Dump out UTS info on oops / panic.
+ */
+ 
+static int dump_utsname(struct notifier_block *self, unsigned long v, void *p)

+{
+   printk ("%s %s %s %s %s %s\n",
+   utsname()->sysname,
+   utsname()->nodename,
+   utsname()->release,
+   utsname()->version,
+   utsname()->machine,
+   utsname()->domainname);
+   return 0;
+}
+
+static struct notifier_block utsname_notifier;
+
+static int __init register_utsname_dump(void) {
+   utsname_notifier.notifier_call = dump_utsname;
+   atomic_notifier_chain_register(_dumper_list, _notifier);
+   return

[PATCH] Info dump on Oops or panic()

2007-06-28 Thread Joshua Wise

From: Joshua Wise [EMAIL PROTECTED]

Background:
 When managing a large number of servers, as Google does, it's sometimes
 useful to get an at-a-glance view of a machine when it crashes. When no
 other post-mortem is possible, it's often useful to know how long the
 machine has been powered on, which kernel it was running, and possibly
 other information that a driver may have logged. Up until now, there was no
 real way to do this other than to keep statistics outside to the machine.

Description:
 This patch adds a call chain to be invoked when the system oopses or
 panics. Traps have been added to i386 and x86_64 kernels, but it should be
 trivial to add support for more architectures. Two sample notifiers have
 been added -- an uptime printer, and a utsname printer for showing various
 statistics about the running system.

Remaining issues:
 * Is do_posix_clock_monotonic_gettime really the right API?
 * Should this be on by default, or a disablable Kconfig entry?

Testing:
 I accidentally built my testing kernel without the IDE drivers. It panic()ed
 immediately, and informed me that my uptime was 0.38 seconds.

Credits:
 This code was mainly written by Mike Waychison [EMAIL PROTECTED] and
 Masoud Sharbiani [EMAIL PROTECTED].

Patch:
 This patch is against git 48d8d7ee5dd17c64833e0343ab4ae8ef01cc2648.

Signed-off-by: Joshua Wise [EMAIL PROTECTED]
Acked-by: Mike Waychison [EMAIL PROTECTED]

--

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index 06dfa65..42d4397 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -325,6 +325,8 @@ void show_regs(struct pt_regs * regs)
cr4 = read_cr4_safe();
printk(CR0: %08lx CR2: %08lx CR3: %08lx CR4: %08lx\n, cr0, cr2, cr3, 
cr4);
show_trace(NULL, regs, regs-esp);
+ 
+	atomic_notifier_call_chain(info_dumper_list, 0, NULL);

 }

 /*
diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c
index 90da057..16668c2 100644
--- a/arch/i386/kernel/traps.c
+++ b/arch/i386/kernel/traps.c
@@ -341,6 +341,8 @@ void show_registers(struct pt_regs *regs
}
}
printk(\n);
+ 
+	atomic_notifier_call_chain(info_dumper_list, 0, NULL);

 }

 int is_valid_bugaddr(unsigned long eip)
diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c
index 5909039..7b0619a 100644
--- a/arch/x86_64/kernel/process.c
+++ b/arch/x86_64/kernel/process.c
@@ -356,6 +356,7 @@ void show_regs(struct pt_regs *regs)
printk(CPU %d:, smp_processor_id());
__show_regs(regs);
show_trace(NULL, regs, (void *)(regs + 1));
+   atomic_notifier_call_chain(info_dumper_list, 0, NULL);
 }

 /*
diff --git a/arch/x86_64/kernel/traps.c b/arch/x86_64/kernel/traps.c
index aac1c0b..e443613 100644
--- a/arch/x86_64/kernel/traps.c
+++ b/arch/x86_64/kernel/traps.c
@@ -439,6 +439,8 @@ bad:
}
}
printk(\n);
+ 
+	atomic_notifier_call_chain(info_dumper_list, 0, NULL);

 }

 int is_valid_bugaddr(unsigned long rip)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 7a48525..2058e58 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -104,6 +104,7 @@ #define abs(x) ({   \
})

 extern struct atomic_notifier_head panic_notifier_list;
+extern struct atomic_notifier_head info_dumper_list;
 extern long (*panic_blink)(long time);
 NORET_TYPE void panic(const char * fmt, ...)
__attribute__ ((NORET_AND format (printf, 1, 2)));
diff --git a/kernel/panic.c b/kernel/panic.c
index 623d182..8bae618 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -97,6 +97,7 @@ #ifdef CONFIG_SMP
 #endif

atomic_notifier_call_chain(panic_notifier_list, 0, buf);
+   atomic_notifier_call_chain(info_dumper_list, 0, NULL);

if (!panic_blink)
panic_blink = no_blink;
diff --git a/kernel/sys.c b/kernel/sys.c
index 872271c..5bbe926 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -98,6 +98,39 @@ struct pid *cad_pid;
 EXPORT_SYMBOL(cad_pid);

 /*
+ * Notifier list for kernel code which wants to printk hardware specific
+ * information at Oops or panic time.
+ */
+
+ATOMIC_NOTIFIER_HEAD(info_dumper_list);
+EXPORT_SYMBOL(info_dumper_list);
+
+/*
+ * Dump out UTS info on oops / panic.
+ */
+ 
+static int dump_utsname(struct notifier_block *self, unsigned long v, void *p)

+{
+   printk (%s %s %s %s %s %s\n,
+   utsname()-sysname,
+   utsname()-nodename,
+   utsname()-release,
+   utsname()-version,
+   utsname()-machine,
+   utsname()-domainname);
+   return 0;
+}
+
+static struct notifier_block utsname_notifier;
+
+static int __init register_utsname_dump(void) {
+   utsname_notifier.notifier_call = dump_utsname;
+   atomic_notifier_chain_register(info_dumper_list, utsname_notifier);
+   return 0;
+}
+__initcall(register_utsname_dump);
+
+/*
  * Notifier list for kernel code which wants

Re: [PATCH] [revised -- version 2] Info dump on Oops or panic()

2007-06-28 Thread Joshua Wise

On Thu, 28 Jun 2007, Kyle McMartin wrote:

woah woah woah. This could push critical bits of the register dump or
stacktrace off the screen... Barring any other problems with the patch,
this should probably be dumped in the oops header, not trailing it where
it could hide critical debugging info.


Okay, fair enough. Fixed version follows. I also fixed some checkpatch
issues that I missed before.

However, please note that in general, the info dumped by this will consist
of only a few lines. In our implementations, I believe that we only dump one
line per notifier.

joshua

--
From: Joshua Wise [EMAIL PROTECTED]

Version: 2

Background:
 When managing a large number of servers, as Google does, it's sometimes
 useful to get an at-a-glance view of a machine when it crashes. When no
 other post-mortem is possible, it's often useful to know how long the
 machine has been powered on, which kernel it was running, and possibly
 other information that a driver may have logged. Up until now, there was no
 real way to do this other than to keep statistics outside to the machine.

Description:
 This patch adds a call chain to be invoked when the system oopses or
 panics. Traps have been added to i386 and x86_64 kernels, but it should be
 trivial to add support for more architectures. Two sample notifiers have
 been added -- an uptime printer, and a utsname printer for showing various
 statistics about the running system.

Remaining issues:
 * Is do_posix_clock_monotonic_gettime really the right API?
 * Should this be on by default, or a disablable Kconfig entry?

Testing:
 I accidentally built my testing kernel without the IDE drivers. It panic()ed
 immediately, and informed me that my uptime was 0.38 seconds.

Credits:
 This code was mainly written by Mike Waychison [EMAIL PROTECTED] and
 Masoud Sharbiani [EMAIL PROTECTED].

Changelog:
 v2 -- fixed some checkpatch issues. Moved dumping to before Oopses, as per
 the suggestion of Kyle McMartin [EMAIL PROTECTED].

Patch:
 This patch is against git 48d8d7ee5dd17c64833e0343ab4ae8ef01cc2648.

Signed-off-by: Joshua Wise [EMAIL PROTECTED]
Acked-by: Mike Waychison [EMAIL PROTECTED]

--

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index 06dfa65..f969c04 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -301,6 +301,8 @@ void show_regs(struct pt_regs * regs)
 {
unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;

+   atomic_notifier_call_chain(info_dumper_list, 0, NULL);
+
printk(\n);
printk(Pid: %d, comm: %20s\n, current-pid, current-comm);
printk(EIP: %04x:[%08lx] CPU: %d\n,0x  regs-xcs,regs-eip, 
smp_processor_id());
diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c
index 90da057..2ad7f8c 100644
--- a/arch/i386/kernel/traps.c
+++ b/arch/i386/kernel/traps.c
@@ -289,6 +289,9 @@ void show_registers(struct pt_regs *regs
ss = regs-xss  0x;
}
print_modules();
+
+   atomic_notifier_call_chain(info_dumper_list, 0, NULL);
+
printk(KERN_EMERG CPU:%d\n
KERN_EMERG EIP:%04x:[%08lx]%s VLI\n
KERN_EMERG EFLAGS: %08lx   (%s %.*s)\n,
diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c
index 5909039..e7f0298 100644
--- a/arch/x86_64/kernel/process.c
+++ b/arch/x86_64/kernel/process.c
@@ -353,6 +353,7 @@ void __show_regs(struct pt_regs * regs)

 void show_regs(struct pt_regs *regs)
 {
+   atomic_notifier_call_chain(info_dumper_list, 0, NULL);
printk(CPU %d:, smp_processor_id());
__show_regs(regs);
show_trace(NULL, regs, (void *)(regs + 1));
diff --git a/arch/x86_64/kernel/traps.c b/arch/x86_64/kernel/traps.c
index aac1c0b..3d228d0 100644
--- a/arch/x86_64/kernel/traps.c
+++ b/arch/x86_64/kernel/traps.c
@@ -410,6 +410,8 @@ void show_registers(struct pt_regs *regs
const int cpu = smp_processor_id();
struct task_struct *cur = cpu_pda(cpu)-pcurrent;

+   atomic_notifier_call_chain(info_dumper_list, 0, NULL);
+
rsp = regs-rsp;
printk(CPU %d , cpu);
__show_regs(regs);
diff --git a/include/asm-blackfin/macros.h b/include/asm-blackfin/macros.h
deleted file mode 100644
index e69de29..000
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 7a48525..2058e58 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -104,6 +104,7 @@ #define abs(x) ({   \
})

 extern struct atomic_notifier_head panic_notifier_list;
+extern struct atomic_notifier_head info_dumper_list;
 extern long (*panic_blink)(long time);
 NORET_TYPE void panic(const char * fmt, ...)
__attribute__ ((NORET_AND format (printf, 1, 2)));
diff --git a/kernel/panic.c b/kernel/panic.c
index 623d182..8bae618 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -97,6 +97,7 @@ #ifdef CONFIG_SMP
 #endif

atomic_notifier_call_chain(panic_notifier_list, 0, buf

Re: [PATCH] [revised -- version 2] Info dump on Oops or panic()

2007-06-28 Thread Joshua Wise

On Thu, 28 Jun 2007, Andrew Morton wrote:

Your email client is doing space-stuffing.  It's easy enough to fix at this
end, but even easier if you fix it ;)


Aw darn :( Stupid PINE. I'll fix it for the next patch.


+   atomic_notifier_call_chain(info_dumper_list, 0, NULL);

[...]

So...  Please consider abandoning the notifier-chain and just go for a
simple function call.



As we discovered some minutes ago, there appears to be infrastructure for this
already -- the die_chain. For thsoe of you who don't know, the die_chain
gets called from notify_die. It exists on almost all architectures. On i386,
it gets called from die() on arch/i386/kernel/traps.c:417 -- before
registers are scrolled by. So, if we want our own specific output there,
like utsname or uptime, then we can get it.

There exists a DIE_PANIC type on some architectures, but it's never actually
... used. So, I will probably write a patch to add it on all architectures,
and use it in the panic routines.


+ATOMIC_NOTIFIER_HEAD(info_dumper_list);
+EXPORT_SYMBOL(info_dumper_list);


That export isn't needed.


If I want someone else to access it, like a module, it is... But, I guess
if I wanted to act as per canon, I should just do register functions, and
export those. On the die_chain, those are EXPORT_SYMBOL_GPL, FWIW


Again, a deref of current-nsproxy-uts_ns-name at oops-time has risks.

This string could be precalculated, no?


Yes, and will be.


I don't know what to do here.  It will be hard to find a read-the-time
function which is a) lockless and b) available on all architectures and
configs.

If you can find a way to use plain old jiffies, that'd be good.


jiffies sounds good enough to me.

There seems to be some opposition to the utsname and uptime patches. I'll
take a look at those here and see what I can do to make those a little more
pleasing to non-Google users. Expect a patch for DIE_PANIC tomorrow...

joshua

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: Fix misplaced `continue' in mce.c

2007-06-21 Thread Joshua Wise

From: Joshua Wise <[EMAIL PROTECTED]>

Background:
 When a userspace application wants to know about machine check events, it
 opens /dev/mcelog and does a read(). Usually, we found that this interface
 works well, but in some cases, when the system was taking large numbers of
 machine check exceptions, the read() would hang. The system would output a
 soft-lockup warning, and the daemon reading from /dev/mcelog would suck up
 as much of a single CPU as it could spinning in system space.

Description:
 This patch fixes this bug. In particular, there was a "continue" inside a
 timeout loop that presumably was intended to break out of the outer loop,
 but instead caused the inner loop to continue. This patch also makes the
 condition for the break-out a little more evident by changing a
 !time_before to a time_after_eq.

Result:
 The read() no longer hangs in this test case.

Testing:
 On my system, I could replicate the bug with the following command:
   # for i in `seq 15000`; do ./inject_sbe.sh; done
 where inject_sbe.sh contains commands to inject a single-bit error into the
 next memory write transaction.

Patch:
 This patch is against git f1518a088bde6aea49e7c472ed6ab96178fcba3e.

Signed-off-by: Joshua Wise <[EMAIL PROTECTED]>
Signed-off-by: Tim Hockin <[EMAIL PROTECTED]>

--

diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index a14375d..aa83023 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -497,15 +497,17 @@ static ssize_t mce_read(struct file *fil
for (i = 0; i < next; i++) {
unsigned long start = jiffies;
while (!mcelog.entry[i].finished) {
-   if (!time_before(jiffies, start + 2)) {
+   if (time_after_eq(jiffies, start + 2)) {
memset(mcelog.entry + i,0, sizeof(struct mce));
-   continue;
+   goto timeout;
}
cpu_relax();
}
smp_rmb();
err |= copy_to_user(buf, mcelog.entry + i, sizeof(struct mce));
 		buf += sizeof(struct mce); 
+	 timeout:

+   ;
}

memset(mcelog.entry, 0, next * sizeof(struct mce));
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: Fix misplaced `continue' in mce.c

2007-06-21 Thread Joshua Wise

From: Joshua Wise [EMAIL PROTECTED]

Background:
 When a userspace application wants to know about machine check events, it
 opens /dev/mcelog and does a read(). Usually, we found that this interface
 works well, but in some cases, when the system was taking large numbers of
 machine check exceptions, the read() would hang. The system would output a
 soft-lockup warning, and the daemon reading from /dev/mcelog would suck up
 as much of a single CPU as it could spinning in system space.

Description:
 This patch fixes this bug. In particular, there was a continue inside a
 timeout loop that presumably was intended to break out of the outer loop,
 but instead caused the inner loop to continue. This patch also makes the
 condition for the break-out a little more evident by changing a
 !time_before to a time_after_eq.

Result:
 The read() no longer hangs in this test case.

Testing:
 On my system, I could replicate the bug with the following command:
   # for i in `seq 15000`; do ./inject_sbe.sh; done
 where inject_sbe.sh contains commands to inject a single-bit error into the
 next memory write transaction.

Patch:
 This patch is against git f1518a088bde6aea49e7c472ed6ab96178fcba3e.

Signed-off-by: Joshua Wise [EMAIL PROTECTED]
Signed-off-by: Tim Hockin [EMAIL PROTECTED]

--

diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index a14375d..aa83023 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -497,15 +497,17 @@ static ssize_t mce_read(struct file *fil
for (i = 0; i  next; i++) {
unsigned long start = jiffies;
while (!mcelog.entry[i].finished) {
-   if (!time_before(jiffies, start + 2)) {
+   if (time_after_eq(jiffies, start + 2)) {
memset(mcelog.entry + i,0, sizeof(struct mce));
-   continue;
+   goto timeout;
}
cpu_relax();
}
smp_rmb();
err |= copy_to_user(buf, mcelog.entry + i, sizeof(struct mce));
 		buf += sizeof(struct mce); 
+	 timeout:

+   ;
}

memset(mcelog.entry, 0, next * sizeof(struct mce));
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPU_IDLE prevents resuming from STR [was: Re: 2.6.21-rc6-mm1]

2007-04-18 Thread Joshua Wise

On Tue, 17 Apr 2007, Shaohua Li wrote:

Looks there is init order issue of sysfs files. The new refreshed patch
should fix your bug.


Yes, that did fix the hang on resume from STR -- that now works fine.

However:
[EMAIL PROTECTED]:/sys/devices/system/cpu/cpuidle$ cat available_drivers 
current_driver


[EMAIL PROTECTED]:/sys/devices/system/cpu/cpuidle$ cat available_governors 
current_governor
ladder
ladder

Is this correct? For reference, my config is http://joshuawise.com/config.gz
-- I didn't see any options for cpuidle drivers to access ACPI states...

joshua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPU_IDLE prevents resuming from STR [was: Re: 2.6.21-rc6-mm1]

2007-04-18 Thread Joshua Wise

On Tue, 17 Apr 2007, Shaohua Li wrote:

Looks there is init order issue of sysfs files. The new refreshed patch
should fix your bug.


Yes, that did fix the hang on resume from STR -- that now works fine.

However:
[EMAIL PROTECTED]:/sys/devices/system/cpu/cpuidle$ cat available_drivers 
current_driver

NULL
[EMAIL PROTECTED]:/sys/devices/system/cpu/cpuidle$ cat available_governors 
current_governor
ladder
ladder

Is this correct? For reference, my config is http://joshuawise.com/config.gz
-- I didn't see any options for cpuidle drivers to access ACPI states...

joshua
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPU_IDLE prevents resuming from STR [was: Re: 2.6.21-rc6-mm1]

2007-04-16 Thread Joshua Wise

On Mon, 16 Apr 2007, Shaohua Li wrote:

On Sat, 2007-04-14 at 01:45 +0200, Mattia Dongili wrote:

...

please check if the patch at
http://marc.info/?l=linux-acpi=117523651630038=2 fixed the issue


I have the same system as Mattia, and when I applied this patch and turned
CPU_IDLE back on, I got a panic on boot. Unfortunately, the EIP scrolled off
screen, so I can't get a line number.

(I had the same STR breakage as him; STR did not work with CPU_IDLE turned
on, and it did work with CPU_IDLE turned off.)

I'm running +rc6+mm(April 11) on a Sony VAIO SZ.

joshua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CPU_IDLE prevents resuming from STR [was: Re: 2.6.21-rc6-mm1]

2007-04-16 Thread Joshua Wise

On Mon, 16 Apr 2007, Shaohua Li wrote:

On Sat, 2007-04-14 at 01:45 +0200, Mattia Dongili wrote:

...

please check if the patch at
http://marc.info/?l=linux-acpim=117523651630038w=2 fixed the issue


I have the same system as Mattia, and when I applied this patch and turned
CPU_IDLE back on, I got a panic on boot. Unfortunately, the EIP scrolled off
screen, so I can't get a line number.

(I had the same STR breakage as him; STR did not work with CPU_IDLE turned
on, and it did work with CPU_IDLE turned off.)

I'm running +rc6+mm(April 11) on a Sony VAIO SZ.

joshua
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NAPI poll routine happens in interrupt context?

2005-08-17 Thread Joshua Wise
On Wednesday 17 August 2005 12:43, Stephen Hemminger wrote:
> You will get more response to network issues on netdev@vger.kernel.org
Okay. Thanks.

> NAPI poll is usually called from softirq context.  This means that
> hardware interrupts are enabled, but it is not in a thread context that
> can sleep.
Okay. I wasn't aware of quite how it was "supposed" to be.

> You shouldn't be calling things that could sleep! If you are it
> is a bug.
I guess I'd better track down this bug, then :)

> Harald Welte is working on a generic virtual Ethernet device, perhaps
> you could collaborate with him.
I assume he is on this mailing list?

> The bug is that ipv6 is doing an operation to handle MIB statistics and
> the MIPS architecture math routines seem to need to sleep.
> Previous versions of SNMP code may have done atomic operations, but
> current 2.6 code uses per-cpu variables.
> Also, there is no might sleep in the current 2.6 MIPS code either
> so the problem is probably fixed if you use current 2.6.12 or later
> kernel.
Hm -- I am using 2.6.13-rc2.
Here is a new trace, showing the same issue with IPv4:

Debug: sleeping function called from invalid context at 
arch/mips/math-emu/dsemul.c:137 
in_atomic():1, irqs_disabled():0
Call Trace:
 [] __might_sleep+0x180/0x198 (kernel/sched.c:5223)
 [] mipsIRQ+0x130/0x1e0 (arch/mips/sc1000/mipsIRQ.S:95)
 [] ip_rcv+0x9c/0x7b0 (net/ipv4/ip_input.c:381)
 [] do_dsemulret+0x68/0x1a0 
(arch/mips/math-emu/dsemul.c:137)
 [] do_ade+0x24/0x550 (arch/mips/kernel/unaligned.c:506)
 [] handle_adel_int+0x3c/0x58 (arch/mips/kernel/genex.S:281)
 [] netif_receive_skb+0x1b0/0x2e0 (net/core/dev.c:1646)
 [] ip_rcv+0xa0/0x7b0 (net/ipv4/ip_input.c:394)
 [] printk+0x2c/0x38 (kernel/printk.c:515)
 [] netif_receive_skb+0x1b0/0x2e0 (net/core/dev.c:1646)
 [] lanlan_poll+0x3e0/0x440 (drivers/net/lanlan.c:246)
etc, etc.

CC:'ing to linux-mips for obvious reasons. This seems to stem from an 
unaligned access. If this is no longer appropriate for linux-kernel, feel 
free to stop CCing to there, and I will follow.

Thanks,
joshua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NAPI poll routine happens in interrupt context?

2005-08-17 Thread Joshua Wise
It has come to my attention that the link that I posted previously was 
nonfunctional. It has been fixed.

As well, here are some other pertinent details:

This is kernel 2.6.13-rc2, the latest that works with MIPS SMP.

Here is a trace:
Debug: sleeping function called from invalid context at 
arch/mips/math-emu/dsemul.c:137
in_atomic():1, irqs_disabled():0
Call Trace:
 [] __might_sleep+0x180/0x198
 [] ipv6_rcv+0xc0/0x440
 [] do_dsemulret+0x68/0x1a0
 [] do_ade+0x24/0x550
 [] handle_adel_int+0x3c/0x58
 [] netif_receive_skb+0x1b0/0x2e0
 [] ipv6_rcv+0xc4/0x440
 [] netif_receive_skb+0x1b0/0x2e0
 [] lanlan_poll+0x3e0/0x440
 [] net_rx_action+0x16c/0x370
 [] net_rx_action+0x188/0x370
 [] __do_softirq+0x118/0x250
 [] __do_softirq+0x118/0x250
 [] do_softirq+0xb0/0xe0
 [] mipsIRQ+0x130/0x1e0
 [] r4k_wait+0x0/0x10
 [] cpu_idle+0x4c/0x68
 [] cpu_idle+0x44/0x68
 [] start_kernel+0x454/0x4e8
 [] start_kernel+0x44c/0x4e8

Apologies for any inconvenience.

joshua

On Wednesday 17 August 2005 09:32, Joshua Wise wrote:
> Hello LKML,
>
> I have recently been working on a network driver for an emulated
> ultra-simple network card, and I've run into a few snags with the NAPI. My
> current issue is that it seems to me that my poll routine is being called
> from an atomic context, so when poll calls rx, and rx calls
> netif_receive_skb, I end up with lots of __might_sleep warnings in the
> various network layers.
>
> This is not so good. I need every cycle I can get, as this emulator is
> incredibly slow, so burning cycles by printing out the reported badness is
> not really acceptible. Conceivably the badness itself is also an issue.
>
> Before posting here, I did search Google for "lkml napi poll interrupt",
> although I did not find anything relevant to my issue.
>
> If interested, the code is available at http://joshuawise.com/lanlan.c .
> Some notes:
>
> The virtual lan-lan is a very very simple device. It consists of an ioreg
> that maintains state of the device, as described by the ioreg bit defines.
> It also has an ioctlreg that can pass through ioctls to the Linux kernel
> tap device that it's sitting on top of. (This goes with the ifreq seen in
> the struct.) One must always write and read in word-aligned chunks to and
> from it, for simplicity's sake.
>
> Feel free to suggest any modifications that this device might need to make
> it more fully functional. Hopefully we can bring this driver to such a
> state where it will be usable as a replacement skeleton driver for the
> NAPI.
>
> Please cc: Aaron and myself, as neither of us are subscribed to lkml.
>
> Thanks in advance,
> joshua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NAPI poll routine happens in interrupt context?

2005-08-17 Thread Joshua Wise
Hello LKML,

I have recently been working on a network driver for an emulated ultra-simple 
network card, and I've run into a few snags with the NAPI. My current issue 
is that it seems to me that my poll routine is being called from an atomic 
context, so when poll calls rx, and rx calls netif_receive_skb, I end up with 
lots of __might_sleep warnings in the various network layers.

This is not so good. I need every cycle I can get, as this emulator is 
incredibly slow, so burning cycles by printing out the reported badness is 
not really acceptible. Conceivably the badness itself is also an issue.

Before posting here, I did search Google for "lkml napi poll interrupt", 
although I did not find anything relevant to my issue.

If interested, the code is available at http://joshuawise.com/lanlan.c . Some 
notes:

The virtual lan-lan is a very very simple device. It consists of an ioreg that 
maintains state of the device, as described by the ioreg bit defines. It also 
has an ioctlreg that can pass through ioctls to the Linux kernel tap device 
that it's sitting on top of. (This goes with the ifreq seen in the struct.) 
One must always write and read in word-aligned chunks to and from it, for 
simplicity's sake.

Feel free to suggest any modifications that this device might need to make it 
more fully functional. Hopefully we can bring this driver to such a state 
where it will be usable as a replacement skeleton driver for the NAPI.

Please cc: Aaron and myself, as neither of us are subscribed to lkml.

Thanks in advance,
joshua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NAPI poll routine happens in interrupt context?

2005-08-17 Thread Joshua Wise
Hello LKML,

I have recently been working on a network driver for an emulated ultra-simple 
network card, and I've run into a few snags with the NAPI. My current issue 
is that it seems to me that my poll routine is being called from an atomic 
context, so when poll calls rx, and rx calls netif_receive_skb, I end up with 
lots of __might_sleep warnings in the various network layers.

This is not so good. I need every cycle I can get, as this emulator is 
incredibly slow, so burning cycles by printing out the reported badness is 
not really acceptible. Conceivably the badness itself is also an issue.

Before posting here, I did search Google for lkml napi poll interrupt, 
although I did not find anything relevant to my issue.

If interested, the code is available at http://joshuawise.com/lanlan.c . Some 
notes:

The virtual lan-lan is a very very simple device. It consists of an ioreg that 
maintains state of the device, as described by the ioreg bit defines. It also 
has an ioctlreg that can pass through ioctls to the Linux kernel tap device 
that it's sitting on top of. (This goes with the ifreq seen in the struct.) 
One must always write and read in word-aligned chunks to and from it, for 
simplicity's sake.

Feel free to suggest any modifications that this device might need to make it 
more fully functional. Hopefully we can bring this driver to such a state 
where it will be usable as a replacement skeleton driver for the NAPI.

Please cc: Aaron and myself, as neither of us are subscribed to lkml.

Thanks in advance,
joshua
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NAPI poll routine happens in interrupt context?

2005-08-17 Thread Joshua Wise
It has come to my attention that the link that I posted previously was 
nonfunctional. It has been fixed.

As well, here are some other pertinent details:

This is kernel 2.6.13-rc2, the latest that works with MIPS SMP.

Here is a trace:
Debug: sleeping function called from invalid context at 
arch/mips/math-emu/dsemul.c:137
in_atomic():1, irqs_disabled():0
Call Trace:
 [801406e0] __might_sleep+0x180/0x198
 [802cec00] ipv6_rcv+0xc0/0x440
 [80140428] do_dsemulret+0x68/0x1a0
 [8010b3a4] do_ade+0x24/0x550
 [80102964] handle_adel_int+0x3c/0x58
 [80268160] netif_receive_skb+0x1b0/0x2e0
 [802cec04] ipv6_rcv+0xc4/0x440
 [80268160] netif_receive_skb+0x1b0/0x2e0
 [802572c8] lanlan_poll+0x3e0/0x440
 [8026868c] net_rx_action+0x16c/0x370
 [802686a8] net_rx_action+0x188/0x370
 [80154f28] __do_softirq+0x118/0x250
 [80154f28] __do_softirq+0x118/0x250
 [80155110] do_softirq+0xb0/0xe0
 [80101930] mipsIRQ+0x130/0x1e0
 [80101c90] r4k_wait+0x0/0x10
 [80103e6c] cpu_idle+0x4c/0x68
 [80103e64] cpu_idle+0x44/0x68
 [8037fcfc] start_kernel+0x454/0x4e8
 [8037fcf4] start_kernel+0x44c/0x4e8

Apologies for any inconvenience.

joshua

On Wednesday 17 August 2005 09:32, Joshua Wise wrote:
 Hello LKML,

 I have recently been working on a network driver for an emulated
 ultra-simple network card, and I've run into a few snags with the NAPI. My
 current issue is that it seems to me that my poll routine is being called
 from an atomic context, so when poll calls rx, and rx calls
 netif_receive_skb, I end up with lots of __might_sleep warnings in the
 various network layers.

 This is not so good. I need every cycle I can get, as this emulator is
 incredibly slow, so burning cycles by printing out the reported badness is
 not really acceptible. Conceivably the badness itself is also an issue.

 Before posting here, I did search Google for lkml napi poll interrupt,
 although I did not find anything relevant to my issue.

 If interested, the code is available at http://joshuawise.com/lanlan.c .
 Some notes:

 The virtual lan-lan is a very very simple device. It consists of an ioreg
 that maintains state of the device, as described by the ioreg bit defines.
 It also has an ioctlreg that can pass through ioctls to the Linux kernel
 tap device that it's sitting on top of. (This goes with the ifreq seen in
 the struct.) One must always write and read in word-aligned chunks to and
 from it, for simplicity's sake.

 Feel free to suggest any modifications that this device might need to make
 it more fully functional. Hopefully we can bring this driver to such a
 state where it will be usable as a replacement skeleton driver for the
 NAPI.

 Please cc: Aaron and myself, as neither of us are subscribed to lkml.

 Thanks in advance,
 joshua
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NAPI poll routine happens in interrupt context?

2005-08-17 Thread Joshua Wise
On Wednesday 17 August 2005 12:43, Stephen Hemminger wrote:
 You will get more response to network issues on netdev@vger.kernel.org
Okay. Thanks.

 NAPI poll is usually called from softirq context.  This means that
 hardware interrupts are enabled, but it is not in a thread context that
 can sleep.
Okay. I wasn't aware of quite how it was supposed to be.

 You shouldn't be calling things that could sleep! If you are it
 is a bug.
I guess I'd better track down this bug, then :)

 Harald Welte is working on a generic virtual Ethernet device, perhaps
 you could collaborate with him.
I assume he is on this mailing list?

 The bug is that ipv6 is doing an operation to handle MIB statistics and
 the MIPS architecture math routines seem to need to sleep.
 Previous versions of SNMP code may have done atomic operations, but
 current 2.6 code uses per-cpu variables.
 Also, there is no might sleep in the current 2.6 MIPS code either
 so the problem is probably fixed if you use current 2.6.12 or later
 kernel.
Hm -- I am using 2.6.13-rc2.
Here is a new trace, showing the same issue with IPv4:

Debug: sleeping function called from invalid context at 
arch/mips/math-emu/dsemul.c:137 
in_atomic():1, irqs_disabled():0
Call Trace:
 [801406e0] __might_sleep+0x180/0x198 (kernel/sched.c:5223)
 [80101930] mipsIRQ+0x130/0x1e0 (arch/mips/sc1000/mipsIRQ.S:95)
 [802860fc] ip_rcv+0x9c/0x7b0 (net/ipv4/ip_input.c:381)
 [80140428] do_dsemulret+0x68/0x1a0 
(arch/mips/math-emu/dsemul.c:137)
 [8010b3a4] do_ade+0x24/0x550 (arch/mips/kernel/unaligned.c:506)
 [80102964] handle_adel_int+0x3c/0x58 (arch/mips/kernel/genex.S:281)
 [80268260] netif_receive_skb+0x1b0/0x2e0 (net/core/dev.c:1646)
 [80286100] ip_rcv+0xa0/0x7b0 (net/ipv4/ip_input.c:394)
 [8014da5c] printk+0x2c/0x38 (kernel/printk.c:515)
 [80268260] netif_receive_skb+0x1b0/0x2e0 (net/core/dev.c:1646)
 [802573c8] lanlan_poll+0x3e0/0x440 (drivers/net/lanlan.c:246)
etc, etc.

CC:'ing to linux-mips for obvious reasons. This seems to stem from an 
unaligned access. If this is no longer appropriate for linux-kernel, feel 
free to stop CCing to there, and I will follow.

Thanks,
joshua
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/