Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai
On Wed, Jul 18, 2007 at 11:09:07PM -0700, Andrew Morton wrote:
>On Wed, 18 Jul 2007 22:41:21 -0700 Ravikiran G Thirumalai <[EMAIL PROTECTED]> 
>wrote:
>
>> On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
>> > On Mon, 16 Jul 2007 15:26:50 -0700
>> > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>> >
>> > > Kernel warns of softlockups if the softlockup thread is not able to run
>> > > on a CPU for 10s.  It is useful to lower the softlockup warning
>> > > threshold in testing environments to catch potential lockups early.
>> > > Following patch adds a kernel parameter 'softlockup_lim' to control
>> > > the softlockup threshold.
>> > >
>> >
>> > Why not make it tunable at runtime?
>> 
>> Sure! Like a sysctl?
>> 

Hi Andrew,
Here's another take, incorporating all of your comments.

Thanks,
Kiran


Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh to identify
possible softlockups earlier.

This patch:
1. Adds a sysctl softlockup_thresh with valid values of 1-60s
   (Higher value to disable false positives)
2. Changes the softlockup printk to print the cpu softlockup time

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

---
Patch applies on top of Ingo's softlockup patches

 Documentation/sysctl/kernel.txt |8 
 include/linux/sched.h   |1 +
 kernel/softlockup.c |8 +---
 kernel/sysctl.c |   25 +++--
 4 files changed, 33 insertions(+), 9 deletions(-)

Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-18 11:15:18.506614500 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700
@@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic;
+int softlockup_thresh = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -101,7 +102,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now <= (touch_timestamp + 10))
+   if (now <= (touch_timestamp + softlockup_thresh))
return;
 
regs = get_irq_regs();
@@ -109,8 +110,9 @@ void softlockup_tick(void)
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(_lock);
-   printk(KERN_ERR "BUG: soft lockup detected on CPU#%d! [%s:%d]\n",
-   this_cpu, current->comm, current->pid);
+   printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n",
+   this_cpu, now - touch_timestamp,
+   current->comm, current->pid);
if (regs)
show_regs(regs);
else
Index: linux-2.6.22/kernel/sysctl.c
===
--- linux-2.6.22.orig/kernel/sysctl.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/kernel/sysctl.c2007-07-19 13:42:50.275478000 -0700
@@ -79,6 +79,12 @@ extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
 
+/* Constants used for minimum and  maximum */
+static int one = 1;
+static int zero;
+static int sixty = 60;
+static int one_hundred = 100;
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -615,16 +621,23 @@ static ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_DETECT_SOFTLOCKUP
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "softlockup_thresh",
+   .data   = _thresh,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec_minmax,
+   .strategy   = _intvec,
+   .extra1 = ,
+   .extra2 = ,
+   },
+#endif
 
{ .ctl_name = 0 }
 };
 
-/* Constants for minimum and maximum testing in vm_table.
-   We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
 static ctl_table vm_table[] = {
{
.ctl_name   = VM_OVERCOMMIT_MEMORY,
Index: linux-2.6.22/Documentation/sysctl/kernel.txt
===
--- linux-2.6.22.orig/Documentation/sysctl/kernel.txt   2007-07-08 
16:32:17.0 -0700
+++ linux-2.6.22/Documentation/sysctl/kernel.txt2007-07-19 
13:42:09.748945250 -0700
@@ -320,6 +320,14 @@ kernel.  This value defaults to SHMMAX.
 
 ==
 
+softlockup_thresh:
+
+This value can be 

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai
On Thu, Jul 19, 2007 at 11:51:14AM -0700, Jeremy Fitzhardinge wrote:
>Ingo Molnar wrote:
>> just in case someone sees false positives and wants to turn it off.
>
>Why not make 0=off?

A patch to disable softlockup during boot already went in.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=97842216b8400fe9d1a20468959e2989180f8f79

It uses kernel boot parameter to disable softlockup, not exactly disabling
softlockup at run time though.  Do we still need 0=off?

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Jeremy Fitzhardinge
Ingo Molnar wrote:
> just in case someone sees false positives and wants to turn it off.

Why not make 0=off?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ingo Molnar

* Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:

> > also, i think the valid threshold should be between 1 and 60 seconds 
> > i think.
> 
> 60 seconds!  Is that not a pretty high threshold?  The reason for 
> lowering the tolerance threshold from 10s is to catch bugs early in 
> lab environments, but why do we need to raise the tolerance thresh 
> beyond 10s?

just in case someone sees false positives and wants to turn it off.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai
On Thu, Jul 19, 2007 at 11:11:42AM +0200, Ingo Molnar wrote:
>
>* Andrew Morton <[EMAIL PROTECTED]> wrote:
>
>> > +softlockup_thresh:
>> > +
>> > +This value can be used to lower the softlockup tolerance
>> > +threshold. The default threshold is 10s.  If a cpu is locked up
>> > +for 10s, the kernel complains.  Valid values are 1-10s.
>> > +
>> 
>> neato.
>
>please make sure this is applied after the softlockup watchdog patches i 
>already did. (in particular the cleanup patch, which this could interact 
>with)
>
>also, i think the valid threshold should be between 1 and 60 seconds i 
>think.

60 seconds!  Is that not a pretty high threshold?  The reason for lowering
the tolerance threshold from 10s is to catch bugs early in lab environments,
but why do we need to raise the tolerance thresh beyond 10s?

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> > +softlockup_thresh:
> > +
> > +This value can be used to lower the softlockup tolerance
> > +threshold. The default threshold is 10s.  If a cpu is locked up
> > +for 10s, the kernel complains.  Valid values are 1-10s.
> > +
> 
> neato.

please make sure this is applied after the softlockup watchdog patches i 
already did. (in particular the cleanup patch, which this could interact 
with)

also, i think the valid threshold should be between 1 and 60 seconds i 
think.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Andrew Morton
On Wed, 18 Jul 2007 22:41:21 -0700 Ravikiran G Thirumalai <[EMAIL PROTECTED]> 
wrote:

> On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
> > On Mon, 16 Jul 2007 15:26:50 -0700
> > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> >
> > > Kernel warns of softlockups if the softlockup thread is not able to run
> > > on a CPU for 10s.  It is useful to lower the softlockup warning
> > > threshold in testing environments to catch potential lockups early.
> > > Following patch adds a kernel parameter 'softlockup_lim' to control
> > > the softlockup threshold.
> > >
> >
> > Why not make it tunable at runtime?
> 
> Sure! Like a sysctl?
> 
> Here's a patch that does that (On top of Ingo's
> softlockup-improve-debug-output.patch)
>
> ...
>
> --- linux-2.6.22.orig/kernel/sysctl.c 2007-07-08 16:32:17.0 -0700
> +++ linux-2.6.22/kernel/sysctl.c  2007-07-18 21:05:57.877436750 -0700
> @@ -78,6 +78,7 @@ extern int percpu_pagelist_fraction;
>  extern int compat_log;
>  extern int maps_protect;
>  extern int sysctl_stat_interval;
> +extern int softlockup_thresh;

Just because sysctl.c does this all over the place doesn't make it right ;)
Please, if poss, find a header file for it.

>  /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID 
> */
>  static int maxolduid = 65535;
> @@ -206,6 +207,10 @@ static ctl_table root_table[] = {
>   { .ctl_name = 0 }
>  };
>  
> +/* Constants for kernel table minimum and  maximum */
> +static int one = 1;
> +static int ten = 10;

I'd suggest that these go next to "zero", "two" and "one_hundred".  Move 'em
all to top-of-file where they should always have been.

>  static ctl_table kern_table[] = {
>   {
>   .ctl_name   = KERN_PANIC,
> @@ -615,6 +620,19 @@ static ctl_table kern_table[] = {
>   .proc_handler   = _dointvec,
>   },
>  #endif
> +#ifdef CONFIG_DETECT_SOFTLOCKUP
> + {
> + .ctl_name   = KERN_SOFTLOCKUP_THRESHOLD,
> + .procname   = "softlockup_thresh",
> + .data   = _thresh,
> + .maxlen = sizeof(int),
> + .mode   = 0644,
> + .proc_handler   = _dointvec_minmax,
> + .strategy   = _intvec,
> + .extra1 = ,
> + .extra2 = ,
> + },
> +#endif

argh.  There's supposed to be a big comment right here:

/*
 * NOTE: do not add new entries to this table unless you have read
 * Documentation/sysctl/ctl_unnumbered.txt
 */

I'll fix that up.  Please use CTL_UNNUMBERED.

>  };
> Index: linux-2.6.22/include/linux/sysctl.h
> ===
> --- linux-2.6.22.orig/include/linux/sysctl.h  2007-07-08 16:32:17.0 
> -0700
> +++ linux-2.6.22/include/linux/sysctl.h   2007-07-18 21:41:56.584347500 
> -0700
> @@ -165,6 +165,7 @@ enum
>   KERN_MAX_LOCK_DEPTH=74,
>   KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
>   KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
> + KERN_SOFTLOCKUP_THRESHOLD=77, /* int: softlockup tolerance threshold */
>  };

and zap this

> Index: linux-2.6.22/Documentation/sysctl/kernel.txt
> ===
> --- linux-2.6.22.orig/Documentation/sysctl/kernel.txt 2007-07-08 
> 16:32:17.0 -0700
> +++ linux-2.6.22/Documentation/sysctl/kernel.txt  2007-07-18 
> 22:07:29.460146250 -0700
> @@ -320,6 +320,14 @@ kernel.  This value defaults to SHMMAX.
>  
>  ==
>  
> +softlockup_thresh:
> +
> +This value can be used to lower the softlockup tolerance
> +threshold. The default threshold is 10s.  If a cpu is locked up
> +for 10s, the kernel complains.  Valid values are 1-10s.
> +

neato.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai
On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
> On Mon, 16 Jul 2007 15:26:50 -0700
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
>
> > Kernel warns of softlockups if the softlockup thread is not able to run
> > on a CPU for 10s.  It is useful to lower the softlockup warning
> > threshold in testing environments to catch potential lockups early.
> > Following patch adds a kernel parameter 'softlockup_lim' to control
> > the softlockup threshold.
> >
>
> Why not make it tunable at runtime?

Sure! Like a sysctl?

Here's a patch that does that (On top of Ingo's
softlockup-improve-debug-output.patch)

>
> >
> > Control the trigger limit for softlockup warnings.  This is useful for
> > debugging softlockups, by lowering the softlockup_lim to identify
> > possible softlockups earlier.
>
> Please check your patches with scripts/checkpatch.pl.

Yep will-do.
(checkpatch emitted one warning for the patch below, but that was because
of a 'stylo' that already exists in include/linux/sysctl.h -- which probably
needs a style change patch by itself)

---

Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh sysctl,
to identify possible softlockups earlier.

Patch also changes the softlockup printk to print the cpu softlockup time.

Signed-off-by: Ravikiran Thirumalai <[EMAIL PROTECTED]>
Signed-off-by: Shai Fultheim <[EMAIL PROTECTED]>

Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-18 11:15:18.506614500 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700
@@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic;
+int softlockup_thresh = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -101,7 +102,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now <= (touch_timestamp + 10))
+   if (now <= (touch_timestamp + softlockup_thresh))
return;
 
regs = get_irq_regs();
@@ -109,8 +110,9 @@ void softlockup_tick(void)
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(_lock);
-   printk(KERN_ERR "BUG: soft lockup detected on CPU#%d! [%s:%d]\n",
-   this_cpu, current->comm, current->pid);
+   printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n",
+   this_cpu, now - touch_timestamp,
+   current->comm, current->pid);
if (regs)
show_regs(regs);
else
Index: linux-2.6.22/kernel/sysctl.c
===
--- linux-2.6.22.orig/kernel/sysctl.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/kernel/sysctl.c2007-07-18 21:05:57.877436750 -0700
@@ -78,6 +78,7 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
+extern int softlockup_thresh;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -206,6 +207,10 @@ static ctl_table root_table[] = {
{ .ctl_name = 0 }
 };
 
+/* Constants for kernel table minimum and  maximum */
+static int one = 1;
+static int ten = 10;
+
 static ctl_table kern_table[] = {
{
.ctl_name   = KERN_PANIC,
@@ -615,6 +620,19 @@ static ctl_table kern_table[] = {
.proc_handler   = _dointvec,
},
 #endif
+#ifdef CONFIG_DETECT_SOFTLOCKUP
+   {
+   .ctl_name   = KERN_SOFTLOCKUP_THRESHOLD,
+   .procname   = "softlockup_thresh",
+   .data   = _thresh,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = _dointvec_minmax,
+   .strategy   = _intvec,
+   .extra1 = ,
+   .extra2 = ,
+   },
+#endif
 
{ .ctl_name = 0 }
 };
Index: linux-2.6.22/include/linux/sysctl.h
===
--- linux-2.6.22.orig/include/linux/sysctl.h2007-07-08 16:32:17.0 
-0700
+++ linux-2.6.22/include/linux/sysctl.h 2007-07-18 21:41:56.584347500 -0700
@@ -165,6 +165,7 @@ enum
KERN_MAX_LOCK_DEPTH=74,
KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+   KERN_SOFTLOCKUP_THRESHOLD=77, /* int: softlockup tolerance threshold */
 };
 
 
Index: linux-2.6.22/Documentation/sysctl/kernel.txt
===
--- 

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai
On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
 On Mon, 16 Jul 2007 15:26:50 -0700
 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:

  Kernel warns of softlockups if the softlockup thread is not able to run
  on a CPU for 10s.  It is useful to lower the softlockup warning
  threshold in testing environments to catch potential lockups early.
  Following patch adds a kernel parameter 'softlockup_lim' to control
  the softlockup threshold.
 

 Why not make it tunable at runtime?

Sure! Like a sysctl?

Here's a patch that does that (On top of Ingo's
softlockup-improve-debug-output.patch)


 
  Control the trigger limit for softlockup warnings.  This is useful for
  debugging softlockups, by lowering the softlockup_lim to identify
  possible softlockups earlier.

 Please check your patches with scripts/checkpatch.pl.

Yep will-do.
(checkpatch emitted one warning for the patch below, but that was because
of a 'stylo' that already exists in include/linux/sysctl.h -- which probably
needs a style change patch by itself)

---

Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh sysctl,
to identify possible softlockups earlier.

Patch also changes the softlockup printk to print the cpu softlockup time.

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-18 11:15:18.506614500 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700
@@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic;
+int softlockup_thresh = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -101,7 +102,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now = (touch_timestamp + 10))
+   if (now = (touch_timestamp + softlockup_thresh))
return;
 
regs = get_irq_regs();
@@ -109,8 +110,9 @@ void softlockup_tick(void)
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(print_lock);
-   printk(KERN_ERR BUG: soft lockup detected on CPU#%d! [%s:%d]\n,
-   this_cpu, current-comm, current-pid);
+   printk(KERN_ERR BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n,
+   this_cpu, now - touch_timestamp,
+   current-comm, current-pid);
if (regs)
show_regs(regs);
else
Index: linux-2.6.22/kernel/sysctl.c
===
--- linux-2.6.22.orig/kernel/sysctl.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/kernel/sysctl.c2007-07-18 21:05:57.877436750 -0700
@@ -78,6 +78,7 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
+extern int softlockup_thresh;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -206,6 +207,10 @@ static ctl_table root_table[] = {
{ .ctl_name = 0 }
 };
 
+/* Constants for kernel table minimum and  maximum */
+static int one = 1;
+static int ten = 10;
+
 static ctl_table kern_table[] = {
{
.ctl_name   = KERN_PANIC,
@@ -615,6 +620,19 @@ static ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_DETECT_SOFTLOCKUP
+   {
+   .ctl_name   = KERN_SOFTLOCKUP_THRESHOLD,
+   .procname   = softlockup_thresh,
+   .data   = softlockup_thresh,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .strategy   = sysctl_intvec,
+   .extra1 = one,
+   .extra2 = ten,
+   },
+#endif
 
{ .ctl_name = 0 }
 };
Index: linux-2.6.22/include/linux/sysctl.h
===
--- linux-2.6.22.orig/include/linux/sysctl.h2007-07-08 16:32:17.0 
-0700
+++ linux-2.6.22/include/linux/sysctl.h 2007-07-18 21:41:56.584347500 -0700
@@ -165,6 +165,7 @@ enum
KERN_MAX_LOCK_DEPTH=74,
KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+   KERN_SOFTLOCKUP_THRESHOLD=77, /* int: softlockup tolerance threshold */
 };
 
 
Index: linux-2.6.22/Documentation/sysctl/kernel.txt
===
--- 

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Andrew Morton
On Wed, 18 Jul 2007 22:41:21 -0700 Ravikiran G Thirumalai [EMAIL PROTECTED] 
wrote:

 On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
  On Mon, 16 Jul 2007 15:26:50 -0700
  Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:
 
   Kernel warns of softlockups if the softlockup thread is not able to run
   on a CPU for 10s.  It is useful to lower the softlockup warning
   threshold in testing environments to catch potential lockups early.
   Following patch adds a kernel parameter 'softlockup_lim' to control
   the softlockup threshold.
  
 
  Why not make it tunable at runtime?
 
 Sure! Like a sysctl?
 
 Here's a patch that does that (On top of Ingo's
 softlockup-improve-debug-output.patch)

 ...

 --- linux-2.6.22.orig/kernel/sysctl.c 2007-07-08 16:32:17.0 -0700
 +++ linux-2.6.22/kernel/sysctl.c  2007-07-18 21:05:57.877436750 -0700
 @@ -78,6 +78,7 @@ extern int percpu_pagelist_fraction;
  extern int compat_log;
  extern int maps_protect;
  extern int sysctl_stat_interval;
 +extern int softlockup_thresh;

Just because sysctl.c does this all over the place doesn't make it right ;)
Please, if poss, find a header file for it.

  /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID 
 */
  static int maxolduid = 65535;
 @@ -206,6 +207,10 @@ static ctl_table root_table[] = {
   { .ctl_name = 0 }
  };
  
 +/* Constants for kernel table minimum and  maximum */
 +static int one = 1;
 +static int ten = 10;

I'd suggest that these go next to zero, two and one_hundred.  Move 'em
all to top-of-file where they should always have been.

  static ctl_table kern_table[] = {
   {
   .ctl_name   = KERN_PANIC,
 @@ -615,6 +620,19 @@ static ctl_table kern_table[] = {
   .proc_handler   = proc_dointvec,
   },
  #endif
 +#ifdef CONFIG_DETECT_SOFTLOCKUP
 + {
 + .ctl_name   = KERN_SOFTLOCKUP_THRESHOLD,
 + .procname   = softlockup_thresh,
 + .data   = softlockup_thresh,
 + .maxlen = sizeof(int),
 + .mode   = 0644,
 + .proc_handler   = proc_dointvec_minmax,
 + .strategy   = sysctl_intvec,
 + .extra1 = one,
 + .extra2 = ten,
 + },
 +#endif

argh.  There's supposed to be a big comment right here:

/*
 * NOTE: do not add new entries to this table unless you have read
 * Documentation/sysctl/ctl_unnumbered.txt
 */

I'll fix that up.  Please use CTL_UNNUMBERED.

  };
 Index: linux-2.6.22/include/linux/sysctl.h
 ===
 --- linux-2.6.22.orig/include/linux/sysctl.h  2007-07-08 16:32:17.0 
 -0700
 +++ linux-2.6.22/include/linux/sysctl.h   2007-07-18 21:41:56.584347500 
 -0700
 @@ -165,6 +165,7 @@ enum
   KERN_MAX_LOCK_DEPTH=74,
   KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
   KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
 + KERN_SOFTLOCKUP_THRESHOLD=77, /* int: softlockup tolerance threshold */
  };

and zap this

 Index: linux-2.6.22/Documentation/sysctl/kernel.txt
 ===
 --- linux-2.6.22.orig/Documentation/sysctl/kernel.txt 2007-07-08 
 16:32:17.0 -0700
 +++ linux-2.6.22/Documentation/sysctl/kernel.txt  2007-07-18 
 22:07:29.460146250 -0700
 @@ -320,6 +320,14 @@ kernel.  This value defaults to SHMMAX.
  
  ==
  
 +softlockup_thresh:
 +
 +This value can be used to lower the softlockup tolerance
 +threshold. The default threshold is 10s.  If a cpu is locked up
 +for 10s, the kernel complains.  Valid values are 1-10s.
 +

neato.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ingo Molnar

* Andrew Morton [EMAIL PROTECTED] wrote:

  +softlockup_thresh:
  +
  +This value can be used to lower the softlockup tolerance
  +threshold. The default threshold is 10s.  If a cpu is locked up
  +for 10s, the kernel complains.  Valid values are 1-10s.
  +
 
 neato.

please make sure this is applied after the softlockup watchdog patches i 
already did. (in particular the cleanup patch, which this could interact 
with)

also, i think the valid threshold should be between 1 and 60 seconds i 
think.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai
On Thu, Jul 19, 2007 at 11:11:42AM +0200, Ingo Molnar wrote:

* Andrew Morton [EMAIL PROTECTED] wrote:

  +softlockup_thresh:
  +
  +This value can be used to lower the softlockup tolerance
  +threshold. The default threshold is 10s.  If a cpu is locked up
  +for 10s, the kernel complains.  Valid values are 1-10s.
  +
 
 neato.

please make sure this is applied after the softlockup watchdog patches i 
already did. (in particular the cleanup patch, which this could interact 
with)

also, i think the valid threshold should be between 1 and 60 seconds i 
think.

60 seconds!  Is that not a pretty high threshold?  The reason for lowering
the tolerance threshold from 10s is to catch bugs early in lab environments,
but why do we need to raise the tolerance thresh beyond 10s?

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ingo Molnar

* Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:

  also, i think the valid threshold should be between 1 and 60 seconds 
  i think.
 
 60 seconds!  Is that not a pretty high threshold?  The reason for 
 lowering the tolerance threshold from 10s is to catch bugs early in 
 lab environments, but why do we need to raise the tolerance thresh 
 beyond 10s?

just in case someone sees false positives and wants to turn it off.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Jeremy Fitzhardinge
Ingo Molnar wrote:
 just in case someone sees false positives and wants to turn it off.

Why not make 0=off?

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai
On Thu, Jul 19, 2007 at 11:51:14AM -0700, Jeremy Fitzhardinge wrote:
Ingo Molnar wrote:
 just in case someone sees false positives and wants to turn it off.

Why not make 0=off?

A patch to disable softlockup during boot already went in.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=97842216b8400fe9d1a20468959e2989180f8f79

It uses kernel boot parameter to disable softlockup, not exactly disabling
softlockup at run time though.  Do we still need 0=off?

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-19 Thread Ravikiran G Thirumalai
On Wed, Jul 18, 2007 at 11:09:07PM -0700, Andrew Morton wrote:
On Wed, 18 Jul 2007 22:41:21 -0700 Ravikiran G Thirumalai [EMAIL PROTECTED] 
wrote:

 On Wed, Jul 18, 2007 at 04:08:58PM -0700, Andrew Morton wrote:
  On Mon, 16 Jul 2007 15:26:50 -0700
  Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:
 
   Kernel warns of softlockups if the softlockup thread is not able to run
   on a CPU for 10s.  It is useful to lower the softlockup warning
   threshold in testing environments to catch potential lockups early.
   Following patch adds a kernel parameter 'softlockup_lim' to control
   the softlockup threshold.
  
 
  Why not make it tunable at runtime?
 
 Sure! Like a sysctl?
 

Hi Andrew,
Here's another take, incorporating all of your comments.

Thanks,
Kiran


Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh to identify
possible softlockups earlier.

This patch:
1. Adds a sysctl softlockup_thresh with valid values of 1-60s
   (Higher value to disable false positives)
2. Changes the softlockup printk to print the cpu softlockup time

Signed-off-by: Ravikiran Thirumalai [EMAIL PROTECTED]
Signed-off-by: Shai Fultheim [EMAIL PROTECTED]

---
Patch applies on top of Ingo's softlockup patches

 Documentation/sysctl/kernel.txt |8 
 include/linux/sched.h   |1 +
 kernel/softlockup.c |8 +---
 kernel/sysctl.c |   25 +++--
 4 files changed, 33 insertions(+), 9 deletions(-)

Index: linux-2.6.22/kernel/softlockup.c
===
--- linux-2.6.22.orig/kernel/softlockup.c   2007-07-18 11:15:18.506614500 
-0700
+++ linux-2.6.22/kernel/softlockup.c2007-07-18 21:39:20.498592750 -0700
@@ -23,6 +23,7 @@ static DEFINE_PER_CPU(unsigned long, pri
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic;
+int softlockup_thresh = 10;
 
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -101,7 +102,7 @@ void softlockup_tick(void)
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (now = (touch_timestamp + 10))
+   if (now = (touch_timestamp + softlockup_thresh))
return;
 
regs = get_irq_regs();
@@ -109,8 +110,9 @@ void softlockup_tick(void)
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(print_lock);
-   printk(KERN_ERR BUG: soft lockup detected on CPU#%d! [%s:%d]\n,
-   this_cpu, current-comm, current-pid);
+   printk(KERN_ERR BUG: soft lockup - CPU#%d stuck for %lus! [%s:%d]\n,
+   this_cpu, now - touch_timestamp,
+   current-comm, current-pid);
if (regs)
show_regs(regs);
else
Index: linux-2.6.22/kernel/sysctl.c
===
--- linux-2.6.22.orig/kernel/sysctl.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/kernel/sysctl.c2007-07-19 13:42:50.275478000 -0700
@@ -79,6 +79,12 @@ extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
 
+/* Constants used for minimum and  maximum */
+static int one = 1;
+static int zero;
+static int sixty = 60;
+static int one_hundred = 100;
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -615,16 +621,23 @@ static ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
 #endif
+#ifdef CONFIG_DETECT_SOFTLOCKUP
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = softlockup_thresh,
+   .data   = softlockup_thresh,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .strategy   = sysctl_intvec,
+   .extra1 = one,
+   .extra2 = sixty,
+   },
+#endif
 
{ .ctl_name = 0 }
 };
 
-/* Constants for minimum and maximum testing in vm_table.
-   We use these as one-element integer vectors. */
-static int zero;
-static int one_hundred = 100;
-
-
 static ctl_table vm_table[] = {
{
.ctl_name   = VM_OVERCOMMIT_MEMORY,
Index: linux-2.6.22/Documentation/sysctl/kernel.txt
===
--- linux-2.6.22.orig/Documentation/sysctl/kernel.txt   2007-07-08 
16:32:17.0 -0700
+++ linux-2.6.22/Documentation/sysctl/kernel.txt2007-07-19 
13:42:09.748945250 -0700
@@ -320,6 +320,14 @@ kernel.  This value defaults to SHMMAX.
 
 ==
 
+softlockup_thresh:
+
+This value can be used to lower the softlockup 

Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-18 Thread Andrew Morton
On Mon, 16 Jul 2007 15:26:50 -0700
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:

> Kernel warns of softlockups if the softlockup thread is not able to run
> on a CPU for 10s.  It is useful to lower the softlockup warning
> threshold in testing environments to catch potential lockups early.
> Following patch adds a kernel parameter 'softlockup_lim' to control
> the softlockup threshold.
> 

Why not make it tunable at runtime?

> 
> Control the trigger limit for softlockup warnings.  This is useful for
> debugging softlockups, by lowering the softlockup_lim to identify
> possible softlockups earlier.

Please check your patches with scripts/checkpatch.pl.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] Change softlockup trigger limit using a kernel parameter

2007-07-18 Thread Andrew Morton
On Mon, 16 Jul 2007 15:26:50 -0700
Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:

 Kernel warns of softlockups if the softlockup thread is not able to run
 on a CPU for 10s.  It is useful to lower the softlockup warning
 threshold in testing environments to catch potential lockups early.
 Following patch adds a kernel parameter 'softlockup_lim' to control
 the softlockup threshold.
 

Why not make it tunable at runtime?

 
 Control the trigger limit for softlockup warnings.  This is useful for
 debugging softlockups, by lowering the softlockup_lim to identify
 possible softlockups earlier.

Please check your patches with scripts/checkpatch.pl.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/