Re: [poll / rfc] kdb_stop_cpus

2011-06-22 Thread Andriy Gapon
on 04/06/2011 11:22 Andriy Gapon said the following:
 commit 458ebd9aca7e91fc6e0825c727c7220ab9f61016
 
 generic_stop_cpus: move timeout detection code from under DIAGNOSTIC
 
 ... and also increase it a bit.
 IMO it's better to detect and report the (rather serious) condition and
 allow a system to proceed somehow rather than be stuck in an endless
 loop.
 
 diff --git a/sys/kern/subr_smp.c b/sys/kern/subr_smp.c
 index ae52f4b..4bd766b 100644
 --- a/sys/kern/subr_smp.c
 +++ b/sys/kern/subr_smp.c
 @@ -232,12 +232,10 @@ generic_stop_cpus(cpumask_t map, u_int type)
   /* spin */
   cpu_spinwait();
   i++;
 -#ifdef DIAGNOSTIC
 - if (i == 10) {
 + if (i == 1) {
   printf(timeout stopping cpus\n);
   break;
   }
 -#endif
   }
 
   stopping_cpu = NOCPU;
 
 

I would like to commit the above, if nobody objects.

A to do item is adding some code to aid debugging of the timeout condition.  I
discussed this with Attilio, he doesn't see this as a show-stopper and he plans 
to
add the code at a later time.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-22 Thread Andriy Gapon
on 03/06/2011 18:13 Andriy Gapon said the following:
 
 I wonder if anybody uses kdb_stop_cpus with non-default value.

I would like to go ahead and remove kdb_stop_cpus tunable/sysctl if nobody 
objects.

 If, yes, I am very interested to learn about your usecase for it.
 
 I think that the default kdb behavior is the correct one, so it doesn't make 
 sense
 to have a knob to turn on incorrect behavior.
 But I may be missing something obvious.
 
 The comment in the code doesn't really satisfy me:
 /*
  * Flag indicating whether or not to IPI the other CPUs to stop them on
  * entering the debugger.  Sometimes, this will result in a deadlock as
  * stop_cpus() waits for the other cpus to stop, so we allow it to be
  * disabled.  In order to maximize the chances of success, use a hard
  * stop for that.
  */
 
 The hard stop should be sufficiently mighty.
 Yes, I am aware of supposedly extremely rare situations where a deadlock could
 happen even when using hard stop.  But I'd rather fix that than have this 
 switch.
 
 Oh, the commit message (from 2004) explains it:
 Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
 attempt to IPI other cpus when entering the debugger in order to stop
 them while in the debugger.  The default remains to issue the stop;
 however, that can result in a hang if another cpu has interrupts disabled
 and is spinning, since the IPI won't be received and the KDB will wait
 indefinitely.  We probably need to add a timeout, but this is a useful
 stopgap in the mean time.
 
 But that was before we started using hard stop in this context (in 2009).
 


-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-07 Thread Andriy Gapon
on 05/06/2011 01:35 Attilio Rao said the following:
 2011/6/4 Andriy Gapon a...@freebsd.org:
 commit 458ebd9aca7e91fc6e0825c727c7220ab9f61016

generic_stop_cpus: move timeout detection code from under DIAGNOSTIC

... and also increase it a bit.
IMO it's better to detect and report the (rather serious) condition and
allow a system to proceed somehow rather than be stuck in an endless
loop.

 diff --git a/sys/kern/subr_smp.c b/sys/kern/subr_smp.c
 index ae52f4b..4bd766b 100644
 --- a/sys/kern/subr_smp.c
 +++ b/sys/kern/subr_smp.c
 @@ -232,12 +232,10 @@ generic_stop_cpus(cpumask_t map, u_int type)
/* spin */
cpu_spinwait();
i++;
 -#ifdef DIAGNOSTIC
 -   if (i == 10) {
 +   if (i == 1) {
printf(timeout stopping cpus\n);
break;
}
 -#endif
}

stopping_cpu = NOCPU;
 
 I'd also add the ability, once the deadlock is detected, to break in
 KDB, and put that under DIAGNOSTIC.
 I had such a patch and I used it to debug some deadlocks on shutdown
 code, but now it seems I can't find it anymore.

I think that this could be useful.
Of course, it would have to honor KDB_UNATTENDED.
However, I am not sure how to implement it safely.  E.g. panic() should stop 
other
CPUs before setting panicstr and if some CPU is stuck for good, then we would 
just
be recursively calling panic() until triple-fault.  Ditto for kdb_trap().

So if you could dig up your code for implementing this that would be useful.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-07 Thread Andriy Gapon
on 04/06/2011 12:11 Robert N. M. Watson said the following:
 
 On 4 Jun 2011, at 09:22, Andriy Gapon wrote:
 commit 458ebd9aca7e91fc6e0825c727c7220ab9f61016
 
 generic_stop_cpus: move timeout detection code from under DIAGNOSTIC
 
 ... and also increase it a bit. IMO it's better to detect and report the
 (rather serious) condition and allow a system to proceed somehow rather than
 be stuck in an endless loop.
 
 Agreed on detecting and reporting. It would be good to confirm that it works 
 in
 practice, however,

What is your concern here? :)
The code seems rather simple - the loop is no longer infinite.

 and also that there are no false positives. I'm not sure
 what the best test scenarios are for that.

As to the false positives - I think that that can only be verified by practice
(very wide testing), because that would greatly depend on hardware.
Maybe we should use some time-based approach instead of the iteration count
approach or maybe we should calibrate the iteration count based on hardware
characteristics...

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-04 Thread Andriy Gapon
on 03/06/2011 20:57 Robert N. M. Watson said the following:
 
 On 3 Jun 2011, at 16:13, Andriy Gapon wrote:
 
 I wonder if anybody uses kdb_stop_cpus with non-default value. If, yes, I
 am very interested to learn about your usecase for it.
 
 The issue that prompted the sysctl was non-NMI IPIs being used to enter the
 debugger or reboot following a core hanging with interrupts disabled. With
 the switch to NMI IPIs in some of those circumstances, life is better -- at
 least, on hardware that supports non-maskable IPIs. I seem to recall sparc64
 doesn't, however?

Seems to be so as Nathan has also pointed out for PPC.
For this I also plan the following change:

commit 458ebd9aca7e91fc6e0825c727c7220ab9f61016

generic_stop_cpus: move timeout detection code from under DIAGNOSTIC

... and also increase it a bit.
IMO it's better to detect and report the (rather serious) condition and
allow a system to proceed somehow rather than be stuck in an endless
loop.

diff --git a/sys/kern/subr_smp.c b/sys/kern/subr_smp.c
index ae52f4b..4bd766b 100644
--- a/sys/kern/subr_smp.c
+++ b/sys/kern/subr_smp.c
@@ -232,12 +232,10 @@ generic_stop_cpus(cpumask_t map, u_int type)
/* spin */
cpu_spinwait();
i++;
-#ifdef DIAGNOSTIC
-   if (i == 10) {
+   if (i == 1) {
printf(timeout stopping cpus\n);
break;
}
-#endif
}

stopping_cpu = NOCPU;


 Not sure about MIPS, etc. Attilio has since significantly
 improved our shutdown behaviour -- initially, the switch to NMI IPIs broke
 other things (because certain IPIs then improperly preempted threads holding
 spinlocks), but that pretty much all seems worked out now.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-04 Thread Robert N. M. Watson

On 4 Jun 2011, at 09:22, Andriy Gapon wrote:

 on 03/06/2011 20:57 Robert N. M. Watson said the following:
 
 On 3 Jun 2011, at 16:13, Andriy Gapon wrote:
 
 I wonder if anybody uses kdb_stop_cpus with non-default value. If, yes, I
 am very interested to learn about your usecase for it.
 
 The issue that prompted the sysctl was non-NMI IPIs being used to enter the
 debugger or reboot following a core hanging with interrupts disabled. With
 the switch to NMI IPIs in some of those circumstances, life is better -- at
 least, on hardware that supports non-maskable IPIs. I seem to recall sparc64
 doesn't, however?
 
 Seems to be so as Nathan has also pointed out for PPC.
 For this I also plan the following change:
 
 commit 458ebd9aca7e91fc6e0825c727c7220ab9f61016
 
generic_stop_cpus: move timeout detection code from under DIAGNOSTIC
 
... and also increase it a bit.
IMO it's better to detect and report the (rather serious) condition and
allow a system to proceed somehow rather than be stuck in an endless
loop.

Agreed on detecting and reporting. It would be good to confirm that it works in 
practice, however, and also that there are no false positives. I'm not sure 
what the best test scenarios are for that.

Robert

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-04 Thread Attilio Rao
2011/6/3 Nathan Whitehorn nwhiteh...@freebsd.org:
 On 06/03/11 10:13, Andriy Gapon wrote:

 I wonder if anybody uses kdb_stop_cpus with non-default value.
 If, yes, I am very interested to learn about your usecase for it.

 I think that the default kdb behavior is the correct one, so it doesn't
 make sense
 to have a knob to turn on incorrect behavior.
 But I may be missing something obvious.

 The comment in the code doesn't really satisfy me:
 /*
  * Flag indicating whether or not to IPI the other CPUs to stop them on
  * entering the debugger.  Sometimes, this will result in a deadlock as
  * stop_cpus() waits for the other cpus to stop, so we allow it to be
  * disabled.  In order to maximize the chances of success, use a hard
  * stop for that.
  */

 The hard stop should be sufficiently mighty.
 Yes, I am aware of supposedly extremely rare situations where a deadlock
 could
 happen even when using hard stop.  But I'd rather fix that than have this
 switch.

 Oh, the commit message (from 2004) explains it:

 Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
 attempt to IPI other cpus when entering the debugger in order to stop
 them while in the debugger.  The default remains to issue the stop;
 however, that can result in a hang if another cpu has interrupts disabled
 and is spinning, since the IPI won't be received and the KDB will wait
 indefinitely.  We probably need to add a timeout, but this is a useful
 stopgap in the mean time.

 But that was before we started using hard stop in this context (in 2009).

 Some non-x86 platforms (e.g. PPC) don't support real NMIs, and so this still
 applies.

Well, if I get Andriy's proposal right, he just wants to trim off the
possibility to not stop the CPUs on entering KDB. I'm not entirely
sure why there is a sysctl for disabling that and I really don't want
it.

Note that the missing of the NMI/privileged Interrupt is not going to
be a factor on this request, unless you are worried a lot by the easy
deadlock that a normal stop operation may lead.
If that is the case, I think that the upcoming work on skipping
locking during KDB/panic entering is going to help a lot for this
case. At that point removing the possibility to turn off CPU stopping
will be a good idea, IMHO.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-04 Thread Attilio Rao
2011/6/4 Andriy Gapon a...@freebsd.org:
 on 03/06/2011 20:57 Robert N. M. Watson said the following:

 On 3 Jun 2011, at 16:13, Andriy Gapon wrote:

 I wonder if anybody uses kdb_stop_cpus with non-default value. If, yes, I
 am very interested to learn about your usecase for it.

 The issue that prompted the sysctl was non-NMI IPIs being used to enter the
 debugger or reboot following a core hanging with interrupts disabled. With
 the switch to NMI IPIs in some of those circumstances, life is better -- at
 least, on hardware that supports non-maskable IPIs. I seem to recall sparc64
 doesn't, however?

 Seems to be so as Nathan has also pointed out for PPC.
 For this I also plan the following change:

 commit 458ebd9aca7e91fc6e0825c727c7220ab9f61016

    generic_stop_cpus: move timeout detection code from under DIAGNOSTIC

    ... and also increase it a bit.
    IMO it's better to detect and report the (rather serious) condition and
    allow a system to proceed somehow rather than be stuck in an endless
    loop.

 diff --git a/sys/kern/subr_smp.c b/sys/kern/subr_smp.c
 index ae52f4b..4bd766b 100644
 --- a/sys/kern/subr_smp.c
 +++ b/sys/kern/subr_smp.c
 @@ -232,12 +232,10 @@ generic_stop_cpus(cpumask_t map, u_int type)
                /* spin */
                cpu_spinwait();
                i++;
 -#ifdef DIAGNOSTIC
 -               if (i == 10) {
 +               if (i == 1) {
                        printf(timeout stopping cpus\n);
                        break;
                }
 -#endif
        }

        stopping_cpu = NOCPU;

I'd also add the ability, once the deadlock is detected, to break in
KDB, and put that under DIAGNOSTIC.
I had such a patch and I used it to debug some deadlocks on shutdown
code, but now it seems I can't find it anymore.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


[poll / rfc] kdb_stop_cpus

2011-06-03 Thread Andriy Gapon

I wonder if anybody uses kdb_stop_cpus with non-default value.
If, yes, I am very interested to learn about your usecase for it.

I think that the default kdb behavior is the correct one, so it doesn't make 
sense
to have a knob to turn on incorrect behavior.
But I may be missing something obvious.

The comment in the code doesn't really satisfy me:
/*
 * Flag indicating whether or not to IPI the other CPUs to stop them on
 * entering the debugger.  Sometimes, this will result in a deadlock as
 * stop_cpus() waits for the other cpus to stop, so we allow it to be
 * disabled.  In order to maximize the chances of success, use a hard
 * stop for that.
 */

The hard stop should be sufficiently mighty.
Yes, I am aware of supposedly extremely rare situations where a deadlock could
happen even when using hard stop.  But I'd rather fix that than have this 
switch.

Oh, the commit message (from 2004) explains it:
 Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
 attempt to IPI other cpus when entering the debugger in order to stop
 them while in the debugger.  The default remains to issue the stop;
 however, that can result in a hang if another cpu has interrupts disabled
 and is spinning, since the IPI won't be received and the KDB will wait
 indefinitely.  We probably need to add a timeout, but this is a useful
 stopgap in the mean time.

But that was before we started using hard stop in this context (in 2009).

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-03 Thread Nathan Whitehorn

On 06/03/11 10:13, Andriy Gapon wrote:


I wonder if anybody uses kdb_stop_cpus with non-default value.
If, yes, I am very interested to learn about your usecase for it.

I think that the default kdb behavior is the correct one, so it doesn't make 
sense
to have a knob to turn on incorrect behavior.
But I may be missing something obvious.

The comment in the code doesn't really satisfy me:
/*
  * Flag indicating whether or not to IPI the other CPUs to stop them on
  * entering the debugger.  Sometimes, this will result in a deadlock as
  * stop_cpus() waits for the other cpus to stop, so we allow it to be
  * disabled.  In order to maximize the chances of success, use a hard
  * stop for that.
  */

The hard stop should be sufficiently mighty.
Yes, I am aware of supposedly extremely rare situations where a deadlock could
happen even when using hard stop.  But I'd rather fix that than have this 
switch.

Oh, the commit message (from 2004) explains it:

Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
attempt to IPI other cpus when entering the debugger in order to stop
them while in the debugger.  The default remains to issue the stop;
however, that can result in a hang if another cpu has interrupts disabled
and is spinning, since the IPI won't be received and the KDB will wait
indefinitely.  We probably need to add a timeout, but this is a useful
stopgap in the mean time.


But that was before we started using hard stop in this context (in 2009).


Some non-x86 platforms (e.g. PPC) don't support real NMIs, and so this 
still applies.

-Nathan

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-03 Thread Andriy Gapon
on 03/06/2011 18:28 Nathan Whitehorn said the following:
 On 06/03/11 10:13, Andriy Gapon wrote:

 I wonder if anybody uses kdb_stop_cpus with non-default value.
 If, yes, I am very interested to learn about your usecase for it.

 I think that the default kdb behavior is the correct one, so it doesn't make 
 sense
 to have a knob to turn on incorrect behavior.
 But I may be missing something obvious.

 The comment in the code doesn't really satisfy me:
 /*
   * Flag indicating whether or not to IPI the other CPUs to stop them on
   * entering the debugger.  Sometimes, this will result in a deadlock as
   * stop_cpus() waits for the other cpus to stop, so we allow it to be
   * disabled.  In order to maximize the chances of success, use a hard
   * stop for that.
   */

 The hard stop should be sufficiently mighty.
 Yes, I am aware of supposedly extremely rare situations where a deadlock 
 could
 happen even when using hard stop.  But I'd rather fix that than have this 
 switch.

 Oh, the commit message (from 2004) explains it:
 Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
 attempt to IPI other cpus when entering the debugger in order to stop
 them while in the debugger.  The default remains to issue the stop;
 however, that can result in a hang if another cpu has interrupts disabled
 and is spinning, since the IPI won't be received and the KDB will wait
 indefinitely.  We probably need to add a timeout, but this is a useful
 stopgap in the mean time.

 But that was before we started using hard stop in this context (in 2009).
 
 Some non-x86 platforms (e.g. PPC) don't support real NMIs, and so this still 
 applies.

Well, even if it does, there are two things that can be done about that (and, 
IMO,
both are better than the manually controlled knob):

- quick and dirty: just let stop_cpus[_hard] timeout; this way good CPUs are
stopped and the bad ones are no worse than with kdb_stop_cpus=0.
- have a special reserved high priority interrupt, change 'disabling of
interrupts' to 'disabling of all interrupts except the special one' by employing
various kinds of interrupt priority registers (like it was done for splX stuff);
use the special interrupt like an IPI+NMI.

What do you think?

P.S. I think that the first quick and dirty thing should be done anyway,
regardless of any other changes and plans.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: [poll / rfc] kdb_stop_cpus

2011-06-03 Thread Robert N. M. Watson

On 3 Jun 2011, at 16:13, Andriy Gapon wrote:

 I wonder if anybody uses kdb_stop_cpus with non-default value.
 If, yes, I am very interested to learn about your usecase for it.

The issue that prompted the sysctl was non-NMI IPIs being used to enter the 
debugger or reboot following a core hanging with interrupts disabled. With the 
switch to NMI IPIs in some of those circumstances, life is better -- at least, 
on hardware that supports non-maskable IPIs. I seem to recall sparc64 doesn't, 
however? Not sure about MIPS, etc. Attilio has since significantly improved our 
shutdown behaviour -- initially, the switch to NMI IPIs broke other things 
(because certain IPIs then improperly preempted threads holding spinlocks), but 
that pretty much all seems worked out now.

Robert

 
 I think that the default kdb behavior is the correct one, so it doesn't make 
 sense
 to have a knob to turn on incorrect behavior.
 But I may be missing something obvious.
 
 The comment in the code doesn't really satisfy me:
 /*
 * Flag indicating whether or not to IPI the other CPUs to stop them on
 * entering the debugger.  Sometimes, this will result in a deadlock as
 * stop_cpus() waits for the other cpus to stop, so we allow it to be
 * disabled.  In order to maximize the chances of success, use a hard
 * stop for that.
 */
 
 The hard stop should be sufficiently mighty.
 Yes, I am aware of supposedly extremely rare situations where a deadlock could
 happen even when using hard stop.  But I'd rather fix that than have this 
 switch.
 
 Oh, the commit message (from 2004) explains it:
 Add a new sysctl, debug.kdb.stop_cpus, which controls whether or not we
 attempt to IPI other cpus when entering the debugger in order to stop
 them while in the debugger.  The default remains to issue the stop;
 however, that can result in a hang if another cpu has interrupts disabled
 and is spinning, since the IPI won't be received and the KDB will wait
 indefinitely.  We probably need to add a timeout, but this is a useful
 stopgap in the mean time.
 
 But that was before we started using hard stop in this context (in 2009).
 
 -- 
 Andriy Gapon

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org