Re: USB/coredump hangs in 8 and 9

2011-08-20 Thread Hans Petter Selasky
On Saturday 20 August 2011 19:09:02 Andriy Gapon wrote:
> on 20/08/2011 19:54 Hans Petter Selasky said the following:
> > On Saturday 20 August 2011 18:45:57 Andriy Gapon wrote:
> >> SCHEDULER_STOPPED
> > 
> > The USB code needs to check for the SCHEDULER_STOPPED and cold at the
> > present moment. If this state can be set during bootup, and cleared at
> > the same time like "cold", it would be very good.
> 
> Sorry again - not sure if I follow.
> SCHEDULER_STOPPED is supposed to be set on panic and never be reset.  It's
> like a mirror of 'cold' in a sense.

OK. Then you should add a test "&& !SCHEDULER_STOPPED" where I pointed out:

static void
usbd_callback_wrapper(struct usb_xfer_queue *pq)
{
struct usb_xfer *xfer = pq->curr;
struct usb_xfer_root *info = xfer->xroot;

USB_BUS_LOCK_ASSERT(info->bus, MA_OWNED);
if (!mtx_owned(info->xfer_mtx) && !SCHEDULER_STOPPED) {
/*
 * Cases that end up here:
 *

And also ensure that no mutex asserts can trigger further panics.

--HPS
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-20 Thread Andriy Gapon
on 20/08/2011 19:54 Hans Petter Selasky said the following:
> On Saturday 20 August 2011 18:45:57 Andriy Gapon wrote:
>> SCHEDULER_STOPPED
> 
> The USB code needs to check for the SCHEDULER_STOPPED and cold at the present 
> moment. If this state can be set during bootup, and cleared at the same time 
> like "cold", it would be very good.

Sorry again - not sure if I follow.
SCHEDULER_STOPPED is supposed to be set on panic and never be reset.  It's like
a mirror of 'cold' in a sense.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-20 Thread Hans Petter Selasky
On Saturday 20 August 2011 18:45:57 Andriy Gapon wrote:
> SCHEDULER_STOPPED

The USB code needs to check for the SCHEDULER_STOPPED and cold at the present 
moment. If this state can be set during bootup, and cleared at the same time 
like "cold", it would be very good.

--HPS
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-20 Thread Andriy Gapon
on 20/08/2011 16:35 Hans Petter Selasky said the following:
> On Friday 19 August 2011 18:32:13 Andriy Gapon wrote:
>> on 19/08/2011 00:24 Hans Petter Selasky said the following:
>>> On Thursday 18 August 2011 19:04:10 Andriy Gapon wrote:
 If you can help Hans to figure out what you is wrong with USB subsystem
 in this respect that would help us all.
>>>
>>> Hi,
>>>
>>> usb_busdma.c:   /* we use "mtx_owned()" instead of this function */
>>> usb_busdma.c:   owned = mtx_owned(uptag->mtx);
>>> usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
>>> usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
>>> usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
>>> usb_hub.c:  if (mtx_owned(&bus->bus_mtx)) {
>>> usb_transfer.c: if (!mtx_owned(info->xfer_mtx)) {
>>> usb_transfer.c: if (mtx_owned(xfer->xroot->xfer_mtx)) {
>>> usb_transfer.c: while (mtx_owned(&xroot->udev->bus->bus_mtx)) {
>>> usb_transfer.c: while (mtx_owned(xroot->xfer_mtx)) {
>>
>>> One fix you will need to do, if mtx_owned is not giving correct value is:
>> First, could you please clarify what is the correct, or rather - expected,
>> value in this case.  It's not immediately clear to me if we should
>> consider all locks as owned or un-owned in a situation where all locks are
>> actually skipped behind the scenes.
>> Maybe USB code should explicitly check for that condition as to not make
>> any unsafe assumptions.
>>
>> Second, it's not clear to me what the above list actually represents in the
>> context of this discussion.
> 
> Hi,
> 
> The mtx_owned() is not only used to assert mutex ownership, but also to 
> figure 
> out which context the function is being called from. If the correct mutex is 
> not locked already we postpone the work until later. In the panic case, there 
> is no way to postpone work, so this check should be skipped in case of panic, 
> because there is no other thread to put work to.

Now I see, but still I can not make the conclusions...
So what would you suggest - should USB code explicitly check for panicstr (or
SCHEDULER_STOPPED in the future)?  Or what mutex_owned should return - true or
false?

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-20 Thread Hans Petter Selasky
On Friday 19 August 2011 18:32:13 Andriy Gapon wrote:
> on 19/08/2011 00:24 Hans Petter Selasky said the following:
> > On Thursday 18 August 2011 19:04:10 Andriy Gapon wrote:
> >> If you can help Hans to figure out what you is wrong with USB subsystem
> >> in this respect that would help us all.
> > 
> > Hi,
> > 
> > usb_busdma.c:   /* we use "mtx_owned()" instead of this function */
> > usb_busdma.c:   owned = mtx_owned(uptag->mtx);
> > usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
> > usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
> > usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
> > usb_hub.c:  if (mtx_owned(&bus->bus_mtx)) {
> > usb_transfer.c: if (!mtx_owned(info->xfer_mtx)) {
> > usb_transfer.c: if (mtx_owned(xfer->xroot->xfer_mtx)) {
> > usb_transfer.c: while (mtx_owned(&xroot->udev->bus->bus_mtx)) {
> > usb_transfer.c: while (mtx_owned(xroot->xfer_mtx)) {
> 
> > One fix you will need to do, if mtx_owned is not giving correct value is:
> First, could you please clarify what is the correct, or rather - expected,
> value in this case.  It's not immediately clear to me if we should
> consider all locks as owned or un-owned in a situation where all locks are
> actually skipped behind the scenes.
> Maybe USB code should explicitly check for that condition as to not make
> any unsafe assumptions.
> 
> Second, it's not clear to me what the above list actually represents in the
> context of this discussion.

Hi,

The mtx_owned() is not only used to assert mutex ownership, but also to figure 
out which context the function is being called from. If the correct mutex is 
not locked already we postpone the work until later. In the panic case, there 
is no way to postpone work, so this check should be skipped in case of panic, 
because there is no other thread to put work to.

--HPS
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-19 Thread Attilio Rao
2011/8/12 Andrew Boyer :
> Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
> Re: debugging frequent kernel panics on 8.2-RELEASE (originally on 
> freebsd-stable)
> Re: System hang in USB umass module while processing panic  (originally on 
> freebsd-usb)
>
> Hello Andriy and Hans,
>
> Sorry for tying in so many discussions on this topic, but I think I have an 
> explanation for the problems we have been reporting* with hanging coredumps 
> on multicore systems on 8.2-RELEASE, and it has implications for Andriy's 
> proposed scheduler patch** and for USB.
>
> In today's 8.X and 9.X branches, nothing that I can find stops the other CPUs 
> when the kernel panics, but many parts of the locking code get disabled (grep 
> on 'panicstr').  The 'bufwrite: buffer is not busy???' panic is caused by the 
> syncer encountering an error.  If that happens when it's on the dumping CPU 
> everything hangs.  If it's running on a different CPU, it will be blocked and 
> hidden by the panic_cpu spinlock in panic(), and the dump continues, polling 
> every attached keyboard for a Ctl-C.
>
> But, the new 8.X USB stack relies on multithreading.  (The new stack is the 
> variable that broke coredumps for us in the 7.1->8.2 transition, I think.)  
> SVN 224223 fixes a hang that would happen when dumpsys() polls the USB 
> keyboard (IPMI KVM, in our case).  That helps, but it only gets as far as 
> usb_process(), where it hangs in a loop around a cv_wait() call.  This is 
> easy to reproduce by adding code to the watchdog to break into the debugger 
> if panicstr is set.
>
> I am experimenting with Andriy's patch** to stop the scheduler and it seems 
> to be most of the way there, stopping the CPUs and disabling the rest of 
> locking.  There are a few places that still reference panicstr, but that's 
> minor.  These are the changes I made to the patch:
>  * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is 
> true, so that we don't hang up in USB.  ukbd_yield()  locks up in 
> DROP_GIANT(), and if you skip ukbd_yield(), usbd_transfer_poll() locks up 
> trying to drop mutexes.
>  * Changed the call to spinlock_enter() back to critical_enter(), so that 
> interrupts stay enabled and the hardclock still functions.

Which spinlock_enter() are you referring here?
I think that having interrupts fast handlers running during
panic/shutdown is something we should avoid like hell.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-19 Thread Andriy Gapon
on 19/08/2011 00:24 Hans Petter Selasky said the following:
> On Thursday 18 August 2011 19:04:10 Andriy Gapon wrote:
>> If you can help Hans to figure out what you is wrong with USB subsystem in
>> this respect that would help us all.
> 
> Hi,
> 
> usb_busdma.c:   /* we use "mtx_owned()" instead of this function */
> usb_busdma.c:   owned = mtx_owned(uptag->mtx);
> usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
> usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
> usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
> usb_hub.c:  if (mtx_owned(&bus->bus_mtx)) {
> usb_transfer.c: if (!mtx_owned(info->xfer_mtx)) {
> usb_transfer.c: if (mtx_owned(xfer->xroot->xfer_mtx)) {
> usb_transfer.c: while (mtx_owned(&xroot->udev->bus->bus_mtx)) {
> usb_transfer.c: while (mtx_owned(xroot->xfer_mtx)) {
> 
> One fix you will need to do, if mtx_owned is not giving correct value is:

First, could you please clarify what is the correct, or rather - expected, value
in this case.  It's not immediately clear to me if we should consider all locks 
as
owned or un-owned in a situation where all locks are actually skipped behind the
scenes.
Maybe USB code should explicitly check for that condition as to not make any
unsafe assumptions.

Second, it's not clear to me what the above list actually represents in the
context of this discussion.

> static void
> usbd_callback_wrapper(struct usb_xfer_queue *pq)
> {
> struct usb_xfer *xfer = pq->curr;
> struct usb_xfer_root *info = xfer->xroot;
> 
> USB_BUS_LOCK_ASSERT(info->bus, MA_OWNED);
> if (!mtx_owned(info->xfer_mtx)) {
> 
> The above "if" should be anded with && !paniced && !dumping ... or maybe the 
> new not scheduling variable is good for this purpose?
> 
> /*
>  * Cases that end up here:
>  *
> 
> #if USB_HAVE_BUSDMA
> if (mtx_owned(xfer->xroot->xfer_mtx)) {
> struct usb_xfer_queue *pq;
> 
> 
> This case is more like a BUS-DMA error case, and is not so important to 
> execute.
> 
> --HPS


-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-18 Thread Hans Petter Selasky
On Thursday 18 August 2011 19:04:10 Andriy Gapon wrote:
> If you can help Hans to figure out what you is wrong with USB subsystem in
> this respect that would help us all.

Hi,

usb_busdma.c:   /* we use "mtx_owned()" instead of this function */
usb_busdma.c:   owned = mtx_owned(uptag->mtx);
usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
usb_compat_linux.c: do_unlock = mtx_owned(&Giant) ? 0 : 1;
usb_hub.c:  if (mtx_owned(&bus->bus_mtx)) {
usb_transfer.c: if (!mtx_owned(info->xfer_mtx)) {
usb_transfer.c: if (mtx_owned(xfer->xroot->xfer_mtx)) {
usb_transfer.c: while (mtx_owned(&xroot->udev->bus->bus_mtx)) {
usb_transfer.c: while (mtx_owned(xroot->xfer_mtx)) {

One fix you will need to do, if mtx_owned is not giving correct value is:

static void
usbd_callback_wrapper(struct usb_xfer_queue *pq)
{
struct usb_xfer *xfer = pq->curr;
struct usb_xfer_root *info = xfer->xroot;

USB_BUS_LOCK_ASSERT(info->bus, MA_OWNED);
if (!mtx_owned(info->xfer_mtx)) {

The above "if" should be anded with && !paniced && !dumping ... or maybe the 
new not scheduling variable is good for this purpose?

/*
 * Cases that end up here:
 *

#if USB_HAVE_BUSDMA
if (mtx_owned(xfer->xroot->xfer_mtx)) {
struct usb_xfer_queue *pq;


This case is more like a BUS-DMA error case, and is not so important to 
execute.

--HPS
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-18 Thread Andriy Gapon
on 12/08/2011 22:59 Andrew Boyer said the following:
> Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
> 
> Re: debugging frequent kernel panics on 8.2-RELEASE (originally on 
> freebsd-stable)
> 
> Re: System hang in USB umass module while processing panic  (originally on
> freebsd-usb)
> 
> Hello Andriy and Hans,
> 
> Sorry for tying in so many discussions on this topic, but I think I have an
> explanation for the problems we have been reporting* with hanging coredumps on
> multicore systems on 8.2-RELEASE, and it has implications for Andriy's 
> proposed
> scheduler patch** and for USB.
> 
> In today's 8.X and 9.X branches, nothing that I can find stops the other CPUs 
> when
> the kernel panics, but many parts of the locking code get disabled (grep on
> 'panicstr').  The 'bufwrite: buffer is not busy???' panic is caused by the 
> syncer
> encountering an error.  If that happens when it's on the dumping CPU 
> everything
> hangs.  If it's running on a different CPU, it will be blocked and hidden by 
> the
> panic_cpu spinlock in panic(), and the dump continues, polling every attached
> keyboard for a Ctl-C.
> 
> But, the new 8.X USB stack relies on multithreading.  (The new stack is the
> variable that broke coredumps for us in the 7.1->8.2 transition, I think.)  
> SVN
> 224223 fixes a hang that would happen when dumpsys() polls the USB keyboard 
> (IPMI
> KVM, in our case).  That helps, but it only gets as far as usb_process(), 
> where it
> hangs in a loop around a cv_wait() call.  This is easy to reproduce by adding 
> code
> to the watchdog to break into the debugger if panicstr is set.
> 
> I am experimenting with Andriy's patch** to stop the scheduler and it seems 
> to be
> most of the way there, stopping the CPUs and disabling the rest of locking.  
> There
> are a few places that still reference panicstr, but that's minor.  These are 
> the
> changes I made to the patch:
>  * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is 
> true, so
> that we don't hang up in USB.  ukbd_yield()  locks up in DROP_GIANT(), and if 
> you
> skip ukbd_yield(), usbd_transfer_poll() locks up trying to drop mutexes.

Hmm, this is a little bit unexpected.  I though that with the patch all the
mutex/lock operations would be skipped.
Can you please check which locks give you the trouble and why?
I would like to improve the patch, so that all lock operations are by-passed
(whether locking or unlocking).

>  * Changed the call to spinlock_enter() back to critical_enter(), so that
> interrupts stay enabled and the hardclock still functions.

Not sure if I like this idea in general.

>  * Added code in the beginning of panic() to switch to CPU 0, so that we're 
> able
> to service the hardclock interrupts and so that watchdog panics get through.

Also I wouldn't like switching a panic thread to a different CPU as that messes 
up
with a lot of state and is not safe for an arbitrary context.
Also, can you please clarify what you meant by "watchdog panics get through"?
Do you talk about SW_WATCHDOG specifically?

> This has worked 100% for me so far, although anyone using a USB keyboard or 
> dump
> device would still be out of luck.
> 
> Thoughts?  It seems like stopping all of the other CPUs is the right thing to 
> do
> on a panic (what are they doing otherwise?).  Are the USB issues fixable?  If
> Andriy's patch get committed it might just involve short-circuiting all of the
> locking in the polling path, but I haven't gotten that far yet.  I bet 
> dumping to
> NFS will have the same problem.

I think that no subsystem should rely on working scheduling and interrupts in
post-panic world.  In fact, all the code for skipping locking is just a giant
hack/workaround in my opinion.  Ideally, all the subsystems that can be expected
to be called after panic should be aware of that and should check for that.  So
they should not attempt any locking or switching threads or rebinding CPUs or
expect interrupts, etc.  The environment should mirror early boot where we have
only one CPU, only one thread, no interrupts, only polling.

If you can help Hans to figure out what you is wrong with USB subsystem in this
respect that would help us all.

Thank you for your testing and feedback!
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: USB/coredump hangs in 8 and 9

2011-08-12 Thread Hans Petter Selasky
On Friday 12 August 2011 21:59:21 Andrew Boyer wrote:
> Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
> Re: debugging frequent kernel panics on 8.2-RELEASE (originally on
> freebsd-stable) Re: System hang in USB umass module while processing panic
>  (originally on freebsd-usb)
> 
> Hello Andriy and Hans,
> 
> Sorry for tying in so many discussions on this topic, but I think I have an
> explanation for the problems we have been reporting* with hanging
> coredumps on multicore systems on 8.2-RELEASE, and it has implications for
> Andriy's proposed scheduler patch** and for USB.
> 
> In today's 8.X and 9.X branches, nothing that I can find stops the other
> CPUs when the kernel panics, but many parts of the locking code get
> disabled (grep on 'panicstr').  The 'bufwrite: buffer is not busy???'
> panic is caused by the syncer encountering an error.  If that happens when
> it's on the dumping CPU everything hangs.  If it's running on a different
> CPU, it will be blocked and hidden by the panic_cpu spinlock in panic(),
> and the dump continues, polling every attached keyboard for a Ctl-C.
> 
> But, the new 8.X USB stack relies on multithreading.  (The new stack is the
> variable that broke coredumps for us in the 7.1->8.2 transition, I think.)
>  SVN 224223 fixes a hang that would happen when dumpsys() polls the USB
> keyboard (IPMI KVM, in our case).  That helps, but it only gets as far as
> usb_process(), where it hangs in a loop around a cv_wait() call.  This is
> easy to reproduce by adding code to the watchdog to break into the
> debugger if panicstr is set.
> 
> I am experimenting with Andriy's patch** to stop the scheduler and it seems
> to be most of the way there, stopping the CPUs and disabling the rest of
> locking.  There are a few places that still reference panicstr, but that's
> minor.  These are the changes I made to the patch: * Changed
> ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is true, so
> that we don't hang up in USB.  ukbd_yield()  locks up in DROP_GIANT(), and
> if you skip ukbd_yield(), usbd_transfer_poll() locks up trying to drop
> mutexes. * Changed the call to spinlock_enter() back to critical_enter(),
> so that interrupts stay enabled and the hardclock still functions. * Added
> code in the beginning of panic() to switch to CPU 0, so that we're able to
> service the hardclock interrupts and so that watchdog panics get through.
> 
> This has worked 100% for me so far, although anyone using a USB keyboard or
> dump device would still be out of luck.
> 
> Thoughts?  It seems like stopping all of the other CPUs is the right thing
> to do on a panic (what are they doing otherwise?).  Are the USB issues
> fixable?  If Andriy's patch get committed it might just involve
> short-circuiting all of the locking in the polling path, but I haven't
> gotten that far yet.  I bet dumping to NFS will have the same problem.

Hi.

USB does not rely on multithreading when doing polling. It bypasses the 
processing thread and calls the function directly. Also I can add the USB has 
recursive checking flags, so that if important functions are already called, 
the code will simply return.

USB does not rely on locking after panic, except maybe mtx_owned() returning 
the correct value. Your approaching having the mtx_lock() / mtx_unlock() 
functions simply do nothing will affect the USB polling ability if mtx_owned() 
does not return true when the lock is locked. So maybe in case of SCHEDULER 
stopped we should just steal the lock instead of just returning. Also I sssume  
that all interrupts and all other processes are blocked at the moment of panic 
or dump.

--HPS
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


USB/coredump hangs in 8 and 9

2011-08-12 Thread Andrew Boyer
Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
Re: debugging frequent kernel panics on 8.2-RELEASE (originally on 
freebsd-stable)
Re: System hang in USB umass module while processing panic  (originally on 
freebsd-usb)

Hello Andriy and Hans,

Sorry for tying in so many discussions on this topic, but I think I have an 
explanation for the problems we have been reporting* with hanging coredumps on 
multicore systems on 8.2-RELEASE, and it has implications for Andriy's proposed 
scheduler patch** and for USB.

In today's 8.X and 9.X branches, nothing that I can find stops the other CPUs 
when the kernel panics, but many parts of the locking code get disabled (grep 
on 'panicstr').  The 'bufwrite: buffer is not busy???' panic is caused by the 
syncer encountering an error.  If that happens when it's on the dumping CPU 
everything hangs.  If it's running on a different CPU, it will be blocked and 
hidden by the panic_cpu spinlock in panic(), and the dump continues, polling 
every attached keyboard for a Ctl-C.

But, the new 8.X USB stack relies on multithreading.  (The new stack is the 
variable that broke coredumps for us in the 7.1->8.2 transition, I think.)  SVN 
224223 fixes a hang that would happen when dumpsys() polls the USB keyboard 
(IPMI KVM, in our case).  That helps, but it only gets as far as usb_process(), 
where it hangs in a loop around a cv_wait() call.  This is easy to reproduce by 
adding code to the watchdog to break into the debugger if panicstr is set.

I am experimenting with Andriy's patch** to stop the scheduler and it seems to 
be most of the way there, stopping the CPUs and disabling the rest of locking.  
There are a few places that still reference panicstr, but that's minor.  These 
are the changes I made to the patch:
 * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is true, 
so that we don't hang up in USB.  ukbd_yield()  locks up in DROP_GIANT(), and 
if you skip ukbd_yield(), usbd_transfer_poll() locks up trying to drop mutexes.
 * Changed the call to spinlock_enter() back to critical_enter(), so that 
interrupts stay enabled and the hardclock still functions.
 * Added code in the beginning of panic() to switch to CPU 0, so that we're 
able to service the hardclock interrupts and so that watchdog panics get 
through.

This has worked 100% for me so far, although anyone using a USB keyboard or 
dump device would still be out of luck.

Thoughts?  It seems like stopping all of the other CPUs is the right thing to 
do on a panic (what are they doing otherwise?).  Are the USB issues fixable?  
If Andriy's patch get committed it might just involve short-circuiting all of 
the locking in the polling path, but I haven't gotten that far yet.  I bet 
dumping to NFS will have the same problem.

Thanks,
  Andrew

* - http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/155421
** - http://people.freebsd.org/~avg/stop_scheduler_on_panic.8.x.diff
--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"