Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-08 Thread Ben Hutchings
On Sat, 2012-04-07 at 13:40 -0400, David Miller wrote: From: Ben Hutchings b...@decadent.org.uk Date: Sat, 07 Apr 2012 18:21:38 +0100 cheetah_xcall_deliver() does appear to be relevant to the problem and it looks like it could loop indefinitely - though presumably only if a processor is

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-08 Thread David Miller
From: Ben Hutchings b...@decadent.org.uk Date: Sun, 08 Apr 2012 22:12:06 +0100 Will the recipient NACK if the cross-call interrupt is disabled, or do the processors have a buffer/FIFO for such IRQs? Recipient's NACK when their incoming cross-call queue is full. A cpu hung with PSTATE_IE clear

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-07 Thread Jonathan Nieder
Kieron Gillespie wrote: With a completely clean install of Debian and every major version of the Linux kernel I haven't run into this error again. To be clear: are the kernels you are testing now upstream kernels or pre-compiled Debian ones? If I understood correctly before, Tibor experienced

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-07 Thread Kieron Gillespie
They are compiled from the kernel source tars from snapshot.debian.org. I was experiencing the problem with the upstream kernels from kernel.org and the pre-compiled kernels from Debian. I am going to try to compile an upstream kernel again and see if it happens. But before that I am going to

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-07 Thread Ben Hutchings
Summary for the SPARC maintainers: The NMI watchdog is firing on Sunfire 280R and Sun Blade2500 systems with one or both processors in cheetah_xcall_deliver(). This has been seen under 3.0, 3.2 and 3.3 and seems to be associated with disk I/O. Full bug log is at: http://bugs.debian.org/648766

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-07 Thread David Miller
From: Ben Hutchings b...@decadent.org.uk Date: Sat, 07 Apr 2012 18:21:38 +0100 cheetah_xcall_deliver() does appear to be relevant to the problem and it looks like it could loop indefinitely - though presumably only if a processor is behaving strangely? I can only loop indefinitely if one of

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-06 Thread Jonathan Nieder
Kieron Gillespie wrote: I am right now testing one major kernel version at a time, and on the 3.0.0-1 I got Just to be clear, if each time you test the version halfway between the newest known-good and oldest known-bad kernel then you only have to test log(n) kernels instead of n. :) That's

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-06 Thread Kieron Gillespie
That's what I would have done except I ran into a problem. With a completely clean install of Debian and every major version of the Linux kernel I haven't run into this error again. Of coarse I was running bare base of Debian with only the ssh server installed. This error has yet to come up

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-06 Thread Kieron Gillespie
I am right now testing one major kernel version at a time, and on the 3.0.0-1 I got an interesting error when I ran my brutality test on the system. sd 0:0:0:0: ABORT operation complete. I wonder if this is some symptom of the problem as well. It canceled the cat /dev/sda /dev/null process

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-05 Thread Kieron Gillespie
So what have a learned after lots of test cases. With SMP on or off, and nouveau driver loaded or not I have the same unstable behavior and crashing on linux kernel 3.2.13, 3.2.14, 3.3.1. All test involved with only one CPU plugged in, both CPUs plugged in, with SMP on and off, with the

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-05 Thread Jonathan Nieder
found 648766 linux-2.6/3.2.13-1 found 648766 linux-2.6/3.2.14-1 # 3.3.1 found 648766 linux-2.6/3.3-1~experimental.1 tags 648766 + upstream quit Kieron Gillespie wrote: Now with that said I can't seem to crash the 2.6.32 kernel in the same way with SMP off, haven't tried with SMP on yet, but I

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-03 Thread Jonathan Nieder
severity 648766 important quit Kieron Gillespie wrote: I've attached some images to this bug message, not sure if they will appear Received; thanks much. What version of the kernel are you using? Full dmesg output from a normal boot would be useful as well, so we can get to know your

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-03 Thread Kieron Gillespie
Here are the dmesg output from the current system running Linux 3.2.13 with SMP enabled with tickless disabled. [0.00] PROMLIB: Sun IEEE Boot Prom 'OBP 4.9.7 2004/05/27 07:31' [0.00] PROMLIB: Root node compatible: [0.00] Initializing cgroup subsys cpuset [0.00]

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-03 Thread Jonathan Nieder
Kieron Gillespie wrote: Here are the dmesg output from the current system running Linux 3.2.13 with SMP enabled with tickless disabled. Great. Is this reproducible without nouveau? It might be possible to test by putting blacklist nouveau in /etc/modprobe.d/kg-disable-nouveau.conf

Bug#648766: [sparc] BUG: NMI Watchdog detected LOCKUP on CPU0

2012-04-03 Thread Kieron Gillespie
I have also noticed, that if I am reading the trace correctly that in both of my cases, and the original bug submitter's, and a bug posted on old.nabble.com's case the crash always seems to happen when one CPU is doing cheetah_xcall_deliver, and the other CPU is in the same instruction in