Re: 2.6.22.1: hang with forcedeth driver?

2007-08-05 Thread Timo Jantunen
On Fri, 3 Aug 2007, Andrew Morton wrote:

> On Thu, 2 Aug 2007 11:57:21 +0300 (EEST)
> Timo Jantunen <[EMAIL PROTECTED]> wrote:
> 
> > Heip!
> > 
> > I have had few total hangs with 2.6.22.1 kernel. Everything suddenly 
> > freezes and nothing works (SysRq keys, pinging the machine from the 
> > network.) Neither syslog nor netconsole have any relevant messages. I'm 99% 
> > sure this didn't happen in 2.6.21.x kernels.
> 
> Try enabling the NMI watchdog?  (boot with nmi_watchdog=1 on the kernel
> command line) (or maybe nmi_watchdog=2).

nmi_watchdog=1 got me (in dmesg)
Testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)!
CPU#1: NMI appears to be stuck (0->0)!

nmi_watchdog=2 got me
Testing NMI watchdog ... OK.

In both cases, when the machine hanged, nothing happened and nothing more 
was written to console, syslog or netconsole (with console Loglevel set to 
9.)


> otoh, I think NMI watchdog is always enabled on x86_64.  But the counts
> aren't going up.  I forget what the story is there, apart from
> lets-be-as-different-from-i386-as-we-can :(

Like I said later I'm using i386 kernel.


> > All hangs happened with relatively high network traffic (twice with mplayer 
> > using remote display, once with high network fs activity). I copied 
> > gigabytes of files over nfs but couldn't dupicatate the problem that way, 
> > but OTOH I have also watched hours of video without problems so the problem 
> > doesn't occur often. (In the mplayer case, the file I play is usually on 
> > remote nfs disk, too.)
> > 
> > I started using mplayer from a text console and today I finally managed to 
> > catch first bit of information. The machine hanged like before (even SysRq 
> > keys don't print anything) but there was one console message:
> > 
> > eth0: too many iterations (6) in nv_nic_irq.
> > 
> > Looking at the sources, it seems when too much works happens in a single 
> > interrupt, the driver tries to disable device interrupts (and prints above 
> > message). It doesn't seem to be expecting the machine to hang afterwards.
> > 
> > I guess the quick fix for me is to increase max_interrupt_work value so it 
> > doesn't get hit as easily.
> 
> Well.  A Key piece of information would be: did that help?

I added forcedeth.max_interrupt_work=16 to the command line, and haven't 
seen the problem after that. But I did hit the problem about once a week so 
I can't say if it is gone for good yet.


For the tests with NMI I used forcedeth.max_interrupt_work=2 and managed to 
reproduce the problem in few minutes by copying from /dev/zero to nfs disk. 
I also did it without any binary drivers loaded.


As I see it there is two separate problems: that "too many iterations" 
message and the fact the machine hangs after it. Since the same test was in 
the previous kernel but I don't remember ever seeing it it is possible that 
2.6.22 changed timings or something like that which made hitting the 
default max_interrupt_work limit more likely.

I can try to get 2.6.21.x hang with forcedeth.max_interrupt_work=2 if you 
think that information is useful.


> > I have nForce3 board and Athlon 64 X2 CPU (but using 32-bit kernel). I'm 
> > using the built in ethernet with forcedeth driver. I also have ATI/AMD 
> > binary drivers loaded, and at least in some of the cases, VMware host 
> > drivers, too.

//T

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.22.1: hang with forcedeth driver?

2007-08-03 Thread Andrew Morton
On Thu, 2 Aug 2007 11:57:21 +0300 (EEST)
Timo Jantunen <[EMAIL PROTECTED]> wrote:

> Heip!
> 
> I have had few total hangs with 2.6.22.1 kernel. Everything suddenly 
> freezes and nothing works (SysRq keys, pinging the machine from the 
> network.) Neither syslog nor netconsole have any relevant messages. I'm 99% 
> sure this didn't happen in 2.6.21.x kernels.

Try enabling the NMI watchdog?  (boot with nmi_watchdog=1 on the kernel
command line) (or maybe nmi_watchdog=2).

otoh, I think NMI watchdog is always enabled on x86_64.  But the counts
aren't going up.  I forget what the story is there, apart from
lets-be-as-different-from-i386-as-we-can :(

> All hangs happened with relatively high network traffic (twice with mplayer 
> using remote display, once with high network fs activity). I copied 
> gigabytes of files over nfs but couldn't dupicatate the problem that way, 
> but OTOH I have also watched hours of video without problems so the problem 
> doesn't occur often. (In the mplayer case, the file I play is usually on 
> remote nfs disk, too.)
> 
> I started using mplayer from a text console and today I finally managed to 
> catch first bit of information. The machine hanged like before (even SysRq 
> keys don't print anything) but there was one console message:
> 
> eth0: too many iterations (6) in nv_nic_irq.
> 
> Looking at the sources, it seems when too much works happens in a single 
> interrupt, the driver tries to disable device interrupts (and prints above 
> message). It doesn't seem to be expecting the machine to hang afterwards.
> 
> I guess the quick fix for me is to increase max_interrupt_work value so it 
> doesn't get hit as easily.
> 

Well.  A Key piece of information would be: did that help?

> 
> I have nForce3 board and Athlon 64 X2 CPU (but using 32-bit kernel). I'm 
> using the built in ethernet with forcedeth driver. I also have ATI/AMD 
> binary drivers loaded, and at least in some of the cases, VMware host 
> drivers, too.
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.22.1: hang with forcedeth driver?

2007-08-02 Thread Timo Jantunen
Heip!

I have had few total hangs with 2.6.22.1 kernel. Everything suddenly 
freezes and nothing works (SysRq keys, pinging the machine from the 
network.) Neither syslog nor netconsole have any relevant messages. I'm 99% 
sure this didn't happen in 2.6.21.x kernels.

All hangs happened with relatively high network traffic (twice with mplayer 
using remote display, once with high network fs activity). I copied 
gigabytes of files over nfs but couldn't dupicatate the problem that way, 
but OTOH I have also watched hours of video without problems so the problem 
doesn't occur often. (In the mplayer case, the file I play is usually on 
remote nfs disk, too.)

I started using mplayer from a text console and today I finally managed to 
catch first bit of information. The machine hanged like before (even SysRq 
keys don't print anything) but there was one console message:

eth0: too many iterations (6) in nv_nic_irq.

Looking at the sources, it seems when too much works happens in a single 
interrupt, the driver tries to disable device interrupts (and prints above 
message). It doesn't seem to be expecting the machine to hang afterwards.

I guess the quick fix for me is to increase max_interrupt_work value so it 
doesn't get hit as easily.


I have nForce3 board and Athlon 64 X2 CPU (but using 32-bit kernel). I'm 
using the built in ethernet with forcedeth driver. I also have ATI/AMD 
binary drivers loaded, and at least in some of the cases, VMware host 
drivers, too.


parts of .config (full .config available on request)
===cut
CONFIG_NO_HZ=y
CONFIG_SMP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_FORCEDETH=y
# CONFIG_FORCEDETH_NAPI is not set
CONFIG_DETECT_SOFTLOCKUP=y
===cut


lspci
===cut
00:00.0 Host bridge: nVidia Corporation nForce3 250Gb Host Bridge (rev a1)
00:01.0 ISA bridge: nVidia Corporation nForce3 250Gb LPC Bridge (rev a2)
00:01.1 SMBus: nVidia Corporation nForce 250Gb PCI System Management (rev a1)
00:02.0 USB Controller: nVidia Corporation CK8S USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation CK8S USB Controller (rev a1)
00:02.2 USB Controller: nVidia Corporation nForce3 EHCI USB 2.0 Controller (rev 
a2)
00:05.0 Bridge: nVidia Corporation CK8S Ethernet Controller (rev a2)
00:06.0 Multimedia audio controller: nVidia Corporation nForce3 250Gb AC'97 
Audio Controller (rev a1)
00:08.0 IDE interface: nVidia Corporation CK8S Parallel ATA Controller (v2.5) 
(rev a2)
00:09.0 IDE interface: nVidia Corporation CK8S Serial ATA Controller (v2.5) 
(rev a2)
00:0a.0 IDE interface: nVidia Corporation CK8S Serial ATA Controller (v2.5) 
(rev a2)
00:0b.0 PCI bridge: nVidia Corporation nForce3 250Gb AGP Host to PCI Bridge 
(rev a2)
00:0e.0 PCI bridge: nVidia Corporation nForce3 250Gb PCI-to-PCI Bridge (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
01:00.0 VGA compatible controller: ATI Technologies Inc R420 JP [Radeon X800XT]
01:00.1 Display controller: ATI Technologies Inc R420 [X800XT-PE] (Secondary)
02:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit 
Ethernet (rev 10)
===cut

lspci -vvv, only eth0 device
===cut
00:05.0 Bridge: nVidia Corporation CK8S Ethernet Controller (rev a2)
Subsystem: Micro-Star International Co., Ltd. Unknown device 0250
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/