Re: R8169 driver & Netgear GA511 NIC - soft lockup/hang in rtl8169_phy_timer

2006-01-01 Thread Russell Steinthal
On Mon, 2006-01-02 at 00:15 +0100, Francois Romieu wrote: 
> Russell Steinthal <[EMAIL PROTECTED]> :
> [...]
> > I'm seeing a "hang" under 2.6.14.4 when the link goes up (I think) on a
> > Netgear GA511 PCMCIA gigabit NIC.  I put "hang" in quotes because while
> > the system is generally unresponsive, Magic SysRq responds and I can
> > return the system to normal by ejecting the card.
> 
> Can you reproduce the issue w/o link up at boot time with 2.6.15-rc7 ?

Yes, the same problem appears with the card inserted and no ethernet
cable plugged in using RC7.  On my Ubuntu system, the problem manifested
during the initialization of the Hardware Abstraction Layer; the
relevant dmesg excerpt is below:

BUG: soft lockup detected on CPU#0!

Pid: 0, comm:  swapper
EIP: 0060:[] CPU: 0
EIP is at rtl8169_phy_timer+0x8c/0xbd [r8169]
 EFLAGS: 0296Not tainted  (2.6.15-rc7)
EAX: 0001 EBX: ceeec000 ECX: 089e EDX: c0414000
ESI: ceeec260 EDI: d097e000 EBP: ceeec8f8 DS: 007b ES: 007b
CR0: 8005003b CR2: 0804fff5 CR3: 0fbfe000 CR4: 0690
 [] rtl8169_phy_timer+0x0/0xbd [r8169]
 [] run_timer_softirq+0xb9/0x208
 [] __do_softirq+0x4d/0x9c
 [] do_softirq+0x2d/0x2f
 [] irq_exit+0x38/0x3a
 [] do_IRQ+0x20/0x28
 [] common_interrupt+0x1a/0x20
 [] default_idle+0x2b/0x53
 [] cpu_idle+0x45/0x71
 [] start_kernel+0x16b/0x180
 [] unknown_bootoption+0x0/0x1bf
SysRq : HELP : loglevel0-8 reBoot tErm Full kIll saK showMem Nice
powerOff showPc unRaw Sync showTasks Unmount 
SysRq : Show Regs

Pid: 0, comm:  swapper
EIP: 0060:[] CPU: 0
EIP is at rtl8169_phy_timer+0x8c/0xbd [r8169]
 EFLAGS: 0296Not tainted  (2.6.15-rc7)
EAX: 0001 EBX: ceeec000 ECX: 1930 EDX: c0414000
ESI: ceeec260 EDI: d097e000 EBP: ceeec8f8 DS: 007b ES: 007b
CR0: 8005003b CR2: 0804fff5 CR3: 0fbfe000 CR4: 0690
 [] rtl8169_phy_timer+0x0/0xbd [r8169]
 [] run_timer_softirq+0xb9/0x208
 [] __do_softirq+0x4d/0x9c
 [] do_softirq+0x2d/0x2f
 [] irq_exit+0x38/0x3a
 [] do_IRQ+0x20/0x28
 [] common_interrupt+0x1a/0x20
 [] default_idle+0x2b/0x53
 [] cpu_idle+0x45/0x71
 [] start_kernel+0x16b/0x180
 [] unknown_bootoption+0x0/0x1bf
pccard: card ejected from slot 1
IBM machine detected. Enabling interrupts during APM calls.

To avoid making this message overly long, I've excerpted the other
portions of the boot logs.  The full versions are available, however,
at:

http://www.steinthal.us/kernel/2.6.15-rc7-boot.log (/var/log/messages)
http://www.steinthal.us/kernel/2.6.15-rc7-dmesg (dmesg output)

Note that the two logs are actually taken from different (successive)
boots, but under identical conditions.

> Please send:
> - .config

Again, here's the networking section of .config; the full .config is
available at http://www.steinthal.us/kernel/config-2.6.15-rc7

#
# Network device support
#
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# PHY device support
#
# CONFIG_PHYLIB is not set

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_LANCE is not set
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_VENDOR_RACAL is not set

#
# Tulip family network device support
#
# CONFIG_NET_TULIP is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
# CONFIG_NET_ISA is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_CS89x0 is not set
# CONFIG_DGRS is not set
CONFIG_EEPRO100=m
# CONFIG_E100 is not set
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_NET_POCKET is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=m
# CONFIG_R8169_NAPI is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set

#
# Ethernet (1 Mbit)
#
# CONFIG_CHELSIO_T1 is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y

#
# Obsolete Wireless cards support (pre-802.11)
#
# CONFIG_STRIP is not set
CONFIG_ARLAN=m
# CONFIG_WAVELAN is not set
# CONFIG_PCMCIA_WAVELAN is not set
# CONFIG_PCMCIA_NETWAVE is not set

#
# Wireless 802.11 Frequency Hopping card

Re: R8169 driver & Netgear GA511 NIC - soft lockup/hang in rtl8169_phy_timer

2006-01-01 Thread Francois Romieu
Russell Steinthal <[EMAIL PROTECTED]> :
[...]
> I'm seeing a "hang" under 2.6.14.4 when the link goes up (I think) on a
> Netgear GA511 PCMCIA gigabit NIC.  I put "hang" in quotes because while
> the system is generally unresponsive, Magic SysRq responds and I can
> return the system to normal by ejecting the card.

Can you reproduce the issue w/o link up at boot time with 2.6.15-rc7 ?

Please send:
- .config
- complete dmesg from boot + SysRq-P

r8169.o module could help.

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] latency tracer, 2.6.15-rc7

2006-01-01 Thread Lee Revell
On Fri, 2005-12-30 at 20:15 -0500, Lee Revell wrote: 
> On Fri, 2005-12-30 at 17:02 -0800, Linus Torvalds wrote:
> > 
> > On Fri, 30 Dec 2005, Lee Revell wrote:
> > > 
> > > It seems that the networking code's use of RCU can cause 10ms+
> > > latencies:
> > 
> > Hmm. Is there a big jump at the 10ms mark? Do you have a 100Hz timer 
> > source? 
> > 
> > A latency jump at 10ms would tend to indicate a missed wakeup that 
> > was "picked up" by the next timer tick.
> 
> No there are no large jumps, it really seems that this was the network
> code causing an RCU callback to drop ~2K routes at once.  Specifically
> RCU invokes dst_rcu_free 2085 times in a single batch
> (call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free) is only called from
> rt_free() and rt_drop()).
> 
> I have found that many of the paths in the network stack that can cause
> high latencies can be tuned via sysctls (for example
> net.ipv4.route.gc_thresh); this one may be the same.

On a related topic:

I thought I had solved the route cache flushing problem by tuning these
sysctls, but it does not seems to help.

The short version is that rt_run_flush and rt_garbage_collect can cause
15ms+ latencies when a lot of routes (up to 4096 it seems) are flushed
in one go:

$ grep "local_bh_enable (rt_run_flush)" /proc/latency_trace | wc -l
4096

$ grep "local_bh_enable (rt_garbage_collect)" /proc/latency_trace | wc
-l
4096

(rt_run_flush and rt_garbage_collect call spin_lock_bh/spin_unlock_bh
once for each flushed route so the above grep effectively counts the
number of routes flushed at once)

I reported this a year or so ago and it led to Ingo adding an option to
-rt (then called the voluntary preempt patch) to always run softirqs in
threads which makes rt_run_flush preemptible.

Anyway I thought I could work around this with mainline by tuning the
network stack to minimize the effect of route cache flushing on
scheduler latency using these sysctls to cause the route cache to be
flushed more often and/or limit the maximum size of the route cache:

$ sudo /sbin/sysctl -a | grep route
net.ipv4.route.gc_elasticity = 8
net.ipv4.route.gc_interval = 60
net.ipv4.route.gc_timeout = 300
net.ipv4.route.gc_min_interval_ms = 20
net.ipv4.route.gc_min_interval = 0
net.ipv4.route.max_size = 65536
net.ipv4.route.gc_thresh = 256
net.ipv4.route.max_delay = 10
net.ipv4.route.min_delay = 2

I tried lowering gc_min_interval_ms, gc_timeout, max_size, and gc_thresh
but rt_run_flush will still process up to 4096 (the size of the route
hash table?) routes at once.

(stop here if you don't care to interpret long latency_trace reports)

17ms+ latency caused by rt_run_flush then rt_garbage_collect processing
4096 routes:

preemption latency trace v1.1.5 on 2.6.15-rc7

 latency: 17286 us, #19154/19154, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
-  
| task: gtk-gnutella-8581 (uid:1000 nice:0 policy:0 rt_prio:0)
-  

 _--=> CPU#
/ _-=> irqs-off
   | / _=> need-resched
   || / _---=> hardirq/softirq 
   ||| / _--=> preempt-depth   
    /  
   | delay 
   cmd pid | time  |   caller  
  \   /|   \   |   /   
epiphany-8742  0dns80us : __trace_start_sched_wakeup (try_to_wake_up)
epiphany-8742  0dns81us : __trace_start_sched_wakeup <<...>-8581> (73 0)
epiphany-8742  0dns72us : preempt_schedule (__trace_start_sched_wakeup)
epiphany-8742  0dns62us : preempt_schedule (try_to_wake_up)
epiphany-8742  0dns53us : preempt_schedule (__wake_up)
epiphany-8742  0dns54us : preempt_schedule (ep_poll_safewake)
epiphany-8742  0dnH56us : do_IRQ (c011223e b 0)
epiphany-8742  0d.h.6us : __do_IRQ (do_IRQ)
epiphany-8742  0d.h17us+: mask_and_ack_8259A (__do_IRQ)
epiphany-8742  0d.h.   12us : handle_IRQ_event (__do_IRQ)
epiphany-8742  0d.h.   13us : usb_hcd_irq (handle_IRQ_event)
epiphany-8742  0d.h.   13us : uhci_irq (usb_hcd_irq)
epiphany-8742  0d.h.   14us : via_driver_irq_handler (handle_IRQ_event)
epiphany-8742  0d.h.   16us : rhine_interrupt (handle_IRQ_event)
epiphany-8742  0d.h.   16us : ioread16 (rhine_interrupt)
epiphany-8742  0d.h.   17us : ioread8 (rhine_interrupt)
epiphany-8742  0d.h.   18us : iowrite16 (rhine_interrupt)
epiphany-8742  0d.h.   19us : ioread8 (rhine_interrupt)
epiphany-8742  0d.h.   20us : rhine_tx (rhine_interrupt)
epiphany-8742  0d.h1   21us : raise_softirq_irqoff (rhine_tx)
epiphany-8742  0d.h.   22us : ioread16 (rhine_interrupt)
epiphany-8742  0d.h.   23us : ioread8 (rhine_interrupt)
epiphany-8742  0d.h1   25us : note_interrupt (__do_IRQ)
epiphany-8742  0d.h1   25us : end_8259A_irq (__do_IRQ)
epiphany-8742  0d.h1   26us : enable_8259A_irq (end_8259A_irq)
epiphany-8742  0dnH5   28us : irq_exit (do_IRQ)
epiphany-8742  0dns5   28us < (20977