Re: [E1000-devel] e1000: fix Tx hangs by disabling 64-bit DMA

Björn Stenberg Fri, 25 Feb 2011 00:20:16 -0800

Jesse Brandeburg wrote:
> okay, are you getting NETDEV WATCHDOG messages?


I have seen only one, half a day after booting the new kernel:

[41242.800042] ------------[ cut here ]------------
[41242.800061] WARNING: at 
/build/buildd-linux-2.6_2.6.37-1-i386-vxjyZA/linux-2.6-2.6.37/debian/build/source_i386_none/net/sched/sch_generic.c:258
 dev_watchdog+0xec/0x17d()
[41242.800067] Hardware name: PowerEdge 1850
[41242.800070] NETDEV WATCHDOG: eth2 (e1000): transmit queue 0 timed out
[41242.800073] Modules linked in: binfmt_misc fuse ipmi_si ipmi_devintf 
ipmi_msghandler ide_generic ide_gd_mod ide_cd_mod ide_core radeon ttm 
drm_kms_helper drm i2c_algo_bit e752x_edac i2c_core power_supply edac_core 
video output tpm_tis tpm tpm_bios psmouse dcdbas evdev shpchp pcspkr processor 
serio_raw button thermal_sys pci_hotplug rng_core ext3 jbd mbcache sd_mod 
crc_t10dif uhci_hcd mptspi mptscsih mptbase scsi_transport_spi sg sr_mod cdrom 
ehci_hcd ata_generic usbcore ata_piix libata scsi_mod e1000 floppy nls_base 
[last unloaded: scsi_wait_scan]
[41242.800135] Pid: 0, comm: kworker/0:1 Not tainted 2.6.37-1-amd64 #1
[41242.800139] Call Trace:
[41242.800142]  <IRQ>  [<ffffffff81046ed4>] ? warn_slowpath_common+0x78/0x8c
[41242.800158]  [<ffffffff81046f87>] ? warn_slowpath_fmt+0x45/0x4a
[41242.800166]  [<ffffffff81015e62>] ? p4_pmu_enable_event+0x121/0x132
[41242.800171]  [<ffffffff8127e1a0>] ? netif_tx_lock+0x3d/0x65
[41242.800175]  [<ffffffff8127e2b4>] ? dev_watchdog+0xec/0x17d
[41242.800181]  [<ffffffff81015ea6>] ? p4_pmu_enable_all+0x33/0x46
[41242.800188]  [<ffffffff81052ff8>] ? run_timer_softirq+0x1cc/0x298
[41242.800193]  [<ffffffff8127e1c8>] ? dev_watchdog+0x0/0x17d
[41242.800201]  [<ffffffff81067971>] ? ktime_get+0x5f/0xb8
[41242.800208]  [<ffffffff8104c9b9>] ? __do_softirq+0xcf/0x1b6
[41242.800215]  [<ffffffff8100a91c>] ? call_softirq+0x1c/0x30
[41242.800220]  [<ffffffff8100bef7>] ? do_softirq+0x3f/0x79
[41242.800226]  [<ffffffff8104c851>] ? irq_exit+0x36/0x79
[41242.800234]  [<ffffffff8102120e>] ? smp_apic_timer_interrupt+0x87/0x94
[41242.800240]  [<ffffffff8100a3d3>] ? apic_timer_interrupt+0x13/0x20
[41242.800243]  <EOI>  [<ffffffff81010c71>] ? mwait_idle+0x81/0x8c
[41242.800251]  [<ffffffff81010c1e>] ? mwait_idle+0x2e/0x8c
[41242.800256]  [<ffffffff81008be9>] ? cpu_idle+0xb2/0x124
[41242.800265]  [<ffffffff81319f93>] ? _raw_spin_unlock_irqrestore+0xb/0x11
[41242.800271]  [<ffffffff8131334e>] ? start_secondary+0x1e5/0x1eb
[41242.800276] ---[ end trace 4b5c047ab36314c4 ]---

> what does ethtool -S eth2 say, are there lots of tx_timeout there?

What is "lots"? :-) I see 50 right now, after running for 3 days:

# ethtool -S eth2
NIC statistics:
     rx_packets: 59592562
     tx_packets: 91875040
     rx_bytes: 12104735304
     tx_bytes: 112256070528
     rx_broadcast: 396143
     tx_broadcast: 82
     rx_multicast: 310754
     tx_multicast: 315292
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 310754
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 1409
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 50
     tx_restart_queue: 7688
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 12513512
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 12104735304
     rx_csum_offload_good: 57600914
     rx_csum_offload_errors: 373
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
# uptime
 08:51:21 up 3 days, 14:17,  3 users,  load average: 0.79, 0.72, 0.72

> sometimes these can be false hangs.  If you are truly getting reset
> repeatedly like this every couple days or so, would you be willing to
> run a driver that logged a bunch of info to syslog when the tx hang
> occurs (it can keep the nic offline for a couple extra seconds while
> dumping all the info)  it doesn't have any run-time impact.

I can't vouch for every single hang line, but I ssh into this machine to read 
mail and I notice these freezes quite frequently. Often several times a day. (I 
even got one just now while writing this mail.)

I'll be happy to run a debug driver.

-- 
Björn

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] e1000: fix Tx hangs by disabling 64-bit DMA

Reply via email to