There are lots of reports of issues like this for tg3 but most reports end
in "did not hear from requester so closing" or "Fedora Core X is not
supported any more so closing". Does anyone know of a real solution to
this problem. Reading all the bugzilla and Google results for this
problem make my head hurt :-) . Note that the error message
"tg3_stop_block timed out" can be a generic message thus having many
causes.
We have been experiencing intermittent network failures on systems running
SLF47 2.6.9-78.0.1 kernel/tg3 3.86 driver . They waiting for the tcp to
finish, which never happens. The failures are load-related.
This error happened from time to time over the life of these nodes
but they were operating more or less stably under SLF45/x86_64, tg3 3.77
driver.
There are reports of this issue for RHEL 5 too.
--------------------------------------------------------------------------
NETDEV WATCHDOG: eth1: transmit timed out
tg3: eth1: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000010]
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4c00 enable_bit=2
tg3: eth1: Link is down.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.
--------------
From lspci -v
01:05.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
Subsystem: Super Micro Computer Inc: Unknown device 1648
Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 193
Memory at fc9e0000 (64-bit, non-prefetchable) [size=64K]
Expansion ROM at <ignored> [disabled]
Capabilities: [40] PCI-X non-bridge device.
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3
Enable-
dmesg from the bootup:
tg3.c:v3.86 (November 9, 2007)
ACPI: PCI Interrupt 0000:01:05.0[A] -> GSI 29 (level, low) -> IRQ 185
divert: allocating divert_blk for eth0
eth0: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit)
10/100
/1000Base-T Ethernet 00:30:48:76:ec:4e
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
ACPI: PCI Interrupt 0000:01:05.1[B] -> GSI 30 (level, low) -> IRQ 193
divert: allocating divert_blk for eth1
eth1: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit)
10/100
/1000Base-T Ethernet 00:30:48:76:ec:4f
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
[r...@fcdfcaf853 ~]# ethtool -i eth1
driver: tg3
version: 3.86
firmware-version: 5704-v3.36
bus-info: 0000:01:05.1
Not clear if this is the latest network firmware or not.
--------------------------
The hints we have got thus far is that the new driver v3.86 is
trying to implement one of the tcp offload features on the NIC
which in the case of the BCM5704 is broken.
Any help is appreciated.
Steve Timm
Connie Sieh
--
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
t...@fnal.gov http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.