e1000 TX unit hang (redux)

2006-07-11 Thread shaw
Hello All,

I have an e1000 card periodically misbehaving with the message 'Detected Tx 
unit hang'.   I've noticed this problem come up on netdev a couple of times 
and found the link to the bug tracking page--
http://sourceforge.net/tracker/index.php?func=detailaid=1463045group_id=42302atid=447449

I've also seen the patch that I believe was placed in 2.6.16 and subsequently 
brought down to 2.4.2? that seems to address this problem by creating a 
tx_timeout_factor relative to the speed of the NIC.  However, there is no 
mention of this workaround/fix on the bug at the link above and I haven't 
found any discussion of it here on netdev.   Auke recommends turning off tso 
to see if that resolves the problem and this also seems to work, though I 
have as yet not been able to confirm this and would prefer a more performance 
friendly fix..if available ;)

Would one of you pplease give an update on the status of the bug? If a cause 
was ever found and if the tx_timeout_factor was intended as a fix or 
temporary workaround?   I feel like I must have missed something, because I 
never saw the tx_timeout_factor patch go through netdev at all..

Thanks again for your help,
Shaw
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 TX unit hang (redux)

2006-07-11 Thread Auke Kok

[EMAIL PROTECTED] wrote:
I have an e1000 card periodically misbehaving with the message 'Detected Tx 
unit hang'.   I've noticed this problem come up on netdev a couple of times 
and found the link to the bug tracking page--

http://sourceforge.net/tracker/index.php?func=detailaid=1463045group_id=42302atid=447449

I've also seen the patch that I believe was placed in 2.6.16 and subsequently 
brought down to 2.4.2?


that's not only impossible but also unlikely - we don't push changes to 2.4 
kernels anymore a lot, I think the last change is likely older than 2.4.28 or so.


 that seems to address this problem by creating a
tx_timeout_factor relative to the speed of the NIC.  However, there is no 
mention of this workaround/fix on the bug at the link above and I haven't 
found any discussion of it here on netdev. 


I wouldn't even know what patch you are talking about (?!)

Auke recommends turning off tso 
to see if that resolves the problem and this also seems to work, though I 
have as yet not been able to confirm this and would prefer a more performance 
friendly fix..if available ;)


Would one of you pplease give an update on the status of the bug? If a cause 
was ever found and if the tx_timeout_factor was intended as a fix or 
temporary workaround?   I feel like I must have missed something, because I 
never saw the tx_timeout_factor patch go through netdev at all..


One possible problem is a bad EEPROM bit, where the hardware might have been 
misconfigured. This only affects _some_ older e1000's. Any bugreport therefore 
should include the output of `ethtool -e ethX` (as well as the `lspci -vv` 
output of course. If you haven't already done so, please submit this to the 
bugtracker or to us by e-mail


Cheers,

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 TX unit hang (redux)

2006-07-11 Thread shawvrana
Hi Auke,

On Tuesday 11 July 2006 14:09, Auke Kok wrote:

   that seems to address this problem by creating a
 
  tx_timeout_factor relative to the speed of the NIC.  However, there is no
  mention of this workaround/fix on the bug at the link above and I haven't
  found any discussion of it here on netdev.

 I wouldn't even know what patch you are talking about (?!)

Ok, well, the patch is in 2.6.17.4 and looks to have been announced in the 
2.6.16-c2 changelog -- http://lwn.net/Articles/170529/ -- and written by Jeff 
Kirsher.  I haven't been able to find a link to the original patch submission 
anywhere.  The code looks something like this now: 

/* Detect a transmit hang in hardware, this serializes the
 * check with the clearing of time_stamp and movement of i */
adapter-detect_tx_hung = FALSE;
if (tx_ring-buffer_info[eop].dma 
time_after(jiffies, tx_ring-buffer_info[eop].time_stamp +
   (adapter-tx_timeout_factor * HZ))
 !(E1000_READ_REG(adapter-hw, STATUS) 
 E1000_STATUS_TXOFF)) {

..where the tx_timeout_factor has been added and is set in the watchdog code 
based on the link speed. 

 that's not only impossible but also unlikely - we don't push changes to 2.4 
 kernels anymore a lot, I think the last change is likely older than 2.4.28.

I'm sure you're right.  Jumped to conclusions on a patch I saw posted at 
redhat.. I'll be more careful next time :)

I'll also try to get some better debugging info from my side.

Thanks.
Shaw
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


e1000 TX unit hang

2006-07-05 Thread Phil Oester
I saw this error (once) in 2.6.13 a few weeks ago:

Jun 23 15:19:01 X kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun 23 15:19:01 X kernel:   TDH  7e
Jun 23 15:19:01 X kernel:   TDT  7f
Jun 23 15:19:01 X kernel:   next_to_use  7f
Jun 23 15:19:01 X kernel:   next_to_clean7e
Jun 23 15:19:01 X kernel: buffer_info[next_to_clean]
Jun 23 15:19:01 X kernel:   dma  16ef9012
Jun 23 15:19:01 X kernel:   time_stamp   423845db
Jun 23 15:19:01 X kernel:   next_to_watch7e
Jun 23 15:19:01 X kernel:   jiffies  423845db
Jun 23 15:19:01 X kernel:   next_to_watch.status 0

so upgraded to 2.6.17 and got a slew of them today - shown below.
E1000 maintainers: any ideas?

Phil
  


Jul  5 11:43:26 X kernel: e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
Jul  5 11:43:26 X kernel:   Tx Queue 0
Jul  5 11:43:26 X kernel:   TDH  a
Jul  5 11:43:26 X kernel:   TDT  a
Jul  5 11:43:26 X kernel:   next_to_use  a
Jul  5 11:43:26 X kernel:   next_to_clean5f
Jul  5 11:43:26 X kernel: buffer_info[next_to_clean]
Jul  5 11:43:26 X kernel:   time_stamp   b6bc51
Jul  5 11:43:26 X kernel:   next_to_watch5f
Jul  5 11:43:26 X kernel:   jiffies  b6bcc6
Jul  5 11:43:26 X kernel:   next_to_watch.status 1

Jul  5 11:43:33 X kernel: e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
Jul  5 11:43:34 X kernel:   Tx Queue 0
Jul  5 11:43:36 X kernel:   TDH  2c
Jul  5 11:43:38 X kernel:   TDT  2c
Jul  5 11:43:42 X kernel:   next_to_use  2c
Jul  5 11:43:45 X kernel:   next_to_clean81
Jul  5 11:43:46 X kernel: buffer_info[next_to_clean]
Jul  5 11:43:47 X kernel:   time_stamp   b6be88
Jul  5 11:43:49 X kernel:   next_to_watch81
Jul  5 11:43:52 X kernel:   jiffies  b6bf0e
Jul  5 11:43:53 X kernel:   next_to_watch.status 1

Jul  5 11:43:53 X kernel: e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
Jul  5 11:43:53 X kernel:   Tx Queue 0
Jul  5 11:43:53 X kernel:   TDH  ff
Jul  5 11:43:53 X kernel:   TDT  ff
Jul  5 11:43:53 X kernel:   next_to_use  ff
Jul  5 11:43:53 X kernel:   next_to_clean54
Jul  5 11:43:53 X kernel: buffer_info[next_to_clean]
Jul  5 11:43:53 X kernel:   time_stamp   b6c06d
Jul  5 11:43:53 X kernel:   next_to_watch54
Jul  5 11:43:53 X kernel:   jiffies  b6c0d2
Jul  5 11:43:53 X kernel:   next_to_watch.status 1

Jul  5 11:43:53 X kernel: e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
Jul  5 11:43:53 X kernel:   Tx Queue 0
Jul  5 11:43:53 X kernel:   TDH  81
Jul  5 11:43:53 X kernel:   TDT  81
Jul  5 11:43:53 X kernel:   next_to_use  81
Jul  5 11:43:53 X kernel:   next_to_cleand6
Jul  5 11:43:53 X kernel: buffer_info[next_to_clean]
Jul  5 11:43:53 X kernel:   time_stamp   b6c0b8
Jul  5 11:43:53 X kernel:   next_to_watchd6
Jul  5 11:43:53 X kernel:   jiffies  b6c19b
Jul  5 11:43:53 X kernel:   next_to_watch.status 1

Jul  5 11:43:53 X kernel: e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
Jul  5 11:43:53 X kernel:   Tx Queue 0
Jul  5 11:43:53 X kernel:   TDH  1b
Jul  5 11:43:53 X kernel:   TDT  1b
Jul  5 11:43:53 X kernel:   next_to_use  1b
Jul  5 11:43:53 X kernel:   next_to_clean71
Jul  5 11:43:53 X kernel: buffer_info[next_to_clean]
Jul  5 11:43:53 X kernel:   time_stamp   b6c1d8
Jul  5 11:43:53 X kernel:   next_to_watch71
Jul  5 11:43:53 X kernel:   jiffies  b6c255
Jul  5 11:43:53 X kernel:   next_to_watch.status 1

Jul  5 11:43:53 X kernel: e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
Jul  5 11:43:53 X kernel:   Tx Queue 0
Jul  5 11:43:53 X kernel:   TDH  9e
Jul  5 11:43:53 X kernel:   TDT  9e
Jul  5 11:43:53 X kernel:   next_to_use  9e
Jul  5 11:43:54 X kernel:   next_to_cleanf3
Jul  5 11:43:54 X kernel: buffer_info[next_to_clean]
Jul  5 11:43:54 X kernel:   time_stamp   b6c229
Jul  5 11:43:54 X kernel:   next_to_watchf3
Jul  5 11:43:54 X kernel:   jiffies  b6c329
Jul  5 11:43:54 X kernel:   next_to_watch.status 1

Jul  5 11:43:54 X kernel: e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
Jul  5 11:43:54 X kernel:   Tx Queue 0
Jul  5 11:43:54 X kernel:   TDH  39
Jul  5 11:43:54 X kernel:   TDT  39
Jul  5 11:43:54 X kernel:   next_to_use  39
Jul  5 11:43:54 X kernel:   next_to_clean8e
Jul  5 11:43:54 X kernel: buffer_info[next_to_clean]
Jul  5 11:43:54 X kernel:   time_stamp   b6c4a0
Jul  5 11:43:54 X kernel:   next_to_watch8e
Jul  5 11:43:54 X kernel:   jiffies  b6c558
Jul  5 

Re: e1000 TX unit hang

2006-07-05 Thread Auke Kok

Phil Oester wrote:

I saw this error (once) in 2.6.13 a few weeks ago:

Jun 23 15:19:01 X kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun 23 15:19:01 X kernel:   TDH  7e
Jun 23 15:19:01 X kernel:   TDT  7f
Jun 23 15:19:01 X kernel:   next_to_use  7f
Jun 23 15:19:01 X kernel:   next_to_clean7e
Jun 23 15:19:01 X kernel: buffer_info[next_to_clean]
Jun 23 15:19:01 X kernel:   dma  16ef9012
Jun 23 15:19:01 X kernel:   time_stamp   423845db
Jun 23 15:19:01 X kernel:   next_to_watch7e
Jun 23 15:19:01 X kernel:   jiffies  423845db
Jun 23 15:19:01 X kernel:   next_to_watch.status 0

so upgraded to 2.6.17 and got a slew of them today - shown below.
E1000 maintainers: any ideas?


The issue is known and worked on, unfortunately no more information yet.

We're tracking the issue and stuff (debug patches, etc) over here (at 
e1000.sf.net):


http://sourceforge.net/tracker/index.php?func=detailaid=1463045group_id=42302atid=447449

For now, try to see if turning off tso using ethtool helps.


Cheers,

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html