Re: [PATCH 11/22] e1000: Fire a link even interrupt instead of a watchdog at init.

2006-12-11 Thread Shaw Vrana
On Fri, Dec 08, 2006 at 03:03:09PM -0800, Kok, Auke wrote:
> 
> Instead of calling a watchdog event we let our interrupt handler
> cascade a link event. This allows us to spot link up immediately
> after _up() without racing against a new watchdog.
> 
> Signed-off-by: Jesse Brandeburg <[EMAIL PROTECTED]>
> Signed-off-by: Auke Kok <[EMAIL PROTECTED]>
> ---
> 
>  drivers/net/e1000/e1000_main.c |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
> index 0ebd8e2..c5c764f 100644
> --- a/drivers/net/e1000/e1000_main.c
> +++ b/drivers/net/e1000/e1000_main.c
> @@ -558,7 +558,8 @@ e1000_up(struct e1000_adapter *adapter)
>  
>   clear_bit(__E1000_DOWN, &adapter->flags);
>  
> - mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
> + /* fire a link change interrupt to start the watchdog */
> + E1000_WRITE_REG(&adapter->hw, ICS, E1000_ICS_LSC);
>   return 0;
>  }

Hi Auke,

Could you explain the race, how it would be triggered, and what it might
look like if it were to occur?

Thanks,
Shaw
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e100: inappropriate handling of shared interrupt ?

2006-12-01 Thread Shaw Vrana
Hi Auke,

On Mon, Nov 27, 2006 at 11:18:13AM -0800, Auke Kok wrote:
> >I'm seeing some odd behavior using the e100 driver for an intel ethernet
> >controller 82557/8/9 (revv 10).  It appears as if the e100 driver is
> >handling interrupts generated by another device, though I am not certain
> >of this..
> >
> >Using some printks, I see some odd packets received that are eventually
> >dropped somewhere up the stack.  The packets usually look something like
> >this:
> >
> >SrcAddr: 8.0.69.0 (bogus source ip)
> >DstAddr: 0.40.226.169  (bogus dest ip)
> >Protocol: 6
> >InputInt: 2
> >SrcPort: 20
> >DstPort: 8793
> >
> >The src address and dest. address are entirely bogus, the protocol is not
> >always TCP, but I've seen it be icmp or udp.  In addition, I see _nothing_
> >using tcpdump, which I also do not understand as I didn't think packets
> >were dropped before tcpdump.  I've seen this behavior on multiple machines
> >using the same hardware, but haven't been able to make much sense of it. 
> >These packets do not seem to affect the normal operation of the device,
> >i.e. it services correct ips/ports just as one would expect.
> >
> >B/c I haven't been able to see the packets using tcpdump, I have been
> >thinking that the packets were generated by the box itself.  The packets
> >appear to be constantly arriving, though it does not appear as if a packet
> >with the same src ip/dst ip arrives more than once, though I could be
> >wrong about this.
> >
> >From dmesg I see that the e100 is sharing irq 12.
> >
> >e100: Intel(R) PRO/100 Network Driver, 3.4.8-k2-NAPI
> >e100: Copyright(c) 1999-2005 Intel Corporation
> >PCI: Found IRQ 12 for device :01:04.0
> >PCI: Sharing IRQ 12 with :00:02.0
> >PCI: Sharing IRQ 12 with :00:1d.0
> >divert: allocating divert_blk for eth0
> >e100: eth0: e100_probe: addr 0xe8083000, irq 12, MAC addr 00:0E:B6:26:95:05
> >(This is the only other message I see mentioning irq 12)
> 
> what does /proc/interrupts say after the box is fully booted?

   CPU0   
  0: 1236112960  XT-PIC  timer
  2:  0  XT-PIC  cascade
  4: 144338  XT-PIC  serial
  5:2109514  XT-PIC  primary
  8:  0  XT-PIC  rtc
  9:  0  XT-PIC  ehci_hcd
 10:   55010907  XT-PIC  lan0_0
 12:   57079668  XT-PIC  wan0_0
 14:   28456949  XT-PIC  ide0
NMI:  0 
ERR:  0


> >Notice that 0 errors are reported..  How could this be?
> 
> use ethtool -S eth1 to get more information on errors etc.
> 
> It's unlikely that an irq problem shows up in the ifconfig error stats. 
> Those are completely different counters that don't interact.

NIC statistics:
 rx_packets: 25896640
 tx_packets: 33721440
 rx_bytes: 391691733
 tx_bytes: 2804738076
 rx_errors: 0
 tx_errors: 0
 rx_dropped: 0
 tx_dropped: 0
 multicast: 0
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_fifo_errors: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_deferred: 0
 tx_single_collisions: 0
 tx_multi_collisions: 0
 tx_flow_control_pause: 40967
 rx_flow_control_pause: 0
 rx_flow_control_unsupported: 0
 tx_tco_packets: 0
 rx_tco_packets: 0


Unfortunately I do not currently have a machine in the lab on which I can
reproduce this problem, so this data toook a little while to get.  I did
find a box with the same version of the NIC, but the weird packets did not
appear on there at all.  I have seen this problem personally, but that box
seems to have disappeared.

Is there any reason these packets would not be reported in the above
statistics?  And if an interrupt sharing error is out, where could these
packets be coming from?

> can you try with the latest e100 driver from e1000.sf.net ? I don't think 
> it solves it but it might help to try (doesn't hurt).

I'll fly the NIC home and test the new driver if worse comes to worse.. :(


Thanks for your input,
Shaw
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


e100: inappropriate handling of shared interrupt ?

2006-11-27 Thread Shaw Vrana
Hello All,

I'm seeing some odd behavior using the e100 driver for an intel ethernet
controller 82557/8/9 (revv 10).  It appears as if the e100 driver is
handling interrupts generated by another device, though I am not certain
of this..

Using some printks, I see some odd packets received that are eventually
dropped somewhere up the stack.  The packets usually look something like
this:

SrcAddr: 8.0.69.0 (bogus source ip)
DstAddr: 0.40.226.169  (bogus dest ip)
Protocol: 6
InputInt: 2
SrcPort: 20
DstPort: 8793

The src address and dest. address are entirely bogus, the protocol is not
always TCP, but I've seen it be icmp or udp.  In addition, I see _nothing_
using tcpdump, which I also do not understand as I didn't think packets
were dropped before tcpdump.  I've seen this behavior on multiple machines
using the same hardware, but haven't been able to make much sense of it. 
These packets do not seem to affect the normal operation of the device,
i.e. it services correct ips/ports just as one would expect.

B/c I haven't been able to see the packets using tcpdump, I have been
thinking that the packets were generated by the box itself.  The packets
appear to be constantly arriving, though it does not appear as if a packet
with the same src ip/dst ip arrives more than once, though I could be
wrong about this.

>From dmesg I see that the e100 is sharing irq 12.

e100: Intel(R) PRO/100 Network Driver, 3.4.8-k2-NAPI
e100: Copyright(c) 1999-2005 Intel Corporation
PCI: Found IRQ 12 for device :01:04.0
PCI: Sharing IRQ 12 with :00:02.0
PCI: Sharing IRQ 12 with :00:1d.0
divert: allocating divert_blk for eth0
e100: eth0: e100_probe: addr 0xe8083000, irq 12, MAC addr 00:0E:B6:26:95:05
(This is the only other message I see mentioning irq 12)
serio: i8042 AUX port at 0x60,0x64 irq 12

(output of ethtool -e)
Offset  Values
--  --
0x  00 0e b6 26 95 05 1b 0d ff ff 01 02 01 47 ff ff
0x0010  ff ff ff ff 00 5f 70 00 86 80 7f 00 ff ff ff ff
0x0020  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0030  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0040  ff ff ff ff ff ff 29 12 ff ff ff ff ff ff ff ff
0x0050  ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0060  2c 01 00 40 06 41 ff ff ff ff ff ff ff ff ff ff
0x0070  ff ff ff ff ff ff ff ff ff ff ff ff ff ff b3 b5

eth1Link encap:Ethernet  HWaddr 00:0E:B6:26:95:05
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:3959305 errors:0 dropped:0 overruns:0 frame:0
TX packets:5337629 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:801040171 (763.9 MiB)  TX bytes:797939498 (760.9 MiB)
Interrupt:12 Base address:0xd000 Memory:e8083000-e8084000


Notice that 0 errors are reported..  How could this be?

ethtool eth1
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: No
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: g
Wake-on: d
Current message level: 0x0007 (7)
Link detected: yes


Any ideas? or debugging info greatly appreciated.


Thanks,
Shaw

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: watchdog timeout panic in e1000 driver

2006-10-30 Thread Shaw Vrana
On Mon, Oct 30, 2006 at 09:30:24AM -0800, Auke Kok wrote:
> >Even if the total lock time can be reduced, it's possible that interrupt
> >handler is executed while the interrupted code is still holding the 
> >semaphore.
> >I think your method only decrease the frequency of this problem.
> >Why does reducing the lock time solve this problem?
> 
> there are several problems here that need addressing. It's not acceptable 
> for our driver to wait up to 15 seconds, and we can (presumably) reduce it 
> to milliseconds, so that would help a lot. We should in no case at all hold 
> it for any period longer than (give or take) half a second, so working 
> towards that is a very good step in the right direction.
> 
> Adding the timer task back may also help, as we are no longer trying to 
> aqcuire the sw_fw_semaphore in interrupt context, but we removed it for a 
> reason, and I need to dig up what reason this exactly was before we can 
> revert it. Jesse might know, so I'll talk to him. But this will not fix the 
> fact that the semaphore is held for a long time :)


Timer tasks that reschedule themselves are a pain.  The watchdog timer task
had a couple of race conditions that were thought to be better fixed by
removing it all together.  Please, let's not go down that road again!

Check out what you have to say about it, Auke. ;)
http://www.spinics.net/lists/netdev/msg03656.html


Shaw
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000_xmit_frame and e1000_down racing with next_to_use?

2006-09-06 Thread Shaw Vrana
> On Wed, 6 Sep 2006 10:58:15 -0700 (PDT)
> [EMAIL PROTECTED] wrote:
>
>> Hello All,
>>
>> I have a question about the use of the tx_ring->next_to_use variable in
>> the e1000.  Specifically, I'm wondering about a race between the use of
>> next_to_use in e1000_xmit_frame and the clearing of next_to_use in
>> e1000_down via e1000_clean_tx_ring.
>>
>> Thread 1 (_xmit) ->  first = adapter->tx_ring.next_to_use;
>>  e1000_tx_map();
>> Thread 2 (_down) ->  e1000_clean_tx_ring();
>>  tx_ring->next_to_use = 0;
>> Thread 1 (_xmit) -> e1000_tx_queue();
>>
>> It seems that tx_ring.next_to_use could change between the time the
>> skbuff
>> is mapped in e1000_tx_map and the time it is reported to the hardware in
>> e1000_tx_queue.  While I don't see any memory leaks or possible oops, it
>> does seem possible that that an skbuff could be "lost" in the ring as it
>> will not be queued in the subsequent e1000_queue.
>>
>> If the race is possible, perhaps this could be the culprit behind the tx
>> timeouts we've seen reported in this list?  The watchdog will eventually
>> find the "lost" skbuff and mistakenly think that the hardware transmit
>> has
>> hung and stop the queue.
>>
>> Could one of you plese tell me how this race is avoided, if indeed it
>> is?
>>
>> Thanks,
>> Shaw
>>
>
> e1000_down calls netif_stop_queue() and that stops transmit requests.
> It doesn't handle the case of a transmit in flight during the e1000_down.
>
> Shouldn't clean_tx_ring acquire tx_ring->tx_lock to avoid that?

Hi Stephen,

Yes, holding the adapter->tx_lock is all that is needed. e1000_irq_disable
has been called prior to e1000_clean_tx_ring or the interrupt has never
been enabled (e1000_open), so a simple spin_lock should suffice. I've
included a patch against Garzik's netdev git tree.

Thanks,
Shaw


Protect against the race to modify tx_ring->next_to_use in the case of a
transmit in flight during e1000_down.

Signed-off-by: Shaw Vrana <[EMAIL PROTECTED]>
---

 drivers/net/e1000/e1000_main.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 726f43d..b327976 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1937,6 +1937,8 @@ e1000_clean_tx_ring(struct e1000_adapter
unsigned long size;
unsigned int i;

+   spin_lock(&tx_ring->tx_lock);
+
/* Free all the Tx ring sk_buffs */

for (i = 0; i < tx_ring->count; i++) {
@@ -1957,6 +1959,8 @@ e1000_clean_tx_ring(struct e1000_adapter

writel(0, adapter->hw.hw_addr + tx_ring->tdh);
writel(0, adapter->hw.hw_addr + tx_ring->tdt);
+
+   spin_unlock(&tx_ring->tx_lock);
 }

 /**

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000_down and tx_timeout worker race cleaning the transmit buffers

2006-04-29 Thread Shaw Vrana

Hi Auke,

On 4/26/06, Auke Kok <[EMAIL PROTECTED]> wrote:


> I'm concerned about the addition of the netif_running check to
> e1000_down.  While something like this is needed, I'm not familiar
> enough w/ the code to know if this is okay.
> All explanations and comments are greatly appreciated.
While I appreciate patches ;^) I think we're on a better path by making these
cleanups, and actually reducing the code in large places. I hope to be able to
push something out for RFC soon. Added benefit will be that we're dropping a
whole bunch of irq operations where we didn't need to (soft resets).


Well, it looks like my patch won't work as the e1000_close is called
by dev.c only after it clears the __LINK_STATE_START bit, which means
that e1000_down will exit prematurely when called from e1000_close in
my approach.  Any feedback on the approach would be appreciated, as
your upcoming patch sounds like it might be too aggressive to get put
into a stabilization patch.  ;)

I understand the need to fix the problems associated with the
watchdog_task as well, though I wonder if it wouldn't be better to
remove it altogether given the complexity of cleaning up after these
tasks in general.  I've personally had more problems with the watchdog
task than with the possible sleep in the watchdog timer code.  I can't
help but siding with the RedHat folks who currently ship a version of
the e1000 driver that fixes the mechanism used to sleep instead of the
watchdog_task approach.  Perhaps I missed the discussion of this, I'm
only finding the patch itself with google.

Thanks,
Shaw
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000_down and tx_timeout worker race cleaning the transmit buffers

2006-04-20 Thread Shaw Vrana
On Thursday 20 April 2006 17:10, Michael Chan wrote:
> In tg3_remove_one(), we call flush_scheduled_work() in case the
> reset_task is still pending. Here, it is safe to call
> flush_scheduled_work() because we're not holding the rtnl. Again, when
> it runs, nothing bad will happen because it will see netif_running() ==
> 0.

I'll bite!  Here's a patch to add a call to flush_scheduled_work() in 
e1000_down.  It's against 2.6.16.9.

Shaw
diff -u -uprN -X linux-2.6.16.9/Documentation/dontdiff linux-2.6.16.9/drivers/net/e1000/e1000_main.c linux-2.6.16.9-patch/drivers/net/e1000/e1000_main.c
--- linux-2.6.16.9/drivers/net/e1000/e1000_main.c	2006-04-18 23:10:14.0 -0700
+++ linux-2.6.16.9-patch/drivers/net/e1000/e1000_main.c	2006-04-20 19:36:55.0 -0700
@@ -538,6 +538,7 @@ e1000_down(struct e1000_adapter *adapter
 	del_timer_sync(&adapter->tx_fifo_stall_timer);
 	del_timer_sync(&adapter->watchdog_timer);
 	del_timer_sync(&adapter->phy_info_timer);
+	flush_scheduled_work();	
 
 #ifdef CONFIG_E1000_NAPI
 	netif_poll_disable(netdev);