Re: [E1000-devel] 2.6.25.5, e1000e and intel SR2520 server - IPMI is disconnected when e1000emodule is loaded

2008-06-17 Thread Brandeburg, Jesse
Arkadiusz Miskiewicz wrote:
> Hello,

hi, sorry to hear about your problem.
 
> I have a problem with e1000e driver on SR2520 intel platform, running
> 2.6.25.5 kernel.
> 
> When e1000e driver is loaded the IPMI connection to the machine is
> lost. This machine has two ethernet ports that share normal traffic
> and IPMI one. 
> 
> How to repeat, configure ipmi on such machine (ip), run mtr to that
> ip, try to load e1000e on the host kernel.

what BMC version are you running?  do you have the latest bios update?
 
> mtr will show packets lost (if using ipmi SOL console the connection
> is dropped after few failed ipmi-pings) and then after few seconds
> ipmi connection will back to live.
> 
> [4.136408] e1000e: Intel(R) PRO/1000 Network Driver - 0.2.0
> [4.136408] e1000e: Copyright (c) 1999-2007 Intel Corporation.
> [4.136408] ACPI: PCI Interrupt :07:00.0[A] -> GSI 18 (level,
> low) -> IRQ 18
> [4.136408] PCI: Setting latency timer of device :07:00.0 to 64
> [4.215667] eth0: (PCI Express:2.5GB/s:Width x4) 00:15:17:0c:d2:2c
> [4.215669] eth0: Intel(R) PRO/1000 Network Connection
> [4.215746] eth0: MAC: 3, PHY: 5, PBA No: 301000-000
> [4.216398] ACPI: PCI Interrupt :07:00.1[B] -> GSI 19 (level,
> low) -> IRQ 19
> [4.216488] PCI: Setting latency timer of device :07:00.1 to 64
> [4.280928] eth1: (PCI Express:2.5GB/s:Width x4) 00:15:17:0c:d2:2d
> [4.280931] eth1: Intel(R) PRO/1000 Network Connection
> [4.281008] eth1: MAC: 3, PHY: 5, PBA No: 301000-000
> [4.141027] Intel(R) PRO/1000 Network Driver - version
> 7.3.20-k2-NAPI [4.141029] Copyright (c) 1999-2006 Intel
> Corporation. 

first, can you try e1000e from sourceforge.net/projects/e1000?  I want
to make sure we haven't fixed the issue already.  This sounds familiar
to issues that we have fixed before.

second, can you please download compile and run ethregs from the same
sourceforge site, once before loading the driver (when ipmi works) and
once after?

let me know if you have trouble with ethregs or the e1000e driver.

jesse

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] e1000e not returning proper ETHTOOLS link upstatus?

2008-06-17 Thread Ronciak, John
No, it doesn't work like that.  We have lots of internal versions that
are used to introduce new code, fixes and possible new HW enablement.
We get ready for going to the release we branch the code and basically
freeze it, only bug fixes for the release are allowed.  The version is
bumped each time something in the code changes.  So we don't know what
the version is at this time. 


Cheers,
John
---
"Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.", Benjamin
Franklin 1755 

-Original Message-
From: John DeFranco [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 17, 2008 8:55 AM
To: Ronciak, John
Cc: Allan, Bruce W; e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] e1000e not returning proper ETHTOOLS link
upstatus?

Hopefully one last question, would that be version 0.2.9.7? (assuming 
the one you are ready to release is 0.2.9.6).

Thanks!
-jd

Ronciak, John wrote:
> We are guessing the week of Aug. 15th.
>
>
> Cheers,
> John
> ---
> "Those who would give up essential Liberty, to purchase a little
> temporary Safety, deserve neither Liberty nor Safety.", Benjamin
> Franklin 1755
>
> -Original Message-
> From: John DeFranco [mailto:[EMAIL PROTECTED]
> Sent: Monday, June 16, 2008 9:43 AM
> To: Allan, Bruce W
> Cc: Ronciak, John; e1000-devel@lists.sourceforge.net
> Subject: Re: [E1000-devel] e1000e not returning proper ETHTOOLS link
> upstatus?
>
> Thanks. You said that this fix did not make it into the current
version
> that is in process of being released. Any timeframe for when the next
> release would be?
>
> Thanks
> -jd
>
> Allan, Bruce W wrote:
>   
>> Not a complete replacement.  Support for existing Intel PCIe GbE
>> 
> devices
>   
>> was moved from e1000 to e1000e.  PCI and PCIx devices will continue
to
>> be supported by e1000, and e1000e will continue to grow with support
>> 
> for
>   
>> new PCIe parts.
>>
>>
>> 
>>> -Original Message-
>>> From: John DeFranco [mailto:[EMAIL PROTECTED]
>>> Sent: Monday, June 16, 2008 9:04 AM
>>> To: Allan, Bruce W
>>> Cc: Ronciak, John; e1000-devel@lists.sourceforge.net
>>> Subject: Re: [E1000-devel] e1000e not returning proper ETHTOOLS link
>>> upstatus?
>>>
>>> Thanks everyone for your responses. This has been very helpful. I do
>>> have one additional question, is the e1000e driver meant as a
>>>   
> complete
>   
>>> replacement for the e1000 driver or just for a specific
chipset/card?
>>>
>>> Thanks!
>>> -jd
>>>
>>> Allan, Bruce W wrote:
>>>
>>>   
 Unfortunately, that fix did not get into the version that will be
 released shortly which is currently locked down.  It will most
 definitely be in the following release posted to SourceForge, and I

 
>> will
>>
>> 
 be working with Jeff K. to push this and other fixes upstream to
 kernel.org.



 
> -Original Message-
> From: Ronciak, John
> Sent: Friday, June 13, 2008 10:20 AM
> To: John DeFranco; Allan, Bruce W
> Cc: e1000-devel@lists.sourceforge.net
> Subject: RE: [E1000-devel] e1000e not returning proper ETHTOOLS
>   
> link
>   
> upstatus?
>
> It may be in the latest but we'll be posting a new version that
>
>
>   
 absolutely


 
> have the fix in  about 2 weeks.
>
>
> Cheers,
> John
> ---
> "Those who would give up essential Liberty, to purchase a little
>
>
>   
 temporary


 
> Safety, deserve neither Liberty nor Safety.", Benjamin Franklin
>   
> 1755
>   
> -Original Message-
> From: [EMAIL PROTECTED]
>   
> [mailto:e1000-devel-
>   
> [EMAIL PROTECTED] On Behalf Of John DeFranco
> Sent: Friday, June 13, 2008 8:57 AM
> To: Allan, Bruce W
> Cc: e1000-devel@lists.sourceforge.net
> Subject: Re: [E1000-devel] e1000e not returning proper ETHTOOLS
>   
> link
>   
> upstatus?
>
> This is great news, thanks. Do you have any idea what version this
>
>   
>> is
>>
>> 
> fixed in?
>
>
>
> Allan, Bruce W wrote:
>
>
>   
>> Yup, it's known and already fixed in-house.
>>
>> Essentially, the return from e1000_get_link() should be something
>>
>>
>> 
 like:


 
>> return ((status & E1000_STATUS_LU) ? 1 : 0);
>>
>>
>>
>>
>> 
>>> -Original Message-
>>> From: [EMAIL PROTECTED]
>>>
>>>   
>> [mailto:e1000-devel-
>>
>> 
>>> [EMAIL PROTECTED] On Behalf Of John DeFranco
>>> Sent: Thursday, June 12, 2008 4:37 

Re: [E1000-devel] Issue with igb/82575 and Xen

2008-06-17 Thread Williams, Mitch A
This is a known issue, which will be fixed in the upcoming release.
You should be able to work around the issue by turning off TX
checksumming:
  $ ethtool -K  tx off

-Mitch

>-Original Message-
>From: [EMAIL PROTECTED] 
>[mailto:[EMAIL PROTECTED] On Behalf 
>Of David Parsley
[snip]
>I recently purchased a Dell 6950 w/ an Intel 82575 4-port add-in card,
>and installed RHEL5.2 on it.  I found that when a Xen VM is bridged to
>one of the Intel ports, networking for domU is somehow subtly broken.
>For instance, I can ping the VM, but if I try to ssh to it, the
>three-way handshake completes, but then I start getting icmp
>unreachable from the VM.  Bridging to the onboard broadcom interface
>on the same VLAN, the VM works fine.
>
>Using the latest stable 1.2.24 igb driver gives the same results.
>I've bugzilla'd this issue with RedHat:
>https://bugzilla.redhat.com/show_bug.cgi?id=451787

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] e1000e not returning proper ETHTOOLS link upstatus?

2008-06-17 Thread John DeFranco
Hopefully one last question, would that be version 0.2.9.7? (assuming 
the one you are ready to release is 0.2.9.6).

Thanks!
-jd

Ronciak, John wrote:
> We are guessing the week of Aug. 15th.
>
>
> Cheers,
> John
> ---
> "Those who would give up essential Liberty, to purchase a little
> temporary Safety, deserve neither Liberty nor Safety.", Benjamin
> Franklin 1755
>
> -Original Message-
> From: John DeFranco [mailto:[EMAIL PROTECTED]
> Sent: Monday, June 16, 2008 9:43 AM
> To: Allan, Bruce W
> Cc: Ronciak, John; e1000-devel@lists.sourceforge.net
> Subject: Re: [E1000-devel] e1000e not returning proper ETHTOOLS link
> upstatus?
>
> Thanks. You said that this fix did not make it into the current version
> that is in process of being released. Any timeframe for when the next
> release would be?
>
> Thanks
> -jd
>
> Allan, Bruce W wrote:
>   
>> Not a complete replacement.  Support for existing Intel PCIe GbE
>> 
> devices
>   
>> was moved from e1000 to e1000e.  PCI and PCIx devices will continue to
>> be supported by e1000, and e1000e will continue to grow with support
>> 
> for
>   
>> new PCIe parts.
>>
>>
>> 
>>> -Original Message-
>>> From: John DeFranco [mailto:[EMAIL PROTECTED]
>>> Sent: Monday, June 16, 2008 9:04 AM
>>> To: Allan, Bruce W
>>> Cc: Ronciak, John; e1000-devel@lists.sourceforge.net
>>> Subject: Re: [E1000-devel] e1000e not returning proper ETHTOOLS link
>>> upstatus?
>>>
>>> Thanks everyone for your responses. This has been very helpful. I do
>>> have one additional question, is the e1000e driver meant as a
>>>   
> complete
>   
>>> replacement for the e1000 driver or just for a specific chipset/card?
>>>
>>> Thanks!
>>> -jd
>>>
>>> Allan, Bruce W wrote:
>>>
>>>   
 Unfortunately, that fix did not get into the version that will be
 released shortly which is currently locked down.  It will most
 definitely be in the following release posted to SourceForge, and I

 
>> will
>>
>> 
 be working with Jeff K. to push this and other fixes upstream to
 kernel.org.



 
> -Original Message-
> From: Ronciak, John
> Sent: Friday, June 13, 2008 10:20 AM
> To: John DeFranco; Allan, Bruce W
> Cc: e1000-devel@lists.sourceforge.net
> Subject: RE: [E1000-devel] e1000e not returning proper ETHTOOLS
>   
> link
>   
> upstatus?
>
> It may be in the latest but we'll be posting a new version that
>
>
>   
 absolutely


 
> have the fix in  about 2 weeks.
>
>
> Cheers,
> John
> ---
> "Those who would give up essential Liberty, to purchase a little
>
>
>   
 temporary


 
> Safety, deserve neither Liberty nor Safety.", Benjamin Franklin
>   
> 1755
>   
> -Original Message-
> From: [EMAIL PROTECTED]
>   
> [mailto:e1000-devel-
>   
> [EMAIL PROTECTED] On Behalf Of John DeFranco
> Sent: Friday, June 13, 2008 8:57 AM
> To: Allan, Bruce W
> Cc: e1000-devel@lists.sourceforge.net
> Subject: Re: [E1000-devel] e1000e not returning proper ETHTOOLS
>   
> link
>   
> upstatus?
>
> This is great news, thanks. Do you have any idea what version this
>
>   
>> is
>>
>> 
> fixed in?
>
>
>
> Allan, Bruce W wrote:
>
>
>   
>> Yup, it's known and already fixed in-house.
>>
>> Essentially, the return from e1000_get_link() should be something
>>
>>
>> 
 like:


 
>> return ((status & E1000_STATUS_LU) ? 1 : 0);
>>
>>
>>
>>
>> 
>>> -Original Message-
>>> From: [EMAIL PROTECTED]
>>>
>>>   
>> [mailto:e1000-devel-
>>
>> 
>>> [EMAIL PROTECTED] On Behalf Of John DeFranco
>>> Sent: Thursday, June 12, 2008 4:37 PM
>>> To: e1000-devel@lists.sourceforge.net
>>> Subject: [E1000-devel] e1000e not returning proper ETHTOOLS link
>>>
>>>   
>> up
>>
>> 
>> status?
>>
>>
>>
>> 
>>> Hi all,
>>>
>>> I'm seeing what I consider a problem with getting link status via
>>> SIOCETHTOOL and the e1000e driver. According to all the data I
>>>
>>>   
>> have
>>
>> 
 and


 
>>> based on how the e1000/e100/tg3 and any broadcom driver works if
>>>   
> I
>   
>>>
>>>   
>> issue
>>
>>
>>
>> 
>>> something like the following:
>>>
>>>   edata.cmd = ETHTOOL_GLINK;
>>>   ifr.ifr_data = (caddr_t)&edata;
>>>
>>>   if (ioctl(mii_socket, SIOCETHTOOL, &ifr) != 0 ){
>>> 

[E1000-devel] Issue with igb/82575 and Xen

2008-06-17 Thread David Parsley
Hi all,

I recently purchased a Dell 6950 w/ an Intel 82575 4-port add-in card,
and installed RHEL5.2 on it.  I found that when a Xen VM is bridged to
one of the Intel ports, networking for domU is somehow subtly broken.
For instance, I can ping the VM, but if I try to ssh to it, the
three-way handshake completes, but then I start getting icmp
unreachable from the VM.  Bridging to the onboard broadcom interface
on the same VLAN, the VM works fine.

Using the latest stable 1.2.24 igb driver gives the same results.
I've bugzilla'd this issue with RedHat:
https://bugzilla.redhat.com/show_bug.cgi?id=451787

Regards,
David
-- 
David L. Parsley
Manager of Network Services, Bridgewater College
"If I have seen further, it is by standing on ye shoulders of giants"
- Isaac Newton

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

2008-06-17 Thread Ingo Molnar

* David Miller <[EMAIL PROTECTED]> wrote:

> From: Ingo Molnar <[EMAIL PROTECTED]>
> Date: Tue, 17 Jun 2008 11:27:06 +0200
> 
> > when i originally reported it i debugged it back to missing e1000 TX 
> > completion IRQs. I tried various versions of the driver to figure 
> > out whether new workarounds for e1000 cover it but it was fruitless. 
> > There is a 1000 msec internal watchdog timer IRQ within e1000 that 
> > gets things going if it's stuck.
> 
> Then that explains your latency, the chip is getting stuck and TX 
> interrupts stop, right.

note that the 1000 msecs timer is AFAIK internal to the e1000 
_hardware_, not the driver itself. I.e. probably the firmware detects 
and works around a hung transmitter. This is not detectable from the OS 
(it's not an OS timer), but it can be observed by a lot of testing on a 
totally quiescent system - which i did back then ;-)

i also played a lot with the various knobs of the e1000, none of which 
seemed to help.

/me digs in archives

i reported it to the e1000 folks in 2006:

  Date: Mon, 4 Dec 2006 11:24:00 +0100

against 2.6.19. The original report is below - with a trace and various 
things i tried to debug this.

i eventually got the suggestion from Auke to set RxIntDelay=8 which 
seemed to work around the issue - but since i use a built-in driver i 
dont have that setting here (RxIntDelay=8 is a module load parameter and 
not exposed via Kconfig methods) and the e1000 driver does not seem to 
have changed its default setting for RxIntDelay.

2.6.18-1.2849.fc6 was the last kernel that worked fine.

Ingo

>
Date: Wed, 13 Dec 2006 22:09:22 +0100
From: Ingo Molnar <[EMAIL PROTECTED]>
To: Auke Kok <[EMAIL PROTECTED]>
Subject: Re: e1000: 2.6.19 & long packet latencies
Cc: Jesse Brandeburg <[EMAIL PROTECTED]>,
"Ronciak, John" <[EMAIL PROTECTED]>

Jesse, et al.,

i'm having a weird packet processing latency problem with the e1000 
driver and recent kernels.

The symptom is this: if i connect to a T60 laptop (which has an on-board 
e1000) from the outside, i see large delays in network activity, and ssh 
sessions are very sluggish.

ping latencies show it best under a dynticks kernel (but vanilla 2.6.19 
is affected too):

 titan:~/linux/linux> ping e
 PING europe (10.0.1.15) 56(84) bytes of data.
 64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.340 ms
 64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=757 ms
 64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=1001 ms
 64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=1001 ms
 64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.356 ms
 64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=2127 ms
 64 bytes from europe (10.0.1.15): icmp_seq=7 ttl=64 time=1002 ms
 64 bytes from europe (10.0.1.15): icmp_seq=8 ttl=64 time=0.320 ms
 64 bytes from europe (10.0.1.15): icmp_seq=9 ttl=64 time=1002 ms
 64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=2004 ms
 64 bytes from europe (10.0.1.15): icmp_seq=11 ttl=64 time=1002 ms
 64 bytes from europe (10.0.1.15): icmp_seq=12 ttl=64 time=0.303 ms
 64 bytes from europe (10.0.1.15): icmp_seq=13 ttl=64 time=1000 ms
 64 bytes from europe (10.0.1.15): icmp_seq=14 ttl=64 time=2010 ms
 64 bytes from europe (10.0.1.15): icmp_seq=15 ttl=64 time=1009 ms
 64 bytes from europe (10.0.1.15): icmp_seq=16 ttl=64 time=0.283 ms

i have traced this and the 1000/2000 msecs values come from some sort of 
e1000-internal 'heartbeat' interrupt. What seems to happen is that RX 
packet processing is delayed indefinitely and the IRQ just does not 
arrive.

NOTE: the vanilla 2.6.19 kernel shows this too, but the ping delays are 
1/HZ.

here's a (filtered) trace of such a delay. IRQ 0x219 is the e1000 
interrupt:

  -0 0D.h1 761236us : do_IRQ (c0272a9b 219 0)
 IRQ_219-356   0 761412us+: e1000_intr (handle_IRQ_event)
 IRQ_219-356   0 761416us : e1000_clean_rx_irq (e1000_intr)
 IRQ_219-356   0 761418us+: e1000_clean_tx_irq (e1000_intr)
  -0 0D.h1 2760093us : do_IRQ (c0272a9b 219 0)
 IRQ_219-356   0 2760268us+: e1000_intr (handle_IRQ_event)
 IRQ_219-356   0 2760273us : e1000_clean_rx_irq (e1000_intr)
 IRQ_219-356   0 2760275us : e1000_clean_tx_irq (e1000_intr)
  -0 0D.h1 3804499us : do_IRQ (c0272a9b 219 0)
 IRQ_219-356   0 3804674us+: e1000_intr (handle_IRQ_event)
 IRQ_219-356   0 3804679us+: e1000_clean_rx_irq (e1000_intr)
 IRQ_219-356   0 3804761us : e1000_clean_tx_irq (e1000_intr)
 IRQ_219-356   0 3804763us : e1000_clean_rx_irq (e1000_intr)
 IRQ_219-356   0 3804765us : e1000_clean_tx_irq (e1000_intr)
softirq--7 0 3804810us : net_rx_action (ksoftirqd)
softirq--5 0D.h. 3805425us : do_IRQ (c01598ac 219 0)
 IRQ_219-356   0 3805499us+: e1000_intr (handle_IRQ_event)
 IRQ_219-356   0 3805504us : e1000_clean_rx_irq (e1000_intr)
 IRQ_219-356   0 3805506us : e1000_clean_tx_irq (e1000_intr)
 IRQ_219-356   0 3805547us : e1000_clean_rx_irq (e1000_int

Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

2008-06-17 Thread Vitaliy Gusev
On 17 June 2008 12:09:58 Ingo Molnar wrote:
> * David Miller <[EMAIL PROTECTED]> wrote:
> > From: Ingo Molnar <[EMAIL PROTECTED]>
> > Date: Tue, 17 Jun 2008 09:26:58 +0200
> >
> > > So since there's no clear bug pattern and no sure reproducability on
> > > my side i'd suggest we track this problem separately and "do
> > > nothing" right now. I've excluded this warning from my 'is the
> > > freshly booted kernel buggy' list of conditions of -tip testing so
> > > it's not holding me up.
> >
> > I'm going to push the revert through just to be safe and I think it's
> > a good idea to do so because all of those defer accept changes should
> > be resubmitted as a group for 2.6.27
>
> okay - in that case the full revert is well-tested on my side as well,
> fwiw.
>
> Tested-by: Ingo Molnar <[EMAIL PROTECTED]>

Revert patch takes away problem with leak sockets.
Tested-by: Vitaliy Gusev <[EMAIL PROTECTED]>

>
> > > and i can apply any test-patch if that would be helpful - if it does
> > > a WARN_ON() i'll notice it. (pure extra debug printks with no stack
> > > trace are much harder to notice in automated tests)
> >
> > I don't have time to work on your bug, sorry.  Someone else will have
> > to step forward and help you with it.
>
> it's not really "my bug" - i just offered help to debug someone else's
> bug :-) This is pretty common hw so i guess there will be such reports.
>
> Let me describe what i'm doing exactly: i do a lot of randomized testing
> on about a dozen real systems (all across the x86 spectrum) so i tend to
> trigger a lot of mainline bugs pretty early on.
>
> My collection of kernel bugs for the last 8 months shows 1285 bugs
> (kernel crashes or build failures - about 50%/50%) triggered. One
> test-system alone has a serial log of 15 gigabytes - and there's a dozen
> of them. That's about 5 kernel bugs a day handled by me, on average.
>
> These systems have about 10 times the hardware variability of your
> Niagara system for example, and many of them are rather difficult to
> debug (laptops without serial port, etc.). So i physically cannot avoid
> and debug all bugs on all my test-systems, like you do on the Niagara. I
> will report bugs, i'll bisect anything that is bisectable (on average i
> bisect once a day), and i can add patches and report any test-results,
> and i'll of course debug any bugs that look like heavy mainline
> showstoppers.
>
> > FWIW I don't think your TX timeout problem has anything to do with
> > packet ordering.  The TX element of the network device is totally
> > stateless, but it's hanging under some set of circumstances to the
> > point where we timeout and reset the hardware to get it going again.
>
> ok. That's e1000 then. Cc:s added. Stock T60 laptop, 32-bit:
>
> 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet
> Controller Subsystem: Lenovo ThinkPad T60
> Flags: bus master, fast devsel, latency 0, IRQ 16
> Memory at ee00 (32-bit, non-prefetchable) [size=128K]
> I/O ports at 2000 [size=32]
> Capabilities: 
> Kernel driver in use: e1000
>
> the problem is this non-fatal warning showing up after bootup,
> sporadically, in a non-reproducible way:
>
> [  173.354049] NETDEV WATCHDOG: eth0: transmit timed out
> [  173.354148] [ cut here ]
> [  173.354221] WARNING: at net/sched/sch_generic.c:222
> dev_watchdog+0x9a/0xec() [  173.354298] Modules linked in:
> [  173.354421] Pid: 13452, comm: cc1 Tainted: GW
> 2.6.26-rc6-00273-g81ae43a-dirty #2573 [  173.354516]  []
> warn_on_slowpath+0x46/0x76
> [  173.354641]  [] ? try_to_wake_up+0x1d6/0x1e0
> [  173.354815]  [] ? trace_hardirqs_off+0xb/0xd
> [  173.357370]  [] ? default_wake_function+0xb/0xd
> [  173.357370]  [] ? trace_hardirqs_off_caller+0x15/0xc9
> [  173.357370]  [] ? trace_hardirqs_off+0xb/0xd
> [  173.357370]  [] ? trace_hardirqs_on+0xb/0xd
> [  173.357370]  [] ? trace_hardirqs_on_caller+0x16/0x15b
> [  173.357370]  [] ? trace_hardirqs_on+0xb/0xd
> [  173.357370]  [] ? _spin_unlock_irqrestore+0x5b/0x71
> [  173.357370]  [] ? __queue_work+0x2d/0x32
> [  173.357370]  [] ? queue_work+0x50/0x72
> [  173.357483]  [] ? schedule_work+0x14/0x16
> [  173.357654]  [] dev_watchdog+0x9a/0xec
> [  173.357783]  [] run_timer_softirq+0x13d/0x19d
> [  173.357905]  [] ? dev_watchdog+0x0/0xec
> [  173.358073]  [] ? dev_watchdog+0x0/0xec
> [  173.360804]  [] __do_softirq+0xb2/0x15c
> [  173.360804]  [] ? __do_softirq+0x0/0x15c
> [  173.360804]  [] do_softirq+0x84/0xe9
> [  173.360804]  [] irq_exit+0x4b/0x88
> [  173.360804]  [] smp_apic_timer_interrupt+0x73/0x81
> [  173.360804]  [] apic_timer_interrupt+0x2d/0x34
> [  173.360804]  ===
> [  173.360804] ---[ end trace a7919e7f17c0a725 ]---
>
> full report can be found at:
>
>http://lkml.org/lkml/2008/6/13/224
>
> i have 3 other test-systems with e1000 (with a similar CPU) which are
> _not_ showing this symptom, so this could be some model-specific e1000
> issue.
>
>

Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

2008-06-17 Thread David Miller
From: Ingo Molnar <[EMAIL PROTECTED]>
Date: Tue, 17 Jun 2008 11:27:06 +0200

> when i originally reported it i debugged it back to missing e1000 TX 
> completion IRQs. I tried various versions of the driver to figure out 
> whether new workarounds for e1000 cover it but it was fruitless. There 
> is a 1000 msec internal watchdog timer IRQ within e1000 that gets things 
> going if it's stuck.

Then that explains your latency, the chip is getting stuck and
TX interrupts stop, right.

> But the line sch_generic.c:222 problem is new. It could be an 
> escallation of this same problem - not even the hw-internal watchdog 
> timeout fixing up things? So basically two levels of completion failed, 
> the third fallback level (a hard reset of the interface) helped things 
> get going. High score from me for networking layer robustness :-)

I think it is an escallation of the same problem.  My first thought
is that there must have been some change to the reset logic and it
isn't as foolproof as it used to be, especially under load.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

2008-06-17 Thread Ingo Molnar

* David Miller <[EMAIL PROTECTED]> wrote:

> From: Ingo Molnar <[EMAIL PROTECTED]>
> Date: Tue, 17 Jun 2008 10:32:20 +0200
> 
> > those up to 1000 msec delays can be 'felt' via ssh too, if this 
> > problem triggers then the system is almost unusable via the network. 
> > Local latencies are perfect so it's an e1000 problem.
> 
> Or some kind of weird interrupt problem.
> 
> Such an interrupt level bug would also account for the TX timeout's 
> you're seeing btw.

when i originally reported it i debugged it back to missing e1000 TX 
completion IRQs. I tried various versions of the driver to figure out 
whether new workarounds for e1000 cover it but it was fruitless. There 
is a 1000 msec internal watchdog timer IRQ within e1000 that gets things 
going if it's stuck.

But the line sch_generic.c:222 problem is new. It could be an 
escallation of this same problem - not even the hw-internal watchdog 
timeout fixing up things? So basically two levels of completion failed, 
the third fallback level (a hard reset of the interface) helped things 
get going. High score from me for networking layer robustness :-)

Ingo

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

2008-06-17 Thread David Miller
From: Ingo Molnar <[EMAIL PROTECTED]>
Date: Tue, 17 Jun 2008 10:32:20 +0200

> those up to 1000 msec delays can be 'felt' via ssh too, if this problem 
> triggers then the system is almost unusable via the network. Local 
> latencies are perfect so it's an e1000 problem.

Or some kind of weird interrupt problem.

Such an interrupt level bug would also account for the TX timeout's
you're seeing btw.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

2008-06-17 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> 
> > FWIW I don't think your TX timeout problem has anything to do with 
> > packet ordering.  The TX element of the network device is totally 
> > stateless, but it's hanging under some set of circumstances to the 
> > point where we timeout and reset the hardware to get it going again.
> 
> ok. That's e1000 then. Cc:s added. Stock T60 laptop, 32-bit:
> 
> 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet 
> Controller
> Subsystem: Lenovo ThinkPad T60
> Flags: bus master, fast devsel, latency 0, IRQ 16
> Memory at ee00 (32-bit, non-prefetchable) [size=128K]
> I/O ports at 2000 [size=32]
> Capabilities: 
> Kernel driver in use: e1000
> 
> the problem is this non-fatal warning showing up after bootup, 
> sporadically, in a non-reproducible way:
> 
> [  173.354049] NETDEV WATCHDOG: eth0: transmit timed out
> [  173.354148] [ cut here ]
> [  173.354221] WARNING: at net/sched/sch_generic.c:222 
> dev_watchdog+0x9a/0xec()
> [  173.354298] Modules linked in:
> [  173.354421] Pid: 13452, comm: cc1 Tainted: GW 
> 2.6.26-rc6-00273-g81ae43a-dirty #2573
> [  173.354516]  [] warn_on_slowpath+0x46/0x76
> [  173.354641]  [] ? try_to_wake_up+0x1d6/0x1e0
> [  173.354815]  [] ? trace_hardirqs_off+0xb/0xd
> [  173.357370]  [] ? default_wake_function+0xb/0xd
> [  173.357370]  [] ? trace_hardirqs_off_caller+0x15/0xc9
> [  173.357370]  [] ? trace_hardirqs_off+0xb/0xd
> [  173.357370]  [] ? trace_hardirqs_on+0xb/0xd
> [  173.357370]  [] ? trace_hardirqs_on_caller+0x16/0x15b
> [  173.357370]  [] ? trace_hardirqs_on+0xb/0xd
> [  173.357370]  [] ? _spin_unlock_irqrestore+0x5b/0x71
> [  173.357370]  [] ? __queue_work+0x2d/0x32
> [  173.357370]  [] ? queue_work+0x50/0x72
> [  173.357483]  [] ? schedule_work+0x14/0x16
> [  173.357654]  [] dev_watchdog+0x9a/0xec
> [  173.357783]  [] run_timer_softirq+0x13d/0x19d
> [  173.357905]  [] ? dev_watchdog+0x0/0xec
> [  173.358073]  [] ? dev_watchdog+0x0/0xec
> [  173.360804]  [] __do_softirq+0xb2/0x15c
> [  173.360804]  [] ? __do_softirq+0x0/0x15c
> [  173.360804]  [] do_softirq+0x84/0xe9
> [  173.360804]  [] irq_exit+0x4b/0x88
> [  173.360804]  [] smp_apic_timer_interrupt+0x73/0x81
> [  173.360804]  [] apic_timer_interrupt+0x2d/0x34
> [  173.360804]  ===
> [  173.360804] ---[ end trace a7919e7f17c0a725 ]---
> 
> full report can be found at:
> 
>http://lkml.org/lkml/2008/6/13/224
> 
> i have 3 other test-systems with e1000 (with a similar CPU) which are 
> _not_ showing this symptom, so this could be some model-specific e1000 
> issue.

btw., this reminds me that this is the same system that has a serious 
e1000 network latency bug which i have reported more than a year ago, 
but which still does not appear to be fixed in latest mainline:

 PING europe (10.0.1.15) 56(84) bytes of data.
 64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=1.51 ms
 64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=404 ms
 64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=487 ms
 64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=296 ms
 64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=305 ms
 64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=1011 ms
 64 bytes from europe (10.0.1.15): icmp_seq=7 ttl=64 time=0.209 ms
 64 bytes from europe (10.0.1.15): icmp_seq=8 ttl=64 time=763 ms
 64 bytes from europe (10.0.1.15): icmp_seq=9 ttl=64 time=1000 ms
 64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=0.438 ms
 64 bytes from europe (10.0.1.15): icmp_seq=11 ttl=64 time=1000 ms
 64 bytes from europe (10.0.1.15): icmp_seq=12 ttl=64 time=0.299 ms
 ^C
 --- europe ping statistics ---
 12 packets transmitted, 12 received, 0% packet loss, time 11085ms

those up to 1000 msec delays can be 'felt' via ssh too, if this problem 
triggers then the system is almost unusable via the network. Local 
latencies are perfect so it's an e1000 problem.

Ingo

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

2008-06-17 Thread Ingo Molnar

* David Miller <[EMAIL PROTECTED]> wrote:

> From: Ingo Molnar <[EMAIL PROTECTED]>
> Date: Tue, 17 Jun 2008 09:26:58 +0200
> 
> > So since there's no clear bug pattern and no sure reproducability on 
> > my side i'd suggest we track this problem separately and "do 
> > nothing" right now. I've excluded this warning from my 'is the 
> > freshly booted kernel buggy' list of conditions of -tip testing so 
> > it's not holding me up.
> 
> I'm going to push the revert through just to be safe and I think it's 
> a good idea to do so because all of those defer accept changes should 
> be resubmitted as a group for 2.6.27

okay - in that case the full revert is well-tested on my side as well, 
fwiw.

Tested-by: Ingo Molnar <[EMAIL PROTECTED]>

> > and i can apply any test-patch if that would be helpful - if it does 
> > a WARN_ON() i'll notice it. (pure extra debug printks with no stack 
> > trace are much harder to notice in automated tests)
> 
> I don't have time to work on your bug, sorry.  Someone else will have 
> to step forward and help you with it.

it's not really "my bug" - i just offered help to debug someone else's 
bug :-) This is pretty common hw so i guess there will be such reports.

Let me describe what i'm doing exactly: i do a lot of randomized testing 
on about a dozen real systems (all across the x86 spectrum) so i tend to 
trigger a lot of mainline bugs pretty early on.

My collection of kernel bugs for the last 8 months shows 1285 bugs 
(kernel crashes or build failures - about 50%/50%) triggered. One 
test-system alone has a serial log of 15 gigabytes - and there's a dozen 
of them. That's about 5 kernel bugs a day handled by me, on average.

These systems have about 10 times the hardware variability of your 
Niagara system for example, and many of them are rather difficult to 
debug (laptops without serial port, etc.). So i physically cannot avoid 
and debug all bugs on all my test-systems, like you do on the Niagara. I 
will report bugs, i'll bisect anything that is bisectable (on average i 
bisect once a day), and i can add patches and report any test-results, 
and i'll of course debug any bugs that look like heavy mainline 
showstoppers.

> FWIW I don't think your TX timeout problem has anything to do with 
> packet ordering.  The TX element of the network device is totally 
> stateless, but it's hanging under some set of circumstances to the 
> point where we timeout and reset the hardware to get it going again.

ok. That's e1000 then. Cc:s added. Stock T60 laptop, 32-bit:

02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet 
Controller
Subsystem: Lenovo ThinkPad T60
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at ee00 (32-bit, non-prefetchable) [size=128K]
I/O ports at 2000 [size=32]
Capabilities: 
Kernel driver in use: e1000

the problem is this non-fatal warning showing up after bootup, 
sporadically, in a non-reproducible way:

[  173.354049] NETDEV WATCHDOG: eth0: transmit timed out
[  173.354148] [ cut here ]
[  173.354221] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0x9a/0xec()
[  173.354298] Modules linked in:
[  173.354421] Pid: 13452, comm: cc1 Tainted: GW 
2.6.26-rc6-00273-g81ae43a-dirty #2573
[  173.354516]  [] warn_on_slowpath+0x46/0x76
[  173.354641]  [] ? try_to_wake_up+0x1d6/0x1e0
[  173.354815]  [] ? trace_hardirqs_off+0xb/0xd
[  173.357370]  [] ? default_wake_function+0xb/0xd
[  173.357370]  [] ? trace_hardirqs_off_caller+0x15/0xc9
[  173.357370]  [] ? trace_hardirqs_off+0xb/0xd
[  173.357370]  [] ? trace_hardirqs_on+0xb/0xd
[  173.357370]  [] ? trace_hardirqs_on_caller+0x16/0x15b
[  173.357370]  [] ? trace_hardirqs_on+0xb/0xd
[  173.357370]  [] ? _spin_unlock_irqrestore+0x5b/0x71
[  173.357370]  [] ? __queue_work+0x2d/0x32
[  173.357370]  [] ? queue_work+0x50/0x72
[  173.357483]  [] ? schedule_work+0x14/0x16
[  173.357654]  [] dev_watchdog+0x9a/0xec
[  173.357783]  [] run_timer_softirq+0x13d/0x19d
[  173.357905]  [] ? dev_watchdog+0x0/0xec
[  173.358073]  [] ? dev_watchdog+0x0/0xec
[  173.360804]  [] __do_softirq+0xb2/0x15c
[  173.360804]  [] ? __do_softirq+0x0/0x15c
[  173.360804]  [] do_softirq+0x84/0xe9
[  173.360804]  [] irq_exit+0x4b/0x88
[  173.360804]  [] smp_apic_timer_interrupt+0x73/0x81
[  173.360804]  [] apic_timer_interrupt+0x2d/0x34
[  173.360804]  ===
[  173.360804] ---[ end trace a7919e7f17c0a725 ]---

full report can be found at:

   http://lkml.org/lkml/2008/6/13/224

i have 3 other test-systems with e1000 (with a similar CPU) which are 
_not_ showing this symptom, so this could be some model-specific e1000 
issue.

Ingo

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___