Re: [ofa-general] Re: IPoIB forwarding

2007-05-01 Thread Rick Jones

Loic Prylli wrote:

On 4/30/2007 2:12 PM, Rick Jones wrote:



Speaking of defaults, it would seem that the external 1.2.0 driver 
comes with 9000 bytes as the default MTU?  At least I think that is 
what I am seeing now that I've started looking more closely.


rick jones




That's the same for the in-kernel-tree code (9K MTU by default). 
Assuming this is not wanted, I will submit a patch for that.


While I like what that does for perrformance, and at the risk of putting 
words into the mouths of netdev, I suspect that 1500 bytes is indeed the 
desired default.  It matches the IEEE specs, I've yet to see a switch 
which enabled Jumbo Frames by default, not everything out there even 
believes that Jubmo Frames means 9000 byte MTU etc etc etc.  I think 
that 1500 bytes for an Ethernet device remains in line with the 
principle of least surprise.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: IPoIB forwarding

2007-04-30 Thread Rick Jones

What version of the myri10ge driver is this?  With the 1.2.0 version
that comes with the 2.6.20.7 kernel, there is no myri10ge_lro module
parameter.

[EMAIL PROTECTED] ~]# modinfo myri10ge | grep -i lro
[EMAIL PROTECTED] ~]# 


And I've been testing IP forwarding using two Myricom 10-GigE NICs
without setting any special modprobe parameters.



Ethtool -i on the interface reports 1.2.0 as the driver version.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: IPoIB forwarding

2007-04-30 Thread Rick Jones

David Miller wrote:

From: Rick Jones [EMAIL PROTECTED]
Date: Fri, 27 Apr 2007 16:48:00 -0700



No problem - just to play whatif/devil's advocate for a bit
though... is there any way to tie that in with the setting of
net.ipv4.ip_forward (and/or its IPv6 counterpart)?



Even ignoring that, consider the potential issues this
kind of problem could be causing netfilter.



OK, I'll show my ignorance and bite - what sort of issues with 
netfilter?  Is it tied to link-local MTUs?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: IPoIB forwarding

2007-04-30 Thread Rick Jones
Only the 1.2.0 version of the external driver makes LRO incompatible 
with forwarding. The problem should be fixed in version 1.3.0 released a 
few weeks ago (forwarding with myri10ge_lro enabled should then work), 
let us know otherwise.


Anyway, following David Miller remark about netfilter, for the next 
version we might ask the user to explicitely enable LRO rather than 
making the default.


Speaking of defaults, it would seem that the external 1.2.0 driver comes 
with 9000 bytes as the default MTU?  At least I think that is what I am 
seeing now that I've started looking more closely.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: IPoIB forwarding

2007-04-27 Thread Rick Jones

Bryan Lawver wrote:
Your right about the ipoib module not combining packets (I believed you 
without checking) but I did never the less.  The ipoib_start_xmit 
routine is definitely handed a double packet  which means that the IP 
NIC driver or the kernel is combining two packets into a single super 
jumbo packet.  This issue is irrespective of the IP MTU setting because 
I have set all interfaces to 9000k yet  ipoib accepts and forwards this 
17964 packet to the next IB node and onto the TCP stack where it is 
never acknowledged.  This may not have come up in prior testing because 
I am using some of the fastest IP NICs which have no trouble keeping up 
with or exceeding the bandwidth of the IB side.  This issue arises 
exactly every 8 packets...(ring buffer overrun??)


I will be at Sonoma for the next few days as many on this list will be.



Some NICs (esp 10G) support large receive offload - they coalesce TCP segments 
from the wire/fiber into larger ones they pass up the stack.  Perhaps that is 
happening here?


I'm going to go out a bit on a limb, cross the streams, and include netdev, 
because I suspect that if a system is acting as an IP router, one doesn't want 
large receive offload enabled.  That may need some discussion in netdev - it may 
then require some changes to default settings or some documentation 
enhancements.  That or I'll learn that the stack is already dealing with the 
issue...


rick jones


bryan



At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote:


 Quoting Bryan Lawver [EMAIL PROTECTED]:
 Subject: Re: IPoIB forwarding

 Here's a tcpdump of the same sequence.  The TCP MSS is 8960 and it 
appears

 that two payloads are queued at ipoib which combines them into a single
 17920 payload with assumingly correct IP header (40) and IB header
 (4).  The application or TCP stack does not acknowledge this double 
packet

 ie. it does not ACK until each of the 8960 packets are resent
 individually.  Being an IB newbie, I am guessing this combining is
 allowable but may violate TCP protocol.

IPoIB does nothing like this - it's just a network device so
it sends all packets out as is.

--
MST



___
general mailing list
[EMAIL PROTECTED]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: IPoIB forwarding

2007-04-27 Thread Rick Jones

Bryan Lawver wrote:
I hit the IP NIC over the head with a hammer and turned off all offload 
features and I no longer get the super jumbo packet and I have symmetric 
performance.  This NIC supported ethtool -K ethx tso/tx/rx/sg on/off 
and I am not sure at this time which one I needed to whack but all off 
solved the problem.


Yeah, that does seem like a rather broad remedy, but I guess if it works... :) 
And I suppose most of those offloads don't matter for a NIC being used in a router.


Only problem is we don't know if it worked because it slowed-down the 10G side 
or because it had LRO disabling as a side-effect. If I were to guess, of those 
things listed, I'd guess that receive cko would have that as a side effect.


Just what sort of 10G NIC was this anyway?  With that knowledge we could 
probably narrow things down to a more specific modprobe setting, or maybe even 
an ethtool command, for some suitable revision of ethtool.


rick jones



Thanks for listening and re enforcing my search process.

bryan

At 01:32 PM 4/27/2007, Rick Jones wrote:


Bryan Lawver wrote:

Your right about the ipoib module not combining packets (I believed 
you without checking) but I did never the less.  The ipoib_start_xmit 
routine is definitely handed a double packet  which means that the 
IP NIC driver or the kernel is combining two packets into a single 
super jumbo packet.  This issue is irrespective of the IP MTU setting 
because I have set all interfaces to 9000k yet  ipoib accepts and 
forwards this 17964 packet to the next IB node and onto the TCP stack 
where it is never acknowledged.  This may not have come up in prior 
testing because I am using some of the fastest IP NICs which have no 
trouble keeping up with or exceeding the bandwidth of the IB side.  
This issue arises exactly every 8 packets...(ring buffer overrun??)

I will be at Sonoma for the next few days as many on this list will be.




Some NICs (esp 10G) support large receive offload - they coalesce TCP 
segments from the wire/fiber into larger ones they pass up the stack.  
Perhaps that is happening here?


I'm going to go out a bit on a limb, cross the streams, and include 
netdev, because I suspect that if a system is acting as an IP router, 
one doesn't want large receive offload enabled.  That may need some 
discussion in netdev - it may then require some changes to default 
settings or some documentation enhancements.  That or I'll learn that 
the stack is already dealing with the issue...


rick jones


bryan

At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote:


 Quoting Bryan Lawver [EMAIL PROTECTED]:
 Subject: Re: IPoIB forwarding

 Here's a tcpdump of the same sequence.  The TCP MSS is 8960 and it 
appears
 that two payloads are queued at ipoib which combines them into a 
single

 17920 payload with assumingly correct IP header (40) and IB header
 (4).  The application or TCP stack does not acknowledge this 
double packet

 ie. it does not ACK until each of the 8960 packets are resent
 individually.  Being an IB newbie, I am guessing this combining is
 allowable but may violate TCP protocol.

IPoIB does nothing like this - it's just a network device so
it sends all packets out as is.

--
MST



___
general mailing list
[EMAIL PROTECTED]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: IPoIB forwarding

2007-04-27 Thread Rick Jones

Bryan Lawver wrote:
I had so much debugging turned on that it was not the slowing of the 
traffic but the non-coelescencing that was the remedy.  The NIC is a 
MyriCom NIC and these are easy options to set.


As chance would have it, I've played with some Myricom myri10ge NICs recently, 
and even disabled large receive offload during some netperf tests :)  It is a 
modprobe option.  Going back now to the driver source and the README I see :-)



excerpt
Troubleshooting
===

Large Receive Offload (LRO) is enabled by default.  This will
interfere with forwarding TCP traffic.  If you plan to forward TCP
traffic (using the host with the Myri10GE NIC as a router or bridge),
you must disable LRO.  To disable LRO, load the myri10ge driver
with myri10ge_lro set to 0:

 # modprobe myri10ge myri10ge_lro=0

Alternatively, you can disable LRO at runtime by disabling
receive checksum offloading via ethtool:

   # ethtool -K eth2 rx off

/excerpt

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: IPoIB forwarding

2007-04-27 Thread Rick Jones

David Miller wrote:

From: Rick Jones [EMAIL PROTECTED]
Date: Fri, 27 Apr 2007 16:37:49 -0700



Large Receive Offload (LRO) is enabled by default.  This will
interfere with forwarding TCP traffic.  If you plan to forward TCP
traffic (using the host with the Myri10GE NIC as a router or bridge),
you must disable LRO.  To disable LRO, load the myri10ge driver
with myri10ge_lro set to 0:



LRO should be disabled by default if the driver does this.  This is a
major and unacceptable bug.

Thanks for pointing this out Rick.


No problem - just to play whatif/devil's advocate for a bit though... is there 
any way to tie that in with the setting of net.ipv4.ip_forward (and/or its IPv6 
counterpart)?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexcepted latency (order of 100-200 ms) with TCP (packet receive)

2007-04-26 Thread Rick Jones

Ilpo Järvinen wrote:

Hi,

...
Some time ago I noticed that with 2.6.18 I occassionally get latency
spikes as long as 100-200ms in the TCP transfers between components
(I describe later how TCP was tuned during these tests to avoid
problems that occur with small segments). I started to investigate
the spikes, and here are the findings so far:
...
 - I placed a hub to get exact timings on the wire without potential 
   interference from tcpdump on the emulator host (test done with 2.6.18) 
   but to my great surprise, the problem vanished completely


Sounds like tcpdump getting in the way?  How many CPUs do you have in 
the system, and have you tried some explicit binding of processes to 
different CPUs? (taskset etc...)


When running tcpdump are you simply sending raw traces to a file, or are 
you having the ASCII redirected to a file?  What about name resolution 
(ie areyou using -n)?


 - Due to the hub test result, I tested 10/100/duplex settings and found 
   out that if the emulator host has 10fd, the problem does not occur with

   2.6 either?!? This could be due to luck but I cannot say for sure, yet
   the couple of tests I've run with 10fd, did not show this...
 - Tried to change cable  switch that connect hosts together, no effect


To prove this with 100Mbps, I setup routing so that with a host with 10/FD 
configuration (known, based on earlier, to be unlikely to cause errors) I 
collected all traffic between the emulator host and one of the packet 
capture hosts. Here is one example point where a long delay occurs 
(EMU is the emulator host, in the real log each packet is shown twice, I 
only leave the later one here):


1177577267.364596 IP CAP.35305  EMU.52246: . 17231434:17232894(1460) ack 
383357 win 16293
1177577267.364688 IP CAP.35305  EMU.52246: P 17232894:17232946(52) ack 383357 
win 16293
1177577267.366093 IP EMU.52246  CAP.35305: . ack 17232894 win 32718
1177577267.493815 IP EMU.52246  CAP.35305: P 383357:383379(22) ack 17232894 
win 32718
1177577267.534252 IP CAP.35305  EMU.52246: . ack 383379 win 16293


What is the length of the standalone ACK timer these days?


1177577267.534517 IP EMU.59050  CAP.58452: P 624496:624528(32) ack 328 win 365
1177577267.534730 IP CAP.58452  EMU.59050: . ack 624528 win 16293
1177577267.536267 IP CAP.35305  EMU.52246: . 17232946:17234406(1460) ack 
383379 win 16293
1177577267.536360 IP CAP.35305  EMU.52246: P 17234406:17234458(52) ack 383379 
win 16293
1177577267.537764 IP EMU.52246  CAP.35305: . ack 17234406 win 32718
...
All things use TCP_NODELAY. The network is isolated so that 
no other traffic can cause unexcepted effects. The emulator does collect 
log only to a mem buffer that is flushed through TCP only between tests 
(and thus does not cause timing problems).


Might tcp_abc have crept back-in?

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net-2.6.22 UDP stalls/hangs

2007-04-23 Thread Rick Jones

Oh well, one thing at a time.  The good news is that I can reproduce the
problem with netperf.

kpm:/usr/src/netperf-2.4.3 netperf -H akpm2 -t UDP_RR
UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to akpm2 
(172.18.116.155) port 0 AF_INET
netperf: receive_response: no response received. errno 0 counter 0

That's running netserver on the test machine.

The machine running netperf is 172.18.116.160 and the test machine running
netserver is 172.18.116.155

tcpdump from the test machine:

15:24:37.924210 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 
15:24:38.859309 IP 172.18.119.252.hsrp  224.0.0.2.hsrp: HSRPv0-hello 20: state=standby group=1 addr=172.18.119.254

15:24:39.078273 IP 172.18.119.253.hsrp  224.0.0.2.hsrp: HSRPv0-hello 20: 
state=active group=1 addr=172.18.119.254
15:24:39.924074 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 
15:24:40.017081 IP 172.24.0.7.domain  172.18.116.57.37456:  59635 4/7/6 CNAME[|domain]

15:24:41.383433 IP 172.18.116.160.33137  172.18.116.155.12865: S 
2760291763:2760291763(0) win 5840 mss 1460,sackOK,timestamp 1967355840 0,nop,wscale 
8
15:24:41.383479 IP 172.18.116.155.12865  172.18.116.160.33137: S 
1640262480:1640262480(0) ack 2760291764 win 5792 mss 1460,sackOK,timestamp 7714 
1967355840,nop,wscale 7
15:24:41.383683 IP 172.18.116.160.33137  172.18.116.155.12865: . ack 1 win 23 
nop,nop,timestamp 1967355840 7714
15:24:41.383883 IP 172.18.116.160.33137  172.18.116.155.12865: P 1:257(256) ack 1 
win 23 nop,nop,timestamp 1967355840 7714
15:24:41.383902 IP 172.18.116.155.12865  172.18.116.160.33137: . ack 257 win 54 
nop,nop,timestamp 7714 1967355840
15:24:41.384065 IP 172.18.116.155.12865  172.18.116.160.33137: P 1:257(256) ack 257 
win 54 nop,nop,timestamp 7714 1967355840
15:24:41.587266 IP 172.18.116.155.12865  172.18.116.160.33137: P 1:257(256) ack 257 
win 54 nop,nop,timestamp 7765 1967355840
15:24:41.839234 IP 172.18.119.252.hsrp  224.0.0.2.hsrp: HSRPv0-hello 20: 
state=standby group=1 addr=172.18.119.254
15:24:41.924303 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 
15:24:41.995285 IP 172.18.116.155.12865  172.18.116.160.33137: P 1:257(256) ack 257 win 54 nop,nop,timestamp 7867 1967355840

15:24:42.030341 IP 172.18.119.253.hsrp  224.0.0.2.hsrp: HSRPv0-hello 20: 
state=active group=1 addr=172.18.119.254
15:24:42.811330 IP 172.18.116.155.12865  172.18.116.160.33137: P 1:257(256) ack 257 
win 54 nop,nop,timestamp 8071 1967355840
15:24:43.924183 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 
15:24:44.121880 IP 172.24.0.7.domain  172.18.116.22.46700:  52073* 1/4/4 A[|domain]

15:24:44.443419 IP 172.18.116.155.12865  172.18.116.160.33137: P 1:257(256) ack 257 
win 54 nop,nop,timestamp 8479 1967355840
15:24:44.723257 IP 172.18.119.252.hsrp  224.0.0.2.hsrp: HSRPv0-hello 20: 
state=standby group=1 addr=172.18.119.254
15:24:44.886356 IP 172.18.119.253.hsrp  224.0.0.2.hsrp: HSRPv0-hello 20: 
state=active group=1 addr=172.18.119.254
15:24:45.924263 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 
15:24:47.659300 IP 172.18.119.252.hsrp  224.0.0.2.hsrp: HSRPv0-hello 20: state=standby group=1 addr=172.18.119.254

15:24:47.707599 IP 172.18.116.155.12865  172.18.116.160.33137: P 1:257(256) ack 257 
win 54 nop,nop,timestamp 9295 1967355840
15:24:47.874419 IP 172.18.119.253.hsrp  224.0.0.2.hsrp: HSRPv0-hello 20: 
state=active group=1 addr=172.18.119.254
15:24:47.952350 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 
15:24:48.037569 IP 172.24.0.7.domain  172.18.117.18.46665:  59092 2/7/6 CNAME[|domain]


So I think we did a bit of TCP chatter then no UDP at all?


Looks that way, and on top if it got no results back from netserver on 
the control (TCP, port 12865) connection.  Adding some -d's to the 
global options will cause netperf to regurgitate what messages it is 
sending and such.


I'd have expected that even if no UDP traffic could flow between netperf 
and netserver the timer running in the netserver _should_ have gotten it 
out of the recv()/recvfrom() call in recv_udp_rr() (src/nettest_bsd.c) 
and that netperf would then report a normal result of just 0 
transactions per second.


Either that timer didn't get set, didn't fire, or was insufficient to 
get netserver out of that recv() on the UDP socket, or comms between the 
two system got fubar for TCP too.


rick jones


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/2] 2.6.21-rc7: known regressions

2007-04-16 Thread Dave Jones
On Mon, Apr 16, 2007 at 05:14:40PM -0700, Brandeburg, Jesse wrote:
  Adrian Bunk wrote:
   Subject: laptops with e1000: lockups
   References :
   https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603 
   Submitter  : Dave Jones [EMAIL PROTECTED]
   Handled-By : Jesse Brandeburg [EMAIL PROTECTED]
   Status : problem is being debugged
  
  this is being actively debugged, here is what we have so far:
  o v2.6.20: crashes during boot, unless noacpi and nousb bootparams used
  o v2.6.21-rc6: some userspace issue, crashes just after root mount
  without init=/bin/bash
  o v2.6.2X: serial console in docking station spews goo at all speeds
  with console=ttyS0,n8 . work continues on this, as we don't know if
  there are kernel panic messages during the hard lock.
  o fedora 7 test kernel 2948: boots okay, have been using this as only
  truly working kernel on this machine.
  
  one reproduction of the problem was had with scp -l 5000 file remote
  when linked at 100Mb/Full.  Tried probably 20 other times same test with
  no repro, ugh.
  
  Otherwise, slogging through continues.  We are actively working on this
  in case it *is* an e1000 issue.  Right now the repro is so unlikely we
  could hardly tell if we fixed it.

FWIW, I can reproduce this pretty much ondemand, on 100M through
the ethernet port on a netgear wireless AP.
A number of our Fedora7 testers are also able to easily reproduce this.
To isolate e1000, for tomorrows test build I've reverted e1000 to
the same code that was in 2.6.20.  If that works out without causing
hangs, I'll try and narrow down further which of the dozen csets
is responsible.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.21rc7 e1000 media-detect oddness.

2007-04-15 Thread Dave Jones
I booted up 2.6.21rc7 without an ethernet cable plugged in,
and noticed this..

e1000: :02:00.0: e1000_probe: The EEPROM Checksum Is Not Valid
e1000: probe of :02:00.0 failed with error -5

I plugged a cable in, did rmmod e1000;modprobe e1000, and got this..

e1000: :02:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 
00:16:d3:3a:62:d3
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex, Flow Control: 
RX
e1000: eth0: e1000_watchdog: 10/100 speed: disabling TSO

and it works fine..
Why would no cable make it think the EEPROM is invalid ?

I repeated this a few times, just to be sure it wasn't a fluke, and it
seems to happen 100% reproducably.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT] proxy arp deadlock possible

2007-04-05 Thread Dave Jones
On Wed, Apr 04, 2007 at 06:10:42PM -0700, Arjan van de Ven wrote:
  On Thu, 2007-04-05 at 10:44 +1000, Herbert Xu wrote:
   Stephen Hemminger [EMAIL PROTECTED] wrote:
Thanks Dave, there is a classic AB BA deadlock here.
We should break the dependency like this.

Could someone who uses proxy ARP test this?
   
   Sorry Stephen, this isn't necessary.  The lockdep thing is
   simply confused here.  It's treating tbl-proxy_queue as the
   same thing as neigh-arp_queue when they're clearly different.
   
   I'm disappointed that after all this time lockdep is still
   producing bogus reports like this.  I'm sure we've been
   through this particular issue many times already.
  
  
  what's the exact lockdep output here?

http://www.mail-archive.com/netdev@vger.kernel.org/msg35266.html

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ethtool: additional 10Gig niceness

2007-04-05 Thread Rick Jones

applied


Thanks.

One thing I noticed while making the changes is that the reported speed 
is kept in a u16.  With 10G we are already 1/6 of the way to the 
maximum.  I've no idea when 100G will arrive, but euros to beliners it 
will probably arrive some day which means something will have to give.


I've not thought it through completely, but my initial reaction would be 
to suggest just making the thing a 64 bit quantity reporting bits and 
not worry about it again.  And then one doesn't have to worry if ethtool 
starts being applied to links which do not run at integral multiples of 
a Mbit/s.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


lockdep report from 2.6.20.5-rc1

2007-04-04 Thread Dave Jones
===
[ INFO: possible circular locking dependency detected ]
2.6.20-1.2933.fc6debug #1
---
swapper/0 is trying to acquire lock:
 (tbl-lock){-+-+}, at: [c05d5664] neigh_lookup+0x43/0xa2

but task is already holding lock:
 (list-lock#4){-+..}, at: [c05d65c8] neigh_proxy_process+0x20/0xc2

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #2 (list-lock#4){-+..}:
   [c043f4f2] __lock_acquire+0x913/0xa43
   [c043f933] lock_acquire+0x56/0x6f
   [c06325df] _spin_lock_irqsave+0x34/0x44
   [c05cbb87] skb_dequeue+0x12/0x43
   [c05cc9d4] skb_queue_purge+0x14/0x1b
   [c05d5b70] neigh_update+0x349/0x3a5
   [c060cd37] arp_process+0x4d1/0x50a
   [c060ce53] arp_rcv+0xe3/0x100
   [c05d0e43] netif_receive_skb+0x2db/0x35a
   [c05d2806] process_backlog+0x95/0xf6
   [c05d29ed] net_rx_action+0xa1/0x1a8
   [c042c1f6] __do_softirq+0x6f/0xe2
   [c04063c6] do_softirq+0x61/0xd0
   [] 0x

- #1 (n-lock){-+-+}:
   [c043f4f2] __lock_acquire+0x913/0xa43
   [c043f933] lock_acquire+0x56/0x6f
   [c063231e] _write_lock+0x2b/0x38
   [c05d74b7] neigh_periodic_timer+0x99/0x138
   [c042f053] run_timer_softirq+0x104/0x168
   [c042c1f6] __do_softirq+0x6f/0xe2
   [c04063c6] do_softirq+0x61/0xd0
   [] 0x

- #0 (tbl-lock){-+-+}:
   [c043f3f3] __lock_acquire+0x814/0xa43
   [c043f933] lock_acquire+0x56/0x6f
   [c06323d6] _read_lock_bh+0x30/0x3d
   [c05d5664] neigh_lookup+0x43/0xa2
   [c05d6317] neigh_event_ns+0x2c/0x7a
   [c060cbec] arp_process+0x386/0x50a
   [c060ce78] parp_redo+0x8/0xa
   [c05d660e] neigh_proxy_process+0x66/0xc2
   [c042f053] run_timer_softirq+0x104/0x168
   [c042c1f6] __do_softirq+0x6f/0xe2
   [c04063c6] do_softirq+0x61/0xd0
   [] 0x

other info that might help us debug this:

1 lock held by swapper/0:
 #0:  (list-lock#4){-+..}, at: [c05d65c8] neigh_proxy_process+0x20/0xc2

stack backtrace:
 [c04051dd] show_trace_log_lvl+0x1a/0x2f
 [c0405782] show_trace+0x12/0x14
 [c0405806] dump_stack+0x16/0x18
 [c043dcf5] print_circular_bug_tail+0x5f/0x68
 [c043f3f3] __lock_acquire+0x814/0xa43
 [c043f933] lock_acquire+0x56/0x6f
 [c06323d6] _read_lock_bh+0x30/0x3d
 [c05d5664] neigh_lookup+0x43/0xa2
 [c05d6317] neigh_event_ns+0x2c/0x7a
 [c060cbec] arp_process+0x386/0x50a
 [c060ce78] parp_redo+0x8/0xa
 [c05d660e] neigh_proxy_process+0x66/0xc2
 [c042f053] run_timer_softirq+0x104/0x168
 [c042c1f6] __do_softirq+0x6f/0xe2
 [c04063c6] do_softirq+0x61/0xd0
 ===


-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ethtool: additional 10Gig niceness

2007-04-04 Thread Rick Jones

teach ethtool to print 1Mb/s for a 10G NIC and prepare
for 10G NICs where it is possible to run something other than 10G
update the ethtool.8 manpage with info re same and some grammar fixes

Signed-off-by: Rick Jones [EMAIL PROTECTED]

the likely required asbestos at the ready :)

rick jones
From c58b73af0744a3d5dbc4fbf23c7b4d6f9092d21a Mon Sep 17 00:00:00 2001
From: Rick Jones [EMAIL PROTECTED]
Date: Mon, 2 Apr 2007 13:45:53 -0700
Subject: [PATCH] ethtool: additional 10Gig niceness

teach ethtool to print 1Mb/s for a 10G NIC and prepare
for 10G NICs where it is possible to run something other than 10G
update the ethtool.8 manpage with info re same and some grammar fixes

Signed-off-by: Rick Jones [EMAIL PROTECTED]
---
 ethtool.8 |  107 +++-
 ethtool.c |   20 ++-
 2 files changed, 73 insertions(+), 54 deletions(-)

diff --git a/ethtool.8 b/ethtool.8
index d247d51..d6561bf 100644
--- a/ethtool.8
+++ b/ethtool.8
@@ -175,7 +175,7 @@ ethtool \- Display or change ethernet card settings
 
 .B ethtool \-s
 .I ethX
-.B3 speed 10 100 1000
+.B4 speed 10 100 1000 1
 .B2 duplex half full
 .B4 port tp aui bnc mii fibre
 .B2 autoneg on off
@@ -193,65 +193,65 @@ ethtool \- Display or change ethernet card settings
 is used for querying settings of an ethernet device and changing them.
 
 .I ethX
-is the name of the ethernet device to work on.
+is the name of the ethernet device on which ethtool should operate.
 
 .SH OPTIONS
 .B ethtool
 with a single argument specifying the device name prints current
-setting of the specified device.
+settings of the specified device.
 .TP
 .B \-h \-\-help
-shows a short help message.
+Shows a short help message.
 .TP
 .B \-a \-\-show\-pause
-queries the specified ethernet device for pause parameter information.
+Queries the specified ethernet device for pause parameter information.
 .TP
 .B \-A \-\-pause
-change the pause parameters of the specified ethernet device.
+Changes the pause parameters of the specified ethernet device.
 .TP
 .A2 autoneg on off
-Specify if pause autonegotiation is enabled.
+Specifies whether pause autonegotiation should be enabled.
 .TP
 .A2 rx on off
-Specify if RX pause is enabled.
+Specifies whether RX pause should be enabled.
 .TP
 .A2 tx on off
-Specify if TX pause is enabled.
+Specifies whether TX pause should be enabled.
 .TP
 .B \-c \-\-show\-coalesce
-queries the specified ethernet device for coalescing information.
+Queries the specified ethernet device for coalescing information.
 .TP
 .B \-C \-\-coalesce
-change the coalescing settings of the specified ethernet device.
+Changes the coalescing settings of the specified ethernet device.
 .TP
 .B \-g \-\-show\-ring
-queries the specified ethernet device for rx/tx ring parameter information.
+Queries the specified ethernet device for rx/tx ring parameter information.
 .TP
 .B \-G \-\-set\-ring
-change the rx/tx ring parameters of the specified ethernet device.
+Changes the rx/tx ring parameters of the specified ethernet device.
 .TP
 .BI rx \ N
-Change number of ring entries for the Rx ring.
+Changes the number of ring entries for the Rx ring.
 .TP
 .BI rx-mini \ N
-Change number of ring entries for the Rx Mini ring.
+Changes the number of ring entries for the Rx Mini ring.
 .TP
 .BI rx-jumbo \ N
-Change number of ring entries for the Rx Jumbo ring.
+Changes the number of ring entries for the Rx Jumbo ring.
 .TP
 .BI tx \ N
-Change number of ring entries for the Tx ring.
+Changes the number of ring entries for the Tx ring.
 .TP
 .B \-i \-\-driver
-queries the specified ethernet device for associated driver information.
+Queries the specified ethernet device for associated driver information.
 .TP
 .B \-d \-\-register\-dump
-retrieves and prints a register dump for the specified ethernet device.
+Retrieves and prints a register dump for the specified ethernet device.
 The register format for some devices is known and decoded others
 are printed in hex.
 When 
 .I raw 
-is enabled, then it dumps the raw register data to stdout.
+is enabled, then ethtool dumps the raw register data to stdout.
 If
 .I file
 is specified, then use contents of previous raw register dump, rather
@@ -259,7 +259,7 @@ than reading from the device.
 
 .TP
 .B \-e \-\-eeprom\-dump
-retrieves and prints an EEPROM dump for the specified ethernet device.
+Retrieves and prints an EEPROM dump for the specified ethernet device.
 When raw is enabled, then it dumps the raw EEPROM data to stdout. The
 length and offset parameters allow dumping certain portions of the EEPROM.
 Default is to dump the entire EEPROM.
@@ -271,31 +271,31 @@ of writing to the EEPROM, a device-specific magic key 
must be specified
 to prevent the accidental writing to the EEPROM.
 .TP
 .B \-k \-\-show\-offload
-queries the specified ethernet device for offload information.
+Queries the specified ethernet device for offload information.
 .TP
 .B \-K \-\-offload
-change the offload

NIC data corruption

2007-04-02 Thread Rick Jones
I changed the title to be more accurate, and culled the distribution to 
individuals and netdev


The mention of trying to turn-off CKO and see if the data corruption 
goes away leads me to ask a possibly delicate question:


  Should Linux only enable CKO on those NICs certified to have
  ECC/parity throughout their _entire_ data path?

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Add TCP connection abort IOCTL

2007-03-30 Thread Rick Jones
If the switchover from active to standby is commanded then there is 
the opportunity to tell the applications on the server to close their 
connections - either explicitly with some sort of defined interface, or 
implicitly by killing the processes.  Then the IP can be brought-up on 
the standby and processes started/enabled/whatever and the clients can 
establish their new connections.  The ioctl here (at least if it is like 
the tcp_discon options in HP-UX/Solaris) wouldn't be any better than 
just killing the process in so far as what happens on the network - in 
fact, it could be worse since the RST will not be retransmitted if lost, 
but FINs would.  So, the ioctl could still leave clients twisting in the 
ether waiting for their application-level heartbeats to kick-in anyway. 
 Heck, depending on their heartbeat lengths, even the FIN stuff if lost 
could leave them depending on their heartbeats.


If the switchover from active to standby is uncommanded it probably 
means the primary went belly-up which means you don't have the 
opportunity to make an ioctl call anyway, and you are back to the 
heartbeats.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: L2 network namespace benchmarking

2007-03-28 Thread Rick Jones

If I read the results right it took a 32bit machine from AMD with
a gigabit interface before you could measure a throughput difference.
That isn't shabby for a non-optimized code path.


Just some paranoid ramblings - one needs to look beyond just whether or 
not the performance of a bulk transfer test (eg TCP_STREAM) remains able 
to hit link-rate.  One has to also consider the change in service demand 
(the normalization of CPU util and throughput).  Also, with 
functionality like TSO in place, the ability to pass very large things 
down the stack can help cover for a multitude of path-length sins.  And 
with either multiple 1G or 10G NICs becoming more and more prevalent, we 
have another one of those NIC speed vs CPU speed switch-overs, so 
maintaining single-NIC 1 gigabit throughput, while necessary, isn't 
(IMO) sufficient.


S, it becomes very important to go beyond just TCP_STREAM tests when 
evaluating these sorts of things.  Another test to run would be the 
TCP_RR test.  TCP_RR with single-byte request/response sizes will 
bypass the TSO stuff, and the transaction rate will be more directly 
affected by the change in path length than a TCP_STREAM test.  It will 
also show-up quite clearly in the service demand.  Now, with NICs doing 
interrupt coalescing, if the NIC is strapped poorly (IMO) then you may 
not see a change in transaction rate - it may be getting limited 
artifically by the NIC's interrupt coalescing.  So, one has to fall-back 
on service demand, or better yet, disable the interrupt coalescing.


Otherwise, measuring peak aggregate request/response becomes necessary.


rick jones
don't be blinded by bit-rate
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: L2 network namespace benchmarking

2007-03-28 Thread Rick Jones
Do you have any pointer to help on benchmarking the network, perhaps a 
checklist or some scripts for netperf ?


There are some scripts in doc/examples but they are probably a bit long 
in the tooth by now.


The main writeup _I_ have on netperf would be the manual, which was 
recently updated for the 2.4.3 release.


http://www.netperf.org/svn/netperf2/tags/netperf-2.4.3/doc/netperf.html

or the current top of trunk:

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html

There is also a [EMAIL PROTECTED] mailing list which one can join 
and have discussions about netperf, and a [EMAIL PROTECTED] if one 
wants to discuss actual netperf (netperf2 or netperf4) development.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


has ethtool been patched for 10 gig speed reporting?

2007-03-27 Thread Rick Jones
I have some 10gig nics and ethtool is reporting unknown for the speed. 
Is there already a patch, or have I found an opportunity? FWIW, the 
version five bits from sourceforge still report unknown - are they the 
latest or are there later bits somewhere?


thanks,

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Add TCP connection abort IOCTL

2007-03-27 Thread Rick Jones

There is no reason for this ioctl at all.  Either existing
facilities provide what you need or what you want is a
protocol violation we can't do.


I agree that 99 times out of ten such a mechanism serves only as a 
massive KLUDGE to paper-over application bugs.  I'll also sadly 
point-out that such a mechanism exists in HP-UX 11.X and I suspect 
Solaris !-(  I've spent probably the last decade or so attempting to 
discourage its use in the HP-UX space, but like some daemon from hell it 
just refuses to die.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


fix up misplaced inlines.

2007-03-21 Thread Dave Jones
Turning up the warnings on gcc makes it emit warnings
about the placement of 'inline' in function declarations.
Here's everything that was under net/

Signed-off-by: Dave Jones [EMAIL PROTECTED]

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 4c914df..ecfe8da 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -319,7 +319,7 @@ static int __hidp_send_ctrl_message(struct hidp_session 
*session,
return 0;
 }
 
-static int inline hidp_send_ctrl_message(struct hidp_session *session,
+static inline int hidp_send_ctrl_message(struct hidp_session *session,
unsigned char hdr, unsigned char *data, int size)
 {
int err;
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 7712d76..5439a3c 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -61,7 +61,7 @@ static int brnf_filter_vlan_tagged __read_mostly = 1;
 #define brnf_filter_vlan_tagged 1
 #endif
 
-static __be16 inline vlan_proto(const struct sk_buff *skb)
+static inline __be16 vlan_proto(const struct sk_buff *skb)
 {
return vlan_eth_hdr(skb)-h_vlan_encapsulated_proto;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 8d65d64..27c4f62 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -808,7 +808,7 @@ lenout:
  *
  * (We also register the sk_lock with the lock validator.)
  */
-static void inline sock_lock_init(struct sock *sk)
+static inline void sock_lock_init(struct sock *sk)
 {
sock_lock_init_class_and_name(sk,
af_family_slock_key_strings[sk-sk_family],
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a7fee6b..1b61699 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -804,7 +804,7 @@ struct ipv6_saddr_score {
 #define IPV6_SADDR_SCORE_LABEL 0x0020
 #define IPV6_SADDR_SCORE_PRIVACY   0x0040
 
-static int inline ipv6_saddr_preferred(int type)
+static inline int ipv6_saddr_preferred(int type)
 {
if (type  (IPV6_ADDR_MAPPED|IPV6_ADDR_COMPATv4|
IPV6_ADDR_LOOPBACK|IPV6_ADDR_RESERVED))
@@ -813,7 +813,7 @@ static int inline ipv6_saddr_preferred(int type)
 }
 
 /* static matching label */
-static int inline ipv6_saddr_label(const struct in6_addr *addr, int type)
+static inline int ipv6_saddr_label(const struct in6_addr *addr, int type)
 {
  /*
   *prefix (longest match)  label
@@ -3318,7 +3318,7 @@ errout:
rtnl_set_sk_err(RTNLGRP_IPV6_IFADDR, err);
 }
 
-static void inline ipv6_store_devconf(struct ipv6_devconf *cnf,
+static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
__s32 *array, int bytes)
 {
BUG_ON(bytes  (DEVCONF_MAX * 4));
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0e1f4b2..a6b3117 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -308,7 +308,7 @@ static inline void rt6_probe(struct rt6_info *rt)
 /*
  * Default Router Selection (RFC 2461 6.3.6)
  */
-static int inline rt6_check_dev(struct rt6_info *rt, int oif)
+static inline int rt6_check_dev(struct rt6_info *rt, int oif)
 {
struct net_device *dev = rt-rt6i_dev;
int ret = 0;
@@ -328,7 +328,7 @@ static int inline rt6_check_dev(struct rt6_info *rt, int 
oif)
return ret;
 }
 
-static int inline rt6_check_neigh(struct rt6_info *rt)
+static inline int rt6_check_neigh(struct rt6_info *rt)
 {
struct neighbour *neigh = rt-rt6i_nexthop;
int m = 0;
diff --git a/net/ipv6/xfrm6_tunnel.c b/net/ipv6/xfrm6_tunnel.c
index ee4b84a..93c4223 100644
--- a/net/ipv6/xfrm6_tunnel.c
+++ b/net/ipv6/xfrm6_tunnel.c
@@ -58,7 +58,7 @@ static struct kmem_cache *xfrm6_tunnel_spi_kmem __read_mostly;
 static struct hlist_head 
xfrm6_tunnel_spi_byaddr[XFRM6_TUNNEL_SPI_BYADDR_HSIZE];
 static struct hlist_head xfrm6_tunnel_spi_byspi[XFRM6_TUNNEL_SPI_BYSPI_HSIZE];
 
-static unsigned inline xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr)
+static inline unsigned xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr)
 {
unsigned h;
 
@@ -70,7 +70,7 @@ static unsigned inline 
xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr)
return h;
 }
 
-static unsigned inline xfrm6_tunnel_spi_hash_byspi(u32 spi)
+static inline unsigned xfrm6_tunnel_spi_hash_byspi(u32 spi)
 {
return spi % XFRM6_TUNNEL_SPI_BYSPI_HSIZE;
 }
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index e85df07..abc47cc 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -93,7 +93,7 @@ void route4_reset_fastmap(struct net_device *dev, struct 
route4_head *head, u32
spin_unlock_bh(dev-queue_lock);
 }
 
-static void __inline__
+static inline void
 route4_set_fastmap(struct route4_head *head, u32 id, int iif,
   struct route4_filter *f)
 {
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 9678995..e81e2fb 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2025,7 +2025,7 @@ nlmsg_failure:
return -1;
 }
 
-static int inline

Re: ping DOS avoidance?

2007-03-15 Thread Rick Jones
I was just asked about something not too different, involving IIRC 
tnsping.  It got me to looking at ip_sysctl.txt which has:


icmp_ratelimit - INTEGER
Limit the maximal rates for sending ICMP packets whose type
matches icmp_ratemask (see below) to specific targets.
0 to disable any limiting, otherwise the maximal rate in
jiffies(1)
Default: 100

icmp_ratemask - INTEGER
Mask made of ICMP types for which rates are being limited.
Significant bits: IHGFEDCBA9876543210
Default mask: 00110011000 (6168)

Bit definitions (see include/linux/icmp.h):
0 Echo Reply
3 Destination Unreachable *
4 Source Quench *
5 Redirect
8 Echo Request
B Time Exceeded *
C Parameter Problem *
D Timestamp Request
E Timestamp Reply
F Info Request
G Info Reply
H Address Mask Request
I Address Mask Reply

* These are rate limited by default (see default mask above)


(I've always been used to masks being specified as hex values)

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bridge: faster compare for link local addresses

2007-03-12 Thread Rick Jones

Stephen Hemminger wrote:

Use logic operations rather than memcmp() to compare destination
address with link local multicast addresses.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

---
 net/bridge/br_input.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- netem-dev.orig/net/bridge/br_input.c
+++ netem-dev/net/bridge/br_input.c
@@ -112,7 +112,11 @@ static int br_handle_local_finish(struct
  */
 static inline int is_link_local(const unsigned char *dest)
 {
-   return memcmp(dest, br_group_address, 5) == 0  (dest[5]  0xf0) == 0;
+   const u16 *a = (const u16 *) dest;
+   static const u16 *const b = (const u16 *const ) br_group_address;
+   static const u16 m = __constant_cpu_to_be16(0xfff0);
+
+   return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | ((a[2] ^ b[2])  m)) == 0;
 }


Being paranoid - are there no worries about the alignment of dest?

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP 2MSL on loopback

2007-03-06 Thread Rick Jones
This is probably not something that happens in real world deployments. I 
But it's not 60,000 concurrent connections, it's 60,000 within a 2 
minute span.


Sounds like a case of Doctor! Doctor! It hurts when I do this.



I'm not saying this is a high priority problem, I only encountered it in 
a test scenario where I was deliberately trying to max out the server.



Ideally the 2MSL parameter would be dynamically adjusted based on the
route to the destination and the weights associated with those routes.
In the simplest case, connections between machines on the same subnet
(i.e., no router hops involved) should have a much smaller default value
than connections that traverse any routers. I'd settle for a two-level
setting - with no router hops, use the small value; with any router hops
use the large value.


With transparant bridging, nobody knows how long the datagram may be out 
there.  Admittedly, the chances of a datagram living for a full two 
minutes these days is probably nil, but just being in the same IP subnet 
doesn't really mean anything when it comes to physical locality.


It's a combination of 2MSL and /proc/sys/net/ipv4/ip_local_port_range - 
on my system the default port range is 32768-61000. That means if I use 
up 28232 ports in less than 2MSL then everything stops. netstat will 
show that all the available port numbers are in TIME_WAIT state. And 
this is particularly bad because while waiting for the timeout, I can't 
initiate any new outbound connections of any kind at all - telnet, ssh, 
whatever, you have to wait for at least one port to free up. 
(Interesting denial of service there)


SPECweb benchmarking has had to deal with the issue of attempted 
TIME_WAIT reuse going back to 1997.  It deals with it by not relying on 
the client's configured local/anonymous/ephemeral port number range and 
instead making explicit bind() calls in the (more or less) entire unpriv 
port range (actually it may just be from 5000 to 65535 but still)


Now, if it weren't necessary to fully randomize the ISNs, the chances of 
a successful transition from TIME_WAIT to ESTABLISHED might be greater, 
but going back to the good old days of more or less purly clock driven 
ISN's isn't likely.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET]: Please revert disallowing zero listen queues

2007-03-06 Thread Rick Jones

So we're not disallowing a backlog argument of zero to
listen().  We'll accept that just fine, the only thing that
happens is that you'll get what you ask for, that being
no connections :-)


I'm not sure where HP-UX inherited the 0 = 1 bit - perhaps from BSD, nor 
am I sure there is official chapter and verse, but:


excerpt
backlog is limited to the range of 0 to SOMAXCONN, which is 	defined in 
sys/socket.h.  SOMAXCONN is currently set to 4096.  If any other 
value is specified, the system automatically assigns the closest value 
within the range.  A backlog of 0 specifies only 1 pending 
connection  is allowed at any given time.

/excerpt

I don't have a Solaris, BSD or AIX manpage for listen handy to check 
them but would not be surprised to see they are similar.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP 2MSL on loopback

2007-03-06 Thread Rick Jones
With transparant bridging, nobody knows how long the datagram may be 
out there.  Admittedly, the chances of a datagram living for a full 
two minutes these days is probably nil, but just being in the same IP 
subnet doesn't really mean anything when it comes to physical locality.



Bridging isn't necessarily a problem though. The 2MSL timeout is 
designed to prevent problems from delayed packets that got sent through 
multiple paths. In a bridging setup you don't allow multiple paths, 
that's what STP is designed to prevent. If you want to configure a 
network that allows multiple paths, you need to use a router, not a bridge.


Well, there is trunking at the data link layer, and in theory there 
could be an active-standby where the standby took a somewhat different path.


The timeout is also to cover datagrams which just got stuck somewhere 
too (IIRC) and may not necessarily require a multiple path situation.




SPECweb benchmarking has had to deal with the issue of attempted 
TIME_WAIT reuse going back to 1997.  It deals with it by not relying 
on the client's configured local/anonymous/ephemeral port number range 
and instead making explicit bind() calls in the (more or less) entire 
unpriv port range (actually it may just be from 5000 to 65535 but still)



That still doesn't solve the problem, it only ~doubles the available 
port range. That means it takes 0.6 seconds to trigger the problem 
instead of only 0.3 seconds...


True.  Thankfully, the web learned to use persistent connections so 
later versions of SPECweb benchmarking make use of persistent connections.


In an environment where connections are opened and closed very quickly 
with only a small amount of data carried per connection, it might make 
sense to remember the last sequence number used on a port and use that 
as the floor of the next randomly generated ISN. Monotonically 
increasing sequence numbers aren't a security risk if there's still a 
randomly determined gap from one connection to the next. But I don't 
think it's necessary to consider this at the moment.


I thought that all the security types started squawking if the ISN 
wasn't completely random?


I've not tried this, but if a client does want to cycle through 
thousands of connections per second, and if it is the one to initiate 
connection close, would it be sufficient to only use something like:


socket()
bind()
loop:
connect()
request()
response()
shudtown(SHUT_RDWR)
goto loop

ie not call close on the FD so there is still a direct link to the 
connection in TIME_WAIT so one could in theory initiate a new connection 
from TIME_WAIT?  Then in theory the randomness could be _almost_ the 
entire sequence space, less the previous connection's window (IIRC).


rick jones

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP 2MSL on loopback

2007-03-06 Thread Rick Jones
On the other hand, being able to configure a small MSL for the loopback 
device is perfectly safe. Being able to configure a small MSL for other 
interfaces may be safe, depending on the rest of the network layout.


A peanut gallery question - I seem to recall prior discussions about how 
one cannot assume that a packet destined for a given IP address will 
remain detined for that given IP address as it could go through a module 
that will rewrite headers etc.


Is traffic destined for 127.0.0.1 immune from that?

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: all syscalls initially taking 4usec on a P4? Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?

2007-02-20 Thread Rick Jones

I measure a huge slope, however. Starting at 1usec for back-to-back system
calls, it rises to 2usec after interleaving calls with a count to 20
million.

4usec is hit after 110 million.

The graph, with semi-scientific error-bars is on
http://ds9a.nl/tmp/recvfrom-usec-vs-wait.png

The code to generate it is on:
http://ds9a.nl/tmp/recvtimings.c

I'm investigating this further for other system calls. It might be that my
measurements are off, but it appears even a slight delay between calls
incurs a large penalty.


The slope appears to be flattening-out the farther out to the right it 
goes.  Perhaps that is the length of time it takes to take all the 
requisite cache misses.


Some judicious use of HW perf counters might be in order via say papi or 
pfmon.  Otherwise, you could try a test where you don't delay, but do 
try to blow-out the cache(s) between recvfrom() calls.  If the delay 
there starts to match the delay as you go out to the right on the graph 
it would suggest that it is indeed cache effects.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degradation in bridging performance of 5% in 2.6.20 when compared to 2.6.19

2007-02-16 Thread Rick Jones

kalyan tejaswi wrote:

Hi all,
I have been comparing bridging performance for 2.6.20 and 2.6.19
kernels. The kenel configurations are identical for both the kernels.
I use D-Link cards (8139too driver) for the Malta 4Kc board.

The setup is:

netperf  client  ---  malta 4Kc  - netperf  server.

The throughput statistics (in 10^6 bits/second) are:

   2.6.19  2.6.20
routing30.2   30.16
bridging  32.35  30.81

I observe that there has been a degradation in bridging performance of
5% in 2.6.20 when compared to 2.6.19.

Has anyone observed similar behaviour?
Any inputs or suggestions are welcome.


In each case is the malta CPU bound?  If not, some idea of the change in 
CPU util might be helpful.


rick jones
btw, netperf 2.4.3 just released:
ftp://ftp.netperf.org/netperf
http://www.netperf.org/svn/netperf2/tags/netperf-2.4.3
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FC5 iptables-restore failure

2007-02-15 Thread Dave Jones
On Thu, Feb 15, 2007 at 02:45:07AM -0800, Andrew Morton wrote:
  
  I've recently been noticing nasty messages come out of FC5:
  
  sony:/home/akpm# service iptables stop
  Flushing firewall rules:   [  OK  ]
  Setting chains to policy ACCEPT: filter[  OK  ]
  Unloading iptables modules:[  OK  ]
  sony:/home/akpm# service iptables start
  Applying iptables firewall rules: iptables-restore: line 20 failed
 [FAILED]
  
  Dunno when it started happening, but it's in mainline now.
  
  It's a pretty stupid error message.  line 20 of what?

2.6.18 - 2.6.19 changes a bunch of netfilter config option names.
Sure you weren't bitten by that ?

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] apply cwnd rules to FIN packets with data

2007-02-05 Thread Rick Jones

John Heffner wrote:

David Miller wrote:


However, I can't think of any reason why the cwnd test should not apply.



Care to elaborate here?  You can view the FIN special case as an off
by one error in the CWND test, it's not going to melt the internet.
:-)



True, it's not going to melt the internet, but why stop at one when two 
would finish the connection even faster?  Not sure I buy this argument. 
 Was there some benchmarking data that was a justification for this in 
the first place?


Is the cwnd in the stack byte based, or packet based?

While all the RFCs tend to discuss things in terms of byte-based cwnds 
and assumptions based on MSSes and such, the underlying principle was/is 
a conservation of packets.  As David said, a packet is a packet, and if 
one were going to be sending a FIN segment, it might as well carry data. 
 And if one isn't comfortable sending that one last data segment with 
the FIN because cwnd wasn't large enough at the time, should the FIN be 
sent at that point, even if it is waffer thin?


rick jones
2 cents tossed-in from the peanut gallery
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: meaningful spinlock contention when bound to non-intr CPU?

2007-02-02 Thread Rick Jones

SPINLOCKS HOLDWAIT
   UTIL  CONMEAN(  MAX )   MEAN(  MAX )(% CPU) TOTAL NOWAIT SPIN
RJECT  NAME

   7.4%  2.8%  0.1us( 143us)  3.3us( 147us)( 1.4%)  75262432 97.2%  2.8%
0%  lock_sock_nested+0x30
  29.5%  6.6%  0.5us( 148us)  0.9us( 143us)(0.49%)  37622512 93.4%  6.6%
0%  tcp_v4_rcv+0xb30
   3.0%  5.6%  0.1us( 142us)  0.9us( 143us)(0.14%)  13911325 94.4%  5.6%
0%  release_sock+0x120
   9.6% 0.75%  0.1us( 144us)  0.7us( 139us)(0.08%)  75262432 99.2% 0.75%
0%  release_sock+0x30
...
Still, does this look like something worth persuing?  In a past life/OS
when one was able to eliminate one percentage point of spinlock
contention, two percentage points of improvement ensued.



Rick, this looks like good stuff, we're seeing more and more issues
like this as systems become more multi-core and have more interrupts
per NIC (think MSI-X)


MSI-X - haven't even gotten to that - discussion of that probably 
overlaps with some pci mailing list right?



Let me know if there is something I can do to help.


I suppose one good step would be to reproduce the results on some other 
platform.  After that, I need to understand what those routines are 
doing much better than I currently do, particularly from an 
architecture perspective - I think that it may involve all the 
prequeue/try to get the TCP processing on the user's stack stuff but I'm 
_far_ from certain.


rick jones

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: meaningful spinlock contention when bound to non-intr CPU?

2007-02-02 Thread Rick Jones

Andi Kleen wrote:

Rick Jones [EMAIL PROTECTED] writes:


Still, does this look like something worth persuing?  In a past
life/OS when one was able to eliminate one percentage point of
spinlock contention, two percentage points of improvement ensued.



The stack is really designed to go fast with per CPU local RX processing 
of packets. This normally works because waking on up a task 
the scheduler tries to move it to that CPU. Since the wakeups are

on the CPU that process the incoming packets it should usually
end up correctly.

The trouble is when your NICs are so fast that a single
CPU can't keep up, or when you have programs that process many
different sockets from a single thread.

The fast NIC case will be eventually fixed by adding proper
support for MSI-X and connection hashing. Then the NIC can fan 
out to multiple interrupts and use multiple CPUs to process
the incoming packets. 


If that is implemented well (for some definition of well) then it 
might address the many sockets from a thread issue too, but if not...


If it is simple hash on the headers then you still have issues with a 
process/thread servicing mutiple connections - the hash of the different 
 headers will take things up different CPUs and you induce the 
scheduler to flip the process back and forth between them.


The meta question behind all that would seem to be whether the scheduler 
should be telling us where to perform the network processing, or should 
the network processing be telling the scheduler what to do? (eg all my 
old blathering about IPS vs TOPS in HP-UX...)


Then there is the case of a single process having many 
sockets from different NICs This will be of course somewhat slower
because there will be cross CPU traffic. 


The extreme case I see with the netperf test suggests it will be a 
pretty big hit.  Dragging cachelines from CPU to CPU is evil.  Sometimes 
a necessary evil of course, but still evil.



However there should
be not much socket lock contention because a process handling
many sockets will be hopefully unlikely to bang on each of
its many sockets at the exactly same time as the stack
receives RX packets. This should also eliminate the spinlock
contenion.

From that theory your test sounds somewhat unrealistic to me. 


Do you have any evidence you're modelling a real world scenario
here? I somehow doubt it.


Well, yes and no.  If I drop the burst and instead have N times more 
netperf's going, I see the same lock contention situation.  I wasn't 
expecting to - thinking that if there were then N different processes on 
each CPU the likelihood of there being a contention on any one socket 
was low, but it was there just the same.


That is part of what makes me wonder if there is a race between wakeup 
and release of a lock.



rick
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: meaningful spinlock contention when bound to non-intr CPU?

2007-02-02 Thread Rick Jones

Andi Kleen wrote:
The meta question behind all that would seem to be whether the scheduler 
should be telling us where to perform the network processing, or should 
the network processing be telling the scheduler what to do? (eg all my 
old blathering about IPS vs TOPS in HP-UX...)



That's an unsolved problem.  But past experiments suggest that giving
the scheduler more imperatives than just use CPUs well are often net-losses.


I wasn't thinking about giving the scheduler more imperitives really 
(?), just letting networking know more about where threads executed 
accessing given connections. (eg TOPS)


I suspect it cannot be completely solved in the general case. 


Not unless the NIC can peer into the connection table and see where each 
connection was last accessed by user-space.


Well, yes and no.  If I drop the burst and instead have N times more 
netperf's going, I see the same lock contention situation.  I wasn't 
expecting to - thinking that if there were then N different processes on 
each CPU the likelihood of there being a contention on any one socket 
was low, but it was there just the same.


That is part of what makes me wonder if there is a race between wakeup 



A race?


Perhaps a poor choice of words on my part - something along the lines of:

hold_lock();
wake_up_someone();
release_lock();

where the someone being awoken can try to grab the lock before the path 
doing the waking manages to release it.






and release of a lock.



You could try with echo 1  /proc/sys/net/ipv4/tcp_low_latency.
That should change RX locking behaviour significantly.


Running the same 8 netperf's with TCP_RR and burst bound to different 
CPU than the NIC interrupt, the lockmeter output looks virtually 
unchanged.  Still release_sock, tcp_v4_rcv, lock_sock_nested at their 
same offsets.


However, if I run the multiple-connection-per-thread code, and have each 
service 32 concurrent connections, and bind to a CPU other than the 
interrupt CPU, the lock contention in this case does appear to go away.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: meaningful spinlock contention when bound to non-intr CPU?

2007-02-02 Thread Rick Jones

Yes the wakeup happens deep inside the critical section and if the process
is running on another CPU it could race to the lock.

Hmm, i suppose the wakeup could be moved out, but it would need some 
restructuring of the code. Also to be safe the code would still need

to at least hold a reference count of the sock during the wakeup, and
when that is released then you have another cache line to bounce,
which might not be any better than the lock. So it might not be
actually worth it.

I suppose the socket release could be at least partially protected with
RCU against this case so that could be done without a reference count, but 
it might be tricky to get this right.


Again still not sure it's worth handling this.


Based on my experiments thusfar I'd have to agree/accept (I wasn't 
certain to begin with - hence the post in the first place :)  but I do 
need/want to see what happens with a single-stream through a 10G NIC - 
on the receive side at least with a 1500 byte MTU.


I was using the burst-mode aggregate RR over the 1G NICs to get the CPU 
util up without need for considerable bandwidth, since the system 
handled 8 TCP_STREAM tests across the 8 NICs without working-up a sweat. 
 I suppose I could instead chop the MTU on the 1G NICs and use that to 
increase the CPU util on the receive side.


rick
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


meaningful spinlock contention when bound to non-intr CPU?

2007-02-01 Thread Rick Jones
For various nefarious porpoises relating to comparing and contrasting a 
single 10G NIC with N 1G ports and hopefully finding interesting 
processor cache (mis)behaviour in the stack, I got my hands on a pair of 
8 core systems with plenty of RAM and I/O slots.  (rx6600 with 1.6 GHz 
dual-core Itanium2, aka Montecito)


A 2.6.10-rc5 kernel onto each system thanks to pointers from Dan Frazier.

Into each went a quartet of dual-port 1G NICs driven by e1000 
7.3.15-k2-NAPI and I connected them back to back.  I tweaked 
smp_affinity to have each port's interrupts go to a separate core.


Netperf2 configured with --enable-burst.

When I run eight concurrent netperf TCP_RR tests, each doing 24 
concurrent single-byte transactions (test-specific -b 24), TCP_NODELAY 
set, (test-specific -D) and bind each netserver/netperf to the same CPU 
as is taking the interrupts of the NIC handling that connection (global 
-T) I see things looking pretty good.  Decent aggregate transactions per 
second, and nothing in the CPU profiles to suggest spinlock contention.


Happiness and joy.  An N CPU system behaving (at this level at least) 
like N, 1 CPU systems.


When I then decide to bind the netperf/netservers to CPU(s) other than 
the ones taking the interrupts from the NIC(s) the aggregate 
transactions per second drops by roughly 40/135 or ~30%.  I was indeed 
expecting a delta - no idea if that is in the realm of to be expected 
- but decided to go ahead and look at the profiles.


The profiles (either via q-syscollect or caliper) show upwards of 3% of 
the CPU consumed by spinlock contention (ie time spent in 
ia64_spinlock_contention). (I'm guessing some of the rest of the perf 
drop comes from those interesting cache behaviours still to be sought)


With some help from Lee Schermerhorn and Alan Brunelle I got a lockmeter 
kernel going, and it is suggesting that the greatest spinlock contention 
comes from the routines:


SPINLOCKS HOLDWAIT
  UTIL  CONMEAN(  MAX )   MEAN(  MAX )(% CPU) TOTAL NOWAIT SPIN 
RJECT  NAME


  7.4%  2.8%  0.1us( 143us)  3.3us( 147us)( 1.4%)  75262432 97.2%  2.8% 
   0%  lock_sock_nested+0x30
 29.5%  6.6%  0.5us( 148us)  0.9us( 143us)(0.49%)  37622512 93.4%  6.6% 
   0%  tcp_v4_rcv+0xb30
  3.0%  5.6%  0.1us( 142us)  0.9us( 143us)(0.14%)  13911325 94.4%  5.6% 
   0%  release_sock+0x120
  9.6% 0.75%  0.1us( 144us)  0.7us( 139us)(0.08%)  75262432 99.2% 0.75% 
   0%  release_sock+0x30


I suppose it stands to some reason that there would be contention 
associated with the socket since there will be two things going for the 
socket (a netperf/netserver and an interrupt/upthestack) each running on 
separate CPUs.  Some of it looks like it _may_ be inevitable? - 
waking-up the user who will now  be racing to grab the socket before the 
stack releases it? (I may have been mis-interpreting some of the code I 
was checking)


Still, does this look like something worth persuing?  In a past life/OS 
when one was able to eliminate one percentage point of spinlock 
contention, two percentage points of improvement ensued.


rick jones

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: meaningful spinlock contention when bound to non-intr CPU?

2007-02-01 Thread Rick Jones

Rick Jones wrote:

A 2.6.10-rc5 kernel onto each system thanks to pointers from Dan Frazier.


gaak - 2.6.20-rc5 that is.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why would EPIPE cause socket port to change?

2007-01-23 Thread Rick Jones

Herbert Xu wrote:

dean gaudet [EMAIL PROTECTED] wrote:

in the test program below the getsockname result on a TCP socket changes 
across a write which produces EPIPE... here's a fragment of the strace:


getsockname(3, {sa_family=AF_INET, sin_port=htons(37636), 
sin_addr=inet_addr(127.0.0.1)}, [17863593746633850896]) = 0
...
write(3, hi!\n, 4)= 4
write(3, hi!\n, 4)= -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
getsockname(3, {sa_family=AF_INET, sin_port=htons(59882), 
sin_addr=inet_addr(127.0.0.1)}, [16927060683038654480]) = 0

why does the port# change?  this is on 2.6.19.1.



Prior to the last write, the socket entered the CLOSED state meaning
that the old port is no longer allocated to it.  As a result, the
last write operates on an unconnected socket which causes a new local
port to be allocated as an autobind.  It then fails because the socket
is still not connected.

So any attempt to run getsockname after an error on the socket is
simply buggy.


But falls within the principle of least surprise doesn't it?  Unless the 
application has called close() or bind(), it does seem like a reasonable 
expectation that the port assignments are not changed.


(fwiw this is one of two reasons i've found for libnss-ldap to leak 
sockets... causing nscd to crash.)


Of course, that seems rather odd too - why does libnss-ldap check the 
socket name on a socket after an EPIPE anyway?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two Dual Core processors and NICS (not handling interrupts on one CPU/assigning a Two Dual Core processors and NICS (not handling interrupts on one CPU / assigning a CPU to a NIC)

2007-01-16 Thread Rick Jones

Mark Ryden wrote:

Hello,


I have a machine with 2 dual core CPUs. This machine runs Fedora Core 6.
I have two Intel e1000 GigaBit network cards on this machine; I use 
bonding so

that the machine assigns the same IP address to both NICs ;
It seems to me that bonding is configured OK, bacuse when running:
cat /proc/net/bonding/bond0
I get:
...
Permanent HW addr: 

(And the Permanent HW addr is diffenet in these two entries).

I send a large amount of packets to this machine (more than 20,000 in
a second).


Well, 20K a second is large in some contexts, but not in others :)


cat /proc/interrupts shops something like this:
CPU0   CPU1 CPU2 CPU3
50:3359337  0  0  0 PCI-MSI  eth0
58: 493396136  0  0 PCI-MSI  eth1

CPU0 and CPU1 are of the first CPU as far as I understand ; so this
means as far as I understand that the second CPU (which has CPU3 and
CPU4) does not handle interrupts of the arrived packets; Can I
somehow change it so the second
CPU will also handle network interrupts of receiving packets on the
nic ?


Actually, those could be different chips - it depends on the CPUs I 
think, and I suppose the BIOS/OS.  On a Woodcrest system with which I've 
been playing, CPUs 0 and 2 appear to be on the same die, then 1 and 
three.   I ass-u-me-d the numbering was that way to get maximum 
processor cache when saying numcpu=N for something less than the 
number of cores in the system.


NUMA considerations might come into play if this is Opteron (well, any 
NUMA system really - larger IA64's, certain SPARC and Power systems 
etc...).  In broad handwaving terms, one is better-off with the NICs 
interrupts being handled by the topologically closest CPU.  (Not that 
some irqbalancer programs recognize that just yet :)


Now, if both CPU0 and CPU1 are saturated it might make sense to put some 
interrupts on 2 and/or 3.  One of those fun it depends situations.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network card IRQ balancing with Intel 5000 series chipsets

2007-01-02 Thread Rick Jones

The best way to achieve such balancing is to have the network card help
and essentially be able to select the CPU to notify while at the same
time considering:
a) avoiding any packet reordering - which restricts a flow to be
processed to a single CPU at least within a timeframe
b) be per-CPU-load-aware - which means to busy out only CPUs which are
less utilized

Various such schemes have been discussed here but no vendor is making
such nics today (search Daves Blog - he did discuss this at one point or
other).


I thought that Neterion were doing something along those lines with 
their Xframe II NICs - perhaps not CPU loading aware, but doing stuff to 
spread the work of different connections across the CPUs.


I would add a:

c) some knowledge of the CPU on which the thread accessing the socket 
for that connection will run.  This could be as simple as the CPU on 
which the socket was last accessed.  Having a _NIC_ know this sort of 
thing is somewhat difficult and expensive (perhaps too much so).  If a 
NIC simply hashes the connection idendifiers you then have the issue of 
different connections, each owned/accessed by one thread, taking 
different paths through the system.  No issues about reordering, but 
perhaps some on cache lines going hither and yon.


The question boils down to - Should the application (via the scheduler) 
dictate where its connections are processed, or should the connections 
dictate where the application runs?


rick jones

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network card IRQ balancing with Intel 5000 series chipsets

2007-01-02 Thread Rick Jones

With NAPI, if i have a few interupts it likely implies i have a huge
network load (and therefore CPU use) and would be much more happier if
you didnt start moving more interupt load to that already loaded CPU



current irqbalance accounts for napi by using the number of packets as
indicator for load, not the number of interrupts. (for network
interrupts obviously)


And hopefully some knowledge of NUMA so it doesn't balance the 
interrupts of a NIC to some far-off (topology-wise) CPU...


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network drivers that don't suspend on interface down

2006-12-20 Thread Rick Jones

There are two different problems:

1) Behavior seems to be different depending on device driver
   author. We should document the expected semantics better.

   IMHO:
When device is down, it should:
 a) use as few resources as possible:
   - not grab memory for buffers
   - not assign IRQ unless it could get one
   - turn off all power consumption possible
 b) allow setting parameters like speed/duplex/autonegotiation,
ring buffers, ... with ethtool, and remember the state
 c) not accept data coming in, and drop packets queued


What implications does c have for something like tcpdump?

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network devices don't handle pci_dma_mapping_error()'s

2006-12-06 Thread Rick Jones

David Miller wrote:

From: Stephen Hemminger [EMAIL PROTECTED]
Date: Wed, 6 Dec 2006 16:58:35 -0800



The more robust way would be to stop the queue (like flow control)
and return busy. You would need a timer though to handle the case
where some disk i/o stole all the mappings and then network device flow
blocked.



You need some kind of fairness, yes, that's why I suggested a
callback.  When your DMA allocation fails, you get into the rear of
the FIFO, when a free occurs, we callback starting from the head of
the FIFO.  You don't get removed from the FIFO unless at least one of
your DMA allocation retries succeed.


While tossing a TCP|UDP|SCTP|etc packet could be plusungood, especially 
if the IOMMU fills frequently (for some suitable definiton of 
frequently), is it really worth the effort to save say an ACK?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Max number of TCP sessions

2006-11-16 Thread Rick Jones
On Thu, 2006-11-16 at 20:23 +, James Courtier-Dutton wrote: 
 Hi,
 
 For a host using a Pentium 4 CPU at 2.8Mhz, what is a sensible max value 
 for number of TCP sessions this host could run under Linux?
 Bandwidth per TCP session is likely to be about 10kbytes/second.

To a first order, and assuming that there is nearly no user-space
processing for those TCP connections (TCP is a transport not a session
protocol :) you could take a netperf TCP_RR test result - using the
service demand - usec of CPU/KB transferred you could then do some back
of the envelope calculations as to the number of 10 KByte/s connections
you could support.  It would be a bit of handwaving, but give yourself
say a 20% pad and you'll probably be OK.

rick jones

 
 Kind Regards
 
 James
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc1: Volanomark slowdown

2006-11-08 Thread Rick Jones
On Wed, 2006-11-08 at 23:10 +0100, Olaf Kirch wrote:
 What I'm saying though is that it doesn't rhyme with what I've seen of
 Volanomark - we ran 2.6.16 on a 4p Intel box for instance and it didn't
 come close to saturating a Gigabit pipe before it maxed out on CPU load.

That actually supports the hypothesis doesn't it?  The issue being the
increased number of ACKs causing additional CPU overhead not saturating
a NIC if any involved.

One of these days I may have to try to look more closely at what volano
does relative to netperf - I remember that someone tried very hard (was
it you alexy?) to show a perfomance effect with netperf and it didn't do
it :(

rick jones

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bcm43xx: Readd dropped assignment

2006-10-20 Thread Dave Jones
On Wed, Oct 18, 2006 at 04:40:00PM +0200, Michael Buesch wrote:
  On Wednesday 18 October 2006 01:12, Daniel Drake wrote:
   Larry Finger pointed out a problem with my ieee80211 IV/ICV stripping 
   patch,
   which I forgot about. Sorry about that.
   
   The patch readds the frame_ctl assignment which was accidently dropped.
   
   Signed-off-by: Daniel Drake [EMAIL PROTECTED]
  
  Whoops. Please merge this as fast as possible, John.
  That's a real bug which prevents RX from working.

Is that one for -stable too? That file looks similar enough
between .18.1 and .19rc that it should be the case ?

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] netpoll: rework skb transmit queue

2006-10-20 Thread Dave Jones
On Fri, Oct 20, 2006 at 01:25:32PM -0700, Stephen Hemminger wrote:
  On Fri, 20 Oct 2006 12:52:26 -0700 (PDT)
  David Miller [EMAIL PROTECTED] wrote:
  
   From: Stephen Hemminger [EMAIL PROTECTED]
   Date: Fri, 20 Oct 2006 12:25:27 -0700
   
Sorry, but why should we treat out-of-tree vendor code any
differently than out-of-tree other code.
   
   I think what netdump was trying to do, provide a way to
   requeue instead of fully drop the SKB, is quite reasonable.
   Don't you think?
  
  
  Netdump doesn't even exist in the current Fedora source rpm.
  I think Dave dropped it.

Indeed. Practically no-one cared about it, so it bit-rotted
really fast after we shipped RHEL4.  That, along with the focus
shifting to making kdump work seemed to kill it off over the last
12 months.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BCM5461 phy issue in 10M/Full duplex

2006-10-18 Thread Rick Jones

Kumar Gala wrote:
I was wondering if anyone has had any issues when trying to force a  
BCM5461 phy into 10M/full duplex.  I seem to be having an issue in  the 
two managed switches I've tried this on but autoneg to 10/half.   This 
causes a problem in that I start seeing a large number of frame  errors.


I believe, but need to double check, that if I leave the BCM5461 in  
autoneg, and foce the switch to 10M/full that the BCM5461 will  autoneg 
at 10M/half duplex.


Indeed, if one side is hardcoded, autoneg will fail and the side trying to 
autoneg is required by the specs (not that I know chapter and verse to quote 
from the IEE stuff :( to go into half-duplex.


Was 10M/Fullduplex ever standardized?  If not I could see where kit might not be 
willing/able to autoneg to that.



Just wondering if anyone else has seen similar behavior with this PHY.

thanks

- kumar
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Remove useless comment from sb1250

2006-10-17 Thread Dave Jones
Signed-off-by: Dave Jones [EMAIL PROTECTED]

diff --git a/drivers/net/sb1250-mac.c b/drivers/net/sb1250-mac.c
index db23249..1eae16b 100644
--- a/drivers/net/sb1250-mac.c
+++ b/drivers/net/sb1250-mac.c
@@ -2903,7 +2903,7 @@ #endif
 
dev = alloc_etherdev(sizeof(struct sbmac_softc));
if (!dev)
-   return -ENOMEM; /* return ENOMEM */
+   return -ENOMEM;
 
printk(KERN_DEBUG sbmac: configuring MAC at %lx\n, port);
 

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suppress / delay SYN-ACK

2006-10-13 Thread Rick Jones

Eric Dumazet wrote:

Rick Jones a écrit :


More to the point, on what basis would the application be rejecting a
connection request based solely on the SYN?



True, it isn't like there would suddenly be any call user data as in 
XTI/TLI.



DATA payload could be included in the SYN packet. TCP specs allow this 
AFAIK.


Yes, but it isn't supposed to be delivered until the 3-way handshake is complete 
right?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


getaddrinfo - should accept IPPROTO_SCTP no?

2006-10-13 Thread Rick Jones
I made some recent changes to netperf to workaround what is IMO a bug in the 
Solaris getaddrinfo() where it will clear the ai_protocol field even when one 
gives it a protocol in the hints.


[If you happen to be trying to use the test-specific -D to set TCP_NODELAY in 
netperf on Solaris, you might want to grab netperf TOT to get this workaround as 
it relates to issues with setting TCP_NODELAY - modulo what it will do to being 
able to run the netperf SCTP tests on Linux...]


In the process though I have stumbled across what appears to be a bug (?) in 
Linux getaddrinfo() - returning a -7 EAI_SOCKTYPE if given as hints 
SOCK_STREAM and IPPROTO_SCTP - this on a system that ostensibly supports SCTP. 
I've seen this on RHAS4U4 as well as another less well known distro.


I'm about to see about concocting an additional workaround in netperf for this, 
but thought I'd ask if my assumption - that getaddrinfo() returning -7 when 
given IPPROTO_SCTP - is indeed a bug in getaddrinfo().  Or am I just woefully 
behind in patches or completely offbase on what is correct behaviour for 
getaddrinfo and hints?


FWIW, which may not be much, Solaris 10 06/06 seems content to accept 
IPPROTO_SCTP in the hints.


thanks,

rick jones
http://www.netperf.org/svn/netperf2/trunk/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suppress / delay SYN-ACK

2006-10-13 Thread Rick Jones
DATA payload could be included in the SYN packet. TCP specs allow 
this AFAIK.
Yes, but it isn't supposed to be delivered until the 3-way handshake 
is complete right?

Are you speaking of 20 years old BSD API ? :)


Nope - the bits in the RFCs about data not being delivered until the ISN's are 
validated.  I may have some of the timing a bit wrong though.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


sfuzz hanging on 2.6.18

2006-10-12 Thread Dave Jones
sfuzz.c (google for it if you don't have it already) used to
run forver (or until I got bored and ctrl-c'd it) as long
as it didn't trigger an oops or the like in 2.6.17

Running it against 2.6.18, I notice that it runs for a while,
and then gets totally wedged.  It doesn't respond to any signals,
can't be ptraced, and even strace subsequently gets wedged.
The machine responds, and is still interactive, but that process
is hosed.

sysrq-t shows it stuck here..

sfuzz D 724EF62A  2828 28717  28691 (NOTLB)
   cd69fe98 0082 012d 724ef62a 0001971a 0010 0007 df6d22b0 
   dfd81080 725bbc5e 0001971a 000cc634 0001 df6d23bc c140e260 0202 
   de1d5ba0 cd69fea0 de1d5ba0   de1d5b60 de1d5b8c de1d5ba0 
Call Trace:
 [c05b1708] lock_sock+0x75/0xa6
 [e0b0b604] dn_getname+0x18/0x5f [decnet]
 [c05b083b] sys_getsockname+0x5c/0xb0
 [c05b0b46] sys_socketcall+0xef/0x261
 [c0403f97] syscall_call+0x7/0xb
DWARF2 unwinder stuck at syscall_call+0x7/0xb

I wonder if the plethora of lockdep related changes inadvertantly broke 
something?

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suppress / delay SYN-ACK

2006-10-12 Thread Rick Jones

Martin Schiller wrote:

Hi!

I'm searching for a solution to suppress / delay the SYN-ACK packet of a
listening server (-application) until he has decided (e.g. analysed the
requesting ip-address or checked if the corresponding other end of a
connection is available) if he wants to accept the connect request of the
client. If not, it should be possible to reject the connect request.


How often do you expect the incomming call to be rejected?  I suspect that would 
have a significant effect on whether the whole thing is worthwhile.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suppress / delay SYN-ACK

2006-10-12 Thread Rick Jones

More to the point, on what basis would the application be rejecting a
connection request based solely on the SYN?


True, it isn't like there would suddenly be any call user data as in XTI/TLI.


There are only two pieces of information available: the remote IP
address and port, and the total number of pending requests. The
latter is already addressed through the backlog size, and netfilter
rules can already be used to reject based on IP address.


It would though allow an application to have an even more restricted set of 
allowed IP's than was set in netfilter.  Rather like allowing the application to 
set socket buffer sizes rather than relying on the system's default.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mii-tool gigabit support.

2006-09-29 Thread Rick Jones

2) develop some style

of register description definition type of text file, maybe XML, maybe
INI style or something stored in /etc/ethtool as drivername.conf or
something like that.  This way, ethtool doesn't have to be
changed/updated/patched/likely-bug-added for every single device known
to man.  


Just a thought.


 
We could switch to shared libraries like 'tc' uses.


From a practical standpoint is shipping a new config file or a new shared 
library all that much different from a new ethtool binary?


rick
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][BNX2]: Disable MSI on 5706 if AMD 8132 bridge is present

2006-09-29 Thread Rick Jones

It absolutely was not vague, it gave an explicit description of what
the problem was, down to the transaction type being used by 5706 and
what the stated rules are in the PCI spec, and it also gave a clear
indication that the 5706 was in the wrong and that this was believed
to be a unique situation.


I'm not disagreeing with a per-driver check at the moment, but I thought that 
Michael told us that the masking being attempted by the 5706 was legal:


Michael Chan wrote:

MSI is defined to be 32-bit write.  The 5706 does 64-bit MSI writes
with byte enables disabled on the unused 32-bit word.  This is legal
but causes problems on the AMD 8132 which will eventually stop
responding after a while.


 ...

MSI is defined to be 32-bit write.  The 5706 does 64-bit MSI writes
with byte enables disabled on the unused 32-bit word.  This is legal
but causes problems on the AMD 8132 which will eventually stop
responding after a while.



rick jones

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mii-tool gigabit support.

2006-09-27 Thread Rick Jones
With mii-tool we can do the command below and work with a half duplex 
hub and a full duplex switch.

mii-tool -A 10baseT-FD,10baseT-HD eth0


Why, and how often, is that really necessary?

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mii-tool gigabit support.

2006-09-27 Thread Rick Jones

Auke Kok wrote:

Rick Jones wrote:

With mii-tool we can do the command below and work with a half duplex 
hub and a full duplex switch.

mii-tool -A 10baseT-FD,10baseT-HD eth0



Why, and how often, is that really necessary?



This is a bit of a hypothetical discussion of course, but I can imagine 
a lot of users with 100mbit switches in their homes (imagine all the 
DSL/cable routers out there...) that want to stop their nic from 
attempting to negotiate 1000mbit.


That would be covered by autosense right?  IIRC there haven't been issues with 
speed sensing, just duplex negotiation right?


Another scenario: forcing the NIC to negotiate only full-duplex speeds. 
Not only fun if you try it against a hub, but possibly useful.


For us it's much more interesting because we try every damn impossible 
configuration anyway and see what gives (or breaks).


Anyway, a patch to make ethtool do this was merged as Jeff Kirsher 
pointed out, so you can do this now with ethool too.


I'm just worried (as in Fear Uncertainty and Doubt) that having people set the 
allowed things to negotiate isn't really any more robust than stright-up 
hardcodes and perpetuates the (IMO) myth that one shouldn't autoneg on general 
principle.


rick



Cheers,

Auke


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tc related lockdep warning.

2006-09-26 Thread Dave Jones
On Tue, Sep 26, 2006 at 06:15:21PM +0200, Patrick McHardy wrote:
  Patrick McHardy wrote:
   jamal wrote:
   
  Yes, that looks plausible. Can you try making those changes and see if
  the warning is gone?
   
  
   I think this points to a bigger brokeness caused by the move of
   dev-qdisc to RCU. It means destruction of filters and actions doesn't
   necessarily happens in user-context and thus not protected by the rtnl
   anymore.
  
  I looked into this and we indeed still have lots of problems from that
  broken RCU patch. Basically all locking (qdiscs, classifiers, actions,
  estimators) assumes that updates are only done in process context and
  thus read_lock doesn't need bottem half protection. Quite a few things
  also assume that updates only happen under the RTNL and don't need
  any further protection if not used during packet processing.
  
  Instead of fixing all this I suggest something like this (untested)
  patch instead. Since only the dev-qdisc pointer is protected by RCU,
  but enqueue and the qdisc tree are still protected by dev-qdisc_lock,
  we can perform destruction of the tree immediately and only do the
  final free in the rcu callback, as long as we make sure not to enqueue
  anything to a half-way destroyed qdisc.

With this patch, I get no lockdep warnings, but the machine locks up completely.
I hooked up a serial console, and found this..


u32 classifier
Performance counters on
input device check on 
Actions configured 
BUG: warning at net/sched/sch_htb.c:395/htb_safe_rb_erase()

Call Trace:
 [8026f79b] show_trace+0xae/0x336
 [8026fa38] dump_stack+0x15/0x17
 [8860a171] :sch_htb:htb_safe_rb_erase+0x3b/0x55
 [8860a4d5] :sch_htb:htb_deactivate_prios+0x173/0x1cd
 [8860b437] :sch_htb:htb_dequeue+0x4d0/0x856
 [8042dc0d] __qdisc_run+0x3f/0x1ca
 [802329a6] dev_queue_xmit+0x137/0x268
 [8025b4a2] neigh_resolve_output+0x249/0x27e
 [802353fd] ip_output+0x210/0x25a
 [8043ce28] ip_push_pending_frames+0x37c/0x45b
 [8044ffd7] icmp_push_reply+0x13b/0x148
 [80450900] icmp_send+0x366/0x3d3
 [802568a9] udp_rcv+0x53d/0x556
 [80237e73] ip_local_deliver+0x1a3/0x26b
 [80238ec8] ip_rcv+0x4b9/0x501
 [802218bb] netif_receive_skb+0x33d/0x3c9
 [881f6348] :e1000:e1000_clean_rx_irq+0x450/0x4fe
 [881f47eb] :e1000:e1000_clean+0x88/0x17d
 [8020cab3] net_rx_action+0xac/0x1d1
 [80212725] __do_softirq+0x68/0xf5
 [80262638] call_softirq+0x1c/0x28
DWARF2 unwinder stuck at call_softirq+0x1c/0x28
Leftover inexact backtrace:
 IRQ [80270aaa] do_softirq+0x39/0x9f
 [80296102] irq_exit+0x57/0x59
 [80270c0d] do_IRQ+0xfd/0x107
 [8025b51d] mwait_idle+0x0/0x54
 [802618c6] ret_from_intr+0x0/0xf
 EOI [80265e66] __sched_text_start+0xaa6/0xadd
 [8025b55c] mwait_idle+0x3f/0x54
 [8025b526] mwait_idle+0x9/0x54
 [8024c81c] cpu_idle+0xa2/0xc5
 [8026e519] rest_init+0x2b/0x2d
 [80a7f811] start_kernel+0x24a/0x24c
 [80a7f28b] _sinittext+0x28b/0x292

BUG: warning at net/sched/sch_htb.c:395/htb_safe_rb_erase()

Call Trace:
 [8026f79b] show_trace+0xae/0x336
 [8026fa38] dump_stack+0x15/0x17
 [8860a171] :sch_htb:htb_safe_rb_erase+0x3b/0x55
 [8860a4d5] :sch_htb:htb_deactivate_prios+0x173/0x1cd
 [8860b437] :sch_htb:htb_dequeue+0x4d0/0x856
 [8042dc0d] __qdisc_run+0x3f/0x1ca
 [802329a6] dev_queue_xmit+0x137/0x268
 [8025b4a2] neigh_resolve_output+0x249/0x27e
 [802353fd] ip_output+0x210/0x25a
 [8043ce28] ip_push_pending_frames+0x37c/0x45b
 [8044ffd7] icmp_push_reply+0x13b/0x148
 [80450900] icmp_send+0x366/0x3d3
 [802568a9] udp_rcv+0x53d/0x556
 [80237e73] ip_local_deliver+0x1a3/0x26b
 [80238ec8] ip_rcv+0x4b9/0x501
 [802218bb] netif_receive_skb+0x33d/0x3c9
 [881f6348] :e1000:e1000_clean_rx_irq+0x450/0x4fe
 [881f47eb] :e1000:e1000_clean+0x88/0x17d
 [8020cab3] net_rx_action+0xac/0x1d1
 [80212725] __do_softirq+0x68/0xf5
 [80262638] call_softirq+0x1c/0x28
DWARF2 unwinder stuck at call_softirq+0x1c/0x28
Leftover inexact backtrace:
 IRQ [80270aaa] do_softirq+0x39/0x9f
 [80296102] irq_exit+0x57/0x59
 [80270c0d] do_IRQ+0xfd/0x107
 [8025b51d] mwait_idle+0x0/0x54
 [802618c6] ret_from_intr+0x0/0xf
 EOI [80265e66] __sched_text_start+0xaa6/0xadd
 [8025b55c] mwait_idle+0x3f/0x54
 [8025b526] mwait_idle+0x9/0x54
 [8024c81c] cpu_idle+0xa2/0xc5
 [8026e519] rest_init+0x2b/0x2d
 [80a7f811] start_kernel+0x24a/0x24c
 [80a7f28b] _sinittext+0x28b/0x292

BUG: soft lockup detected on CPU#0!

Call Trace:
 [8026f79b] show_trace+0xae/0x336
 [8026fa38] dump_stack+0x15/0x17
 [802bfea7] softlockup_tick+0xd5/0xea
 

tc related lockdep warning.

2006-09-24 Thread Dave Jones

=
[ INFO: inconsistent lock state ]
-
inconsistent {softirq-on-R} - {in-softirq-W} usage.
swapper/0 [HC0[0]:SC1[2]:HE1:SE0] takes:
 (police_lock){-+--}, at: [f8d304fd] tcf_police_destroy+0x24/0x8f [act_police]
{softirq-on-R} state was registered at:
  [c043bdd6] lock_acquire+0x4b/0x6d
  [c061495a] _read_lock+0x19/0x28
  [f8d3026a] tcf_act_police_locate+0x26a/0x363 [act_police]
  [c05cacc3] tcf_action_init_1+0x113/0x1a7
  [c05c97c9] tcf_exts_validate+0x3c/0x85
  [f8d4337c] u32_set_parms+0x26/0x131 [cls_u32]
  [f8d43dc7] u32_change+0x2fc/0x371 [cls_u32]
  [c05c9f44] tc_ctl_tfilter+0x417/0x487
  [c05c0d67] rtnetlink_rcv_msg+0x1b3/0x1d6
  [c05ccef3] netlink_run_queue+0x69/0xfe
  [c05c0b6a] rtnetlink_rcv+0x29/0x42
  [c05cd380] netlink_data_ready+0x12/0x50
  [c05cc3e8] netlink_sendskb+0x1f/0x37
  [c051] netlink_unicast+0x1a1/0x1bb
  [c05cd361] netlink_sendmsg+0x275/0x282
  [c05aff4a] sock_sendmsg+0xe8/0x103
  [c05b074d] sys_sendmsg+0x14d/0x1a8
  [c05b1937] sys_socketcall+0x16b/0x186
  [c0403fb7] syscall_call+0x7/0xb
irq event stamp: 278833666
hardirqs last  enabled at (278833666): [c04290b9] tasklet_action+0x30/0xca
hardirqs last disabled at (278833665): [c0429095] tasklet_action+0xc/0xca
softirqs last  enabled at (278833650): [c0429083] __do_softirq+0xec/0xf2
softirqs last disabled at (278833659): [c0406683] do_softirq+0x5a/0xbe

other info that might help us debug this:
1 lock held by swapper/0:
 #0:  (qdisc_tree_lock){-+-.}, at: [c05c7737] __qdisc_destroy+0x20/0x85

stack backtrace:
 [c04051ed] show_trace_log_lvl+0x58/0x16a
 [c04057fa] show_trace+0xd/0x10
 [c0405913] dump_stack+0x19/0x1b
 [c043a20b] print_usage_bug+0x1cf/0x1dc
 [c043a5e4] mark_lock+0x124/0x353
 [c043b2a0] __lock_acquire+0x3d7/0x99c
 [c043bdd6] lock_acquire+0x4b/0x6d
 [c06148a5] _write_lock_bh+0x1e/0x2d
 [f8d304fd] tcf_police_destroy+0x24/0x8f [act_police]
 [f8d30590] tcf_act_police_cleanup+0x28/0x33 [act_police]
 [c05ca1a1] tcf_action_destroy+0x20/0x84
 [c05c9784] tcf_exts_destroy+0x16/0x1f
 [f8d43114] u32_destroy_key+0x30/0x50 [cls_u32]
 [f8d4314f] u32_clear_hnode+0x1b/0x2e [cls_u32]
 [f8d4319a] u32_destroy_hnode+0x38/0x81 [cls_u32]
 [f8d4322d] u32_destroy+0x4a/0xc9 [cls_u32]
 [f8d340f9] ingress_destroy+0x1a/0x5c [sch_ingress]
 [c05c774d] __qdisc_destroy+0x36/0x85
 [c043438f] __rcu_process_callbacks+0xfe/0x169
 [c04346f0] rcu_process_callbacks+0x23/0x45
 [c04290ee] tasklet_action+0x65/0xca
 [c042900f] __do_softirq+0x78/0xf2
 [c0406683] do_softirq+0x5a/0xbe
 [c0428eb8] irq_exit+0x3d/0x3f
 [c04179df] smp_apic_timer_interrupt+0x73/0x78
 [c0404b12] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-22 Thread Rick Jones

Alexey Kuznetsov wrote:

Hello!


transactions to data segments is fubar.  That issue is also why I wonder 
about the setting of tcp_abc.



Yes, switching ABC on/off has visible impact on amount of segments.
When ABC is off, amount of segments is almost the same as number of
transactions. When it is on, ~1.5% are merged. But this is invisible
in numbers of throughput/cpu usage.


Hmm, that would seem to suggest that for new the netperf/netserver 
were being fast enough that the code didn't perceive the receipt of 
back-to-back sub-MSS segments? (Is that even possible once -b is fairly 
large?)  Otherwise, with new I would have expected the segment count to 
be meaningfully  than the transaction count?




That' numbers:

1Gig link. The first column is b. - separates runs of netperf
in backward direction.

Run #1. One host is slower.

old,abc=0
 new,abc=0
  new,abc=1
   old,abc=1

2   23652.00  6.31   21.11  10.665  8.924
 23622.16  6.47   21.01  10.951  8.893
  23625.05  6.21   21.01  10.512  8.891
   23725.12  6.46   20.31  10.898  8.559
-
23594.87  21.90  6.44   9.283   10.912
 23631.52  20.30  6.36   8.592   10.766
  23609.55  21.00  6.26   8.896   10.599
   23633.75  21.10  5.44   8.929   9.206

4   36349.11  8.71   31.21  9.584   8.585
 36461.37  8.65   30.81  9.492   8.449
  36723.72  8.22   31.31  8.949   8.526
   35801.24  8.58   30.51  9.589   8.521
-
35127.34  33.80  8.43   9.621   9.605
 36165.50  30.90  8.48   8.545   9.381
  36201.45  31.10  8.31   8.592   9.185
   35269.76  30.00  8.58   8.507   9.732

8   41148.23  10.39  42.30  10.101  10.281
 41270.06  11.04  31.31  10.698  7.585
  41181.56  5.66   48.61  5.496   11.803
   40372.37  9.68   56.50  9.591   13.996
-
40392.14  47.00  11.89  11.637  11.775
 40613.80  36.90  9.16   9.086   9.019
  40504.66  53.60  7.73   13.234  7.639
   40388.99  48.70  11.93  12.058  11.814

16  67952.27  16.27  43.70  9.576   6.432
 68031.40  10.56  53.70  6.206   7.894
  6.95  12.81  46.90  7.559   6.920
   67814.41  16.13  46.50  9.517   6.857
-
68031.46  51.30  11.53  7.541   6.781
 68044.57  40.70  8.48   5.982   4.986
  67808.13  39.60  15.86  5.840   9.355
   67818.32  52.90  11.51  7.801   6.791

32  90445.09  15.41  99.90  6.817   11.045
 90210.34  16.11  100.00 7.143   11.085
  90221.84  17.31  98.90  7.676   10.962
   90712.78  18.41  99.40  8.120   10.958
-
89155.51  99.90  12.89  11.205  5.782
 90058.54  99.90  16.16  11.093  7.179
  90092.31  98.60  15.41  10.944  6.840
   88688.96  99.00  17.59  11.163  7.933

64  89983.76  13.66  100.00 6.071   11.113
 90504.24  17.54  100.00 7.750   11.049
  92043.36  17.44  99.70  7.580   10.832
   90979.29  16.01  99.90  7.038   10.981
-
88615.27  99.90  14.91  11.273  6.729
 89316.13  99.90  17.28  11.185  7.740
  90622.85  99.90  16.81  11.024  7.420
   89084.85  99.90  17.51  11.214  7.861

Run #2. Slower host is replaced with better one. ABC=0.
No runs in backward directions.

new
 old

2   24009.73  8.80   6.49   3.667   10.806
 24008.43  8.00   6.32   3.334   10.524
4   40012.53  18.30  8.79   4.574   8.783
 3.84  19.40  8.86   4.851   8.857
8   60500.29  26.30  12.78  4.348   8.452
 60397.79  26.30  11.73  4.355   7.769
16  69619.95  39.80  14.03  5.717   8.063
 70528.72  24.90  14.43  3.531   8.184
32  132522.01  53.20  21.28  4.015   6.424
 132602.93  57.70  22.59  4.351   6.813
64  145738.83  60.30  25.01  4.138   6.865
 143129.55  73.20  24.19  5.114   6.759
128 148184.21  69.70  24.96  4.704   6.739
 148143.47  71.00  25.01  4.793   6.753
256 144798.91  69.40  25.01  4.793   6.908
 144086.01  73.00  24.61  5.067   6.832

Frankly, I do not see any statistically valid correlations.


Does look like it jumps-around quite a bit - for example the run#2 with 
-b 16 had the CPU util all over the map on the netperf side.  That 
wasn't by any chance an SMP system?


that linux didn't seem to be doing the same thing. Hence my tweaking 
when seeing this patch come along...]



netperf does not catch this. :-)


Nope :(  One of these days I need to teach netperf how to extract 
TCP statistics from as many platforms as possible.  Meantime it relies 
as always on the kindness of benchmarkers :) (My appologies to Tennesee 
Williams :)



Even with this patch linux does not ack each second segment dumbly,
it waits for some conditions, mostly read() emptying receive queue.


Good.  HP-UX is indeed dumb about this, but I'm assured it will be 
changing.  I 

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-22 Thread Rick Jones

That came from named. It opens lots of sockets with SIOCGSTAMP.
No idea what it needs that many for.


IIRC ISC BIND named opens a socket for each IP it finds on the system. 
Presumeably in this way it knows implicitly the destination IP without 
using platform-specific recvfrom/whatever extensions and gets some 
additional parallelism in the stack on SMP systems.


Why it needs/wants the timestamps I've no idea, I don't think it gets 
them that way on all platforms.  I suppose the next time I do some named 
benchmarking I can try to take a closer look in the source.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UDP Out 0f Sequence

2006-09-21 Thread Rick Jones

Majumder, Rajib wrote:

Let's say we have 2 uniprocessor hosts connected back to back. Is
there any possibility of an out-of-order scenario on recv? 


Your application should be written on the assumption that it is 
possible, regardless of the specifics of the hosts involved, however 
unlikely they may be to reorder traffic.


 Is this same for all kernel (linux/solaris)?

Your application should be written on the assumtion that it is possible, 
regardless of the specifics of the OSes involved, however unlikely they 
may be to reorder traffic.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question about David's blog entry for NetCONF 2006, Day 1

2006-09-21 Thread Rick Jones
I was reading David's blog entries on the netdev meeting in Japan, and 
have a question about this bit:



Currently, things like Xen have to put the card into promiscuous
mode, accepting all packets, which is quite inefficient.


Is the inefficient bit meant for accepting all packets, or more broadly 
that the promiscuous path is quite inefficient compared to the 
non-promiscuous path?


I ask because I would have thought that if the system were connected to 
a switch (*), the number of packets received through a NIC in 
promiscuous mode would be nearly the same as when it was not in 
promiscuous mode - the delta being (perhaps) multicast frames.


rick jones

(*) Today, it seems 99 times out of 10 systems are connected to 
switches not hubs.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UDP Out 0f Sequence

2006-09-20 Thread Rick Jones

Majumder, Rajib wrote:

Hi,

If I write UDP datagrams 1,2 and 3 to network and if the receiver
receives in order 2,1, and 3, where can the sequence get changed? Is it
at the source stack, network transit or destination stack?


Yes. :)

Although network transit is by far the most likely case.  Destination 
stack is a distant second and source stack an even more distant third. 
Generally stack writers try to avoid having places in their stacks where 
things can reorder, but it isn't completely unknown.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/23] e100: Add debugging code for cb cleaning and csum failures.

2006-09-19 Thread Dave Jones
On Tue, Sep 19, 2006 at 10:28:38AM -0700, Kok, Auke wrote:
  
  Refine cb cleaning debug printout and print out all cleaned cbs' status. Add
  debug flag for EEPROM csum failures that were overridden by the user.
  
  Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED]
  Signed-off-by: Auke Kok [EMAIL PROTECTED]
  ---
  
   drivers/net/e100.c |9 ++---
   1 files changed, 6 insertions(+), 3 deletions(-)
  
  diff --git a/drivers/net/e100.c b/drivers/net/e100.c
  index ab0868c..ae93c62 100644
  --- a/drivers/net/e100.c
  +++ b/drivers/net/e100.c
  @@ -761,6 +761,8 @@ static int e100_eeprom_load(struct nic *
   DPRINTK(PROBE, ERR, EEPROM corrupted\n);
   if (!eeprom_bad_csum_allow)
   return -EAGAIN;
  +else
  +add_taint(TAINT_MACHINE_CHECK);

I object to this flag being abused this way.
A corrupt EEPROM on a network card has _nothing_ to do with
a CPU machine check exception.

Dave

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/23] e100: Add debugging code for cb cleaning and csum failures.

2006-09-19 Thread Dave Jones
On Tue, Sep 19, 2006 at 05:40:34PM -0400, Jeff Garzik wrote:
  Dave Jones wrote:
   On Tue, Sep 19, 2006 at 10:28:38AM -0700, Kok, Auke wrote:
 +   add_taint(TAINT_MACHINE_CHECK);
   
   I object to this flag being abused this way.
   A corrupt EEPROM on a network card has _nothing_ to do with
   a CPU machine check exception.
  
  Fair enough.  Better suggestions?
  
  I think it's fair to set _some_ taint flag, perhaps a new one, on a 
  known corrupted firmware.  But if others disagree, I'll follow the 
  consensus here.

I don't object to a new flag, but overloading an existing flag that has
established meaning just seems wrong to me.

Question is how many more types of random hardware failures are there
that we'd like to do similar things for ?
Perhaps a catch-all Hardware failure flag for assorted brokenness
would be better than a proliferation of flags?

Dave

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Rick Jones

David Miller wrote:

From: Rick Jones [EMAIL PROTECTED]
Date: Tue, 05 Sep 2006 10:55:16 -0700


Is this really necessary?  I thought that the problems with ABC were in 
trying to apply byte-based heuristics from the RFC(s) to a 
packet-oritented cwnd in the stack?



This is receiver side, and helps a sender who does congestion
control based upon packet counting like Linux does.   It really
is less related to ABC than Alexey implies, we've always had
this kind of problem as I mentioned in previous talks in the
past on this issue.


For a connection receiving nothing but sub-MSS segments this is going to 
non-trivially increase the number of ACKs sent no?  I would expect an 
unpleasant increase in service demands on something like a burst 
enabled (./configure --enable-burst) netperf TCP_RR test:


netperf -t TCP_RR -H foo -- -b N   # N  1

to increase as a result.   Pipelined HTTP would be like that, some NFS 
over TCP stuff too, maybe X traffic, other transactional workloads as 
well - maybe Tuxeudo.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Rick Jones

Alexey Kuznetsov wrote:

Hello!

Of course, number of ACK increases. It is the goal. :-)

unpleasant increase in service demands on something like a burst 
enabled (./configure --enable-burst) netperf TCP_RR test:


netperf -t TCP_RR -H foo -- -b N   # N  1


foo=localhost


There isn't any sort of clever short-circuiting in loopback is there?  I 
do like the convenience of testing things over loopback, but always fret 
about not including drivers and actual hardware interrupts etc.



b   patched orig
2   105874.83   105143.71
3   114208.53   114023.07
4   120493.99   120851.27
5   128087.48   128573.33
10  151328.48   151056.00



Probably, the test is done wrong. But I see no difference.


Regardless, kudos for running the test.  The only thing missing is the 
-c and -C options to enable the CPU utilization measurements which will 
then give the service demand on a CPU time per transaction basis.  Or 
was this a UP system that was taken to CPU saturation?


to increase as a result.   Pipelined HTTP would be like that, some NFS 
over TCP stuff too, maybe X traffic,



X will be excited about better latency.

What's about protocols not interested in latency, they will be a little
happier, if transactions are processed asynchronously.


What i'm thinking about isn't so much about the latency as it is the 
aggregate throughput a system can do with lots of these 
protocols/connections going at the same time.  Hence the concern about 
increases in service demand.



But actually, it is not about increasing/decreasing number of ACKs.
It is about killing that pain in ass which we used to have because
we pretended to be too smart.


:)

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Rick Jones
Regardless, kudos for running the test.  The only thing missing is the 
-c and -C options to enable the CPU utilization measurements which will 
then give the service demand on a CPU time per transaction basis.  Or 
was this a UP system that was taken to CPU saturation?



It is my notebook. :-) Of course, cpu consumption is 100%.
(Actally, netperf shows 100.10 :-))


Gotta love the accuracy. :)



I will redo test on a real network. What range of -b should I test?



I suppose that depends on your patience :) In theory, as you increase 
(eg double) the -b setting you should reach a point of diminishing 
returns wrt transaction rate.  If you see that, and see the service 
demand flattening-out I'd say it is probably time to stop.


I'm also not quite sure if abc needs to be disabled or not.

I do know that I left-out one very important netperf option.  The 
command line should be:


netperf -t TCP_RR -H foo -- -b N -D

where -D is added to set TCP_NODELAY.  Otherwise, the ratio of 
transactions to data segments is fubar.  That issue is also why I wonder 
about the setting of tcp_abc.


[I have this quixotic pipedream about being able to --enable-burst, set 
-D and say that the number of TCP segments exchanged on the network is 
2X the transaction count when request and response size are  MSS.  The 
raison d'etre for this pipe dream is maximizing PPS with TCP_RR tests 
without _having_ to have hundreds if not thousands of simultaneous 
netperfs/connections - say with just as many netperfs/connections as 
there are CPUs or threads/strands in the system. It was while trying to 
make this pipe dream a reality I first noticed that HP-UX 11i, which 
normally has a very nice ACK avoidance heuristic, would send an 
immediate ACK if it received back-to-back sub-MSS segments - thus 
ruining my pipe dream when it came to HP-UX testing.  Hapily, I noticed 
that linux didn't seem to be doing the same thing. Hence my tweaking 
when seeing this patch come along...]



What i'm thinking about isn't so much about the latency



I understand.

Actually, I did those tests ages ago for a pure throughput case,
when nothing goes in the opposite direction. I did not find a difference
that time. And nobody even noticed that Linux sends ACKs _each_ small
segment for unidirectional connections for all those years. :-)


Not everyone looks very closely (alas, sometimes myself included).

If all anyone does is look at throughput, until they CPU saturate they 
wouldn't notice.  Heck, before netperf and TCP_RR tests, and sadly even 
still today, most people just look at how fast a single-connection, 
unidirectional data transfer goes and leave it at that :(


Thankfully, the set of most people and netdev aren't completely 
overlapping.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NIC interrupt assignments under UltraSPARC-T1

2006-09-13 Thread Rick Jones
From time to time I play with netperf on different systems.  I happen 
to have occasion to play with a T2000.  Under Solaris 10 I am able to 
coerce the interrupts of the different core GbEs to be on different 
cores rather than strands of the same core.


Under a 2.6.15 kernel (Ubuntu Dapper) it would appear that the old 
standby of echo affinity mask to the IRQ doesn't work - no matter 
how I change the mask for a NIC, running a netperf TCP_RR test seems to 
show the interrupts happening on the same strand.


Is it indeed not possible to alter the interrupt assignments or have I 
(as I'm wont to do) missed something quasi-obvious?


thanks,

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-05 Thread Rick Jones

Alexey Kuznetsov wrote:

Hello!



Some people reported that this program runs in 9.997 sec when run on
FreeBSD.



Try enclosed patch. I have no idea why 9.997 sec is so magic, but I
get exactly this number on my notebook. :-)

Alexey

=

This patch enables sending ACKs each 2d received segment.
It does not affect either mss-sized connections (obviously) or connections
controlled by Nagle (because there is only one small segment in flight).

The idea is to record the fact that a small segment arrives
on a connection, where one small segment has already been received
and still not-ACKed. In this case ACK is forced after tcp_recvmsg()
drains receive buffer.

In other words, it is a soft each-2d-segment ACK, which is enough
to preserve ACK clock even when ABC is enabled.


Is this really necessary?  I thought that the problems with ABC were in 
trying to apply byte-based heuristics from the RFC(s) to a 
packet-oritented cwnd in the stack?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


neigh_lookup lockdep warning

2006-09-02 Thread Dave Jones
Seen during boot of a 2.6.18rc5-git1 based kernel.

Dave

===
[ INFO: possible circular locking dependency detected ]
2.6.17-1.2608.fc6 #1
---
swapper/0 is trying to acquire lock:
 (tbl-lock){-+-+}, at: [c05bdf97] neigh_lookup+0x50/0xaf

but task is already holding lock:
 (list-lock#3){-+..}, at: [c05bf677] neigh_proxy_process+0x20/0xc2

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #2 (list-lock#3){-+..}:
   [c043c09a] lock_acquire+0x4b/0x6d
   [c061411f] _spin_lock_irqsave+0x22/0x32
   [c05b451f] skb_dequeue+0x12/0x43
   [c05b523a] skb_queue_purge+0x14/0x1b
   [c05be990] neigh_update+0x34a/0x3a6
   [c05f0f6e] arp_process+0x4ad/0x4e7
   [c05f107c] arp_rcv+0xd4/0xf1
   [c05b942c] netif_receive_skb+0x205/0x274
   [c7bb0566] rhine_napipoll+0x28d/0x449 [via_rhine]
   [c05baf73] net_rx_action+0x9d/0x196
   [c04293a7] __do_softirq+0x78/0xf2
   [c0406673] do_softirq+0x5a/0xbe

- #1 (n-lock){-+..}:
   [c043c09a] lock_acquire+0x4b/0x6d
   [c0613e48] _write_lock+0x19/0x28
   [c05bfc69] neigh_periodic_timer+0x98/0x13c
   [c042dc58] run_timer_softirq+0x108/0x167
   [c04293a7] __do_softirq+0x78/0xf2
   [c0406673] do_softirq+0x5a/0xbe

- #0 (tbl-lock){-+-+}:
   [c043c09a] lock_acquire+0x4b/0x6d
   [c0613f02] _read_lock_bh+0x1e/0x2d
   [c05bdf97] neigh_lookup+0x50/0xaf
   [c05bf0b9] neigh_event_ns+0x2c/0x77
   [c05f0e2a] arp_process+0x369/0x4e7
   [c05f10a1] parp_redo+0x8/0xa
   [c05bf6bd] neigh_proxy_process+0x66/0xc2
   [c042dc58] run_timer_softirq+0x108/0x167
   [c04293a7] __do_softirq+0x78/0xf2
   [c0406673] do_softirq+0x5a/0xbe

other info that might help us debug this:

1 lock held by swapper/0:
 #0:  (list-lock#3){-+..}, at: [c05bf677] neigh_proxy_process+0x20/0xc2

stack backtrace:
 [c04051ee] show_trace_log_lvl+0x58/0x159
 [c04057ea] show_trace+0xd/0x10
 [c0405903] dump_stack+0x19/0x1b
 [c043b182] print_circular_bug_tail+0x59/0x64
 [c043b99a] __lock_acquire+0x80d/0x99c
 [c043c09a] lock_acquire+0x4b/0x6d
 [c0613f02] _read_lock_bh+0x1e/0x2d
 [c05bdf97] neigh_lookup+0x50/0xaf
 [c05bf0b9] neigh_event_ns+0x2c/0x77
 [c05f0e2a] arp_process+0x369/0x4e7
 [c05f10a1] parp_redo+0x8/0xa
 [c05bf6bd] neigh_proxy_process+0x66/0xc2
 [c042dc58] run_timer_softirq+0x108/0x167
 [c04293a7] __do_softirq+0x78/0xf2
 [c0406673] do_softirq+0x5a/0xbe
 [c0429250] irq_exit+0x3d/0x3f
 [c0417cbf] smp_apic_timer_interrupt+0x79/0x7e
 [c0404b0a] apic_timer_interrupt+0x2a/0x30
DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30
Leftover inexact backtrace:

-- 
http://www.codemonkey.org.uk

-- 
VGER BF report: U 0.489161
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: high latency with TCP connections

2006-08-30 Thread Rick Jones

David Miller wrote:

From: Stephen Hemminger [EMAIL PROTECTED]
Date: Wed, 30 Aug 2006 10:27:27 -0700



Linux TCP implements Appropriate Byte Count (ABC) and this penalizes
applications that do small sends. The problem is that the other side
may be delaying acknowledgments.  If receiver doesn't acknowledge the
sender will limit itself to the congestion window. If the flow is light,
then you will be limited to 4 packets.



Right.

However it occured to me the other day that ABC could be made smarter.
If we sent small frames, ABC should account for that.


Is that part of the application of a byte-based RFC to packet-counting cwnd?

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH?] tcp and delayed acks

2006-08-16 Thread Rick Jones

The point of delayed ack's was to merge the response and the ack on 
request/response
protocols like NFS or telnet. It does make sense to get it out sooner though.


Well, to a point at least - I wouldn't go so far as to suggest immediate 
ACKs.


However, I was always under the impression that ACKs were sent (in the 
mythical generic TCP stack) when:


a) there was data going the other way
b) there was a window update going the other way
c) the standalone ACK timer expired.

Does this patch then implement b?  Were there perhaps holes in the 
logic when things were smaller than the MTU/MSS?  (-v 2 on the netperf 
command line should show what the MSS was for the connection)


rick jones

BTW, many points scored for including CPU utilization and service demand 
figures with the netperf output :)






[All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.]

Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

Base (2.6.17-rc4):
default send buffer size
netperf -C -c
87380  16384  1638410.02  14127.79   99.9099.900.579   0.579 
87380  16384  1638410.02  13875.28   99.9099.900.590   0.590 
87380  16384  1638410.01  13777.25   99.9099.900.594   0.594 
87380  16384  1638410.02  13796.31   99.9099.900.593   0.593 
87380  16384  1638410.01  13801.97   99.9099.900.593   0.593 


netperf -C -c -- -s 1024
87380   2048   204810.02 0.43   -0.04-0.04-7.105  -7.377
87380   2048   204810.02 0.43   -0.01-0.01-2.337  -2.620
87380   2048   204810.02 0.43   -0.03-0.03-5.683  -5.940
87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625
87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625


Hmm, those CPU numbers don't look right.  I guess there must still be 
some holes in the procstat CPU method code in netperf :(



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Packet Corruption Support

2006-08-16 Thread Rick Jones

Ritesh Taank wrote:

Hi there,

I am currently using netem that has been packaged with my linux kernel 
2.6.17 (as part of the Knoppix 5.0.1 Boot CD), and the 'corrupt' 
parameter is not being recognised as a valid argument.


Having read many posts online, it appears that the Packet Corruption 
feature should be supported from kernel versions 2.6.16 onwards.


I would think that if run on an end system at least, trying to corrupt 
packets with CKO enabled on a NIC might be, well, difficult.  Or does 
netem disable CKO?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2]: powerpc/cell spidernet bottom half

2006-08-16 Thread Rick Jones

Linas Vepstas wrote:

On Wed, Aug 16, 2006 at 11:24:46PM +0200, Arnd Bergmann wrote:


it only
seems to be hard to make it go fast using any of them. 



Last round of measurements seemed linear for packet sizes between
60 and 600 bytes, suggesting that the hardware can handle a 
maximum of 120K descriptors/second, independent of packet size.

I don't know why this is.


DMA overhead perhaps?  If it takes so many micro/nanoseconds to get a 
DMA going  That used to be a reason the Tigon2 had such low PPS 
rates and issues with multiple buffer packets and a 1500 byte MTU - it 
had rather high DMA setup latency, and then if you put it into a system 
with highish DMA read/write latency... well that didn't make it any 
better :)


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use __always_inline in orinoco_lock()/orinoco_unlock()

2006-08-15 Thread Dave Jones
On Tue, Aug 15, 2006 at 03:25:58PM -0400, Pavel Roskin wrote:

  diff --git a/drivers/net/wireless/orinoco.h b/drivers/net/wireless/orinoco.h
  index 16db3e1..8fd9b32 100644
  --- a/drivers/net/wireless/orinoco.h
  +++ b/drivers/net/wireless/orinoco.h
  @@ -135,11 +135,9 @@ extern irqreturn_t orinoco_interrupt(int
   //
   
   /* These functions *must* be inline or they will break horribly on
  - * SPARC, due to its weird semantics for save/restore flags. extern
  - * inline should prevent the kernel from linking or module from
  - * loading if they are not inlined. */
  -extern inline int orinoco_lock(struct orinoco_private *priv,
  -   unsigned long *flags)
  + * SPARC, due to its weird semantics for save/restore flags. */

Didn't that get fixed up for SPARC a year or so back?

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


remove unnecessary config.h includes from drivers/net/

2006-08-10 Thread Dave Jones
On Wed, Aug 09, 2006 at 09:04:38PM -0700, David Miller wrote:
  From: Dave Jones [EMAIL PROTECTED]
  Date: Wed, 9 Aug 2006 22:21:16 -0400
  
   config.h is automatically included by kbuild these days.
   
   Signed-off-by: Dave Jones [EMAIL PROTECTED]
  
  Applied to net-2.6.19, thanks Dave.

Here's a similar patch that does the same removals for drivers/net/

Signed-off-by: Dave Jones [EMAIL PROTECTED]

--- linux-2.6.17.noarch/drivers/net/irda/mcs7780.c~ 2006-08-10 
21:35:23.0 -0400
+++ linux-2.6.17.noarch/drivers/net/irda/mcs7780.c  2006-08-10 
21:35:25.0 -0400
@@ -45,7 +45,6 @@
 
 #include linux/module.h
 #include linux/moduleparam.h
-#include linux/config.h
 #include linux/kernel.h
 #include linux/types.h
 #include linux/errno.h
--- linux-2.6.17.noarch/drivers/net/irda/w83977af_ir.c~ 2006-08-10 
21:35:28.0 -0400
+++ linux-2.6.17.noarch/drivers/net/irda/w83977af_ir.c  2006-08-10 
21:35:30.0 -0400
@@ -40,7 +40,6 @@
  /
 
 #include linux/module.h
-#include linux/config.h 
 #include linux/kernel.h
 #include linux/types.h
 #include linux/skbuff.h
--- linux-2.6.17.noarch/drivers/net/smc911x.c~  2006-08-10 21:35:34.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/smc911x.c   2006-08-10 21:35:37.0 
-0400
@@ -55,8 +55,6 @@ static const char version[] =
 )
 #endif
 
-
-#include linux/config.h
 #include linux/init.h
 #include linux/module.h
 #include linux/kernel.h
--- linux-2.6.17.noarch/drivers/net/netx-eth.c~ 2006-08-10 21:35:41.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/netx-eth.c  2006-08-10 21:35:42.0 
-0400
@@ -17,7 +17,6 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 
-#include linux/config.h
 #include linux/init.h
 #include linux/module.h
 #include linux/kernel.h
--- linux-2.6.17.noarch/drivers/net/wan/cycx_main.c~2006-08-10 
21:35:45.0 -0400
+++ linux-2.6.17.noarch/drivers/net/wan/cycx_main.c 2006-08-10 
21:35:48.0 -0400
@@ -40,7 +40,6 @@
 * 1998/08/08   acmeInitial version.
 */
 
-#include linux/config.h  /* OS configuration options */
 #include linux/stddef.h  /* offsetof(), etc. */
 #include linux/errno.h   /* return codes */
 #include linux/string.h  /* inline memset(), etc. */
--- linux-2.6.17.noarch/drivers/net/wan/sdla.c~ 2006-08-10 21:35:51.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/wan/sdla.c  2006-08-10 21:35:53.0 
-0400
@@ -32,7 +32,6 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include linux/config.h /* for CONFIG_DLCI_MAX */
 #include linux/module.h
 #include linux/kernel.h
 #include linux/types.h
--- linux-2.6.17.noarch/drivers/net/wan/dlci.c~ 2006-08-10 21:35:57.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/wan/dlci.c  2006-08-10 21:35:59.0 
-0400
@@ -28,7 +28,6 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include linux/config.h /* for CONFIG_DLCI_COUNT */
 #include linux/module.h
 #include linux/kernel.h
 #include linux/types.h
--- linux-2.6.17.noarch/drivers/net/phy/vitesse.c~  2006-08-10 
21:36:02.0 -0400
+++ linux-2.6.17.noarch/drivers/net/phy/vitesse.c   2006-08-10 
21:36:04.0 -0400
@@ -12,7 +12,6 @@
  *
  */
 
-#include linux/config.h
 #include linux/kernel.h
 #include linux/module.h
 #include linux/mii.h
--- linux-2.6.17.noarch/drivers/net/phy/smsc.c~ 2006-08-10 21:36:07.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/phy/smsc.c  2006-08-10 21:36:08.0 
-0400
@@ -14,7 +14,6 @@
  *
  */
 
-#include linux/config.h
 #include linux/kernel.h
 #include linux/module.h
 #include linux/mii.h
--- linux-2.6.17.noarch/drivers/net/hp100.c~2006-08-10 21:36:12.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/hp100.c 2006-08-10 21:36:14.0 
-0400
@@ -111,7 +111,6 @@
 #include linux/etherdevice.h
 #include linux/skbuff.h
 #include linux/types.h
-#include linux/config.h  /* for CONFIG_PCI */
 #include linux/delay.h
 #include linux/init.h
 #include linux/bitops.h
--- linux-2.6.17.noarch/drivers/net/3c501.c~2006-08-10 21:36:18.0 
-0400
+++ linux-2.6.17.noarch/drivers/net/3c501.c 2006-08-10 21:36:20.0 
-0400
@@ -120,7 +120,6 @@ static const char version[] =
 #include linux/slab.h
 #include linux/string.h
 #include linux/errno.h
-#include linux/config.h  /* for CONFIG_IP_MULTICAST */
 #include linux/spinlock.h
 #include linux/ethtool.h
 #include linux/delay.h

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPX changes introduce warning.

2006-08-09 Thread Dave Jones
We've just added an implicit declaration in the latest tree..

net/ipx/af_ipx.c: In function 'ipx_rcv':
net/ipx/af_ipx.c:1648: error: implicit declaration of function 'ipxhdr'

(Yes, my builds fail on -Werror-implicit, so that things like this get caught 
early)

Probably something simple like a missing #include, but I'm heading out
the door right now :)  I'll poke at it later if no-one has beaten me to it.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


remove unnecessary config.h includes from net/

2006-08-09 Thread Dave Jones
config.h is automatically included by kbuild these days.

Signed-off-by: Dave Jones [EMAIL PROTECTED]

--- linux-2.6/net/ipv4/netfilter/ip_conntrack_sip.c~2006-08-09 
22:18:48.0 -0400
+++ linux-2.6/net/ipv4/netfilter/ip_conntrack_sip.c 2006-08-09 
22:18:53.0 -0400
@@ -8,7 +8,6 @@
  * published by the Free Software Foundation.
  */
 
-#include linux/config.h
 #include linux/module.h
 #include linux/ctype.h
 #include linux/skbuff.h
--- linux-2.6/net/ipv4/af_inet.c~   2006-08-09 22:18:58.0 -0400
+++ linux-2.6/net/ipv4/af_inet.c2006-08-09 22:19:03.0 -0400
@@ -67,7 +67,6 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include linux/config.h
 #include linux/err.h
 #include linux/errno.h
 #include linux/types.h
--- linux-2.6/net/ipv4/ipconfig.c~  2006-08-09 22:19:07.0 -0400
+++ linux-2.6/net/ipv4/ipconfig.c   2006-08-09 22:19:10.0 -0400
@@ -31,7 +31,6 @@
  *  --  Josef Siemes [EMAIL PROTECTED], Aug 2002
  */
 
-#include linux/config.h
 #include linux/types.h
 #include linux/string.h
 #include linux/kernel.h
--- linux-2.6/net/ipv4/raw.c~   2006-08-09 22:19:14.0 -0400
+++ linux-2.6/net/ipv4/raw.c2006-08-09 22:19:18.0 -0400
@@ -38,8 +38,7 @@
  * as published by the Free Software Foundation; either version
  * 2 of the License, or (at your option) any later version.
  */
- 
-#include linux/config.h 
+
 #include linux/types.h
 #include asm/atomic.h
 #include asm/byteorder.h
--- linux-2.6/net/ipv4/tcp_veno.c~  2006-08-09 22:19:23.0 -0400
+++ linux-2.6/net/ipv4/tcp_veno.c   2006-08-09 22:19:26.0 -0400
@@ -9,7 +9,6 @@
  * See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf
  */
 
-#include linux/config.h
 #include linux/mm.h
 #include linux/module.h
 #include linux/skbuff.h
--- linux-2.6/net/ipv4/tcp_lp.c~2006-08-09 22:19:31.0 -0400
+++ linux-2.6/net/ipv4/tcp_lp.c 2006-08-09 22:19:34.0 -0400
@@ -31,7 +31,6 @@
  * Version: $Id: tcp_lp.c,v 1.22 2006-05-02 18:18:19 hswong3i Exp $
  */
 
-#include linux/config.h
 #include linux/module.h
 #include net/tcp.h
 
--- linux-2.6/net/atm/atm_sysfs.c~  2006-08-09 22:19:38.0 -0400
+++ linux-2.6/net/atm/atm_sysfs.c   2006-08-09 22:19:40.0 -0400
@@ -1,6 +1,5 @@
 /* ATM driver model support. */
 
-#include linux/config.h
 #include linux/kernel.h
 #include linux/init.h
 #include linux/kobject.h
--- linux-2.6/net/core/wireless.c~  2006-08-09 22:19:44.0 -0400
+++ linux-2.6/net/core/wireless.c   2006-08-09 22:19:47.0 -0400
@@ -72,7 +72,6 @@
 
 /* INCLUDES */
 
-#include linux/config.h  /* Not needed ??? */
 #include linux/module.h
 #include linux/types.h   /* off_t */
 #include linux/netdevice.h   /* struct ifreq, dev_get_by_name() */
--- linux-2.6/net/core/dev_mcast.c~ 2006-08-09 22:19:52.0 -0400
+++ linux-2.6/net/core/dev_mcast.c  2006-08-09 22:19:59.0 -0400
@@ -21,8 +21,7 @@
  * 2 of the License, or (at your option) any later version.
  */
 
-#include linux/config.h 
-#include linux/module.h 
+#include linux/module.h
 #include asm/uaccess.h
 #include asm/system.h
 #include linux/bitops.h

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


another networking lockdep trace.

2006-08-08 Thread Dave Jones
From a recent rc3-git kernel.

Dave

-- 
http://www.codemonkey.org.uk
---BeginMessage---
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug report.




https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=201560

   Summary: INFO: inconsistent lock state - during boot .2528
   Product: Fedora Core
   Version: devel
  Platform: All
OS/Version: Linux
Status: NEW
  Severity: normal
  Priority: normal
 Component: kernel
AssignedTo: [EMAIL PROTECTED]
ReportedBy: [EMAIL PROTECTED]
 QAContact: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]


Description of problem:
Get this on boot of new kernel:

Aug  7 06:47:44 localhost kernel: [ INFO: inconsistent lock state ]
Aug  7 06:47:44 localhost pcscd: winscard.c:219:SCardConnect() Reader E-Gate 0 0
Not Found
Aug  7 06:47:44 localhost kernel: -
Aug  7 06:47:44 localhost kernel: inconsistent {in-softirq-W} - {softirq-on-W}
usage.
Aug  7 06:47:44 localhost kernel: ip/2617 [HC0[0]:SC0[0]:HE1:SE1] takes:
Aug  7 06:47:44 localhost kernel:  (ifa-lock){-+..}, at: [f90a3836]
inet6_addr_add+0xf8/0x13e [ipv6]
Aug  7 06:47:44 localhost kernel: {in-softirq-W} state was registered at:
Aug  7 06:47:44 localhost kernel:   [c043bfb9] lock_acquire+0x4b/0x6a
Aug  7 06:47:44 localhost kernel:   [c060f428] _spin_lock_bh+0x1e/0x2d
Aug  7 06:47:44 localhost kernel:   [f90a4757] addrconf_dad_timer+0x3a/0xe2 
[ipv6]
Aug  7 06:47:44 localhost pcscd: winscard.c:219:SCardConnect() Reader E-Gate 0 0
Not Found
Aug  7 06:47:44 localhost kernel:   [c042dbc0] run_timer_softirq+0x108/0x167
Aug  7 06:47:44 localhost kernel:   [c04293ab] __do_softirq+0x78/0xf2
Aug  7 06:47:44 localhost kernel:   [c0406673] do_softirq+0x5a/0xbe
Aug  7 06:47:44 localhost kernel: irq event stamp: 3551
Aug  7 06:47:44 localhost kernel: hardirqs last  enabled at (3551): [c04291bf]
local_bh_enable_ip+0xc6/0xcf
Aug  7 06:47:45 localhost kernel: hardirqs last disabled at (3549): [c0429152]
local_bh_enable_ip+0x59/0xcf
Aug  7 06:47:45 localhost kernel: softirqs last  enabled at (3550): [f90a09ce]
ipv6_add_addr+0x210/0x254 [ipv6]
Aug  7 06:47:45 localhost kernel: softirqs last disabled at (3538): [c060f4f7]
_read_lock_bh+0xb/0x2d
Aug  7 06:47:45 localhost kernel:
Aug  7 06:47:45 localhost kernel: other info that might help us debug this:
Aug  7 06:47:45 localhost kernel: 1 lock held by ip/2617:
Aug  7 06:47:45 localhost kernel:  #0:  (rtnl_mutex){--..}, at: [c060e378]
mutex_lock+0x21/0x24
Aug  7 06:47:45 localhost kernel:
Aug  7 06:47:45 localhost kernel: stack backtrace:
Aug  7 06:47:45 localhost kernel:  [c04051ee] show_trace_log_lvl+0x58/0x159
Aug  7 06:47:45 localhost kernel:  [c04057ea] show_trace+0xd/0x10
Aug  7 06:47:45 localhost kernel:  [c0405903] dump_stack+0x19/0x1b
Aug  7 06:47:45 localhost kernel:  [c043a402] print_usage_bug+0x1ca/0x1d7
Aug  7 06:47:45 localhost kernel:  [c043a8eb] mark_lock+0x239/0x353
Aug  7 06:47:45 localhost kernel:  [c043b50a] __lock_acquire+0x459/0x997
Aug  7 06:47:45 localhost kernel:  [c043bfb9] lock_acquire+0x4b/0x6a
Aug  7 06:47:45 localhost kernel:  [c060f3fb] _spin_lock+0x19/0x28
Aug  7 06:47:45 localhost kernel:  [f90a3836] inet6_addr_add+0xf8/0x13e [ipv6]
Aug  7 06:47:45 localhost kernel:  [f90a3a39] inet6_rtm_newaddr+0x1bd/0x1d2 
[ipv6]
Aug  7 06:47:45 localhost kernel:  [c05bf5f3] rtnetlink_rcv_msg+0x1b3/0x1d6
Aug  7 06:47:45 localhost kernel:  [c05cae7b] netlink_run_queue+0x69/0xfe
Aug  7 06:47:45 localhost kernel:  [c05bf3f6] rtnetlink_rcv+0x29/0x42
Aug  7 06:47:45 localhost kernel:  [c05cb308] netlink_data_ready+0x12/0x50
Aug  7 06:47:45 localhost kernel:  [c05ca370] netlink_sendskb+0x1f/0x37
Aug  7 06:47:45 localhost kernel:  [c05cac49] netlink_unicast+0x1a1/0x1bb
Aug  7 06:47:45 localhost kernel:  [c05cb2e9] netlink_sendmsg+0x275/0x282
Aug  7 06:47:45 localhost kernel:  [c05ae91a] sock_sendmsg+0xe8/0x103
Aug  7 06:47:45 localhost kernel:  [c05af129] sys_sendmsg+0x14d/0x1a8
Aug  7 06:47:45 localhost kernel:  [c05b02fb] sys_socketcall+0x16b/0x186
Aug  7 06:47:45 localhost kernel:  [c0403faf] syscall_call+0x7/0xb
Aug  7 06:47:45 localhost kernel: DWARF2 unwinder stuck at syscall_call+0x7/0xb
Aug  7 06:47:45 localhost kernel: Leftover inexact backtrace:
Aug  7 06:47:45 localhost avahi-daemon[2392]: New relevant interface eth0.IPv6
for mDNS.
Aug  7 06:47:45 localhost kernel:  [c04057ea] show_trace+0xd/0x10
Aug  7 06:47:45 localhost avahi-daemon[2392]: Joining mDNS multicast group on
interface eth0.IPv6 with address fe80::20a:e4ff:fe3f:8bc4.
Aug  7 06:47:45 localhost kernel:  [c0405903] dump_stack+0x19/0x1b
Aug  7 06:47:45 localhost avahi-daemon[2392]: Registering new address record for
fe80::20a:e4ff:fe3f:8bc4 on eth0.
Aug  7 06:47:45 localhost kernel:  [c043a402] print_usage_bug+0x1ca/0x1d7
Aug  7 06:47:45 localhost kernel:  [c043a8eb] mark_lock+0x239/0x353
Aug  7 

Re: [RFC] driver adjusts qlen, increases CPU

2006-08-04 Thread Rick Jones

Jesse Brandeburg wrote:
So we've recently put a bit of code in our e1000 driver to decrease the 
qlen based on the speed of the link.


On the surface it seems like a great idea.  A driver knows when the link 
speed changed, and having a 1000 packet deep queue (the default for most 
kernels now) on top of a 100Mb/s link (or 10Mb/s worst case for us) makes 
for a *lot* of latency if many packets are queued up in the qdisc.


Problem we've seen is that setting this shorter queue causes a large spike 
in cpu when transmitting using UDP:


100Mb/s link
txqueuelen: 1000 Throughput: 92.44 CPU: 5.00
txqueuelen: 100 Throughput: 93.80 CPU: 61.59

Is this expected? any comments?


Triggering intra-stack flow-control perhaps?  Perhaps 10X more often 
than before if the queue is 1/10th what it was before?


Out of curiousity, how does the UDP socket's SO_SNDBUF compare to the 
queue depth?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: orinoco driver causes *lots* of lockdep spew

2006-08-03 Thread Dave Jones
On Thu, Aug 03, 2006 at 03:11:53PM +0100, Christoph Hellwig wrote:

  Could we please just get rid of the wireless extensions over netlink code
  again?  It doesn't help to solve anything and just creates a bigger mess
  to untangle when switching to a fully fledged wireless stack.

If we're going to do that, now is probably the best time to do it,
before any distro userland starts using it.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: orinoco driver causes *lots* of lockdep spew

2006-08-03 Thread Dave Jones
On Thu, Aug 03, 2006 at 11:58:00AM -0700, Jean Tourrilhes wrote:
  On Thu, Aug 03, 2006 at 03:11:53PM +0100, Christoph Hellwig wrote:
   On Thu, Aug 03, 2006 at 11:54:41PM +1000, Herbert Xu wrote:
Arjan van de Ven [EMAIL PROTECTED] wrote:
 
 this is another one of those nasty buggers;

Good catch.  It's really time that we fix this properly rather than
adding more kludges to the core code.

Dave, once this goes in you can revert the previous netlink workaround
that added the _bh suffix.

[WIRELESS]: Send wireless netlink events with a clean slate
   
   Could we please just get rid of the wireless extensions over netlink code
   again?  It doesn't help to solve anything and just creates a bigger mess
   to untangle when switching to a fully fledged wireless stack.
  
   That's not going to happen any time soon, NetworkManager
  depends on Wireless Events, as well as many other apps. And there is
  not many mechanisms you can use in the kernel to generate events from
  driver to userspace.

It seemed to cope pretty well before we had this ?

Dave
-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


orinoco driver causes *lots* of lockdep spew

2006-08-02 Thread Dave Jones
Wow. Nearly 400 lines of debug spew, from a simple 'ifup eth1'.

Dave


ADDRCONF(NETDEV_UP): eth1: link is not ready
eth1: New link status: Disconnected (0002)

==
[ INFO: hard-safe - hard-unsafe lock order detected ]
--
events/0/5 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (af_callback_keys + sk-sk_family){-.--}, at: [802136b1] 
sock_def_readable+0x19/0x6f

and this task is already holding:
 (priv-lock){++..}, at: [8824f70e] orinoco_send_wevents+0x28/0x8b 
[orinoco]
which would create a new lock dependency:
 (priv-lock){++..} - (af_callback_keys + sk-sk_family){-.--}

but this new dependency connects a hard-irq-safe lock:
 (priv-lock){++..}
... which became hard-irq-safe at:
  [802a8e62] lock_acquire+0x4a/0x69
  [80267ba2] _spin_lock_irqsave+0x2b/0x3c
  [8824f7be] orinoco_interrupt+0x4d/0xf49 [orinoco]
  [8021151f] handle_IRQ_event+0x2b/0x64
  [802c0987] __do_IRQ+0xae/0x114
  [8026fca8] do_IRQ+0xf7/0x107
  [802609c4] common_interrupt+0x64/0x65

to a hard-irq-unsafe lock:
 (af_callback_keys + sk-sk_family){-.--}
... which became hard-irq-unsafe at:
...  [802a8e62] lock_acquire+0x4a/0x69
  [80267867] _write_lock_bh+0x29/0x36
  [80433960] netlink_release+0x139/0x2ca
  [80257903] sock_release+0x19/0x9b
  [80257b13] sock_close+0x33/0x3a
  [802130ee] __fput+0xc6/0x1a8
  [8022effe] fput+0x13/0x16
  [80225383] filp_close+0x64/0x70
  [8021eecc] sys_close+0x93/0xb0
  [8026048d] system_call+0x7d/0x83

other info that might help us debug this:

1 lock held by events/0/5:
 #0:  (priv-lock){++..}, at: [8824f70e] 
orinoco_send_wevents+0x28/0x8b [orinoco]

the hard-irq-safe lock's dependencies:
- (priv-lock){++..} ops: 0 {
   initial-use  at:
[802a8e62] lock_acquire+0x4a/0x69
[80267a3e] _spin_lock_irq+0x2a/0x38
[8824f102] orinoco_init+0x934/0x966 [orinoco]
[8041e762] register_netdevice+0xe6/0x375
[8041ea4b] register_netdev+0x5a/0x69
[8826155f] orinoco_cs_probe+0x3d7/0x475 
[orinoco_cs]
[803daa02] pcmcia_device_probe+0x7f/0x124
[803b5e74] driver_probe_device+0x5b/0xb1
[803b5fde] __driver_attach+0x88/0xdb
[803b5826] bus_for_each_dev+0x48/0x7a
[803b5d9e] driver_attach+0x1b/0x1e
[803b543e] bus_add_driver+0x88/0x138
[803b6289] driver_register+0x8e/0x93
[803da89b] pcmcia_register_driver+0xd0/0xda
[880a9024] 0x880a9024
[802af420] sys_init_module+0x16f2/0x18b7
[8026048d] system_call+0x7d/0x83
   in-hardirq-W at:
[802a8e62] lock_acquire+0x4a/0x69
[80267ba2] _spin_lock_irqsave+0x2b/0x3c
[8824f7be] orinoco_interrupt+0x4d/0xf49 
[orinoco]
[8021151f] handle_IRQ_event+0x2b/0x64
[802c0987] __do_IRQ+0xae/0x114
[8026fca8] do_IRQ+0xf7/0x107
[802609c4] common_interrupt+0x64/0x65
   in-softirq-W at:
[802a8e62] lock_acquire+0x4a/0x69
[80267ba2] _spin_lock_irqsave+0x2b/0x3c
[8824f7be] orinoco_interrupt+0x4d/0xf49 
[orinoco]
[8021151f] handle_IRQ_event+0x2b/0x64
[802c0987] __do_IRQ+0xae/0x114
[8026fca8] do_IRQ+0xf7/0x107
[802609c4] common_interrupt+0x64/0x65
[8028ebce] scheduler_tick+0xc1/0x362
[80261739] call_softirq+0x1d/0x28
[80295edb] irq_exit+0x56/0x59
[8027a67f] smp_apic_timer_interrupt+0x5c/0x62
[802610ad] apic_timer_interrupt+0x69/0x70
 }
 ... key  at: [8825fd80] __key.22351+0x0/0x27fa 
[orinoco]
 - (cwq-lock){++..} ops: 0 {
initial-use  at:
  [802a8e62] lock_acquire+0x4a/0x69
  [80267ba2] _spin_lock_irqsave+0x2b/0x3c
  [802a0314] __queue_work+0x17/0x5e
  [802a03de] queue_work+0x4d/0x57
  [8029fdda] 

Re: e1000 speed/duplex error

2006-08-01 Thread Rick Jones

I thought the common behavior is that if one side force any particular
parameter, other side should sense that and go to that mode too.


Nope.  That is a common misconception and perhaps the source of many 
duplex mismatch problems today.  Here is some boilerplate I bring-out 
from time to time that may be of help:


$ cat usenet_replies/duplex
How 100Base-T Autoneg is supposed to work:

When both sides of the link are set to autoneg, they will negotiate
the duplex setting and select full-duplex if both sides can do
full-duplex.

If one side is hardcoded and not using autoneg, the autoneg process
will fail and the side trying to autoneg is required by spec to use
half-duplex mode.

If one side is using half-duplex, and the other is using full-duplex,
sorrow and woe is the usual result.

So, the following table shows what will happen given various settings
on each side:

 Auto   Half   Full

   AutoHappiness   Lucky  Sorrow

   HalfLucky   Happiness  Sorrow

   FullSorrow  Sorrow Happiness

Happiness means that there is a good shot of everything going well.
Lucky means that things will likely go well, but not because you did
anything correctly :) Sorrow means that there _will_ be a duplex
mis-match.

When there is a duplex mismatch, on the side running half-duplex you
will see various errors and probably a number of _LATE_ collisions
(normal collisions don't count here).  On the side running
full-duplex you will see things like FCS errors.  Note that those
errors are not necessarily conclusive, they are simply indicators.

Further, it is important to keep in mind that a clean ping (or the
like - eg linkloop or default netperf TCP_RR) test result is
inconclusive here - a duplex mismatch causes lost traffic _only_ when
both sides of the link try to speak at the same time. A typical ping
test, being synchronous, one at a time request/response, never tries
to have both sides talking at the same time.

Finally, when/if you migrate to 1000Base-T, everything has to be set
to auto-neg anyway.

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


neigh_lookup lockdep bug.

2006-07-31 Thread Dave Jones
2.6.18rc2-gitSomething on my firewall box just triggered this..

Dave

[515613.791771] ===
[515613.841467] [ INFO: possible circular locking dependency detected ]
[515613.873284] ---
[515613.904945] swapper/0 is trying to acquire lock:
[515613.931489]  (tbl-lock){-+-+}, at: [c05b5d63] neigh_lookup+0x50/0xaf
[515613.964369] 
[515613.964373] but task is already holding lock:
[515614.006550]  (skb_queue_lock_key){-+..}, at: [c05b741c] 
neigh_proxy_process+0x20/0xc2
[515614.043225] 
[515614.043228] which lock already depends on the new lock.
[515614.043234] 
[515614.103456] 
[515614.103459] the existing dependency chain (in reverse order) is:
[515614.148752] 
[515614.148755] - #2 (skb_queue_lock_key){-+..}:
[515614.10][c043bf43] lock_acquire+0x4b/0x6c
[515614.215554][c06089a7] _spin_lock_irqsave+0x22/0x32
[515614.243606][c05ac2e3] skb_dequeue+0x12/0x43
[515614.269657][c05acffe] skb_queue_purge+0x14/0x1b
[515614.296565][c05b673e] neigh_update+0x317/0x353
[515614.323004][c05e8a0b] arp_process+0x4aa/0x4e4
[515614.349004][c05e8b19] arp_rcv+0xd4/0xf1
[515614.373209][c05b1210] netif_receive_skb+0x204/0x271
[515614.400405][c05b2b73] process_backlog+0x99/0xfa
[515614.426351][c05b2d56] net_rx_action+0x9d/0x196
[515614.451856][c04293d5] __do_softirq+0x78/0xf2
[515614.476660][c040662f] do_softirq+0x5a/0xbe
[515614.500737] 
[515614.500741] - #1 (n-lock){-+-+}:
[515614.532763][c043bf43] lock_acquire+0x4b/0x6c
[515614.556814][c06086d0] _write_lock+0x19/0x28
[515614.580398][c05b7a0e] neigh_periodic_timer+0x98/0x13c
[515614.606447][c042db48] run_timer_softirq+0x108/0x167
[515614.631798][c04293d5] __do_softirq+0x78/0xf2
[515614.655122][c040662f] do_softirq+0x5a/0xbe
[515614.677721] 
[515614.677724] - #0 (tbl-lock){-+-+}:
[515614.707327][c043bf43] lock_acquire+0x4b/0x6c
[515614.729897][c060878a] _read_lock_bh+0x1e/0x2d
[515614.752546][c05b5d63] neigh_lookup+0x50/0xaf
[515614.774754][c05b6e5e] neigh_event_ns+0x2c/0x77
[515614.797271][c05e88c7] arp_process+0x366/0x4e4
[515614.819349][c05e8b3e] parp_redo+0x8/0xa
[515614.839660][c05b7462] neigh_proxy_process+0x66/0xc2
[515614.862931][c042db48] run_timer_softirq+0x108/0x167
[515614.886048][c04293d5] __do_softirq+0x78/0xf2
[515614.907136][c040662f] do_softirq+0x5a/0xbe
[515614.927553] 
[515614.927557] other info that might help us debug this:
[515614.927563] 
[515614.966774] 1 lock held by swapper/0:
[515614.982693]  #0:  (skb_queue_lock_key){-+..}, at: [c05b741c] 
neigh_proxy_process+0x20/0xc2
[515615.013575] 
[515615.013578] stack backtrace:
[515615.037414]  [c04051ea] show_trace_log_lvl+0x54/0xfd
[515615.057910]  [c04057a6] show_trace+0xd/0x10
[515615.075934]  [c04058bf] dump_stack+0x19/0x1b
[515615.094167]  [c043b030] print_circular_bug_tail+0x59/0x64
[515615.116172]  [c043b843] __lock_acquire+0x808/0x997
[515615.136514]  [c043bf43] lock_acquire+0x4b/0x6c
[515615.155699]  [c060878a] _read_lock_bh+0x1e/0x2d
[515615.175098]  [c05b5d63] neigh_lookup+0x50/0xaf
[515615.197276]  [c05b6e5e] neigh_event_ns+0x2c/0x77
[515615.220267]  [c05e88c7] arp_process+0x366/0x4e4
[515615.243248]  [c05e8b3e] parp_redo+0x8/0xa
[515615.264645]  [c05b7462] neigh_proxy_process+0x66/0xc2
[515615.288899]  [c042db48] run_timer_softirq+0x108/0x167
[515615.309972]  [c04293d5] __do_softirq+0x78/0xf2
[515615.328940]  [c040662f] do_softirq+0x5a/0xbe
[515615.347150]  [c042927e] irq_exit+0x3d/0x3f
[515615.365067]  [c0417cbb] smp_apic_timer_interrupt+0x79/0x7e
[515615.387057]  [c0404b0a] apic_timer_interrupt+0x2a/0x30


-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-25 Thread Rick Jones

David Miller wrote:

From: Rick Jones [EMAIL PROTECTED]
Date: Mon, 24 Jul 2006 17:55:24 -0700


Even enough bits for 1024 or 2048 CPUs in the single system image?  I have seen 
1024 touted by SGI, and with things going so multi-core, perhaps 16384 while 
sounding initially bizzare would be in the realm of theoretically possible 
before to long.



Read the RSS NDIS documents from Microsoft. 


I'll see about hunting them down.


You aren't going to want
to demux to more than, say, 256 cpus for single network adapter even
on the largest machines.


I suppose, it just seems to tweak _small_ alarms in my intuition - maybe because 
it still sounds like networking telling the scheduler where to run threads of 
execution, and even though I'm a networking guy I seem to have the notion that 
it should be the other way 'round.



That would cover TCP, are there similarly fungible fields in SCTP or
other ULPs?  And if we were to want to get HW support for the thing,
getting it adopted in a de jure standards body would probably be in
order :)



Microsoft never does this, neither do we.  LRO came out of our own
design, the network folks found it reasonable and thus they have
started to implement it.  The same is true for Microsofts RSS stuff.

It's a hardware interpretation, therefore it belongs in a driver API
specification, nowhere else.


It may be a hardware interpretation, but doesn't it have non-trivial system 
implications - where one runs threads/processes etc?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is RDMA

2006-07-24 Thread Rick Jones
That TOE/iWARP could end-up being precluded by NAT seems so ironic from a POE2E 
standpoint.


rick jones

Purity Of End To END
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-24 Thread Rick Jones
This all sounds like the discussions we had within HP-UX between 10.20 and 11.0 
concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling.  IPS 
was done by the 10.20 stack at the handoff between the driver and netisr.  If 
the packet was not an IP datagram fragment, parts of the transport and IP 
headers would be hashed, and the result would be the netisr queue to which the 
packet would be queued for further processing.


It worked fine and dandy for stuff like aggregate netperf TCP_RR tests because 
there was a 1-1 correspondence between a connection and a process/thread.  It 
was OK for the networking to dictate where the process should run.  That feels 
rather like a NIC that would hash packets and pick the MSI-X based on that.


However, as Andi discusses, when there is a process/thread doing more than one 
connection, picking a CPU based on addressing hashing will be like TweedleDee 
and TweedleDum telling Alice to go in opposite directions.  Hence TOPS in 11.X. 
 This time, when there is a normal lookup location in the path, where the 
application last accessed the socket is determined, and things shift-over to 
that CPU.  This then is the process (well actually the scheduler) telling 
networking where it should do its work.


That addresses the multiple connections per thread/process and still works just 
as well for 1-1.  There are still issues if you have mutiple threads/processes 
concurrently accessing the same socket/connection, but that one is much more rare.


Nirvana I suppose would be the addition of a field in the header which could be 
used for the determination of where to process. A Transport Protocol option I 
suppose, maybe the IPv6 flow id, but knuth only knows if anyone would go for 
something along those lines.  It does though mean that the state is per-packet 
without it having to be based on addressing information.  Almost like RDMA 
arriving saying where the data goes, but this thing says where the processing 
should happen :)


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-24 Thread Rick Jones

David Miller wrote:

From: Rick Jones [EMAIL PROTECTED]
Date: Mon, 24 Jul 2006 17:29:05 -0700



Nirvana I suppose would be the addition of a field in the header
which could be used for the determination of where to process. A
Transport Protocol option I suppose, maybe the IPv6 flow id, but
knuth only knows if anyone would go for something along those lines.
It does though mean that the state is per-packet without it having
to be based on addressing information.  Almost like RDMA arriving
saying where the data goes, but this thing says where the processing
should happen :)



Since the full interpretation of the TCP timestamp option field value
is largely local to the peer setting it, there is nothing wrong with
stealing a few bits for destination cpu information.


Even enough bits for 1024 or 2048 CPUs in the single system image?  I have seen 
1024 touted by SGI, and with things going so multi-core, perhaps 16384 while 
sounding initially bizzare would be in the realm of theoretically possible 
before to long.



It would have to be done in such a way as to not make the PAWS
tests fail by accident.  But I think it's doable.


That would cover TCP, are there similarly fungible fields in SCTP or other ULPs?

And if we were to want to get HW support for the thing, getting it adopted in a 
de jure standards body would probably be in order :)


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-24 Thread Rick Jones

It would have to be done in such a way as to not make the PAWS
tests fail by accident.  But I think it's doable.


CPU ID and higher-order generation number such that whenever the process 
migrates to a lower-numbered CPU, the generation number is bumped to make the 
timestamp larger than before?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mark sk98lin driver for removal

2006-07-22 Thread Dave Jones
On Sat, Jul 22, 2006 at 02:11:50PM -0700, Stephen Hemminger wrote:
  The sk98lin driver is now superseded by the skge driver. I wanted to just
  let the old driver wither and die from old age, but there are still bugs
  that are too painful to fix.
  
  See http://bugzilla.kernel.org/show_bug.cgi?id=6780
  The board crashes repeatedly after 2 weeks. It probably is something
  in the vendor MIB code. That code is a mess, and starting over was one
  of the motivations for creating the skge driver.
  
  So rather than add more bondo to the old beater to cover the rusty bits,
  throw it in the dustbin.
  
  Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

After a huge number of bug reports in Fedora 'went away' when we
switched our users to using skge instead, I wholeheartedly endorse this.
sk98lin is a disaster.  The last time I looked the vendor out-of-tree
driver had a huge delta vs mainline, and backed out numerous fixes
made to it in the mainline kernel. It's a huge effort to get the 'good bits'
out of that patch, and letting it die is the only sensible solution IMO.

ACKed-by: Dave Jones [EMAIL PROTECTED]


  +SK98LIN GIGABBIT ETHERNET DRIVER

typo :-)

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-21 Thread Rick Jones

All this talk reminds me of one thing, how expensive tcp_ack() is.
And this expense has nothing to do with TCP really.  The main cost is
purging and freeing up the skbs which have been ACK'd in the
retransmit queue.

So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs
which haven't been touched by the cpu in some time and are thus nearly
guarenteed to be cold in the cache.

This is the kind of work we could think about batching to user
sleeping on some socket call.


Ultimately isn't that just trying to squeeze the balloon?

rick jones

nice to see people seeing ACKs as expensive though :)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


sch_htb compile fix.

2006-07-15 Thread Dave Jones
net/sched/sch_htb.c: In function 'htb_change_class':
net/sched/sch_htb.c:1605: error: expected ';' before 'do_gettimeofday'

Signed-off-by: Dave Jones [EMAIL PROTECTED]

--- linux-2.6.17.noarch/net/sched/sch_htb.c~2006-07-15 03:40:14.0 
-0400
+++ linux-2.6.17.noarch/net/sched/sch_htb.c 2006-07-15 03:40:21.0 
-0400
@@ -1601,7 +1601,7 @@ static int htb_change_class(struct Qdisc
/* set class to be in HTB_CAN_SEND state */
cl-tokens = hopt-buffer;
cl-ctokens = hopt-cbuffer;
-   cl-mbuffer = PSCHED_JIFFIE2US(HZ*60) /* 1min */
+   cl-mbuffer = PSCHED_JIFFIE2US(HZ*60); /* 1min */
PSCHED_GET_TIME(cl-t_c);
cl-cmode = HTB_CAN_SEND;
-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: I/O Acceleration Technology Nics

2006-07-14 Thread Rick Jones

Ian Brown wrote:

Hello,
 I came across the e1000 download for linux in intel site.
I saw that in the readme they talk about Intel(R) I/O Acceleration 
Technology;

According to this readme , there is support for systems using the
Intel(R) 5000 Series Chipsets Integrated Device - 1A38.
see:
http://downloadmirror.intel.com/df-support/9180/ENG/README.txt

My question is : did anybody tried using chipsets with this I/O
Acceleration Technology ?  Did he get a significant performance
improvement over non I/O Accelerated nics ?


IIRC, there were some measures made and discussed at least a little in 
netdev.  A search of the archives should find them.


I would also expect that Intel would have some glossy PDF's on their 
site touting the performance boosts technology :)  They should at least 
somewhere have some links to actual measurements...



Ian
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


I suspect the URL above there will start one on the path to the email 
archive.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2   3   4   5   6   7   >