Re: [ofa-general] Re: IPoIB forwarding
Loic Prylli wrote: On 4/30/2007 2:12 PM, Rick Jones wrote: Speaking of defaults, it would seem that the external 1.2.0 driver comes with 9000 bytes as the default MTU? At least I think that is what I am seeing now that I've started looking more closely. rick jones That's the same for the in-kernel-tree code (9K MTU by default). Assuming this is not wanted, I will submit a patch for that. While I like what that does for perrformance, and at the risk of putting words into the mouths of netdev, I suspect that 1500 bytes is indeed the desired default. It matches the IEEE specs, I've yet to see a switch which enabled Jumbo Frames by default, not everything out there even believes that Jubmo Frames means 9000 byte MTU etc etc etc. I think that 1500 bytes for an Ethernet device remains in line with the principle of least surprise. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: IPoIB forwarding
What version of the myri10ge driver is this? With the 1.2.0 version that comes with the 2.6.20.7 kernel, there is no myri10ge_lro module parameter. [EMAIL PROTECTED] ~]# modinfo myri10ge | grep -i lro [EMAIL PROTECTED] ~]# And I've been testing IP forwarding using two Myricom 10-GigE NICs without setting any special modprobe parameters. Ethtool -i on the interface reports 1.2.0 as the driver version. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: IPoIB forwarding
David Miller wrote: From: Rick Jones [EMAIL PROTECTED] Date: Fri, 27 Apr 2007 16:48:00 -0700 No problem - just to play whatif/devil's advocate for a bit though... is there any way to tie that in with the setting of net.ipv4.ip_forward (and/or its IPv6 counterpart)? Even ignoring that, consider the potential issues this kind of problem could be causing netfilter. OK, I'll show my ignorance and bite - what sort of issues with netfilter? Is it tied to link-local MTUs? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: IPoIB forwarding
Only the 1.2.0 version of the external driver makes LRO incompatible with forwarding. The problem should be fixed in version 1.3.0 released a few weeks ago (forwarding with myri10ge_lro enabled should then work), let us know otherwise. Anyway, following David Miller remark about netfilter, for the next version we might ask the user to explicitely enable LRO rather than making the default. Speaking of defaults, it would seem that the external 1.2.0 driver comes with 9000 bytes as the default MTU? At least I think that is what I am seeing now that I've started looking more closely. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: IPoIB forwarding
Bryan Lawver wrote: Your right about the ipoib module not combining packets (I believed you without checking) but I did never the less. The ipoib_start_xmit routine is definitely handed a double packet which means that the IP NIC driver or the kernel is combining two packets into a single super jumbo packet. This issue is irrespective of the IP MTU setting because I have set all interfaces to 9000k yet ipoib accepts and forwards this 17964 packet to the next IB node and onto the TCP stack where it is never acknowledged. This may not have come up in prior testing because I am using some of the fastest IP NICs which have no trouble keeping up with or exceeding the bandwidth of the IB side. This issue arises exactly every 8 packets...(ring buffer overrun??) I will be at Sonoma for the next few days as many on this list will be. Some NICs (esp 10G) support large receive offload - they coalesce TCP segments from the wire/fiber into larger ones they pass up the stack. Perhaps that is happening here? I'm going to go out a bit on a limb, cross the streams, and include netdev, because I suspect that if a system is acting as an IP router, one doesn't want large receive offload enabled. That may need some discussion in netdev - it may then require some changes to default settings or some documentation enhancements. That or I'll learn that the stack is already dealing with the issue... rick jones bryan At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote: Quoting Bryan Lawver [EMAIL PROTECTED]: Subject: Re: IPoIB forwarding Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it appears that two payloads are queued at ipoib which combines them into a single 17920 payload with assumingly correct IP header (40) and IB header (4). The application or TCP stack does not acknowledge this double packet ie. it does not ACK until each of the 8960 packets are resent individually. Being an IB newbie, I am guessing this combining is allowable but may violate TCP protocol. IPoIB does nothing like this - it's just a network device so it sends all packets out as is. -- MST ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: IPoIB forwarding
Bryan Lawver wrote: I hit the IP NIC over the head with a hammer and turned off all offload features and I no longer get the super jumbo packet and I have symmetric performance. This NIC supported ethtool -K ethx tso/tx/rx/sg on/off and I am not sure at this time which one I needed to whack but all off solved the problem. Yeah, that does seem like a rather broad remedy, but I guess if it works... :) And I suppose most of those offloads don't matter for a NIC being used in a router. Only problem is we don't know if it worked because it slowed-down the 10G side or because it had LRO disabling as a side-effect. If I were to guess, of those things listed, I'd guess that receive cko would have that as a side effect. Just what sort of 10G NIC was this anyway? With that knowledge we could probably narrow things down to a more specific modprobe setting, or maybe even an ethtool command, for some suitable revision of ethtool. rick jones Thanks for listening and re enforcing my search process. bryan At 01:32 PM 4/27/2007, Rick Jones wrote: Bryan Lawver wrote: Your right about the ipoib module not combining packets (I believed you without checking) but I did never the less. The ipoib_start_xmit routine is definitely handed a double packet which means that the IP NIC driver or the kernel is combining two packets into a single super jumbo packet. This issue is irrespective of the IP MTU setting because I have set all interfaces to 9000k yet ipoib accepts and forwards this 17964 packet to the next IB node and onto the TCP stack where it is never acknowledged. This may not have come up in prior testing because I am using some of the fastest IP NICs which have no trouble keeping up with or exceeding the bandwidth of the IB side. This issue arises exactly every 8 packets...(ring buffer overrun??) I will be at Sonoma for the next few days as many on this list will be. Some NICs (esp 10G) support large receive offload - they coalesce TCP segments from the wire/fiber into larger ones they pass up the stack. Perhaps that is happening here? I'm going to go out a bit on a limb, cross the streams, and include netdev, because I suspect that if a system is acting as an IP router, one doesn't want large receive offload enabled. That may need some discussion in netdev - it may then require some changes to default settings or some documentation enhancements. That or I'll learn that the stack is already dealing with the issue... rick jones bryan At 11:06 AM 4/26/2007, Michael S. Tsirkin wrote: Quoting Bryan Lawver [EMAIL PROTECTED]: Subject: Re: IPoIB forwarding Here's a tcpdump of the same sequence. The TCP MSS is 8960 and it appears that two payloads are queued at ipoib which combines them into a single 17920 payload with assumingly correct IP header (40) and IB header (4). The application or TCP stack does not acknowledge this double packet ie. it does not ACK until each of the 8960 packets are resent individually. Being an IB newbie, I am guessing this combining is allowable but may violate TCP protocol. IPoIB does nothing like this - it's just a network device so it sends all packets out as is. -- MST ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: IPoIB forwarding
Bryan Lawver wrote: I had so much debugging turned on that it was not the slowing of the traffic but the non-coelescencing that was the remedy. The NIC is a MyriCom NIC and these are easy options to set. As chance would have it, I've played with some Myricom myri10ge NICs recently, and even disabled large receive offload during some netperf tests :) It is a modprobe option. Going back now to the driver source and the README I see :-) excerpt Troubleshooting === Large Receive Offload (LRO) is enabled by default. This will interfere with forwarding TCP traffic. If you plan to forward TCP traffic (using the host with the Myri10GE NIC as a router or bridge), you must disable LRO. To disable LRO, load the myri10ge driver with myri10ge_lro set to 0: # modprobe myri10ge myri10ge_lro=0 Alternatively, you can disable LRO at runtime by disabling receive checksum offloading via ethtool: # ethtool -K eth2 rx off /excerpt rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: IPoIB forwarding
David Miller wrote: From: Rick Jones [EMAIL PROTECTED] Date: Fri, 27 Apr 2007 16:37:49 -0700 Large Receive Offload (LRO) is enabled by default. This will interfere with forwarding TCP traffic. If you plan to forward TCP traffic (using the host with the Myri10GE NIC as a router or bridge), you must disable LRO. To disable LRO, load the myri10ge driver with myri10ge_lro set to 0: LRO should be disabled by default if the driver does this. This is a major and unacceptable bug. Thanks for pointing this out Rick. No problem - just to play whatif/devil's advocate for a bit though... is there any way to tie that in with the setting of net.ipv4.ip_forward (and/or its IPv6 counterpart)? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexcepted latency (order of 100-200 ms) with TCP (packet receive)
Ilpo Järvinen wrote: Hi, ... Some time ago I noticed that with 2.6.18 I occassionally get latency spikes as long as 100-200ms in the TCP transfers between components (I describe later how TCP was tuned during these tests to avoid problems that occur with small segments). I started to investigate the spikes, and here are the findings so far: ... - I placed a hub to get exact timings on the wire without potential interference from tcpdump on the emulator host (test done with 2.6.18) but to my great surprise, the problem vanished completely Sounds like tcpdump getting in the way? How many CPUs do you have in the system, and have you tried some explicit binding of processes to different CPUs? (taskset etc...) When running tcpdump are you simply sending raw traces to a file, or are you having the ASCII redirected to a file? What about name resolution (ie areyou using -n)? - Due to the hub test result, I tested 10/100/duplex settings and found out that if the emulator host has 10fd, the problem does not occur with 2.6 either?!? This could be due to luck but I cannot say for sure, yet the couple of tests I've run with 10fd, did not show this... - Tried to change cable switch that connect hosts together, no effect To prove this with 100Mbps, I setup routing so that with a host with 10/FD configuration (known, based on earlier, to be unlikely to cause errors) I collected all traffic between the emulator host and one of the packet capture hosts. Here is one example point where a long delay occurs (EMU is the emulator host, in the real log each packet is shown twice, I only leave the later one here): 1177577267.364596 IP CAP.35305 EMU.52246: . 17231434:17232894(1460) ack 383357 win 16293 1177577267.364688 IP CAP.35305 EMU.52246: P 17232894:17232946(52) ack 383357 win 16293 1177577267.366093 IP EMU.52246 CAP.35305: . ack 17232894 win 32718 1177577267.493815 IP EMU.52246 CAP.35305: P 383357:383379(22) ack 17232894 win 32718 1177577267.534252 IP CAP.35305 EMU.52246: . ack 383379 win 16293 What is the length of the standalone ACK timer these days? 1177577267.534517 IP EMU.59050 CAP.58452: P 624496:624528(32) ack 328 win 365 1177577267.534730 IP CAP.58452 EMU.59050: . ack 624528 win 16293 1177577267.536267 IP CAP.35305 EMU.52246: . 17232946:17234406(1460) ack 383379 win 16293 1177577267.536360 IP CAP.35305 EMU.52246: P 17234406:17234458(52) ack 383379 win 16293 1177577267.537764 IP EMU.52246 CAP.35305: . ack 17234406 win 32718 ... All things use TCP_NODELAY. The network is isolated so that no other traffic can cause unexcepted effects. The emulator does collect log only to a mem buffer that is flushed through TCP only between tests (and thus does not cause timing problems). Might tcp_abc have crept back-in? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net-2.6.22 UDP stalls/hangs
Oh well, one thing at a time. The good news is that I can reproduce the problem with netperf. kpm:/usr/src/netperf-2.4.3 netperf -H akpm2 -t UDP_RR UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to akpm2 (172.18.116.155) port 0 AF_INET netperf: receive_response: no response received. errno 0 counter 0 That's running netserver on the test machine. The machine running netperf is 172.18.116.160 and the test machine running netserver is 172.18.116.155 tcpdump from the test machine: 15:24:37.924210 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 15:24:38.859309 IP 172.18.119.252.hsrp 224.0.0.2.hsrp: HSRPv0-hello 20: state=standby group=1 addr=172.18.119.254 15:24:39.078273 IP 172.18.119.253.hsrp 224.0.0.2.hsrp: HSRPv0-hello 20: state=active group=1 addr=172.18.119.254 15:24:39.924074 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 15:24:40.017081 IP 172.24.0.7.domain 172.18.116.57.37456: 59635 4/7/6 CNAME[|domain] 15:24:41.383433 IP 172.18.116.160.33137 172.18.116.155.12865: S 2760291763:2760291763(0) win 5840 mss 1460,sackOK,timestamp 1967355840 0,nop,wscale 8 15:24:41.383479 IP 172.18.116.155.12865 172.18.116.160.33137: S 1640262480:1640262480(0) ack 2760291764 win 5792 mss 1460,sackOK,timestamp 7714 1967355840,nop,wscale 7 15:24:41.383683 IP 172.18.116.160.33137 172.18.116.155.12865: . ack 1 win 23 nop,nop,timestamp 1967355840 7714 15:24:41.383883 IP 172.18.116.160.33137 172.18.116.155.12865: P 1:257(256) ack 1 win 23 nop,nop,timestamp 1967355840 7714 15:24:41.383902 IP 172.18.116.155.12865 172.18.116.160.33137: . ack 257 win 54 nop,nop,timestamp 7714 1967355840 15:24:41.384065 IP 172.18.116.155.12865 172.18.116.160.33137: P 1:257(256) ack 257 win 54 nop,nop,timestamp 7714 1967355840 15:24:41.587266 IP 172.18.116.155.12865 172.18.116.160.33137: P 1:257(256) ack 257 win 54 nop,nop,timestamp 7765 1967355840 15:24:41.839234 IP 172.18.119.252.hsrp 224.0.0.2.hsrp: HSRPv0-hello 20: state=standby group=1 addr=172.18.119.254 15:24:41.924303 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 15:24:41.995285 IP 172.18.116.155.12865 172.18.116.160.33137: P 1:257(256) ack 257 win 54 nop,nop,timestamp 7867 1967355840 15:24:42.030341 IP 172.18.119.253.hsrp 224.0.0.2.hsrp: HSRPv0-hello 20: state=active group=1 addr=172.18.119.254 15:24:42.811330 IP 172.18.116.155.12865 172.18.116.160.33137: P 1:257(256) ack 257 win 54 nop,nop,timestamp 8071 1967355840 15:24:43.924183 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 15:24:44.121880 IP 172.24.0.7.domain 172.18.116.22.46700: 52073* 1/4/4 A[|domain] 15:24:44.443419 IP 172.18.116.155.12865 172.18.116.160.33137: P 1:257(256) ack 257 win 54 nop,nop,timestamp 8479 1967355840 15:24:44.723257 IP 172.18.119.252.hsrp 224.0.0.2.hsrp: HSRPv0-hello 20: state=standby group=1 addr=172.18.119.254 15:24:44.886356 IP 172.18.119.253.hsrp 224.0.0.2.hsrp: HSRPv0-hello 20: state=active group=1 addr=172.18.119.254 15:24:45.924263 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 15:24:47.659300 IP 172.18.119.252.hsrp 224.0.0.2.hsrp: HSRPv0-hello 20: state=standby group=1 addr=172.18.119.254 15:24:47.707599 IP 172.18.116.155.12865 172.18.116.160.33137: P 1:257(256) ack 257 win 54 nop,nop,timestamp 9295 1967355840 15:24:47.874419 IP 172.18.119.253.hsrp 224.0.0.2.hsrp: HSRPv0-hello 20: state=active group=1 addr=172.18.119.254 15:24:47.952350 802.1d config 8000.00:18:74:5d:04:66.80ae root 0066.00:15:c7:20:57:c0 pathcost 4 age 1 max 20 hello 2 fdelay 15 15:24:48.037569 IP 172.24.0.7.domain 172.18.117.18.46665: 59092 2/7/6 CNAME[|domain] So I think we did a bit of TCP chatter then no UDP at all? Looks that way, and on top if it got no results back from netserver on the control (TCP, port 12865) connection. Adding some -d's to the global options will cause netperf to regurgitate what messages it is sending and such. I'd have expected that even if no UDP traffic could flow between netperf and netserver the timer running in the netserver _should_ have gotten it out of the recv()/recvfrom() call in recv_udp_rr() (src/nettest_bsd.c) and that netperf would then report a normal result of just 0 transactions per second. Either that timer didn't get set, didn't fire, or was insufficient to get netserver out of that recv() on the UDP socket, or comms between the two system got fubar for TCP too. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/2] 2.6.21-rc7: known regressions
On Mon, Apr 16, 2007 at 05:14:40PM -0700, Brandeburg, Jesse wrote: Adrian Bunk wrote: Subject: laptops with e1000: lockups References : https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603 Submitter : Dave Jones [EMAIL PROTECTED] Handled-By : Jesse Brandeburg [EMAIL PROTECTED] Status : problem is being debugged this is being actively debugged, here is what we have so far: o v2.6.20: crashes during boot, unless noacpi and nousb bootparams used o v2.6.21-rc6: some userspace issue, crashes just after root mount without init=/bin/bash o v2.6.2X: serial console in docking station spews goo at all speeds with console=ttyS0,n8 . work continues on this, as we don't know if there are kernel panic messages during the hard lock. o fedora 7 test kernel 2948: boots okay, have been using this as only truly working kernel on this machine. one reproduction of the problem was had with scp -l 5000 file remote when linked at 100Mb/Full. Tried probably 20 other times same test with no repro, ugh. Otherwise, slogging through continues. We are actively working on this in case it *is* an e1000 issue. Right now the repro is so unlikely we could hardly tell if we fixed it. FWIW, I can reproduce this pretty much ondemand, on 100M through the ethernet port on a netgear wireless AP. A number of our Fedora7 testers are also able to easily reproduce this. To isolate e1000, for tomorrows test build I've reverted e1000 to the same code that was in 2.6.20. If that works out without causing hangs, I'll try and narrow down further which of the dozen csets is responsible. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.21rc7 e1000 media-detect oddness.
I booted up 2.6.21rc7 without an ethernet cable plugged in, and noticed this.. e1000: :02:00.0: e1000_probe: The EEPROM Checksum Is Not Valid e1000: probe of :02:00.0 failed with error -5 I plugged a cable in, did rmmod e1000;modprobe e1000, and got this.. e1000: :02:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:16:d3:3a:62:d3 e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX e1000: eth0: e1000_watchdog: 10/100 speed: disabling TSO and it works fine.. Why would no cable make it think the EEPROM is invalid ? I repeated this a few times, just to be sure it wasn't a fluke, and it seems to happen 100% reproducably. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT] proxy arp deadlock possible
On Wed, Apr 04, 2007 at 06:10:42PM -0700, Arjan van de Ven wrote: On Thu, 2007-04-05 at 10:44 +1000, Herbert Xu wrote: Stephen Hemminger [EMAIL PROTECTED] wrote: Thanks Dave, there is a classic AB BA deadlock here. We should break the dependency like this. Could someone who uses proxy ARP test this? Sorry Stephen, this isn't necessary. The lockdep thing is simply confused here. It's treating tbl-proxy_queue as the same thing as neigh-arp_queue when they're clearly different. I'm disappointed that after all this time lockdep is still producing bogus reports like this. I'm sure we've been through this particular issue many times already. what's the exact lockdep output here? http://www.mail-archive.com/netdev@vger.kernel.org/msg35266.html Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ethtool: additional 10Gig niceness
applied Thanks. One thing I noticed while making the changes is that the reported speed is kept in a u16. With 10G we are already 1/6 of the way to the maximum. I've no idea when 100G will arrive, but euros to beliners it will probably arrive some day which means something will have to give. I've not thought it through completely, but my initial reaction would be to suggest just making the thing a 64 bit quantity reporting bits and not worry about it again. And then one doesn't have to worry if ethtool starts being applied to links which do not run at integral multiples of a Mbit/s. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
lockdep report from 2.6.20.5-rc1
=== [ INFO: possible circular locking dependency detected ] 2.6.20-1.2933.fc6debug #1 --- swapper/0 is trying to acquire lock: (tbl-lock){-+-+}, at: [c05d5664] neigh_lookup+0x43/0xa2 but task is already holding lock: (list-lock#4){-+..}, at: [c05d65c8] neigh_proxy_process+0x20/0xc2 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #2 (list-lock#4){-+..}: [c043f4f2] __lock_acquire+0x913/0xa43 [c043f933] lock_acquire+0x56/0x6f [c06325df] _spin_lock_irqsave+0x34/0x44 [c05cbb87] skb_dequeue+0x12/0x43 [c05cc9d4] skb_queue_purge+0x14/0x1b [c05d5b70] neigh_update+0x349/0x3a5 [c060cd37] arp_process+0x4d1/0x50a [c060ce53] arp_rcv+0xe3/0x100 [c05d0e43] netif_receive_skb+0x2db/0x35a [c05d2806] process_backlog+0x95/0xf6 [c05d29ed] net_rx_action+0xa1/0x1a8 [c042c1f6] __do_softirq+0x6f/0xe2 [c04063c6] do_softirq+0x61/0xd0 [] 0x - #1 (n-lock){-+-+}: [c043f4f2] __lock_acquire+0x913/0xa43 [c043f933] lock_acquire+0x56/0x6f [c063231e] _write_lock+0x2b/0x38 [c05d74b7] neigh_periodic_timer+0x99/0x138 [c042f053] run_timer_softirq+0x104/0x168 [c042c1f6] __do_softirq+0x6f/0xe2 [c04063c6] do_softirq+0x61/0xd0 [] 0x - #0 (tbl-lock){-+-+}: [c043f3f3] __lock_acquire+0x814/0xa43 [c043f933] lock_acquire+0x56/0x6f [c06323d6] _read_lock_bh+0x30/0x3d [c05d5664] neigh_lookup+0x43/0xa2 [c05d6317] neigh_event_ns+0x2c/0x7a [c060cbec] arp_process+0x386/0x50a [c060ce78] parp_redo+0x8/0xa [c05d660e] neigh_proxy_process+0x66/0xc2 [c042f053] run_timer_softirq+0x104/0x168 [c042c1f6] __do_softirq+0x6f/0xe2 [c04063c6] do_softirq+0x61/0xd0 [] 0x other info that might help us debug this: 1 lock held by swapper/0: #0: (list-lock#4){-+..}, at: [c05d65c8] neigh_proxy_process+0x20/0xc2 stack backtrace: [c04051dd] show_trace_log_lvl+0x1a/0x2f [c0405782] show_trace+0x12/0x14 [c0405806] dump_stack+0x16/0x18 [c043dcf5] print_circular_bug_tail+0x5f/0x68 [c043f3f3] __lock_acquire+0x814/0xa43 [c043f933] lock_acquire+0x56/0x6f [c06323d6] _read_lock_bh+0x30/0x3d [c05d5664] neigh_lookup+0x43/0xa2 [c05d6317] neigh_event_ns+0x2c/0x7a [c060cbec] arp_process+0x386/0x50a [c060ce78] parp_redo+0x8/0xa [c05d660e] neigh_proxy_process+0x66/0xc2 [c042f053] run_timer_softirq+0x104/0x168 [c042c1f6] __do_softirq+0x6f/0xe2 [c04063c6] do_softirq+0x61/0xd0 === -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ethtool: additional 10Gig niceness
teach ethtool to print 1Mb/s for a 10G NIC and prepare for 10G NICs where it is possible to run something other than 10G update the ethtool.8 manpage with info re same and some grammar fixes Signed-off-by: Rick Jones [EMAIL PROTECTED] the likely required asbestos at the ready :) rick jones From c58b73af0744a3d5dbc4fbf23c7b4d6f9092d21a Mon Sep 17 00:00:00 2001 From: Rick Jones [EMAIL PROTECTED] Date: Mon, 2 Apr 2007 13:45:53 -0700 Subject: [PATCH] ethtool: additional 10Gig niceness teach ethtool to print 1Mb/s for a 10G NIC and prepare for 10G NICs where it is possible to run something other than 10G update the ethtool.8 manpage with info re same and some grammar fixes Signed-off-by: Rick Jones [EMAIL PROTECTED] --- ethtool.8 | 107 +++- ethtool.c | 20 ++- 2 files changed, 73 insertions(+), 54 deletions(-) diff --git a/ethtool.8 b/ethtool.8 index d247d51..d6561bf 100644 --- a/ethtool.8 +++ b/ethtool.8 @@ -175,7 +175,7 @@ ethtool \- Display or change ethernet card settings .B ethtool \-s .I ethX -.B3 speed 10 100 1000 +.B4 speed 10 100 1000 1 .B2 duplex half full .B4 port tp aui bnc mii fibre .B2 autoneg on off @@ -193,65 +193,65 @@ ethtool \- Display or change ethernet card settings is used for querying settings of an ethernet device and changing them. .I ethX -is the name of the ethernet device to work on. +is the name of the ethernet device on which ethtool should operate. .SH OPTIONS .B ethtool with a single argument specifying the device name prints current -setting of the specified device. +settings of the specified device. .TP .B \-h \-\-help -shows a short help message. +Shows a short help message. .TP .B \-a \-\-show\-pause -queries the specified ethernet device for pause parameter information. +Queries the specified ethernet device for pause parameter information. .TP .B \-A \-\-pause -change the pause parameters of the specified ethernet device. +Changes the pause parameters of the specified ethernet device. .TP .A2 autoneg on off -Specify if pause autonegotiation is enabled. +Specifies whether pause autonegotiation should be enabled. .TP .A2 rx on off -Specify if RX pause is enabled. +Specifies whether RX pause should be enabled. .TP .A2 tx on off -Specify if TX pause is enabled. +Specifies whether TX pause should be enabled. .TP .B \-c \-\-show\-coalesce -queries the specified ethernet device for coalescing information. +Queries the specified ethernet device for coalescing information. .TP .B \-C \-\-coalesce -change the coalescing settings of the specified ethernet device. +Changes the coalescing settings of the specified ethernet device. .TP .B \-g \-\-show\-ring -queries the specified ethernet device for rx/tx ring parameter information. +Queries the specified ethernet device for rx/tx ring parameter information. .TP .B \-G \-\-set\-ring -change the rx/tx ring parameters of the specified ethernet device. +Changes the rx/tx ring parameters of the specified ethernet device. .TP .BI rx \ N -Change number of ring entries for the Rx ring. +Changes the number of ring entries for the Rx ring. .TP .BI rx-mini \ N -Change number of ring entries for the Rx Mini ring. +Changes the number of ring entries for the Rx Mini ring. .TP .BI rx-jumbo \ N -Change number of ring entries for the Rx Jumbo ring. +Changes the number of ring entries for the Rx Jumbo ring. .TP .BI tx \ N -Change number of ring entries for the Tx ring. +Changes the number of ring entries for the Tx ring. .TP .B \-i \-\-driver -queries the specified ethernet device for associated driver information. +Queries the specified ethernet device for associated driver information. .TP .B \-d \-\-register\-dump -retrieves and prints a register dump for the specified ethernet device. +Retrieves and prints a register dump for the specified ethernet device. The register format for some devices is known and decoded others are printed in hex. When .I raw -is enabled, then it dumps the raw register data to stdout. +is enabled, then ethtool dumps the raw register data to stdout. If .I file is specified, then use contents of previous raw register dump, rather @@ -259,7 +259,7 @@ than reading from the device. .TP .B \-e \-\-eeprom\-dump -retrieves and prints an EEPROM dump for the specified ethernet device. +Retrieves and prints an EEPROM dump for the specified ethernet device. When raw is enabled, then it dumps the raw EEPROM data to stdout. The length and offset parameters allow dumping certain portions of the EEPROM. Default is to dump the entire EEPROM. @@ -271,31 +271,31 @@ of writing to the EEPROM, a device-specific magic key must be specified to prevent the accidental writing to the EEPROM. .TP .B \-k \-\-show\-offload -queries the specified ethernet device for offload information. +Queries the specified ethernet device for offload information. .TP .B \-K \-\-offload -change the offload
NIC data corruption
I changed the title to be more accurate, and culled the distribution to individuals and netdev The mention of trying to turn-off CKO and see if the data corruption goes away leads me to ask a possibly delicate question: Should Linux only enable CKO on those NICs certified to have ECC/parity throughout their _entire_ data path? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Add TCP connection abort IOCTL
If the switchover from active to standby is commanded then there is the opportunity to tell the applications on the server to close their connections - either explicitly with some sort of defined interface, or implicitly by killing the processes. Then the IP can be brought-up on the standby and processes started/enabled/whatever and the clients can establish their new connections. The ioctl here (at least if it is like the tcp_discon options in HP-UX/Solaris) wouldn't be any better than just killing the process in so far as what happens on the network - in fact, it could be worse since the RST will not be retransmitted if lost, but FINs would. So, the ioctl could still leave clients twisting in the ether waiting for their application-level heartbeats to kick-in anyway. Heck, depending on their heartbeat lengths, even the FIN stuff if lost could leave them depending on their heartbeats. If the switchover from active to standby is uncommanded it probably means the primary went belly-up which means you don't have the opportunity to make an ioctl call anyway, and you are back to the heartbeats. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: L2 network namespace benchmarking
If I read the results right it took a 32bit machine from AMD with a gigabit interface before you could measure a throughput difference. That isn't shabby for a non-optimized code path. Just some paranoid ramblings - one needs to look beyond just whether or not the performance of a bulk transfer test (eg TCP_STREAM) remains able to hit link-rate. One has to also consider the change in service demand (the normalization of CPU util and throughput). Also, with functionality like TSO in place, the ability to pass very large things down the stack can help cover for a multitude of path-length sins. And with either multiple 1G or 10G NICs becoming more and more prevalent, we have another one of those NIC speed vs CPU speed switch-overs, so maintaining single-NIC 1 gigabit throughput, while necessary, isn't (IMO) sufficient. S, it becomes very important to go beyond just TCP_STREAM tests when evaluating these sorts of things. Another test to run would be the TCP_RR test. TCP_RR with single-byte request/response sizes will bypass the TSO stuff, and the transaction rate will be more directly affected by the change in path length than a TCP_STREAM test. It will also show-up quite clearly in the service demand. Now, with NICs doing interrupt coalescing, if the NIC is strapped poorly (IMO) then you may not see a change in transaction rate - it may be getting limited artifically by the NIC's interrupt coalescing. So, one has to fall-back on service demand, or better yet, disable the interrupt coalescing. Otherwise, measuring peak aggregate request/response becomes necessary. rick jones don't be blinded by bit-rate - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: L2 network namespace benchmarking
Do you have any pointer to help on benchmarking the network, perhaps a checklist or some scripts for netperf ? There are some scripts in doc/examples but they are probably a bit long in the tooth by now. The main writeup _I_ have on netperf would be the manual, which was recently updated for the 2.4.3 release. http://www.netperf.org/svn/netperf2/tags/netperf-2.4.3/doc/netperf.html or the current top of trunk: http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html There is also a [EMAIL PROTECTED] mailing list which one can join and have discussions about netperf, and a [EMAIL PROTECTED] if one wants to discuss actual netperf (netperf2 or netperf4) development. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
has ethtool been patched for 10 gig speed reporting?
I have some 10gig nics and ethtool is reporting unknown for the speed. Is there already a patch, or have I found an opportunity? FWIW, the version five bits from sourceforge still report unknown - are they the latest or are there later bits somewhere? thanks, rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Add TCP connection abort IOCTL
There is no reason for this ioctl at all. Either existing facilities provide what you need or what you want is a protocol violation we can't do. I agree that 99 times out of ten such a mechanism serves only as a massive KLUDGE to paper-over application bugs. I'll also sadly point-out that such a mechanism exists in HP-UX 11.X and I suspect Solaris !-( I've spent probably the last decade or so attempting to discourage its use in the HP-UX space, but like some daemon from hell it just refuses to die. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
fix up misplaced inlines.
Turning up the warnings on gcc makes it emit warnings about the placement of 'inline' in function declarations. Here's everything that was under net/ Signed-off-by: Dave Jones [EMAIL PROTECTED] diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c index 4c914df..ecfe8da 100644 --- a/net/bluetooth/hidp/core.c +++ b/net/bluetooth/hidp/core.c @@ -319,7 +319,7 @@ static int __hidp_send_ctrl_message(struct hidp_session *session, return 0; } -static int inline hidp_send_ctrl_message(struct hidp_session *session, +static inline int hidp_send_ctrl_message(struct hidp_session *session, unsigned char hdr, unsigned char *data, int size) { int err; diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c index 7712d76..5439a3c 100644 --- a/net/bridge/br_netfilter.c +++ b/net/bridge/br_netfilter.c @@ -61,7 +61,7 @@ static int brnf_filter_vlan_tagged __read_mostly = 1; #define brnf_filter_vlan_tagged 1 #endif -static __be16 inline vlan_proto(const struct sk_buff *skb) +static inline __be16 vlan_proto(const struct sk_buff *skb) { return vlan_eth_hdr(skb)-h_vlan_encapsulated_proto; } diff --git a/net/core/sock.c b/net/core/sock.c index 8d65d64..27c4f62 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -808,7 +808,7 @@ lenout: * * (We also register the sk_lock with the lock validator.) */ -static void inline sock_lock_init(struct sock *sk) +static inline void sock_lock_init(struct sock *sk) { sock_lock_init_class_and_name(sk, af_family_slock_key_strings[sk-sk_family], diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index a7fee6b..1b61699 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -804,7 +804,7 @@ struct ipv6_saddr_score { #define IPV6_SADDR_SCORE_LABEL 0x0020 #define IPV6_SADDR_SCORE_PRIVACY 0x0040 -static int inline ipv6_saddr_preferred(int type) +static inline int ipv6_saddr_preferred(int type) { if (type (IPV6_ADDR_MAPPED|IPV6_ADDR_COMPATv4| IPV6_ADDR_LOOPBACK|IPV6_ADDR_RESERVED)) @@ -813,7 +813,7 @@ static int inline ipv6_saddr_preferred(int type) } /* static matching label */ -static int inline ipv6_saddr_label(const struct in6_addr *addr, int type) +static inline int ipv6_saddr_label(const struct in6_addr *addr, int type) { /* *prefix (longest match) label @@ -3318,7 +3318,7 @@ errout: rtnl_set_sk_err(RTNLGRP_IPV6_IFADDR, err); } -static void inline ipv6_store_devconf(struct ipv6_devconf *cnf, +static inline void ipv6_store_devconf(struct ipv6_devconf *cnf, __s32 *array, int bytes) { BUG_ON(bytes (DEVCONF_MAX * 4)); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 0e1f4b2..a6b3117 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -308,7 +308,7 @@ static inline void rt6_probe(struct rt6_info *rt) /* * Default Router Selection (RFC 2461 6.3.6) */ -static int inline rt6_check_dev(struct rt6_info *rt, int oif) +static inline int rt6_check_dev(struct rt6_info *rt, int oif) { struct net_device *dev = rt-rt6i_dev; int ret = 0; @@ -328,7 +328,7 @@ static int inline rt6_check_dev(struct rt6_info *rt, int oif) return ret; } -static int inline rt6_check_neigh(struct rt6_info *rt) +static inline int rt6_check_neigh(struct rt6_info *rt) { struct neighbour *neigh = rt-rt6i_nexthop; int m = 0; diff --git a/net/ipv6/xfrm6_tunnel.c b/net/ipv6/xfrm6_tunnel.c index ee4b84a..93c4223 100644 --- a/net/ipv6/xfrm6_tunnel.c +++ b/net/ipv6/xfrm6_tunnel.c @@ -58,7 +58,7 @@ static struct kmem_cache *xfrm6_tunnel_spi_kmem __read_mostly; static struct hlist_head xfrm6_tunnel_spi_byaddr[XFRM6_TUNNEL_SPI_BYADDR_HSIZE]; static struct hlist_head xfrm6_tunnel_spi_byspi[XFRM6_TUNNEL_SPI_BYSPI_HSIZE]; -static unsigned inline xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr) +static inline unsigned xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr) { unsigned h; @@ -70,7 +70,7 @@ static unsigned inline xfrm6_tunnel_spi_hash_byaddr(xfrm_address_t *addr) return h; } -static unsigned inline xfrm6_tunnel_spi_hash_byspi(u32 spi) +static inline unsigned xfrm6_tunnel_spi_hash_byspi(u32 spi) { return spi % XFRM6_TUNNEL_SPI_BYSPI_HSIZE; } diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c index e85df07..abc47cc 100644 --- a/net/sched/cls_route.c +++ b/net/sched/cls_route.c @@ -93,7 +93,7 @@ void route4_reset_fastmap(struct net_device *dev, struct route4_head *head, u32 spin_unlock_bh(dev-queue_lock); } -static void __inline__ +static inline void route4_set_fastmap(struct route4_head *head, u32 id, int iif, struct route4_filter *f) { diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 9678995..e81e2fb 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -2025,7 +2025,7 @@ nlmsg_failure: return -1; } -static int inline
Re: ping DOS avoidance?
I was just asked about something not too different, involving IIRC tnsping. It got me to looking at ip_sysctl.txt which has: icmp_ratelimit - INTEGER Limit the maximal rates for sending ICMP packets whose type matches icmp_ratemask (see below) to specific targets. 0 to disable any limiting, otherwise the maximal rate in jiffies(1) Default: 100 icmp_ratemask - INTEGER Mask made of ICMP types for which rates are being limited. Significant bits: IHGFEDCBA9876543210 Default mask: 00110011000 (6168) Bit definitions (see include/linux/icmp.h): 0 Echo Reply 3 Destination Unreachable * 4 Source Quench * 5 Redirect 8 Echo Request B Time Exceeded * C Parameter Problem * D Timestamp Request E Timestamp Reply F Info Request G Info Reply H Address Mask Request I Address Mask Reply * These are rate limited by default (see default mask above) (I've always been used to masks being specified as hex values) rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bridge: faster compare for link local addresses
Stephen Hemminger wrote: Use logic operations rather than memcmp() to compare destination address with link local multicast addresses. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/bridge/br_input.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) --- netem-dev.orig/net/bridge/br_input.c +++ netem-dev/net/bridge/br_input.c @@ -112,7 +112,11 @@ static int br_handle_local_finish(struct */ static inline int is_link_local(const unsigned char *dest) { - return memcmp(dest, br_group_address, 5) == 0 (dest[5] 0xf0) == 0; + const u16 *a = (const u16 *) dest; + static const u16 *const b = (const u16 *const ) br_group_address; + static const u16 m = __constant_cpu_to_be16(0xfff0); + + return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | ((a[2] ^ b[2]) m)) == 0; } Being paranoid - are there no worries about the alignment of dest? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP 2MSL on loopback
This is probably not something that happens in real world deployments. I But it's not 60,000 concurrent connections, it's 60,000 within a 2 minute span. Sounds like a case of Doctor! Doctor! It hurts when I do this. I'm not saying this is a high priority problem, I only encountered it in a test scenario where I was deliberately trying to max out the server. Ideally the 2MSL parameter would be dynamically adjusted based on the route to the destination and the weights associated with those routes. In the simplest case, connections between machines on the same subnet (i.e., no router hops involved) should have a much smaller default value than connections that traverse any routers. I'd settle for a two-level setting - with no router hops, use the small value; with any router hops use the large value. With transparant bridging, nobody knows how long the datagram may be out there. Admittedly, the chances of a datagram living for a full two minutes these days is probably nil, but just being in the same IP subnet doesn't really mean anything when it comes to physical locality. It's a combination of 2MSL and /proc/sys/net/ipv4/ip_local_port_range - on my system the default port range is 32768-61000. That means if I use up 28232 ports in less than 2MSL then everything stops. netstat will show that all the available port numbers are in TIME_WAIT state. And this is particularly bad because while waiting for the timeout, I can't initiate any new outbound connections of any kind at all - telnet, ssh, whatever, you have to wait for at least one port to free up. (Interesting denial of service there) SPECweb benchmarking has had to deal with the issue of attempted TIME_WAIT reuse going back to 1997. It deals with it by not relying on the client's configured local/anonymous/ephemeral port number range and instead making explicit bind() calls in the (more or less) entire unpriv port range (actually it may just be from 5000 to 65535 but still) Now, if it weren't necessary to fully randomize the ISNs, the chances of a successful transition from TIME_WAIT to ESTABLISHED might be greater, but going back to the good old days of more or less purly clock driven ISN's isn't likely. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NET]: Please revert disallowing zero listen queues
So we're not disallowing a backlog argument of zero to listen(). We'll accept that just fine, the only thing that happens is that you'll get what you ask for, that being no connections :-) I'm not sure where HP-UX inherited the 0 = 1 bit - perhaps from BSD, nor am I sure there is official chapter and verse, but: excerpt backlog is limited to the range of 0 to SOMAXCONN, which is defined in sys/socket.h. SOMAXCONN is currently set to 4096. If any other value is specified, the system automatically assigns the closest value within the range. A backlog of 0 specifies only 1 pending connection is allowed at any given time. /excerpt I don't have a Solaris, BSD or AIX manpage for listen handy to check them but would not be surprised to see they are similar. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP 2MSL on loopback
With transparant bridging, nobody knows how long the datagram may be out there. Admittedly, the chances of a datagram living for a full two minutes these days is probably nil, but just being in the same IP subnet doesn't really mean anything when it comes to physical locality. Bridging isn't necessarily a problem though. The 2MSL timeout is designed to prevent problems from delayed packets that got sent through multiple paths. In a bridging setup you don't allow multiple paths, that's what STP is designed to prevent. If you want to configure a network that allows multiple paths, you need to use a router, not a bridge. Well, there is trunking at the data link layer, and in theory there could be an active-standby where the standby took a somewhat different path. The timeout is also to cover datagrams which just got stuck somewhere too (IIRC) and may not necessarily require a multiple path situation. SPECweb benchmarking has had to deal with the issue of attempted TIME_WAIT reuse going back to 1997. It deals with it by not relying on the client's configured local/anonymous/ephemeral port number range and instead making explicit bind() calls in the (more or less) entire unpriv port range (actually it may just be from 5000 to 65535 but still) That still doesn't solve the problem, it only ~doubles the available port range. That means it takes 0.6 seconds to trigger the problem instead of only 0.3 seconds... True. Thankfully, the web learned to use persistent connections so later versions of SPECweb benchmarking make use of persistent connections. In an environment where connections are opened and closed very quickly with only a small amount of data carried per connection, it might make sense to remember the last sequence number used on a port and use that as the floor of the next randomly generated ISN. Monotonically increasing sequence numbers aren't a security risk if there's still a randomly determined gap from one connection to the next. But I don't think it's necessary to consider this at the moment. I thought that all the security types started squawking if the ISN wasn't completely random? I've not tried this, but if a client does want to cycle through thousands of connections per second, and if it is the one to initiate connection close, would it be sufficient to only use something like: socket() bind() loop: connect() request() response() shudtown(SHUT_RDWR) goto loop ie not call close on the FD so there is still a direct link to the connection in TIME_WAIT so one could in theory initiate a new connection from TIME_WAIT? Then in theory the randomness could be _almost_ the entire sequence space, less the previous connection's window (IIRC). rick jones rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP 2MSL on loopback
On the other hand, being able to configure a small MSL for the loopback device is perfectly safe. Being able to configure a small MSL for other interfaces may be safe, depending on the rest of the network layout. A peanut gallery question - I seem to recall prior discussions about how one cannot assume that a packet destined for a given IP address will remain detined for that given IP address as it could go through a module that will rewrite headers etc. Is traffic destined for 127.0.0.1 immune from that? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: all syscalls initially taking 4usec on a P4? Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?
I measure a huge slope, however. Starting at 1usec for back-to-back system calls, it rises to 2usec after interleaving calls with a count to 20 million. 4usec is hit after 110 million. The graph, with semi-scientific error-bars is on http://ds9a.nl/tmp/recvfrom-usec-vs-wait.png The code to generate it is on: http://ds9a.nl/tmp/recvtimings.c I'm investigating this further for other system calls. It might be that my measurements are off, but it appears even a slight delay between calls incurs a large penalty. The slope appears to be flattening-out the farther out to the right it goes. Perhaps that is the length of time it takes to take all the requisite cache misses. Some judicious use of HW perf counters might be in order via say papi or pfmon. Otherwise, you could try a test where you don't delay, but do try to blow-out the cache(s) between recvfrom() calls. If the delay there starts to match the delay as you go out to the right on the graph it would suggest that it is indeed cache effects. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: degradation in bridging performance of 5% in 2.6.20 when compared to 2.6.19
kalyan tejaswi wrote: Hi all, I have been comparing bridging performance for 2.6.20 and 2.6.19 kernels. The kenel configurations are identical for both the kernels. I use D-Link cards (8139too driver) for the Malta 4Kc board. The setup is: netperf client --- malta 4Kc - netperf server. The throughput statistics (in 10^6 bits/second) are: 2.6.19 2.6.20 routing30.2 30.16 bridging 32.35 30.81 I observe that there has been a degradation in bridging performance of 5% in 2.6.20 when compared to 2.6.19. Has anyone observed similar behaviour? Any inputs or suggestions are welcome. In each case is the malta CPU bound? If not, some idea of the change in CPU util might be helpful. rick jones btw, netperf 2.4.3 just released: ftp://ftp.netperf.org/netperf http://www.netperf.org/svn/netperf2/tags/netperf-2.4.3 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FC5 iptables-restore failure
On Thu, Feb 15, 2007 at 02:45:07AM -0800, Andrew Morton wrote: I've recently been noticing nasty messages come out of FC5: sony:/home/akpm# service iptables stop Flushing firewall rules: [ OK ] Setting chains to policy ACCEPT: filter[ OK ] Unloading iptables modules:[ OK ] sony:/home/akpm# service iptables start Applying iptables firewall rules: iptables-restore: line 20 failed [FAILED] Dunno when it started happening, but it's in mainline now. It's a pretty stupid error message. line 20 of what? 2.6.18 - 2.6.19 changes a bunch of netfilter config option names. Sure you weren't bitten by that ? Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] apply cwnd rules to FIN packets with data
John Heffner wrote: David Miller wrote: However, I can't think of any reason why the cwnd test should not apply. Care to elaborate here? You can view the FIN special case as an off by one error in the CWND test, it's not going to melt the internet. :-) True, it's not going to melt the internet, but why stop at one when two would finish the connection even faster? Not sure I buy this argument. Was there some benchmarking data that was a justification for this in the first place? Is the cwnd in the stack byte based, or packet based? While all the RFCs tend to discuss things in terms of byte-based cwnds and assumptions based on MSSes and such, the underlying principle was/is a conservation of packets. As David said, a packet is a packet, and if one were going to be sending a FIN segment, it might as well carry data. And if one isn't comfortable sending that one last data segment with the FIN because cwnd wasn't large enough at the time, should the FIN be sent at that point, even if it is waffer thin? rick jones 2 cents tossed-in from the peanut gallery - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: meaningful spinlock contention when bound to non-intr CPU?
SPINLOCKS HOLDWAIT UTIL CONMEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 7.4% 2.8% 0.1us( 143us) 3.3us( 147us)( 1.4%) 75262432 97.2% 2.8% 0% lock_sock_nested+0x30 29.5% 6.6% 0.5us( 148us) 0.9us( 143us)(0.49%) 37622512 93.4% 6.6% 0% tcp_v4_rcv+0xb30 3.0% 5.6% 0.1us( 142us) 0.9us( 143us)(0.14%) 13911325 94.4% 5.6% 0% release_sock+0x120 9.6% 0.75% 0.1us( 144us) 0.7us( 139us)(0.08%) 75262432 99.2% 0.75% 0% release_sock+0x30 ... Still, does this look like something worth persuing? In a past life/OS when one was able to eliminate one percentage point of spinlock contention, two percentage points of improvement ensued. Rick, this looks like good stuff, we're seeing more and more issues like this as systems become more multi-core and have more interrupts per NIC (think MSI-X) MSI-X - haven't even gotten to that - discussion of that probably overlaps with some pci mailing list right? Let me know if there is something I can do to help. I suppose one good step would be to reproduce the results on some other platform. After that, I need to understand what those routines are doing much better than I currently do, particularly from an architecture perspective - I think that it may involve all the prequeue/try to get the TCP processing on the user's stack stuff but I'm _far_ from certain. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: meaningful spinlock contention when bound to non-intr CPU?
Andi Kleen wrote: Rick Jones [EMAIL PROTECTED] writes: Still, does this look like something worth persuing? In a past life/OS when one was able to eliminate one percentage point of spinlock contention, two percentage points of improvement ensued. The stack is really designed to go fast with per CPU local RX processing of packets. This normally works because waking on up a task the scheduler tries to move it to that CPU. Since the wakeups are on the CPU that process the incoming packets it should usually end up correctly. The trouble is when your NICs are so fast that a single CPU can't keep up, or when you have programs that process many different sockets from a single thread. The fast NIC case will be eventually fixed by adding proper support for MSI-X and connection hashing. Then the NIC can fan out to multiple interrupts and use multiple CPUs to process the incoming packets. If that is implemented well (for some definition of well) then it might address the many sockets from a thread issue too, but if not... If it is simple hash on the headers then you still have issues with a process/thread servicing mutiple connections - the hash of the different headers will take things up different CPUs and you induce the scheduler to flip the process back and forth between them. The meta question behind all that would seem to be whether the scheduler should be telling us where to perform the network processing, or should the network processing be telling the scheduler what to do? (eg all my old blathering about IPS vs TOPS in HP-UX...) Then there is the case of a single process having many sockets from different NICs This will be of course somewhat slower because there will be cross CPU traffic. The extreme case I see with the netperf test suggests it will be a pretty big hit. Dragging cachelines from CPU to CPU is evil. Sometimes a necessary evil of course, but still evil. However there should be not much socket lock contention because a process handling many sockets will be hopefully unlikely to bang on each of its many sockets at the exactly same time as the stack receives RX packets. This should also eliminate the spinlock contenion. From that theory your test sounds somewhat unrealistic to me. Do you have any evidence you're modelling a real world scenario here? I somehow doubt it. Well, yes and no. If I drop the burst and instead have N times more netperf's going, I see the same lock contention situation. I wasn't expecting to - thinking that if there were then N different processes on each CPU the likelihood of there being a contention on any one socket was low, but it was there just the same. That is part of what makes me wonder if there is a race between wakeup and release of a lock. rick - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: meaningful spinlock contention when bound to non-intr CPU?
Andi Kleen wrote: The meta question behind all that would seem to be whether the scheduler should be telling us where to perform the network processing, or should the network processing be telling the scheduler what to do? (eg all my old blathering about IPS vs TOPS in HP-UX...) That's an unsolved problem. But past experiments suggest that giving the scheduler more imperatives than just use CPUs well are often net-losses. I wasn't thinking about giving the scheduler more imperitives really (?), just letting networking know more about where threads executed accessing given connections. (eg TOPS) I suspect it cannot be completely solved in the general case. Not unless the NIC can peer into the connection table and see where each connection was last accessed by user-space. Well, yes and no. If I drop the burst and instead have N times more netperf's going, I see the same lock contention situation. I wasn't expecting to - thinking that if there were then N different processes on each CPU the likelihood of there being a contention on any one socket was low, but it was there just the same. That is part of what makes me wonder if there is a race between wakeup A race? Perhaps a poor choice of words on my part - something along the lines of: hold_lock(); wake_up_someone(); release_lock(); where the someone being awoken can try to grab the lock before the path doing the waking manages to release it. and release of a lock. You could try with echo 1 /proc/sys/net/ipv4/tcp_low_latency. That should change RX locking behaviour significantly. Running the same 8 netperf's with TCP_RR and burst bound to different CPU than the NIC interrupt, the lockmeter output looks virtually unchanged. Still release_sock, tcp_v4_rcv, lock_sock_nested at their same offsets. However, if I run the multiple-connection-per-thread code, and have each service 32 concurrent connections, and bind to a CPU other than the interrupt CPU, the lock contention in this case does appear to go away. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: meaningful spinlock contention when bound to non-intr CPU?
Yes the wakeup happens deep inside the critical section and if the process is running on another CPU it could race to the lock. Hmm, i suppose the wakeup could be moved out, but it would need some restructuring of the code. Also to be safe the code would still need to at least hold a reference count of the sock during the wakeup, and when that is released then you have another cache line to bounce, which might not be any better than the lock. So it might not be actually worth it. I suppose the socket release could be at least partially protected with RCU against this case so that could be done without a reference count, but it might be tricky to get this right. Again still not sure it's worth handling this. Based on my experiments thusfar I'd have to agree/accept (I wasn't certain to begin with - hence the post in the first place :) but I do need/want to see what happens with a single-stream through a 10G NIC - on the receive side at least with a 1500 byte MTU. I was using the burst-mode aggregate RR over the 1G NICs to get the CPU util up without need for considerable bandwidth, since the system handled 8 TCP_STREAM tests across the 8 NICs without working-up a sweat. I suppose I could instead chop the MTU on the 1G NICs and use that to increase the CPU util on the receive side. rick - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
meaningful spinlock contention when bound to non-intr CPU?
For various nefarious porpoises relating to comparing and contrasting a single 10G NIC with N 1G ports and hopefully finding interesting processor cache (mis)behaviour in the stack, I got my hands on a pair of 8 core systems with plenty of RAM and I/O slots. (rx6600 with 1.6 GHz dual-core Itanium2, aka Montecito) A 2.6.10-rc5 kernel onto each system thanks to pointers from Dan Frazier. Into each went a quartet of dual-port 1G NICs driven by e1000 7.3.15-k2-NAPI and I connected them back to back. I tweaked smp_affinity to have each port's interrupts go to a separate core. Netperf2 configured with --enable-burst. When I run eight concurrent netperf TCP_RR tests, each doing 24 concurrent single-byte transactions (test-specific -b 24), TCP_NODELAY set, (test-specific -D) and bind each netserver/netperf to the same CPU as is taking the interrupts of the NIC handling that connection (global -T) I see things looking pretty good. Decent aggregate transactions per second, and nothing in the CPU profiles to suggest spinlock contention. Happiness and joy. An N CPU system behaving (at this level at least) like N, 1 CPU systems. When I then decide to bind the netperf/netservers to CPU(s) other than the ones taking the interrupts from the NIC(s) the aggregate transactions per second drops by roughly 40/135 or ~30%. I was indeed expecting a delta - no idea if that is in the realm of to be expected - but decided to go ahead and look at the profiles. The profiles (either via q-syscollect or caliper) show upwards of 3% of the CPU consumed by spinlock contention (ie time spent in ia64_spinlock_contention). (I'm guessing some of the rest of the perf drop comes from those interesting cache behaviours still to be sought) With some help from Lee Schermerhorn and Alan Brunelle I got a lockmeter kernel going, and it is suggesting that the greatest spinlock contention comes from the routines: SPINLOCKS HOLDWAIT UTIL CONMEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 7.4% 2.8% 0.1us( 143us) 3.3us( 147us)( 1.4%) 75262432 97.2% 2.8% 0% lock_sock_nested+0x30 29.5% 6.6% 0.5us( 148us) 0.9us( 143us)(0.49%) 37622512 93.4% 6.6% 0% tcp_v4_rcv+0xb30 3.0% 5.6% 0.1us( 142us) 0.9us( 143us)(0.14%) 13911325 94.4% 5.6% 0% release_sock+0x120 9.6% 0.75% 0.1us( 144us) 0.7us( 139us)(0.08%) 75262432 99.2% 0.75% 0% release_sock+0x30 I suppose it stands to some reason that there would be contention associated with the socket since there will be two things going for the socket (a netperf/netserver and an interrupt/upthestack) each running on separate CPUs. Some of it looks like it _may_ be inevitable? - waking-up the user who will now be racing to grab the socket before the stack releases it? (I may have been mis-interpreting some of the code I was checking) Still, does this look like something worth persuing? In a past life/OS when one was able to eliminate one percentage point of spinlock contention, two percentage points of improvement ensued. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: meaningful spinlock contention when bound to non-intr CPU?
Rick Jones wrote: A 2.6.10-rc5 kernel onto each system thanks to pointers from Dan Frazier. gaak - 2.6.20-rc5 that is. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: why would EPIPE cause socket port to change?
Herbert Xu wrote: dean gaudet [EMAIL PROTECTED] wrote: in the test program below the getsockname result on a TCP socket changes across a write which produces EPIPE... here's a fragment of the strace: getsockname(3, {sa_family=AF_INET, sin_port=htons(37636), sin_addr=inet_addr(127.0.0.1)}, [17863593746633850896]) = 0 ... write(3, hi!\n, 4)= 4 write(3, hi!\n, 4)= -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- getsockname(3, {sa_family=AF_INET, sin_port=htons(59882), sin_addr=inet_addr(127.0.0.1)}, [16927060683038654480]) = 0 why does the port# change? this is on 2.6.19.1. Prior to the last write, the socket entered the CLOSED state meaning that the old port is no longer allocated to it. As a result, the last write operates on an unconnected socket which causes a new local port to be allocated as an autobind. It then fails because the socket is still not connected. So any attempt to run getsockname after an error on the socket is simply buggy. But falls within the principle of least surprise doesn't it? Unless the application has called close() or bind(), it does seem like a reasonable expectation that the port assignments are not changed. (fwiw this is one of two reasons i've found for libnss-ldap to leak sockets... causing nscd to crash.) Of course, that seems rather odd too - why does libnss-ldap check the socket name on a socket after an EPIPE anyway? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two Dual Core processors and NICS (not handling interrupts on one CPU/assigning a Two Dual Core processors and NICS (not handling interrupts on one CPU / assigning a CPU to a NIC)
Mark Ryden wrote: Hello, I have a machine with 2 dual core CPUs. This machine runs Fedora Core 6. I have two Intel e1000 GigaBit network cards on this machine; I use bonding so that the machine assigns the same IP address to both NICs ; It seems to me that bonding is configured OK, bacuse when running: cat /proc/net/bonding/bond0 I get: ... Permanent HW addr: (And the Permanent HW addr is diffenet in these two entries). I send a large amount of packets to this machine (more than 20,000 in a second). Well, 20K a second is large in some contexts, but not in others :) cat /proc/interrupts shops something like this: CPU0 CPU1 CPU2 CPU3 50:3359337 0 0 0 PCI-MSI eth0 58: 493396136 0 0 PCI-MSI eth1 CPU0 and CPU1 are of the first CPU as far as I understand ; so this means as far as I understand that the second CPU (which has CPU3 and CPU4) does not handle interrupts of the arrived packets; Can I somehow change it so the second CPU will also handle network interrupts of receiving packets on the nic ? Actually, those could be different chips - it depends on the CPUs I think, and I suppose the BIOS/OS. On a Woodcrest system with which I've been playing, CPUs 0 and 2 appear to be on the same die, then 1 and three. I ass-u-me-d the numbering was that way to get maximum processor cache when saying numcpu=N for something less than the number of cores in the system. NUMA considerations might come into play if this is Opteron (well, any NUMA system really - larger IA64's, certain SPARC and Power systems etc...). In broad handwaving terms, one is better-off with the NICs interrupts being handled by the topologically closest CPU. (Not that some irqbalancer programs recognize that just yet :) Now, if both CPU0 and CPU1 are saturated it might make sense to put some interrupts on 2 and/or 3. One of those fun it depends situations. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network card IRQ balancing with Intel 5000 series chipsets
The best way to achieve such balancing is to have the network card help and essentially be able to select the CPU to notify while at the same time considering: a) avoiding any packet reordering - which restricts a flow to be processed to a single CPU at least within a timeframe b) be per-CPU-load-aware - which means to busy out only CPUs which are less utilized Various such schemes have been discussed here but no vendor is making such nics today (search Daves Blog - he did discuss this at one point or other). I thought that Neterion were doing something along those lines with their Xframe II NICs - perhaps not CPU loading aware, but doing stuff to spread the work of different connections across the CPUs. I would add a: c) some knowledge of the CPU on which the thread accessing the socket for that connection will run. This could be as simple as the CPU on which the socket was last accessed. Having a _NIC_ know this sort of thing is somewhat difficult and expensive (perhaps too much so). If a NIC simply hashes the connection idendifiers you then have the issue of different connections, each owned/accessed by one thread, taking different paths through the system. No issues about reordering, but perhaps some on cache lines going hither and yon. The question boils down to - Should the application (via the scheduler) dictate where its connections are processed, or should the connections dictate where the application runs? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network card IRQ balancing with Intel 5000 series chipsets
With NAPI, if i have a few interupts it likely implies i have a huge network load (and therefore CPU use) and would be much more happier if you didnt start moving more interupt load to that already loaded CPU current irqbalance accounts for napi by using the number of packets as indicator for load, not the number of interrupts. (for network interrupts obviously) And hopefully some knowledge of NUMA so it doesn't balance the interrupts of a NIC to some far-off (topology-wise) CPU... rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network drivers that don't suspend on interface down
There are two different problems: 1) Behavior seems to be different depending on device driver author. We should document the expected semantics better. IMHO: When device is down, it should: a) use as few resources as possible: - not grab memory for buffers - not assign IRQ unless it could get one - turn off all power consumption possible b) allow setting parameters like speed/duplex/autonegotiation, ring buffers, ... with ethtool, and remember the state c) not accept data coming in, and drop packets queued What implications does c have for something like tcpdump? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network devices don't handle pci_dma_mapping_error()'s
David Miller wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 6 Dec 2006 16:58:35 -0800 The more robust way would be to stop the queue (like flow control) and return busy. You would need a timer though to handle the case where some disk i/o stole all the mappings and then network device flow blocked. You need some kind of fairness, yes, that's why I suggested a callback. When your DMA allocation fails, you get into the rear of the FIFO, when a free occurs, we callback starting from the head of the FIFO. You don't get removed from the FIFO unless at least one of your DMA allocation retries succeed. While tossing a TCP|UDP|SCTP|etc packet could be plusungood, especially if the IOMMU fills frequently (for some suitable definiton of frequently), is it really worth the effort to save say an ACK? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Max number of TCP sessions
On Thu, 2006-11-16 at 20:23 +, James Courtier-Dutton wrote: Hi, For a host using a Pentium 4 CPU at 2.8Mhz, what is a sensible max value for number of TCP sessions this host could run under Linux? Bandwidth per TCP session is likely to be about 10kbytes/second. To a first order, and assuming that there is nearly no user-space processing for those TCP connections (TCP is a transport not a session protocol :) you could take a netperf TCP_RR test result - using the service demand - usec of CPU/KB transferred you could then do some back of the envelope calculations as to the number of 10 KByte/s connections you could support. It would be a bit of handwaving, but give yourself say a 20% pad and you'll probably be OK. rick jones Kind Regards James - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.19-rc1: Volanomark slowdown
On Wed, 2006-11-08 at 23:10 +0100, Olaf Kirch wrote: What I'm saying though is that it doesn't rhyme with what I've seen of Volanomark - we ran 2.6.16 on a 4p Intel box for instance and it didn't come close to saturating a Gigabit pipe before it maxed out on CPU load. That actually supports the hypothesis doesn't it? The issue being the increased number of ACKs causing additional CPU overhead not saturating a NIC if any involved. One of these days I may have to try to look more closely at what volano does relative to netperf - I remember that someone tried very hard (was it you alexy?) to show a perfomance effect with netperf and it didn't do it :( rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bcm43xx: Readd dropped assignment
On Wed, Oct 18, 2006 at 04:40:00PM +0200, Michael Buesch wrote: On Wednesday 18 October 2006 01:12, Daniel Drake wrote: Larry Finger pointed out a problem with my ieee80211 IV/ICV stripping patch, which I forgot about. Sorry about that. The patch readds the frame_ctl assignment which was accidently dropped. Signed-off-by: Daniel Drake [EMAIL PROTECTED] Whoops. Please merge this as fast as possible, John. That's a real bug which prevents RX from working. Is that one for -stable too? That file looks similar enough between .18.1 and .19rc that it should be the case ? Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] netpoll: rework skb transmit queue
On Fri, Oct 20, 2006 at 01:25:32PM -0700, Stephen Hemminger wrote: On Fri, 20 Oct 2006 12:52:26 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 20 Oct 2006 12:25:27 -0700 Sorry, but why should we treat out-of-tree vendor code any differently than out-of-tree other code. I think what netdump was trying to do, provide a way to requeue instead of fully drop the SKB, is quite reasonable. Don't you think? Netdump doesn't even exist in the current Fedora source rpm. I think Dave dropped it. Indeed. Practically no-one cared about it, so it bit-rotted really fast after we shipped RHEL4. That, along with the focus shifting to making kdump work seemed to kill it off over the last 12 months. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BCM5461 phy issue in 10M/Full duplex
Kumar Gala wrote: I was wondering if anyone has had any issues when trying to force a BCM5461 phy into 10M/full duplex. I seem to be having an issue in the two managed switches I've tried this on but autoneg to 10/half. This causes a problem in that I start seeing a large number of frame errors. I believe, but need to double check, that if I leave the BCM5461 in autoneg, and foce the switch to 10M/full that the BCM5461 will autoneg at 10M/half duplex. Indeed, if one side is hardcoded, autoneg will fail and the side trying to autoneg is required by the specs (not that I know chapter and verse to quote from the IEE stuff :( to go into half-duplex. Was 10M/Fullduplex ever standardized? If not I could see where kit might not be willing/able to autoneg to that. Just wondering if anyone else has seen similar behavior with this PHY. thanks - kumar - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Remove useless comment from sb1250
Signed-off-by: Dave Jones [EMAIL PROTECTED] diff --git a/drivers/net/sb1250-mac.c b/drivers/net/sb1250-mac.c index db23249..1eae16b 100644 --- a/drivers/net/sb1250-mac.c +++ b/drivers/net/sb1250-mac.c @@ -2903,7 +2903,7 @@ #endif dev = alloc_etherdev(sizeof(struct sbmac_softc)); if (!dev) - return -ENOMEM; /* return ENOMEM */ + return -ENOMEM; printk(KERN_DEBUG sbmac: configuring MAC at %lx\n, port); -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suppress / delay SYN-ACK
Eric Dumazet wrote: Rick Jones a écrit : More to the point, on what basis would the application be rejecting a connection request based solely on the SYN? True, it isn't like there would suddenly be any call user data as in XTI/TLI. DATA payload could be included in the SYN packet. TCP specs allow this AFAIK. Yes, but it isn't supposed to be delivered until the 3-way handshake is complete right? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
getaddrinfo - should accept IPPROTO_SCTP no?
I made some recent changes to netperf to workaround what is IMO a bug in the Solaris getaddrinfo() where it will clear the ai_protocol field even when one gives it a protocol in the hints. [If you happen to be trying to use the test-specific -D to set TCP_NODELAY in netperf on Solaris, you might want to grab netperf TOT to get this workaround as it relates to issues with setting TCP_NODELAY - modulo what it will do to being able to run the netperf SCTP tests on Linux...] In the process though I have stumbled across what appears to be a bug (?) in Linux getaddrinfo() - returning a -7 EAI_SOCKTYPE if given as hints SOCK_STREAM and IPPROTO_SCTP - this on a system that ostensibly supports SCTP. I've seen this on RHAS4U4 as well as another less well known distro. I'm about to see about concocting an additional workaround in netperf for this, but thought I'd ask if my assumption - that getaddrinfo() returning -7 when given IPPROTO_SCTP - is indeed a bug in getaddrinfo(). Or am I just woefully behind in patches or completely offbase on what is correct behaviour for getaddrinfo and hints? FWIW, which may not be much, Solaris 10 06/06 seems content to accept IPPROTO_SCTP in the hints. thanks, rick jones http://www.netperf.org/svn/netperf2/trunk/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suppress / delay SYN-ACK
DATA payload could be included in the SYN packet. TCP specs allow this AFAIK. Yes, but it isn't supposed to be delivered until the 3-way handshake is complete right? Are you speaking of 20 years old BSD API ? :) Nope - the bits in the RFCs about data not being delivered until the ISN's are validated. I may have some of the timing a bit wrong though. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
sfuzz hanging on 2.6.18
sfuzz.c (google for it if you don't have it already) used to run forver (or until I got bored and ctrl-c'd it) as long as it didn't trigger an oops or the like in 2.6.17 Running it against 2.6.18, I notice that it runs for a while, and then gets totally wedged. It doesn't respond to any signals, can't be ptraced, and even strace subsequently gets wedged. The machine responds, and is still interactive, but that process is hosed. sysrq-t shows it stuck here.. sfuzz D 724EF62A 2828 28717 28691 (NOTLB) cd69fe98 0082 012d 724ef62a 0001971a 0010 0007 df6d22b0 dfd81080 725bbc5e 0001971a 000cc634 0001 df6d23bc c140e260 0202 de1d5ba0 cd69fea0 de1d5ba0 de1d5b60 de1d5b8c de1d5ba0 Call Trace: [c05b1708] lock_sock+0x75/0xa6 [e0b0b604] dn_getname+0x18/0x5f [decnet] [c05b083b] sys_getsockname+0x5c/0xb0 [c05b0b46] sys_socketcall+0xef/0x261 [c0403f97] syscall_call+0x7/0xb DWARF2 unwinder stuck at syscall_call+0x7/0xb I wonder if the plethora of lockdep related changes inadvertantly broke something? Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suppress / delay SYN-ACK
Martin Schiller wrote: Hi! I'm searching for a solution to suppress / delay the SYN-ACK packet of a listening server (-application) until he has decided (e.g. analysed the requesting ip-address or checked if the corresponding other end of a connection is available) if he wants to accept the connect request of the client. If not, it should be possible to reject the connect request. How often do you expect the incomming call to be rejected? I suspect that would have a significant effect on whether the whole thing is worthwhile. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suppress / delay SYN-ACK
More to the point, on what basis would the application be rejecting a connection request based solely on the SYN? True, it isn't like there would suddenly be any call user data as in XTI/TLI. There are only two pieces of information available: the remote IP address and port, and the total number of pending requests. The latter is already addressed through the backlog size, and netfilter rules can already be used to reject based on IP address. It would though allow an application to have an even more restricted set of allowed IP's than was set in netfilter. Rather like allowing the application to set socket buffer sizes rather than relying on the system's default. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mii-tool gigabit support.
2) develop some style of register description definition type of text file, maybe XML, maybe INI style or something stored in /etc/ethtool as drivername.conf or something like that. This way, ethtool doesn't have to be changed/updated/patched/likely-bug-added for every single device known to man. Just a thought. We could switch to shared libraries like 'tc' uses. From a practical standpoint is shipping a new config file or a new shared library all that much different from a new ethtool binary? rick - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][BNX2]: Disable MSI on 5706 if AMD 8132 bridge is present
It absolutely was not vague, it gave an explicit description of what the problem was, down to the transaction type being used by 5706 and what the stated rules are in the PCI spec, and it also gave a clear indication that the 5706 was in the wrong and that this was believed to be a unique situation. I'm not disagreeing with a per-driver check at the moment, but I thought that Michael told us that the masking being attempted by the 5706 was legal: Michael Chan wrote: MSI is defined to be 32-bit write. The 5706 does 64-bit MSI writes with byte enables disabled on the unused 32-bit word. This is legal but causes problems on the AMD 8132 which will eventually stop responding after a while. ... MSI is defined to be 32-bit write. The 5706 does 64-bit MSI writes with byte enables disabled on the unused 32-bit word. This is legal but causes problems on the AMD 8132 which will eventually stop responding after a while. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mii-tool gigabit support.
With mii-tool we can do the command below and work with a half duplex hub and a full duplex switch. mii-tool -A 10baseT-FD,10baseT-HD eth0 Why, and how often, is that really necessary? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mii-tool gigabit support.
Auke Kok wrote: Rick Jones wrote: With mii-tool we can do the command below and work with a half duplex hub and a full duplex switch. mii-tool -A 10baseT-FD,10baseT-HD eth0 Why, and how often, is that really necessary? This is a bit of a hypothetical discussion of course, but I can imagine a lot of users with 100mbit switches in their homes (imagine all the DSL/cable routers out there...) that want to stop their nic from attempting to negotiate 1000mbit. That would be covered by autosense right? IIRC there haven't been issues with speed sensing, just duplex negotiation right? Another scenario: forcing the NIC to negotiate only full-duplex speeds. Not only fun if you try it against a hub, but possibly useful. For us it's much more interesting because we try every damn impossible configuration anyway and see what gives (or breaks). Anyway, a patch to make ethtool do this was merged as Jeff Kirsher pointed out, so you can do this now with ethool too. I'm just worried (as in Fear Uncertainty and Doubt) that having people set the allowed things to negotiate isn't really any more robust than stright-up hardcodes and perpetuates the (IMO) myth that one shouldn't autoneg on general principle. rick Cheers, Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tc related lockdep warning.
On Tue, Sep 26, 2006 at 06:15:21PM +0200, Patrick McHardy wrote: Patrick McHardy wrote: jamal wrote: Yes, that looks plausible. Can you try making those changes and see if the warning is gone? I think this points to a bigger brokeness caused by the move of dev-qdisc to RCU. It means destruction of filters and actions doesn't necessarily happens in user-context and thus not protected by the rtnl anymore. I looked into this and we indeed still have lots of problems from that broken RCU patch. Basically all locking (qdiscs, classifiers, actions, estimators) assumes that updates are only done in process context and thus read_lock doesn't need bottem half protection. Quite a few things also assume that updates only happen under the RTNL and don't need any further protection if not used during packet processing. Instead of fixing all this I suggest something like this (untested) patch instead. Since only the dev-qdisc pointer is protected by RCU, but enqueue and the qdisc tree are still protected by dev-qdisc_lock, we can perform destruction of the tree immediately and only do the final free in the rcu callback, as long as we make sure not to enqueue anything to a half-way destroyed qdisc. With this patch, I get no lockdep warnings, but the machine locks up completely. I hooked up a serial console, and found this.. u32 classifier Performance counters on input device check on Actions configured BUG: warning at net/sched/sch_htb.c:395/htb_safe_rb_erase() Call Trace: [8026f79b] show_trace+0xae/0x336 [8026fa38] dump_stack+0x15/0x17 [8860a171] :sch_htb:htb_safe_rb_erase+0x3b/0x55 [8860a4d5] :sch_htb:htb_deactivate_prios+0x173/0x1cd [8860b437] :sch_htb:htb_dequeue+0x4d0/0x856 [8042dc0d] __qdisc_run+0x3f/0x1ca [802329a6] dev_queue_xmit+0x137/0x268 [8025b4a2] neigh_resolve_output+0x249/0x27e [802353fd] ip_output+0x210/0x25a [8043ce28] ip_push_pending_frames+0x37c/0x45b [8044ffd7] icmp_push_reply+0x13b/0x148 [80450900] icmp_send+0x366/0x3d3 [802568a9] udp_rcv+0x53d/0x556 [80237e73] ip_local_deliver+0x1a3/0x26b [80238ec8] ip_rcv+0x4b9/0x501 [802218bb] netif_receive_skb+0x33d/0x3c9 [881f6348] :e1000:e1000_clean_rx_irq+0x450/0x4fe [881f47eb] :e1000:e1000_clean+0x88/0x17d [8020cab3] net_rx_action+0xac/0x1d1 [80212725] __do_softirq+0x68/0xf5 [80262638] call_softirq+0x1c/0x28 DWARF2 unwinder stuck at call_softirq+0x1c/0x28 Leftover inexact backtrace: IRQ [80270aaa] do_softirq+0x39/0x9f [80296102] irq_exit+0x57/0x59 [80270c0d] do_IRQ+0xfd/0x107 [8025b51d] mwait_idle+0x0/0x54 [802618c6] ret_from_intr+0x0/0xf EOI [80265e66] __sched_text_start+0xaa6/0xadd [8025b55c] mwait_idle+0x3f/0x54 [8025b526] mwait_idle+0x9/0x54 [8024c81c] cpu_idle+0xa2/0xc5 [8026e519] rest_init+0x2b/0x2d [80a7f811] start_kernel+0x24a/0x24c [80a7f28b] _sinittext+0x28b/0x292 BUG: warning at net/sched/sch_htb.c:395/htb_safe_rb_erase() Call Trace: [8026f79b] show_trace+0xae/0x336 [8026fa38] dump_stack+0x15/0x17 [8860a171] :sch_htb:htb_safe_rb_erase+0x3b/0x55 [8860a4d5] :sch_htb:htb_deactivate_prios+0x173/0x1cd [8860b437] :sch_htb:htb_dequeue+0x4d0/0x856 [8042dc0d] __qdisc_run+0x3f/0x1ca [802329a6] dev_queue_xmit+0x137/0x268 [8025b4a2] neigh_resolve_output+0x249/0x27e [802353fd] ip_output+0x210/0x25a [8043ce28] ip_push_pending_frames+0x37c/0x45b [8044ffd7] icmp_push_reply+0x13b/0x148 [80450900] icmp_send+0x366/0x3d3 [802568a9] udp_rcv+0x53d/0x556 [80237e73] ip_local_deliver+0x1a3/0x26b [80238ec8] ip_rcv+0x4b9/0x501 [802218bb] netif_receive_skb+0x33d/0x3c9 [881f6348] :e1000:e1000_clean_rx_irq+0x450/0x4fe [881f47eb] :e1000:e1000_clean+0x88/0x17d [8020cab3] net_rx_action+0xac/0x1d1 [80212725] __do_softirq+0x68/0xf5 [80262638] call_softirq+0x1c/0x28 DWARF2 unwinder stuck at call_softirq+0x1c/0x28 Leftover inexact backtrace: IRQ [80270aaa] do_softirq+0x39/0x9f [80296102] irq_exit+0x57/0x59 [80270c0d] do_IRQ+0xfd/0x107 [8025b51d] mwait_idle+0x0/0x54 [802618c6] ret_from_intr+0x0/0xf EOI [80265e66] __sched_text_start+0xaa6/0xadd [8025b55c] mwait_idle+0x3f/0x54 [8025b526] mwait_idle+0x9/0x54 [8024c81c] cpu_idle+0xa2/0xc5 [8026e519] rest_init+0x2b/0x2d [80a7f811] start_kernel+0x24a/0x24c [80a7f28b] _sinittext+0x28b/0x292 BUG: soft lockup detected on CPU#0! Call Trace: [8026f79b] show_trace+0xae/0x336 [8026fa38] dump_stack+0x15/0x17 [802bfea7] softlockup_tick+0xd5/0xea
tc related lockdep warning.
= [ INFO: inconsistent lock state ] - inconsistent {softirq-on-R} - {in-softirq-W} usage. swapper/0 [HC0[0]:SC1[2]:HE1:SE0] takes: (police_lock){-+--}, at: [f8d304fd] tcf_police_destroy+0x24/0x8f [act_police] {softirq-on-R} state was registered at: [c043bdd6] lock_acquire+0x4b/0x6d [c061495a] _read_lock+0x19/0x28 [f8d3026a] tcf_act_police_locate+0x26a/0x363 [act_police] [c05cacc3] tcf_action_init_1+0x113/0x1a7 [c05c97c9] tcf_exts_validate+0x3c/0x85 [f8d4337c] u32_set_parms+0x26/0x131 [cls_u32] [f8d43dc7] u32_change+0x2fc/0x371 [cls_u32] [c05c9f44] tc_ctl_tfilter+0x417/0x487 [c05c0d67] rtnetlink_rcv_msg+0x1b3/0x1d6 [c05ccef3] netlink_run_queue+0x69/0xfe [c05c0b6a] rtnetlink_rcv+0x29/0x42 [c05cd380] netlink_data_ready+0x12/0x50 [c05cc3e8] netlink_sendskb+0x1f/0x37 [c051] netlink_unicast+0x1a1/0x1bb [c05cd361] netlink_sendmsg+0x275/0x282 [c05aff4a] sock_sendmsg+0xe8/0x103 [c05b074d] sys_sendmsg+0x14d/0x1a8 [c05b1937] sys_socketcall+0x16b/0x186 [c0403fb7] syscall_call+0x7/0xb irq event stamp: 278833666 hardirqs last enabled at (278833666): [c04290b9] tasklet_action+0x30/0xca hardirqs last disabled at (278833665): [c0429095] tasklet_action+0xc/0xca softirqs last enabled at (278833650): [c0429083] __do_softirq+0xec/0xf2 softirqs last disabled at (278833659): [c0406683] do_softirq+0x5a/0xbe other info that might help us debug this: 1 lock held by swapper/0: #0: (qdisc_tree_lock){-+-.}, at: [c05c7737] __qdisc_destroy+0x20/0x85 stack backtrace: [c04051ed] show_trace_log_lvl+0x58/0x16a [c04057fa] show_trace+0xd/0x10 [c0405913] dump_stack+0x19/0x1b [c043a20b] print_usage_bug+0x1cf/0x1dc [c043a5e4] mark_lock+0x124/0x353 [c043b2a0] __lock_acquire+0x3d7/0x99c [c043bdd6] lock_acquire+0x4b/0x6d [c06148a5] _write_lock_bh+0x1e/0x2d [f8d304fd] tcf_police_destroy+0x24/0x8f [act_police] [f8d30590] tcf_act_police_cleanup+0x28/0x33 [act_police] [c05ca1a1] tcf_action_destroy+0x20/0x84 [c05c9784] tcf_exts_destroy+0x16/0x1f [f8d43114] u32_destroy_key+0x30/0x50 [cls_u32] [f8d4314f] u32_clear_hnode+0x1b/0x2e [cls_u32] [f8d4319a] u32_destroy_hnode+0x38/0x81 [cls_u32] [f8d4322d] u32_destroy+0x4a/0xc9 [cls_u32] [f8d340f9] ingress_destroy+0x1a/0x5c [sch_ingress] [c05c774d] __qdisc_destroy+0x36/0x85 [c043438f] __rcu_process_callbacks+0xfe/0x169 [c04346f0] rcu_process_callbacks+0x23/0x45 [c04290ee] tasklet_action+0x65/0xca [c042900f] __do_softirq+0x78/0xf2 [c0406683] do_softirq+0x5a/0xbe [c0428eb8] irq_exit+0x3d/0x3f [c04179df] smp_apic_timer_interrupt+0x73/0x78 [c0404b12] apic_timer_interrupt+0x2a/0x30 DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30 Leftover inexact backtrace: - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Alexey Kuznetsov wrote: Hello! transactions to data segments is fubar. That issue is also why I wonder about the setting of tcp_abc. Yes, switching ABC on/off has visible impact on amount of segments. When ABC is off, amount of segments is almost the same as number of transactions. When it is on, ~1.5% are merged. But this is invisible in numbers of throughput/cpu usage. Hmm, that would seem to suggest that for new the netperf/netserver were being fast enough that the code didn't perceive the receipt of back-to-back sub-MSS segments? (Is that even possible once -b is fairly large?) Otherwise, with new I would have expected the segment count to be meaningfully than the transaction count? That' numbers: 1Gig link. The first column is b. - separates runs of netperf in backward direction. Run #1. One host is slower. old,abc=0 new,abc=0 new,abc=1 old,abc=1 2 23652.00 6.31 21.11 10.665 8.924 23622.16 6.47 21.01 10.951 8.893 23625.05 6.21 21.01 10.512 8.891 23725.12 6.46 20.31 10.898 8.559 - 23594.87 21.90 6.44 9.283 10.912 23631.52 20.30 6.36 8.592 10.766 23609.55 21.00 6.26 8.896 10.599 23633.75 21.10 5.44 8.929 9.206 4 36349.11 8.71 31.21 9.584 8.585 36461.37 8.65 30.81 9.492 8.449 36723.72 8.22 31.31 8.949 8.526 35801.24 8.58 30.51 9.589 8.521 - 35127.34 33.80 8.43 9.621 9.605 36165.50 30.90 8.48 8.545 9.381 36201.45 31.10 8.31 8.592 9.185 35269.76 30.00 8.58 8.507 9.732 8 41148.23 10.39 42.30 10.101 10.281 41270.06 11.04 31.31 10.698 7.585 41181.56 5.66 48.61 5.496 11.803 40372.37 9.68 56.50 9.591 13.996 - 40392.14 47.00 11.89 11.637 11.775 40613.80 36.90 9.16 9.086 9.019 40504.66 53.60 7.73 13.234 7.639 40388.99 48.70 11.93 12.058 11.814 16 67952.27 16.27 43.70 9.576 6.432 68031.40 10.56 53.70 6.206 7.894 6.95 12.81 46.90 7.559 6.920 67814.41 16.13 46.50 9.517 6.857 - 68031.46 51.30 11.53 7.541 6.781 68044.57 40.70 8.48 5.982 4.986 67808.13 39.60 15.86 5.840 9.355 67818.32 52.90 11.51 7.801 6.791 32 90445.09 15.41 99.90 6.817 11.045 90210.34 16.11 100.00 7.143 11.085 90221.84 17.31 98.90 7.676 10.962 90712.78 18.41 99.40 8.120 10.958 - 89155.51 99.90 12.89 11.205 5.782 90058.54 99.90 16.16 11.093 7.179 90092.31 98.60 15.41 10.944 6.840 88688.96 99.00 17.59 11.163 7.933 64 89983.76 13.66 100.00 6.071 11.113 90504.24 17.54 100.00 7.750 11.049 92043.36 17.44 99.70 7.580 10.832 90979.29 16.01 99.90 7.038 10.981 - 88615.27 99.90 14.91 11.273 6.729 89316.13 99.90 17.28 11.185 7.740 90622.85 99.90 16.81 11.024 7.420 89084.85 99.90 17.51 11.214 7.861 Run #2. Slower host is replaced with better one. ABC=0. No runs in backward directions. new old 2 24009.73 8.80 6.49 3.667 10.806 24008.43 8.00 6.32 3.334 10.524 4 40012.53 18.30 8.79 4.574 8.783 3.84 19.40 8.86 4.851 8.857 8 60500.29 26.30 12.78 4.348 8.452 60397.79 26.30 11.73 4.355 7.769 16 69619.95 39.80 14.03 5.717 8.063 70528.72 24.90 14.43 3.531 8.184 32 132522.01 53.20 21.28 4.015 6.424 132602.93 57.70 22.59 4.351 6.813 64 145738.83 60.30 25.01 4.138 6.865 143129.55 73.20 24.19 5.114 6.759 128 148184.21 69.70 24.96 4.704 6.739 148143.47 71.00 25.01 4.793 6.753 256 144798.91 69.40 25.01 4.793 6.908 144086.01 73.00 24.61 5.067 6.832 Frankly, I do not see any statistically valid correlations. Does look like it jumps-around quite a bit - for example the run#2 with -b 16 had the CPU util all over the map on the netperf side. That wasn't by any chance an SMP system? that linux didn't seem to be doing the same thing. Hence my tweaking when seeing this patch come along...] netperf does not catch this. :-) Nope :( One of these days I need to teach netperf how to extract TCP statistics from as many platforms as possible. Meantime it relies as always on the kindness of benchmarkers :) (My appologies to Tennesee Williams :) Even with this patch linux does not ack each second segment dumbly, it waits for some conditions, mostly read() emptying receive queue. Good. HP-UX is indeed dumb about this, but I'm assured it will be changing. I
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
That came from named. It opens lots of sockets with SIOCGSTAMP. No idea what it needs that many for. IIRC ISC BIND named opens a socket for each IP it finds on the system. Presumeably in this way it knows implicitly the destination IP without using platform-specific recvfrom/whatever extensions and gets some additional parallelism in the stack on SMP systems. Why it needs/wants the timestamps I've no idea, I don't think it gets them that way on all platforms. I suppose the next time I do some named benchmarking I can try to take a closer look in the source. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UDP Out 0f Sequence
Majumder, Rajib wrote: Let's say we have 2 uniprocessor hosts connected back to back. Is there any possibility of an out-of-order scenario on recv? Your application should be written on the assumption that it is possible, regardless of the specifics of the hosts involved, however unlikely they may be to reorder traffic. Is this same for all kernel (linux/solaris)? Your application should be written on the assumtion that it is possible, regardless of the specifics of the OSes involved, however unlikely they may be to reorder traffic. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Question about David's blog entry for NetCONF 2006, Day 1
I was reading David's blog entries on the netdev meeting in Japan, and have a question about this bit: Currently, things like Xen have to put the card into promiscuous mode, accepting all packets, which is quite inefficient. Is the inefficient bit meant for accepting all packets, or more broadly that the promiscuous path is quite inefficient compared to the non-promiscuous path? I ask because I would have thought that if the system were connected to a switch (*), the number of packets received through a NIC in promiscuous mode would be nearly the same as when it was not in promiscuous mode - the delta being (perhaps) multicast frames. rick jones (*) Today, it seems 99 times out of 10 systems are connected to switches not hubs. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UDP Out 0f Sequence
Majumder, Rajib wrote: Hi, If I write UDP datagrams 1,2 and 3 to network and if the receiver receives in order 2,1, and 3, where can the sequence get changed? Is it at the source stack, network transit or destination stack? Yes. :) Although network transit is by far the most likely case. Destination stack is a distant second and source stack an even more distant third. Generally stack writers try to avoid having places in their stacks where things can reorder, but it isn't completely unknown. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/23] e100: Add debugging code for cb cleaning and csum failures.
On Tue, Sep 19, 2006 at 10:28:38AM -0700, Kok, Auke wrote: Refine cb cleaning debug printout and print out all cleaned cbs' status. Add debug flag for EEPROM csum failures that were overridden by the user. Signed-off-by: Jesse Brandeburg [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e100.c |9 ++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/net/e100.c b/drivers/net/e100.c index ab0868c..ae93c62 100644 --- a/drivers/net/e100.c +++ b/drivers/net/e100.c @@ -761,6 +761,8 @@ static int e100_eeprom_load(struct nic * DPRINTK(PROBE, ERR, EEPROM corrupted\n); if (!eeprom_bad_csum_allow) return -EAGAIN; +else +add_taint(TAINT_MACHINE_CHECK); I object to this flag being abused this way. A corrupt EEPROM on a network card has _nothing_ to do with a CPU machine check exception. Dave - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/23] e100: Add debugging code for cb cleaning and csum failures.
On Tue, Sep 19, 2006 at 05:40:34PM -0400, Jeff Garzik wrote: Dave Jones wrote: On Tue, Sep 19, 2006 at 10:28:38AM -0700, Kok, Auke wrote: + add_taint(TAINT_MACHINE_CHECK); I object to this flag being abused this way. A corrupt EEPROM on a network card has _nothing_ to do with a CPU machine check exception. Fair enough. Better suggestions? I think it's fair to set _some_ taint flag, perhaps a new one, on a known corrupted firmware. But if others disagree, I'll follow the consensus here. I don't object to a new flag, but overloading an existing flag that has established meaning just seems wrong to me. Question is how many more types of random hardware failures are there that we'd like to do similar things for ? Perhaps a catch-all Hardware failure flag for assorted brokenness would be better than a proliferation of flags? Dave - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
David Miller wrote: From: Rick Jones [EMAIL PROTECTED] Date: Tue, 05 Sep 2006 10:55:16 -0700 Is this really necessary? I thought that the problems with ABC were in trying to apply byte-based heuristics from the RFC(s) to a packet-oritented cwnd in the stack? This is receiver side, and helps a sender who does congestion control based upon packet counting like Linux does. It really is less related to ABC than Alexey implies, we've always had this kind of problem as I mentioned in previous talks in the past on this issue. For a connection receiving nothing but sub-MSS segments this is going to non-trivially increase the number of ACKs sent no? I would expect an unpleasant increase in service demands on something like a burst enabled (./configure --enable-burst) netperf TCP_RR test: netperf -t TCP_RR -H foo -- -b N # N 1 to increase as a result. Pipelined HTTP would be like that, some NFS over TCP stuff too, maybe X traffic, other transactional workloads as well - maybe Tuxeudo. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Alexey Kuznetsov wrote: Hello! Of course, number of ACK increases. It is the goal. :-) unpleasant increase in service demands on something like a burst enabled (./configure --enable-burst) netperf TCP_RR test: netperf -t TCP_RR -H foo -- -b N # N 1 foo=localhost There isn't any sort of clever short-circuiting in loopback is there? I do like the convenience of testing things over loopback, but always fret about not including drivers and actual hardware interrupts etc. b patched orig 2 105874.83 105143.71 3 114208.53 114023.07 4 120493.99 120851.27 5 128087.48 128573.33 10 151328.48 151056.00 Probably, the test is done wrong. But I see no difference. Regardless, kudos for running the test. The only thing missing is the -c and -C options to enable the CPU utilization measurements which will then give the service demand on a CPU time per transaction basis. Or was this a UP system that was taken to CPU saturation? to increase as a result. Pipelined HTTP would be like that, some NFS over TCP stuff too, maybe X traffic, X will be excited about better latency. What's about protocols not interested in latency, they will be a little happier, if transactions are processed asynchronously. What i'm thinking about isn't so much about the latency as it is the aggregate throughput a system can do with lots of these protocols/connections going at the same time. Hence the concern about increases in service demand. But actually, it is not about increasing/decreasing number of ACKs. It is about killing that pain in ass which we used to have because we pretended to be too smart. :) rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Regardless, kudos for running the test. The only thing missing is the -c and -C options to enable the CPU utilization measurements which will then give the service demand on a CPU time per transaction basis. Or was this a UP system that was taken to CPU saturation? It is my notebook. :-) Of course, cpu consumption is 100%. (Actally, netperf shows 100.10 :-)) Gotta love the accuracy. :) I will redo test on a real network. What range of -b should I test? I suppose that depends on your patience :) In theory, as you increase (eg double) the -b setting you should reach a point of diminishing returns wrt transaction rate. If you see that, and see the service demand flattening-out I'd say it is probably time to stop. I'm also not quite sure if abc needs to be disabled or not. I do know that I left-out one very important netperf option. The command line should be: netperf -t TCP_RR -H foo -- -b N -D where -D is added to set TCP_NODELAY. Otherwise, the ratio of transactions to data segments is fubar. That issue is also why I wonder about the setting of tcp_abc. [I have this quixotic pipedream about being able to --enable-burst, set -D and say that the number of TCP segments exchanged on the network is 2X the transaction count when request and response size are MSS. The raison d'etre for this pipe dream is maximizing PPS with TCP_RR tests without _having_ to have hundreds if not thousands of simultaneous netperfs/connections - say with just as many netperfs/connections as there are CPUs or threads/strands in the system. It was while trying to make this pipe dream a reality I first noticed that HP-UX 11i, which normally has a very nice ACK avoidance heuristic, would send an immediate ACK if it received back-to-back sub-MSS segments - thus ruining my pipe dream when it came to HP-UX testing. Hapily, I noticed that linux didn't seem to be doing the same thing. Hence my tweaking when seeing this patch come along...] What i'm thinking about isn't so much about the latency I understand. Actually, I did those tests ages ago for a pure throughput case, when nothing goes in the opposite direction. I did not find a difference that time. And nobody even noticed that Linux sends ACKs _each_ small segment for unidirectional connections for all those years. :-) Not everyone looks very closely (alas, sometimes myself included). If all anyone does is look at throughput, until they CPU saturate they wouldn't notice. Heck, before netperf and TCP_RR tests, and sadly even still today, most people just look at how fast a single-connection, unidirectional data transfer goes and leave it at that :( Thankfully, the set of most people and netdev aren't completely overlapping. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
NIC interrupt assignments under UltraSPARC-T1
From time to time I play with netperf on different systems. I happen to have occasion to play with a T2000. Under Solaris 10 I am able to coerce the interrupts of the different core GbEs to be on different cores rather than strands of the same core. Under a 2.6.15 kernel (Ubuntu Dapper) it would appear that the old standby of echo affinity mask to the IRQ doesn't work - no matter how I change the mask for a NIC, running a netperf TCP_RR test seems to show the interrupts happening on the same strand. Is it indeed not possible to alter the interrupt assignments or have I (as I'm wont to do) missed something quasi-obvious? thanks, rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Alexey Kuznetsov wrote: Hello! Some people reported that this program runs in 9.997 sec when run on FreeBSD. Try enclosed patch. I have no idea why 9.997 sec is so magic, but I get exactly this number on my notebook. :-) Alexey = This patch enables sending ACKs each 2d received segment. It does not affect either mss-sized connections (obviously) or connections controlled by Nagle (because there is only one small segment in flight). The idea is to record the fact that a small segment arrives on a connection, where one small segment has already been received and still not-ACKed. In this case ACK is forced after tcp_recvmsg() drains receive buffer. In other words, it is a soft each-2d-segment ACK, which is enough to preserve ACK clock even when ABC is enabled. Is this really necessary? I thought that the problems with ABC were in trying to apply byte-based heuristics from the RFC(s) to a packet-oritented cwnd in the stack? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
neigh_lookup lockdep warning
Seen during boot of a 2.6.18rc5-git1 based kernel. Dave === [ INFO: possible circular locking dependency detected ] 2.6.17-1.2608.fc6 #1 --- swapper/0 is trying to acquire lock: (tbl-lock){-+-+}, at: [c05bdf97] neigh_lookup+0x50/0xaf but task is already holding lock: (list-lock#3){-+..}, at: [c05bf677] neigh_proxy_process+0x20/0xc2 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #2 (list-lock#3){-+..}: [c043c09a] lock_acquire+0x4b/0x6d [c061411f] _spin_lock_irqsave+0x22/0x32 [c05b451f] skb_dequeue+0x12/0x43 [c05b523a] skb_queue_purge+0x14/0x1b [c05be990] neigh_update+0x34a/0x3a6 [c05f0f6e] arp_process+0x4ad/0x4e7 [c05f107c] arp_rcv+0xd4/0xf1 [c05b942c] netif_receive_skb+0x205/0x274 [c7bb0566] rhine_napipoll+0x28d/0x449 [via_rhine] [c05baf73] net_rx_action+0x9d/0x196 [c04293a7] __do_softirq+0x78/0xf2 [c0406673] do_softirq+0x5a/0xbe - #1 (n-lock){-+..}: [c043c09a] lock_acquire+0x4b/0x6d [c0613e48] _write_lock+0x19/0x28 [c05bfc69] neigh_periodic_timer+0x98/0x13c [c042dc58] run_timer_softirq+0x108/0x167 [c04293a7] __do_softirq+0x78/0xf2 [c0406673] do_softirq+0x5a/0xbe - #0 (tbl-lock){-+-+}: [c043c09a] lock_acquire+0x4b/0x6d [c0613f02] _read_lock_bh+0x1e/0x2d [c05bdf97] neigh_lookup+0x50/0xaf [c05bf0b9] neigh_event_ns+0x2c/0x77 [c05f0e2a] arp_process+0x369/0x4e7 [c05f10a1] parp_redo+0x8/0xa [c05bf6bd] neigh_proxy_process+0x66/0xc2 [c042dc58] run_timer_softirq+0x108/0x167 [c04293a7] __do_softirq+0x78/0xf2 [c0406673] do_softirq+0x5a/0xbe other info that might help us debug this: 1 lock held by swapper/0: #0: (list-lock#3){-+..}, at: [c05bf677] neigh_proxy_process+0x20/0xc2 stack backtrace: [c04051ee] show_trace_log_lvl+0x58/0x159 [c04057ea] show_trace+0xd/0x10 [c0405903] dump_stack+0x19/0x1b [c043b182] print_circular_bug_tail+0x59/0x64 [c043b99a] __lock_acquire+0x80d/0x99c [c043c09a] lock_acquire+0x4b/0x6d [c0613f02] _read_lock_bh+0x1e/0x2d [c05bdf97] neigh_lookup+0x50/0xaf [c05bf0b9] neigh_event_ns+0x2c/0x77 [c05f0e2a] arp_process+0x369/0x4e7 [c05f10a1] parp_redo+0x8/0xa [c05bf6bd] neigh_proxy_process+0x66/0xc2 [c042dc58] run_timer_softirq+0x108/0x167 [c04293a7] __do_softirq+0x78/0xf2 [c0406673] do_softirq+0x5a/0xbe [c0429250] irq_exit+0x3d/0x3f [c0417cbf] smp_apic_timer_interrupt+0x79/0x7e [c0404b0a] apic_timer_interrupt+0x2a/0x30 DWARF2 unwinder stuck at apic_timer_interrupt+0x2a/0x30 Leftover inexact backtrace: -- http://www.codemonkey.org.uk -- VGER BF report: U 0.489161 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: high latency with TCP connections
David Miller wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 30 Aug 2006 10:27:27 -0700 Linux TCP implements Appropriate Byte Count (ABC) and this penalizes applications that do small sends. The problem is that the other side may be delaying acknowledgments. If receiver doesn't acknowledge the sender will limit itself to the congestion window. If the flow is light, then you will be limited to 4 packets. Right. However it occured to me the other day that ABC could be made smarter. If we sent small frames, ABC should account for that. Is that part of the application of a byte-based RFC to packet-counting cwnd? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH?] tcp and delayed acks
The point of delayed ack's was to merge the response and the ack on request/response protocols like NFS or telnet. It does make sense to get it out sooner though. Well, to a point at least - I wouldn't go so far as to suggest immediate ACKs. However, I was always under the impression that ACKs were sent (in the mythical generic TCP stack) when: a) there was data going the other way b) there was a window update going the other way c) the standalone ACK timer expired. Does this patch then implement b? Were there perhaps holes in the logic when things were smaller than the MTU/MSS? (-v 2 on the netperf command line should show what the MSS was for the connection) rick jones BTW, many points scored for including CPU utilization and service demand figures with the netperf output :) [All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.] Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB Base (2.6.17-rc4): default send buffer size netperf -C -c 87380 16384 1638410.02 14127.79 99.9099.900.579 0.579 87380 16384 1638410.02 13875.28 99.9099.900.590 0.590 87380 16384 1638410.01 13777.25 99.9099.900.594 0.594 87380 16384 1638410.02 13796.31 99.9099.900.593 0.593 87380 16384 1638410.01 13801.97 99.9099.900.593 0.593 netperf -C -c -- -s 1024 87380 2048 204810.02 0.43 -0.04-0.04-7.105 -7.377 87380 2048 204810.02 0.43 -0.01-0.01-2.337 -2.620 87380 2048 204810.02 0.43 -0.03-0.03-5.683 -5.940 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 Hmm, those CPU numbers don't look right. I guess there must still be some holes in the procstat CPU method code in netperf :( - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Packet Corruption Support
Ritesh Taank wrote: Hi there, I am currently using netem that has been packaged with my linux kernel 2.6.17 (as part of the Knoppix 5.0.1 Boot CD), and the 'corrupt' parameter is not being recognised as a valid argument. Having read many posts online, it appears that the Packet Corruption feature should be supported from kernel versions 2.6.16 onwards. I would think that if run on an end system at least, trying to corrupt packets with CKO enabled on a NIC might be, well, difficult. Or does netem disable CKO? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2]: powerpc/cell spidernet bottom half
Linas Vepstas wrote: On Wed, Aug 16, 2006 at 11:24:46PM +0200, Arnd Bergmann wrote: it only seems to be hard to make it go fast using any of them. Last round of measurements seemed linear for packet sizes between 60 and 600 bytes, suggesting that the hardware can handle a maximum of 120K descriptors/second, independent of packet size. I don't know why this is. DMA overhead perhaps? If it takes so many micro/nanoseconds to get a DMA going That used to be a reason the Tigon2 had such low PPS rates and issues with multiple buffer packets and a 1500 byte MTU - it had rather high DMA setup latency, and then if you put it into a system with highish DMA read/write latency... well that didn't make it any better :) rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use __always_inline in orinoco_lock()/orinoco_unlock()
On Tue, Aug 15, 2006 at 03:25:58PM -0400, Pavel Roskin wrote: diff --git a/drivers/net/wireless/orinoco.h b/drivers/net/wireless/orinoco.h index 16db3e1..8fd9b32 100644 --- a/drivers/net/wireless/orinoco.h +++ b/drivers/net/wireless/orinoco.h @@ -135,11 +135,9 @@ extern irqreturn_t orinoco_interrupt(int // /* These functions *must* be inline or they will break horribly on - * SPARC, due to its weird semantics for save/restore flags. extern - * inline should prevent the kernel from linking or module from - * loading if they are not inlined. */ -extern inline int orinoco_lock(struct orinoco_private *priv, - unsigned long *flags) + * SPARC, due to its weird semantics for save/restore flags. */ Didn't that get fixed up for SPARC a year or so back? Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
remove unnecessary config.h includes from drivers/net/
On Wed, Aug 09, 2006 at 09:04:38PM -0700, David Miller wrote: From: Dave Jones [EMAIL PROTECTED] Date: Wed, 9 Aug 2006 22:21:16 -0400 config.h is automatically included by kbuild these days. Signed-off-by: Dave Jones [EMAIL PROTECTED] Applied to net-2.6.19, thanks Dave. Here's a similar patch that does the same removals for drivers/net/ Signed-off-by: Dave Jones [EMAIL PROTECTED] --- linux-2.6.17.noarch/drivers/net/irda/mcs7780.c~ 2006-08-10 21:35:23.0 -0400 +++ linux-2.6.17.noarch/drivers/net/irda/mcs7780.c 2006-08-10 21:35:25.0 -0400 @@ -45,7 +45,6 @@ #include linux/module.h #include linux/moduleparam.h -#include linux/config.h #include linux/kernel.h #include linux/types.h #include linux/errno.h --- linux-2.6.17.noarch/drivers/net/irda/w83977af_ir.c~ 2006-08-10 21:35:28.0 -0400 +++ linux-2.6.17.noarch/drivers/net/irda/w83977af_ir.c 2006-08-10 21:35:30.0 -0400 @@ -40,7 +40,6 @@ / #include linux/module.h -#include linux/config.h #include linux/kernel.h #include linux/types.h #include linux/skbuff.h --- linux-2.6.17.noarch/drivers/net/smc911x.c~ 2006-08-10 21:35:34.0 -0400 +++ linux-2.6.17.noarch/drivers/net/smc911x.c 2006-08-10 21:35:37.0 -0400 @@ -55,8 +55,6 @@ static const char version[] = ) #endif - -#include linux/config.h #include linux/init.h #include linux/module.h #include linux/kernel.h --- linux-2.6.17.noarch/drivers/net/netx-eth.c~ 2006-08-10 21:35:41.0 -0400 +++ linux-2.6.17.noarch/drivers/net/netx-eth.c 2006-08-10 21:35:42.0 -0400 @@ -17,7 +17,6 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -#include linux/config.h #include linux/init.h #include linux/module.h #include linux/kernel.h --- linux-2.6.17.noarch/drivers/net/wan/cycx_main.c~2006-08-10 21:35:45.0 -0400 +++ linux-2.6.17.noarch/drivers/net/wan/cycx_main.c 2006-08-10 21:35:48.0 -0400 @@ -40,7 +40,6 @@ * 1998/08/08 acmeInitial version. */ -#include linux/config.h /* OS configuration options */ #include linux/stddef.h /* offsetof(), etc. */ #include linux/errno.h /* return codes */ #include linux/string.h /* inline memset(), etc. */ --- linux-2.6.17.noarch/drivers/net/wan/sdla.c~ 2006-08-10 21:35:51.0 -0400 +++ linux-2.6.17.noarch/drivers/net/wan/sdla.c 2006-08-10 21:35:53.0 -0400 @@ -32,7 +32,6 @@ * 2 of the License, or (at your option) any later version. */ -#include linux/config.h /* for CONFIG_DLCI_MAX */ #include linux/module.h #include linux/kernel.h #include linux/types.h --- linux-2.6.17.noarch/drivers/net/wan/dlci.c~ 2006-08-10 21:35:57.0 -0400 +++ linux-2.6.17.noarch/drivers/net/wan/dlci.c 2006-08-10 21:35:59.0 -0400 @@ -28,7 +28,6 @@ * 2 of the License, or (at your option) any later version. */ -#include linux/config.h /* for CONFIG_DLCI_COUNT */ #include linux/module.h #include linux/kernel.h #include linux/types.h --- linux-2.6.17.noarch/drivers/net/phy/vitesse.c~ 2006-08-10 21:36:02.0 -0400 +++ linux-2.6.17.noarch/drivers/net/phy/vitesse.c 2006-08-10 21:36:04.0 -0400 @@ -12,7 +12,6 @@ * */ -#include linux/config.h #include linux/kernel.h #include linux/module.h #include linux/mii.h --- linux-2.6.17.noarch/drivers/net/phy/smsc.c~ 2006-08-10 21:36:07.0 -0400 +++ linux-2.6.17.noarch/drivers/net/phy/smsc.c 2006-08-10 21:36:08.0 -0400 @@ -14,7 +14,6 @@ * */ -#include linux/config.h #include linux/kernel.h #include linux/module.h #include linux/mii.h --- linux-2.6.17.noarch/drivers/net/hp100.c~2006-08-10 21:36:12.0 -0400 +++ linux-2.6.17.noarch/drivers/net/hp100.c 2006-08-10 21:36:14.0 -0400 @@ -111,7 +111,6 @@ #include linux/etherdevice.h #include linux/skbuff.h #include linux/types.h -#include linux/config.h /* for CONFIG_PCI */ #include linux/delay.h #include linux/init.h #include linux/bitops.h --- linux-2.6.17.noarch/drivers/net/3c501.c~2006-08-10 21:36:18.0 -0400 +++ linux-2.6.17.noarch/drivers/net/3c501.c 2006-08-10 21:36:20.0 -0400 @@ -120,7 +120,6 @@ static const char version[] = #include linux/slab.h #include linux/string.h #include linux/errno.h -#include linux/config.h /* for CONFIG_IP_MULTICAST */ #include linux/spinlock.h #include linux/ethtool.h #include linux/delay.h -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
IPX changes introduce warning.
We've just added an implicit declaration in the latest tree.. net/ipx/af_ipx.c: In function 'ipx_rcv': net/ipx/af_ipx.c:1648: error: implicit declaration of function 'ipxhdr' (Yes, my builds fail on -Werror-implicit, so that things like this get caught early) Probably something simple like a missing #include, but I'm heading out the door right now :) I'll poke at it later if no-one has beaten me to it. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
remove unnecessary config.h includes from net/
config.h is automatically included by kbuild these days. Signed-off-by: Dave Jones [EMAIL PROTECTED] --- linux-2.6/net/ipv4/netfilter/ip_conntrack_sip.c~2006-08-09 22:18:48.0 -0400 +++ linux-2.6/net/ipv4/netfilter/ip_conntrack_sip.c 2006-08-09 22:18:53.0 -0400 @@ -8,7 +8,6 @@ * published by the Free Software Foundation. */ -#include linux/config.h #include linux/module.h #include linux/ctype.h #include linux/skbuff.h --- linux-2.6/net/ipv4/af_inet.c~ 2006-08-09 22:18:58.0 -0400 +++ linux-2.6/net/ipv4/af_inet.c2006-08-09 22:19:03.0 -0400 @@ -67,7 +67,6 @@ * 2 of the License, or (at your option) any later version. */ -#include linux/config.h #include linux/err.h #include linux/errno.h #include linux/types.h --- linux-2.6/net/ipv4/ipconfig.c~ 2006-08-09 22:19:07.0 -0400 +++ linux-2.6/net/ipv4/ipconfig.c 2006-08-09 22:19:10.0 -0400 @@ -31,7 +31,6 @@ * -- Josef Siemes [EMAIL PROTECTED], Aug 2002 */ -#include linux/config.h #include linux/types.h #include linux/string.h #include linux/kernel.h --- linux-2.6/net/ipv4/raw.c~ 2006-08-09 22:19:14.0 -0400 +++ linux-2.6/net/ipv4/raw.c2006-08-09 22:19:18.0 -0400 @@ -38,8 +38,7 @@ * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. */ - -#include linux/config.h + #include linux/types.h #include asm/atomic.h #include asm/byteorder.h --- linux-2.6/net/ipv4/tcp_veno.c~ 2006-08-09 22:19:23.0 -0400 +++ linux-2.6/net/ipv4/tcp_veno.c 2006-08-09 22:19:26.0 -0400 @@ -9,7 +9,6 @@ * See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf */ -#include linux/config.h #include linux/mm.h #include linux/module.h #include linux/skbuff.h --- linux-2.6/net/ipv4/tcp_lp.c~2006-08-09 22:19:31.0 -0400 +++ linux-2.6/net/ipv4/tcp_lp.c 2006-08-09 22:19:34.0 -0400 @@ -31,7 +31,6 @@ * Version: $Id: tcp_lp.c,v 1.22 2006-05-02 18:18:19 hswong3i Exp $ */ -#include linux/config.h #include linux/module.h #include net/tcp.h --- linux-2.6/net/atm/atm_sysfs.c~ 2006-08-09 22:19:38.0 -0400 +++ linux-2.6/net/atm/atm_sysfs.c 2006-08-09 22:19:40.0 -0400 @@ -1,6 +1,5 @@ /* ATM driver model support. */ -#include linux/config.h #include linux/kernel.h #include linux/init.h #include linux/kobject.h --- linux-2.6/net/core/wireless.c~ 2006-08-09 22:19:44.0 -0400 +++ linux-2.6/net/core/wireless.c 2006-08-09 22:19:47.0 -0400 @@ -72,7 +72,6 @@ /* INCLUDES */ -#include linux/config.h /* Not needed ??? */ #include linux/module.h #include linux/types.h /* off_t */ #include linux/netdevice.h /* struct ifreq, dev_get_by_name() */ --- linux-2.6/net/core/dev_mcast.c~ 2006-08-09 22:19:52.0 -0400 +++ linux-2.6/net/core/dev_mcast.c 2006-08-09 22:19:59.0 -0400 @@ -21,8 +21,7 @@ * 2 of the License, or (at your option) any later version. */ -#include linux/config.h -#include linux/module.h +#include linux/module.h #include asm/uaccess.h #include asm/system.h #include linux/bitops.h -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
another networking lockdep trace.
From a recent rc3-git kernel. Dave -- http://www.codemonkey.org.uk ---BeginMessage--- Please do not reply directly to this email. All additional comments should be made in the comments box of this bug report. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=201560 Summary: INFO: inconsistent lock state - during boot .2528 Product: Fedora Core Version: devel Platform: All OS/Version: Linux Status: NEW Severity: normal Priority: normal Component: kernel AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] QAContact: [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Description of problem: Get this on boot of new kernel: Aug 7 06:47:44 localhost kernel: [ INFO: inconsistent lock state ] Aug 7 06:47:44 localhost pcscd: winscard.c:219:SCardConnect() Reader E-Gate 0 0 Not Found Aug 7 06:47:44 localhost kernel: - Aug 7 06:47:44 localhost kernel: inconsistent {in-softirq-W} - {softirq-on-W} usage. Aug 7 06:47:44 localhost kernel: ip/2617 [HC0[0]:SC0[0]:HE1:SE1] takes: Aug 7 06:47:44 localhost kernel: (ifa-lock){-+..}, at: [f90a3836] inet6_addr_add+0xf8/0x13e [ipv6] Aug 7 06:47:44 localhost kernel: {in-softirq-W} state was registered at: Aug 7 06:47:44 localhost kernel: [c043bfb9] lock_acquire+0x4b/0x6a Aug 7 06:47:44 localhost kernel: [c060f428] _spin_lock_bh+0x1e/0x2d Aug 7 06:47:44 localhost kernel: [f90a4757] addrconf_dad_timer+0x3a/0xe2 [ipv6] Aug 7 06:47:44 localhost pcscd: winscard.c:219:SCardConnect() Reader E-Gate 0 0 Not Found Aug 7 06:47:44 localhost kernel: [c042dbc0] run_timer_softirq+0x108/0x167 Aug 7 06:47:44 localhost kernel: [c04293ab] __do_softirq+0x78/0xf2 Aug 7 06:47:44 localhost kernel: [c0406673] do_softirq+0x5a/0xbe Aug 7 06:47:44 localhost kernel: irq event stamp: 3551 Aug 7 06:47:44 localhost kernel: hardirqs last enabled at (3551): [c04291bf] local_bh_enable_ip+0xc6/0xcf Aug 7 06:47:45 localhost kernel: hardirqs last disabled at (3549): [c0429152] local_bh_enable_ip+0x59/0xcf Aug 7 06:47:45 localhost kernel: softirqs last enabled at (3550): [f90a09ce] ipv6_add_addr+0x210/0x254 [ipv6] Aug 7 06:47:45 localhost kernel: softirqs last disabled at (3538): [c060f4f7] _read_lock_bh+0xb/0x2d Aug 7 06:47:45 localhost kernel: Aug 7 06:47:45 localhost kernel: other info that might help us debug this: Aug 7 06:47:45 localhost kernel: 1 lock held by ip/2617: Aug 7 06:47:45 localhost kernel: #0: (rtnl_mutex){--..}, at: [c060e378] mutex_lock+0x21/0x24 Aug 7 06:47:45 localhost kernel: Aug 7 06:47:45 localhost kernel: stack backtrace: Aug 7 06:47:45 localhost kernel: [c04051ee] show_trace_log_lvl+0x58/0x159 Aug 7 06:47:45 localhost kernel: [c04057ea] show_trace+0xd/0x10 Aug 7 06:47:45 localhost kernel: [c0405903] dump_stack+0x19/0x1b Aug 7 06:47:45 localhost kernel: [c043a402] print_usage_bug+0x1ca/0x1d7 Aug 7 06:47:45 localhost kernel: [c043a8eb] mark_lock+0x239/0x353 Aug 7 06:47:45 localhost kernel: [c043b50a] __lock_acquire+0x459/0x997 Aug 7 06:47:45 localhost kernel: [c043bfb9] lock_acquire+0x4b/0x6a Aug 7 06:47:45 localhost kernel: [c060f3fb] _spin_lock+0x19/0x28 Aug 7 06:47:45 localhost kernel: [f90a3836] inet6_addr_add+0xf8/0x13e [ipv6] Aug 7 06:47:45 localhost kernel: [f90a3a39] inet6_rtm_newaddr+0x1bd/0x1d2 [ipv6] Aug 7 06:47:45 localhost kernel: [c05bf5f3] rtnetlink_rcv_msg+0x1b3/0x1d6 Aug 7 06:47:45 localhost kernel: [c05cae7b] netlink_run_queue+0x69/0xfe Aug 7 06:47:45 localhost kernel: [c05bf3f6] rtnetlink_rcv+0x29/0x42 Aug 7 06:47:45 localhost kernel: [c05cb308] netlink_data_ready+0x12/0x50 Aug 7 06:47:45 localhost kernel: [c05ca370] netlink_sendskb+0x1f/0x37 Aug 7 06:47:45 localhost kernel: [c05cac49] netlink_unicast+0x1a1/0x1bb Aug 7 06:47:45 localhost kernel: [c05cb2e9] netlink_sendmsg+0x275/0x282 Aug 7 06:47:45 localhost kernel: [c05ae91a] sock_sendmsg+0xe8/0x103 Aug 7 06:47:45 localhost kernel: [c05af129] sys_sendmsg+0x14d/0x1a8 Aug 7 06:47:45 localhost kernel: [c05b02fb] sys_socketcall+0x16b/0x186 Aug 7 06:47:45 localhost kernel: [c0403faf] syscall_call+0x7/0xb Aug 7 06:47:45 localhost kernel: DWARF2 unwinder stuck at syscall_call+0x7/0xb Aug 7 06:47:45 localhost kernel: Leftover inexact backtrace: Aug 7 06:47:45 localhost avahi-daemon[2392]: New relevant interface eth0.IPv6 for mDNS. Aug 7 06:47:45 localhost kernel: [c04057ea] show_trace+0xd/0x10 Aug 7 06:47:45 localhost avahi-daemon[2392]: Joining mDNS multicast group on interface eth0.IPv6 with address fe80::20a:e4ff:fe3f:8bc4. Aug 7 06:47:45 localhost kernel: [c0405903] dump_stack+0x19/0x1b Aug 7 06:47:45 localhost avahi-daemon[2392]: Registering new address record for fe80::20a:e4ff:fe3f:8bc4 on eth0. Aug 7 06:47:45 localhost kernel: [c043a402] print_usage_bug+0x1ca/0x1d7 Aug 7 06:47:45 localhost kernel: [c043a8eb] mark_lock+0x239/0x353 Aug 7
Re: [RFC] driver adjusts qlen, increases CPU
Jesse Brandeburg wrote: So we've recently put a bit of code in our e1000 driver to decrease the qlen based on the speed of the link. On the surface it seems like a great idea. A driver knows when the link speed changed, and having a 1000 packet deep queue (the default for most kernels now) on top of a 100Mb/s link (or 10Mb/s worst case for us) makes for a *lot* of latency if many packets are queued up in the qdisc. Problem we've seen is that setting this shorter queue causes a large spike in cpu when transmitting using UDP: 100Mb/s link txqueuelen: 1000 Throughput: 92.44 CPU: 5.00 txqueuelen: 100 Throughput: 93.80 CPU: 61.59 Is this expected? any comments? Triggering intra-stack flow-control perhaps? Perhaps 10X more often than before if the queue is 1/10th what it was before? Out of curiousity, how does the UDP socket's SO_SNDBUF compare to the queue depth? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: orinoco driver causes *lots* of lockdep spew
On Thu, Aug 03, 2006 at 03:11:53PM +0100, Christoph Hellwig wrote: Could we please just get rid of the wireless extensions over netlink code again? It doesn't help to solve anything and just creates a bigger mess to untangle when switching to a fully fledged wireless stack. If we're going to do that, now is probably the best time to do it, before any distro userland starts using it. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: orinoco driver causes *lots* of lockdep spew
On Thu, Aug 03, 2006 at 11:58:00AM -0700, Jean Tourrilhes wrote: On Thu, Aug 03, 2006 at 03:11:53PM +0100, Christoph Hellwig wrote: On Thu, Aug 03, 2006 at 11:54:41PM +1000, Herbert Xu wrote: Arjan van de Ven [EMAIL PROTECTED] wrote: this is another one of those nasty buggers; Good catch. It's really time that we fix this properly rather than adding more kludges to the core code. Dave, once this goes in you can revert the previous netlink workaround that added the _bh suffix. [WIRELESS]: Send wireless netlink events with a clean slate Could we please just get rid of the wireless extensions over netlink code again? It doesn't help to solve anything and just creates a bigger mess to untangle when switching to a fully fledged wireless stack. That's not going to happen any time soon, NetworkManager depends on Wireless Events, as well as many other apps. And there is not many mechanisms you can use in the kernel to generate events from driver to userspace. It seemed to cope pretty well before we had this ? Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
orinoco driver causes *lots* of lockdep spew
Wow. Nearly 400 lines of debug spew, from a simple 'ifup eth1'. Dave ADDRCONF(NETDEV_UP): eth1: link is not ready eth1: New link status: Disconnected (0002) == [ INFO: hard-safe - hard-unsafe lock order detected ] -- events/0/5 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: (af_callback_keys + sk-sk_family){-.--}, at: [802136b1] sock_def_readable+0x19/0x6f and this task is already holding: (priv-lock){++..}, at: [8824f70e] orinoco_send_wevents+0x28/0x8b [orinoco] which would create a new lock dependency: (priv-lock){++..} - (af_callback_keys + sk-sk_family){-.--} but this new dependency connects a hard-irq-safe lock: (priv-lock){++..} ... which became hard-irq-safe at: [802a8e62] lock_acquire+0x4a/0x69 [80267ba2] _spin_lock_irqsave+0x2b/0x3c [8824f7be] orinoco_interrupt+0x4d/0xf49 [orinoco] [8021151f] handle_IRQ_event+0x2b/0x64 [802c0987] __do_IRQ+0xae/0x114 [8026fca8] do_IRQ+0xf7/0x107 [802609c4] common_interrupt+0x64/0x65 to a hard-irq-unsafe lock: (af_callback_keys + sk-sk_family){-.--} ... which became hard-irq-unsafe at: ... [802a8e62] lock_acquire+0x4a/0x69 [80267867] _write_lock_bh+0x29/0x36 [80433960] netlink_release+0x139/0x2ca [80257903] sock_release+0x19/0x9b [80257b13] sock_close+0x33/0x3a [802130ee] __fput+0xc6/0x1a8 [8022effe] fput+0x13/0x16 [80225383] filp_close+0x64/0x70 [8021eecc] sys_close+0x93/0xb0 [8026048d] system_call+0x7d/0x83 other info that might help us debug this: 1 lock held by events/0/5: #0: (priv-lock){++..}, at: [8824f70e] orinoco_send_wevents+0x28/0x8b [orinoco] the hard-irq-safe lock's dependencies: - (priv-lock){++..} ops: 0 { initial-use at: [802a8e62] lock_acquire+0x4a/0x69 [80267a3e] _spin_lock_irq+0x2a/0x38 [8824f102] orinoco_init+0x934/0x966 [orinoco] [8041e762] register_netdevice+0xe6/0x375 [8041ea4b] register_netdev+0x5a/0x69 [8826155f] orinoco_cs_probe+0x3d7/0x475 [orinoco_cs] [803daa02] pcmcia_device_probe+0x7f/0x124 [803b5e74] driver_probe_device+0x5b/0xb1 [803b5fde] __driver_attach+0x88/0xdb [803b5826] bus_for_each_dev+0x48/0x7a [803b5d9e] driver_attach+0x1b/0x1e [803b543e] bus_add_driver+0x88/0x138 [803b6289] driver_register+0x8e/0x93 [803da89b] pcmcia_register_driver+0xd0/0xda [880a9024] 0x880a9024 [802af420] sys_init_module+0x16f2/0x18b7 [8026048d] system_call+0x7d/0x83 in-hardirq-W at: [802a8e62] lock_acquire+0x4a/0x69 [80267ba2] _spin_lock_irqsave+0x2b/0x3c [8824f7be] orinoco_interrupt+0x4d/0xf49 [orinoco] [8021151f] handle_IRQ_event+0x2b/0x64 [802c0987] __do_IRQ+0xae/0x114 [8026fca8] do_IRQ+0xf7/0x107 [802609c4] common_interrupt+0x64/0x65 in-softirq-W at: [802a8e62] lock_acquire+0x4a/0x69 [80267ba2] _spin_lock_irqsave+0x2b/0x3c [8824f7be] orinoco_interrupt+0x4d/0xf49 [orinoco] [8021151f] handle_IRQ_event+0x2b/0x64 [802c0987] __do_IRQ+0xae/0x114 [8026fca8] do_IRQ+0xf7/0x107 [802609c4] common_interrupt+0x64/0x65 [8028ebce] scheduler_tick+0xc1/0x362 [80261739] call_softirq+0x1d/0x28 [80295edb] irq_exit+0x56/0x59 [8027a67f] smp_apic_timer_interrupt+0x5c/0x62 [802610ad] apic_timer_interrupt+0x69/0x70 } ... key at: [8825fd80] __key.22351+0x0/0x27fa [orinoco] - (cwq-lock){++..} ops: 0 { initial-use at: [802a8e62] lock_acquire+0x4a/0x69 [80267ba2] _spin_lock_irqsave+0x2b/0x3c [802a0314] __queue_work+0x17/0x5e [802a03de] queue_work+0x4d/0x57 [8029fdda]
Re: e1000 speed/duplex error
I thought the common behavior is that if one side force any particular parameter, other side should sense that and go to that mode too. Nope. That is a common misconception and perhaps the source of many duplex mismatch problems today. Here is some boilerplate I bring-out from time to time that may be of help: $ cat usenet_replies/duplex How 100Base-T Autoneg is supposed to work: When both sides of the link are set to autoneg, they will negotiate the duplex setting and select full-duplex if both sides can do full-duplex. If one side is hardcoded and not using autoneg, the autoneg process will fail and the side trying to autoneg is required by spec to use half-duplex mode. If one side is using half-duplex, and the other is using full-duplex, sorrow and woe is the usual result. So, the following table shows what will happen given various settings on each side: Auto Half Full AutoHappiness Lucky Sorrow HalfLucky Happiness Sorrow FullSorrow Sorrow Happiness Happiness means that there is a good shot of everything going well. Lucky means that things will likely go well, but not because you did anything correctly :) Sorrow means that there _will_ be a duplex mis-match. When there is a duplex mismatch, on the side running half-duplex you will see various errors and probably a number of _LATE_ collisions (normal collisions don't count here). On the side running full-duplex you will see things like FCS errors. Note that those errors are not necessarily conclusive, they are simply indicators. Further, it is important to keep in mind that a clean ping (or the like - eg linkloop or default netperf TCP_RR) test result is inconclusive here - a duplex mismatch causes lost traffic _only_ when both sides of the link try to speak at the same time. A typical ping test, being synchronous, one at a time request/response, never tries to have both sides talking at the same time. Finally, when/if you migrate to 1000Base-T, everything has to be set to auto-neg anyway. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
neigh_lookup lockdep bug.
2.6.18rc2-gitSomething on my firewall box just triggered this.. Dave [515613.791771] === [515613.841467] [ INFO: possible circular locking dependency detected ] [515613.873284] --- [515613.904945] swapper/0 is trying to acquire lock: [515613.931489] (tbl-lock){-+-+}, at: [c05b5d63] neigh_lookup+0x50/0xaf [515613.964369] [515613.964373] but task is already holding lock: [515614.006550] (skb_queue_lock_key){-+..}, at: [c05b741c] neigh_proxy_process+0x20/0xc2 [515614.043225] [515614.043228] which lock already depends on the new lock. [515614.043234] [515614.103456] [515614.103459] the existing dependency chain (in reverse order) is: [515614.148752] [515614.148755] - #2 (skb_queue_lock_key){-+..}: [515614.10][c043bf43] lock_acquire+0x4b/0x6c [515614.215554][c06089a7] _spin_lock_irqsave+0x22/0x32 [515614.243606][c05ac2e3] skb_dequeue+0x12/0x43 [515614.269657][c05acffe] skb_queue_purge+0x14/0x1b [515614.296565][c05b673e] neigh_update+0x317/0x353 [515614.323004][c05e8a0b] arp_process+0x4aa/0x4e4 [515614.349004][c05e8b19] arp_rcv+0xd4/0xf1 [515614.373209][c05b1210] netif_receive_skb+0x204/0x271 [515614.400405][c05b2b73] process_backlog+0x99/0xfa [515614.426351][c05b2d56] net_rx_action+0x9d/0x196 [515614.451856][c04293d5] __do_softirq+0x78/0xf2 [515614.476660][c040662f] do_softirq+0x5a/0xbe [515614.500737] [515614.500741] - #1 (n-lock){-+-+}: [515614.532763][c043bf43] lock_acquire+0x4b/0x6c [515614.556814][c06086d0] _write_lock+0x19/0x28 [515614.580398][c05b7a0e] neigh_periodic_timer+0x98/0x13c [515614.606447][c042db48] run_timer_softirq+0x108/0x167 [515614.631798][c04293d5] __do_softirq+0x78/0xf2 [515614.655122][c040662f] do_softirq+0x5a/0xbe [515614.677721] [515614.677724] - #0 (tbl-lock){-+-+}: [515614.707327][c043bf43] lock_acquire+0x4b/0x6c [515614.729897][c060878a] _read_lock_bh+0x1e/0x2d [515614.752546][c05b5d63] neigh_lookup+0x50/0xaf [515614.774754][c05b6e5e] neigh_event_ns+0x2c/0x77 [515614.797271][c05e88c7] arp_process+0x366/0x4e4 [515614.819349][c05e8b3e] parp_redo+0x8/0xa [515614.839660][c05b7462] neigh_proxy_process+0x66/0xc2 [515614.862931][c042db48] run_timer_softirq+0x108/0x167 [515614.886048][c04293d5] __do_softirq+0x78/0xf2 [515614.907136][c040662f] do_softirq+0x5a/0xbe [515614.927553] [515614.927557] other info that might help us debug this: [515614.927563] [515614.966774] 1 lock held by swapper/0: [515614.982693] #0: (skb_queue_lock_key){-+..}, at: [c05b741c] neigh_proxy_process+0x20/0xc2 [515615.013575] [515615.013578] stack backtrace: [515615.037414] [c04051ea] show_trace_log_lvl+0x54/0xfd [515615.057910] [c04057a6] show_trace+0xd/0x10 [515615.075934] [c04058bf] dump_stack+0x19/0x1b [515615.094167] [c043b030] print_circular_bug_tail+0x59/0x64 [515615.116172] [c043b843] __lock_acquire+0x808/0x997 [515615.136514] [c043bf43] lock_acquire+0x4b/0x6c [515615.155699] [c060878a] _read_lock_bh+0x1e/0x2d [515615.175098] [c05b5d63] neigh_lookup+0x50/0xaf [515615.197276] [c05b6e5e] neigh_event_ns+0x2c/0x77 [515615.220267] [c05e88c7] arp_process+0x366/0x4e4 [515615.243248] [c05e8b3e] parp_redo+0x8/0xa [515615.264645] [c05b7462] neigh_proxy_process+0x66/0xc2 [515615.288899] [c042db48] run_timer_softirq+0x108/0x167 [515615.309972] [c04293d5] __do_softirq+0x78/0xf2 [515615.328940] [c040662f] do_softirq+0x5a/0xbe [515615.347150] [c042927e] irq_exit+0x3d/0x3f [515615.365067] [c0417cbb] smp_apic_timer_interrupt+0x79/0x7e [515615.387057] [c0404b0a] apic_timer_interrupt+0x2a/0x30 -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
David Miller wrote: From: Rick Jones [EMAIL PROTECTED] Date: Mon, 24 Jul 2006 17:55:24 -0700 Even enough bits for 1024 or 2048 CPUs in the single system image? I have seen 1024 touted by SGI, and with things going so multi-core, perhaps 16384 while sounding initially bizzare would be in the realm of theoretically possible before to long. Read the RSS NDIS documents from Microsoft. I'll see about hunting them down. You aren't going to want to demux to more than, say, 256 cpus for single network adapter even on the largest machines. I suppose, it just seems to tweak _small_ alarms in my intuition - maybe because it still sounds like networking telling the scheduler where to run threads of execution, and even though I'm a networking guy I seem to have the notion that it should be the other way 'round. That would cover TCP, are there similarly fungible fields in SCTP or other ULPs? And if we were to want to get HW support for the thing, getting it adopted in a de jure standards body would probably be in order :) Microsoft never does this, neither do we. LRO came out of our own design, the network folks found it reasonable and thus they have started to implement it. The same is true for Microsofts RSS stuff. It's a hardware interpretation, therefore it belongs in a driver API specification, nowhere else. It may be a hardware interpretation, but doesn't it have non-trivial system implications - where one runs threads/processes etc? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is RDMA
That TOE/iWARP could end-up being precluded by NAT seems so ironic from a POE2E standpoint. rick jones Purity Of End To END - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
This all sounds like the discussions we had within HP-UX between 10.20 and 11.0 concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling. IPS was done by the 10.20 stack at the handoff between the driver and netisr. If the packet was not an IP datagram fragment, parts of the transport and IP headers would be hashed, and the result would be the netisr queue to which the packet would be queued for further processing. It worked fine and dandy for stuff like aggregate netperf TCP_RR tests because there was a 1-1 correspondence between a connection and a process/thread. It was OK for the networking to dictate where the process should run. That feels rather like a NIC that would hash packets and pick the MSI-X based on that. However, as Andi discusses, when there is a process/thread doing more than one connection, picking a CPU based on addressing hashing will be like TweedleDee and TweedleDum telling Alice to go in opposite directions. Hence TOPS in 11.X. This time, when there is a normal lookup location in the path, where the application last accessed the socket is determined, and things shift-over to that CPU. This then is the process (well actually the scheduler) telling networking where it should do its work. That addresses the multiple connections per thread/process and still works just as well for 1-1. There are still issues if you have mutiple threads/processes concurrently accessing the same socket/connection, but that one is much more rare. Nirvana I suppose would be the addition of a field in the header which could be used for the determination of where to process. A Transport Protocol option I suppose, maybe the IPv6 flow id, but knuth only knows if anyone would go for something along those lines. It does though mean that the state is per-packet without it having to be based on addressing information. Almost like RDMA arriving saying where the data goes, but this thing says where the processing should happen :) rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
David Miller wrote: From: Rick Jones [EMAIL PROTECTED] Date: Mon, 24 Jul 2006 17:29:05 -0700 Nirvana I suppose would be the addition of a field in the header which could be used for the determination of where to process. A Transport Protocol option I suppose, maybe the IPv6 flow id, but knuth only knows if anyone would go for something along those lines. It does though mean that the state is per-packet without it having to be based on addressing information. Almost like RDMA arriving saying where the data goes, but this thing says where the processing should happen :) Since the full interpretation of the TCP timestamp option field value is largely local to the peer setting it, there is nothing wrong with stealing a few bits for destination cpu information. Even enough bits for 1024 or 2048 CPUs in the single system image? I have seen 1024 touted by SGI, and with things going so multi-core, perhaps 16384 while sounding initially bizzare would be in the realm of theoretically possible before to long. It would have to be done in such a way as to not make the PAWS tests fail by accident. But I think it's doable. That would cover TCP, are there similarly fungible fields in SCTP or other ULPs? And if we were to want to get HW support for the thing, getting it adopted in a de jure standards body would probably be in order :) rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
It would have to be done in such a way as to not make the PAWS tests fail by accident. But I think it's doable. CPU ID and higher-order generation number such that whenever the process migrates to a lower-numbered CPU, the generation number is bumped to make the timestamp larger than before? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark sk98lin driver for removal
On Sat, Jul 22, 2006 at 02:11:50PM -0700, Stephen Hemminger wrote: The sk98lin driver is now superseded by the skge driver. I wanted to just let the old driver wither and die from old age, but there are still bugs that are too painful to fix. See http://bugzilla.kernel.org/show_bug.cgi?id=6780 The board crashes repeatedly after 2 weeks. It probably is something in the vendor MIB code. That code is a mess, and starting over was one of the motivations for creating the skge driver. So rather than add more bondo to the old beater to cover the rusty bits, throw it in the dustbin. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] After a huge number of bug reports in Fedora 'went away' when we switched our users to using skge instead, I wholeheartedly endorse this. sk98lin is a disaster. The last time I looked the vendor out-of-tree driver had a huge delta vs mainline, and backed out numerous fixes made to it in the mainline kernel. It's a huge effort to get the 'good bits' out of that patch, and letting it die is the only sensible solution IMO. ACKed-by: Dave Jones [EMAIL PROTECTED] +SK98LIN GIGABBIT ETHERNET DRIVER typo :-) Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
All this talk reminds me of one thing, how expensive tcp_ack() is. And this expense has nothing to do with TCP really. The main cost is purging and freeing up the skbs which have been ACK'd in the retransmit queue. So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs which haven't been touched by the cpu in some time and are thus nearly guarenteed to be cold in the cache. This is the kind of work we could think about batching to user sleeping on some socket call. Ultimately isn't that just trying to squeeze the balloon? rick jones nice to see people seeing ACKs as expensive though :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
sch_htb compile fix.
net/sched/sch_htb.c: In function 'htb_change_class': net/sched/sch_htb.c:1605: error: expected ';' before 'do_gettimeofday' Signed-off-by: Dave Jones [EMAIL PROTECTED] --- linux-2.6.17.noarch/net/sched/sch_htb.c~2006-07-15 03:40:14.0 -0400 +++ linux-2.6.17.noarch/net/sched/sch_htb.c 2006-07-15 03:40:21.0 -0400 @@ -1601,7 +1601,7 @@ static int htb_change_class(struct Qdisc /* set class to be in HTB_CAN_SEND state */ cl-tokens = hopt-buffer; cl-ctokens = hopt-cbuffer; - cl-mbuffer = PSCHED_JIFFIE2US(HZ*60) /* 1min */ + cl-mbuffer = PSCHED_JIFFIE2US(HZ*60); /* 1min */ PSCHED_GET_TIME(cl-t_c); cl-cmode = HTB_CAN_SEND; -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I/O Acceleration Technology Nics
Ian Brown wrote: Hello, I came across the e1000 download for linux in intel site. I saw that in the readme they talk about Intel(R) I/O Acceleration Technology; According to this readme , there is support for systems using the Intel(R) 5000 Series Chipsets Integrated Device - 1A38. see: http://downloadmirror.intel.com/df-support/9180/ENG/README.txt My question is : did anybody tried using chipsets with this I/O Acceleration Technology ? Did he get a significant performance improvement over non I/O Accelerated nics ? IIRC, there were some measures made and discussed at least a little in netdev. A search of the archives should find them. I would also expect that Intel would have some glossy PDF's on their site touting the performance boosts technology :) They should at least somewhere have some links to actual measurements... Ian - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I suspect the URL above there will start one on the path to the email archive. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html