Re: tcp bw in 2.6

2007-10-03 Thread Larry McVoy
 A few notes to the discussion. I've seen one e1000 bug that ended up being
 a crappy AMD pre-opteron SMP chipset with a totally useless PCI bus
 implementation, which limited performance quite a bit-totally depending on
 what you plugged in and in which slot. 10e milk-and-bread-store 
 32/33 gige nics actually were better than server-class e1000's 
 in those, but weren't that great either.

That could well be my problem, this is a dual processor (not core) athlon
(not opteron) tyan motherboard if I recall correctly.

 Check your interrupt rates for the interface. You shouldn't be getting
 anywhere near 1 interrupt/packet. If you are, something is badly wrong :).

The acks (because I'm sending) are about 1.5 packets/interrupt.
When this box is receiving it's moving about 3x ass much data
and has a _lower_ (absolute, not per packet) interrupt load.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
On Tue, Oct 02, 2007 at 06:52:54PM +0800, Herbert Xu wrote:
  One of my clients also has gigabit so I played around with just that
  one and it (itanium running hpux w/ broadcom gigabit) can push the load
  as well.  One weird thing is that it is dependent on the direction the
  data is flowing.  If the hp is sending then I get 46MB/sec, if linux is
  sending then I get 18MB/sec.  Weird.  Linux is debian, running 
 
 First of all check the CPU load on both sides to see if either
 of them is saturating.  If the CPU's fine then look at the tcpdump
 output to see if both receivers are using the same window settings.

tcpdump is a good idea, take a look at this.  The window starts out
at 46 and never opens up in my test case, but in the rsh case it 
starts out the same but does open up.  Ideas?

08:08:06.033305 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: S 2756874880:2756874880(0) win 32768 mss 
1460,wscale 0,nop
08:08:06.05 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: S 3360532803:3360532803(0) ack 2756874881 win 5840 
mss 1460,nop,wscale 7
08:08:06.047924 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 1 win 32768
08:08:06.048218 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 1:2921(2920) ack 1 win 46
08:08:06.048426 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 1461 win 32768
08:08:06.048446 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 2921:5841(2920) ack 1 win 46
08:08:06.048673 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 4381 win 32768
08:08:06.048684 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 5841:10221(4380) ack 1 win 46
08:08:06.049047 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 8761 win 32768
08:08:06.049057 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 10221:16061(5840) ack 1 win 46
08:08:06.049422 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 14601 win 32768
08:08:06.049429 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: P 16061:18981(2920) ack 1 win 46
08:08:06.049462 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 18981:20441(1460) ack 1 win 46
08:08:06.049484 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 20441:23361(2920) ack 1 win 46
08:08:06.049924 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 21901 win 32768
08:08:06.049943 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 23361:32121(8760) ack 1 win 46
08:08:06.050549 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 30661 win 32768
08:08:06.050559 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: P 32121:39421(7300) ack 1 win 46
08:08:06.050592 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 39421:40881(1460) ack 1 win 46
08:08:06.050614 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 40881:42341(1460) ack 1 win 46
08:08:06.051170 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 40881 win 32768
08:08:06.051188 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 42341:54021(11680) ack 1 win 46
08:08:06.051923 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 52561 win 32768
08:08:06.051932 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: P 54021:58401(4380) ack 1 win 46
08:08:06.051942 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 58401:67161(8760) ack 1 win 46
08:08:06.052671 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 65701 win 32768
08:08:06.052680 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: P 67161:74461(7300) ack 1 win 46
08:08:06.052719 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 74461:77381(2920) ack 1 win 46
08:08:06.052752 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 77381:81761(4380) ack 1 win 46
08:08:06.053549 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 80301 win 32768
08:08:06.053566 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: P 81761:97821(16060) ack 1 win 46
08:08:06.054423 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 96361 win 32768
08:08:06.054433 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: P 97821:113881(16060) ack 1 win 46
08:08:06.054476 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: . 113881:115341(1460) ack 1 win 46
08:08:06.055422 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 113881 win 32768
08:08:06.055438 IP work-cluster.bitmover.com.31235  
hp-ia64.bitmover.com.49614: P 115341:131401(16060) ack 1 win 46
08:08:06.056421 IP hp-ia64.bitmover.com.49614  
work-cluster.bitmover.com.31235: . ack 131401 win 32768
08:08:06.056432 IP work-cluster.bitmover.com.31235  

Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
Interesting data point.  My test case is like this:

server
bind
listen
while (newsock = accept...)
transfer()

client
connect
transfer

If the server side is the source of the data, i.e, it's transfer is a 
write loop, then I get the bad behaviour.  If I switch them so the data
flows in the other direction, then it works, I go from about 14K pkt/sec
to 43K pkt/sec.

Can anyone else reproduce this?  I can extract the test case from lmbench
so it is standalone but I suspect that any test case will do it.  I'll
try with the one that John sent.  Yup, s/read/write/ and s/write/read/
in his two files at the appropriate places and I get exactly the same
behaviour.

So is this a bug or intentional?
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
 If the server side is the source of the data, i.e, it's transfer is a 
 write loop, then I get the bad behaviour.  
 ...
 So is this a bug or intentional?

For whatever it is worth, I believed that we used to get better performance
from the same hardware.  My guess is that it changed somewhere between
2.6.15-1-k7 and 2.6.18-5-k7.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
Isn't this something so straightforward that you would have tests for it?
This is the basic FTP server loop, doesn't someone have a big machine with
10gig cards and test that sending/recving data doesn't regress?

 Sounds like a bug to me, modulo the above caveat of making sure that it's 
 not some hw/driver/switch kind of difference.

Pretty unlikely given that we've changed the switch, the card works fine
in the other direction, and I'm 95% sure that we used to get better perf
before we switched to a more recent kernel.

I'll try and find some other gig ether cards and try them.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
On Tue, Oct 02, 2007 at 09:47:26AM -0700, Stephen Hemminger wrote:
 On Tue, 2 Oct 2007 09:25:34 -0700
 [EMAIL PROTECTED] (Larry McVoy) wrote:
 
   If the server side is the source of the data, i.e, it's transfer is a 
   write loop, then I get the bad behaviour.  
   ...
   So is this a bug or intentional?
  
  For whatever it is worth, I believed that we used to get better performance
  from the same hardware.  My guess is that it changed somewhere between
  2.6.15-1-k7 and 2.6.18-5-k7.
 
 For the period from 2.6.15 to 2.6.18, the kernel by default enabled TCP
 Appropriate Byte Counting. This caused bad performance on applications that
 did small writes.

It's doing 1MB writes.

Is there a sockopt to turn that off?  Or /proc or something?
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
 I have a more complex configuration  application, but I don't see this 
 problem in my testing.  Using e1000 nics and modern hardware 

I'm using a similar setup, what kernel are you using?

 I am purposefully setting the socket send/rx buffers, as well has 
 twiddling with the tcp and netdev related tunables.  

Ben sent those to me, see below, they didn't make any difference.
I tried diddling the socket send/recv buffers to 10MB, that didn't
help.  The defaults didn't help.  1MB didn't help and 64K didn't
help.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
On Tue, Oct 02, 2007 at 10:14:11AM -0700, Rick Jones wrote:
 Larry McVoy wrote:
 A short summary is can someone please post a test program that sources
 and sinks data at the wire speed?  because apparently I'm too old and
 clueless to write such a thing.
 
 WRT the different speeds in each direction talking with HP-UX, perhaps 
 there is an interaction between the Linux TCP stack (TSO perhaps) and 
 HP-UX's ACK avoidance heuristics. If that is the case, tweaking 
 tcp_deferred_ack_max with ndd on the HP-UX system might yield different 
 results.

I doubt it because I see the same sort of behaviour when I have a group
of Linux clients talking to the server.  The HP box is in the mix
simply because it has a gigabit card and that makes driving the load
simpler.  But if I do several loads from 100Mbit clients I get the same
packet throughput.

 WRT the small program making a setsockopt(SO_*BUF) call going slower than 
 the rsh, does rsh make the setsockopt() call, or does it bend itself to the 
 will of the linux stack's autotuning?  What happens if your small program 
 does not make setsockopt(SO_*BUF) calls?

I haven't tracked down if rsh does that but I've tried doing it with 
values of default, 64K, 1MB, and 10MB with no difference.

 *) depending on the quantity of CPU around, and the type of test one is 

These are fast CPUs and they are running at 93% idle while running the test.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
 I'm currently on 2.6.20, and have also tried 10gbe nics on 2.6.23 with

My guess is that it is a bug in the debian 2.6.18 kernel.

 Have you tried something like ttcp, iperf, or even regular ftp?

Yeah, I've factored out the code since BitKeeper, my test program,
and John's test program all exhibit the same behaviour.  Also switched
switches.

 Checked your nics to make sure they have no errors and are negotiated
 to full duplex?

Yup and yup.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
 Make sure you don't have slab debugging turned on. It kills performance.

It's a stock debian kernel, so unless they turn it on it's off.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
On Tue, Oct 02, 2007 at 11:01:47AM -0700, Rick Jones wrote:
 has anyone already asked whether link-layer flow-control is enabled?

I doubt it, the same test works fine in one direction and poorly in the other.
Wouldn't the flow control squelch either way?
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
 Looks like you have TSO enabled.  Does it behave differently if it's 
 disabled?  

It cranks the interrupts/sec up to 8K instead of 5K.  No difference in
performance other than that.

 I think Rick Jones is on to something with the HP ack avoidance.  

I sincerely doubt it.  I'm only using the HP box because it has gigabit
so it's a single connection.  I can produce almost identical results by
doing the same sorts of tests with several linux clients.  One direction
goes fast and the other goes slow.

3x performance difference depending on the direction of data flow:

# Server is receiving, goes fast
$ for i in 22 24 25 26; do rsh -n glibc$i dd if=/dev/zero|dd of=/dev/null  done
load free cach swap pgin  pgou dk0 dk1 dk2 dk3 ipkt opkt  int  ctx  usr sys idl
0.98   0000 00   0   0   0   30K  15K 8.1K  68K  12  66  22
0.98   0000 00   0   0   0   29K  15K 8.2K  67K  11  64  25
0.98   0000 00   0   0   0   29K  15K 8.2K  67K  12  66  22

# Server is sending, goes slow
$ for i in 22 24 25 26; do dd if=/dev/zero|rsh glibc$i dd of=/dev/null  done
load free cach swap pgin  pgou dk0 dk1 dk2 dk3 ipkt opkt  int  ctx  usr sys idl
1.06   0000 00   0   0   0  5.0K  10K 4.4K 8.4K  21  17  62
0.97   0000 00   0   0   0  5.1K  10K 4.4K 8.9K   2  15  83
0.97   0000 00   0   0   0  5.0K  10K 4.4K 8.6K  21  26  53

$ for i in 22 24 25 26; do rsh glibc$i cat /etc/motd; done | grep Welcome
Welcome to redhat71.bitmover.com, a 2Ghz Athlon running Red Hat 7.1.
Welcome to glibc24.bitmover.com, a 1.2Ghz Athlon running SUSE 10.1.
Welcome to glibc25.bitmover.com, a 2Ghz Athlon running Fedora Core 6
Welcome to glibc26.bitmover.com, a 2Ghz Athlon running Fedora Core 7

$ for i in 22 24 25 26; do rsh glibc$i uname -r; done
2.4.2-2
2.6.16.13-4-default
2.6.18-1.2798.fc6
2.6.22.4-65.fc7

No HP in the mix.  It's got nothing to do with hp, nor to do with rsh, it 
has everything to do with the direction the data is flowing.  
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
More data, we've conclusively eliminated the card / cpu from the mix.
We've got 2 ia64 boxes with e1000 interfaces.  One box is running
linux 2.6.12 and the other is running hpux 11.

I made sure the linux one was running at gigabit and reran the tests
from the linux/ia64 = hp/ia64.  Same results, when linux sends
it is slow, when it receives it is fast.

And note carefully: we've removed hpux from the equation, we can do
the same tests from linux to multiple linux clients and see the same
thing, sending from the server is slow, receiving on the server is
fast.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
 I think I'm still missing some basic data here (probably because this 
 thread did not originate on netdev).  Let me try to nail down some of 
 the basics.  You have a linux ia64 box (running 2.6.12 or 2.6.18?) that 
 sends slowly, and receives faster, but not quite a 1 Gbps?  And this is 
 true regardless of which peer it sends or receives from?  And the 
 behavior is different depending on which kernel?  How, and which kernel 
 versions?  Do you have other hardware running the same kernel that 
 behaves the same or differently?

just got off the phone with Linus and he thinks it is the side that does
the accept is the problem side, i.e., if you are the server, you do the
accept, and you send the data, you'll go slow.  But as I'm writing this
I realize he's wrong, because it is the combination of accept  send.
accept  recv goes fast.

A trivial way to see the problem is to take two linux boxes, on each
apt-get install rsh-client rsh-server
set up your .rhosts,
and then do

dd if=/dev/zero count=10 | rsh OTHER_BOX dd of=/dev/null
rsh OTHER_BOX dd if=/dev/zero count=10 | dd of=/dev/null

See if you get balanced results.  For me, I get 45MB/sec one way, and
15-19MB/sec the other way.

I've tried the same test linux - linux and linux - hpux.  Same results.
The test setup I have is

work:   2ghz x 2 Athlons, e1000, 2.6.18
ia64:   900mhz Itanium, e1000, 2.6.12
hp-ia64:900mhz Itanium, e1000, hpux 11
glibc*: 1-2ghz athlons running various linux releases

all connected through a netgear 724T 10/100/1000 switch (a linksys showed
identical results).

I tested 

work - hp-ia64
work - ia64
ia64 - hp-ia64

and in all cases, one direction worked fast and the other didn't.

It would be good if people tried the same simple test.  You have to
use rsh, ssh will slow things down way too much.

Alternatively, take your favorite test programs, such as John's,
and make a second pair that reverses the direction the data is 
sent.  So one pair is server sends, the other is server receives,
try both.  That's where we started, BitKeeper, my stripped down test,
and John's test all exhibit the same behavior.  And the rsh test
is just a really simple way to demonstrate it.

Wayne, Linus asked for tcp dumps from just one side, with the first 100
packets and then wait 10 seconds or so for the window to open up, and then
a snap shot of the another 100 packets.  Do that for both directions
and send them to the list.  Can you do that?  I want to get lunch, I'm
starving.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
 We fixed a lot of bugs in TSO last year.
 
 It would be really great to see numbers with a more recent kernel
 than 2.6.18

More data, sky2 works fine (really really fine, like 79MB/sec) between
Linux dylan.bitmover.com 2.6.18.1 #5 SMP Mon Oct 23 17:36:00 PDT 2006 i686
Linux steele 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64

So this is looking like a e1000 bug.  I'll try to upgrade the kernel on 
the ia64 box and see what happens.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
On Tue, Oct 02, 2007 at 02:16:56PM -0700, David Miller wrote:
 We absolutely depend upon people like you to report when there are
 anomalies like this.  It's the only thing that scales.

Well cool, finally doing something useful :)

Is this issue no test setup?  Because this does seem like something we'd
want to have work well.

 FWIW I have a t1000 Niagara box and an Ultra45 going through a netgear
 gigabit switch.  I'm getting 85MB/sec in one direction and 10MB/sec in
 the other (using bw_tcp from lmbench3).  

Note that bw_tcp mucks with SND/RCVBUF.  It probably shouldn't, it's been
12 years since that code went in there and I dunno if it is still needed.

 Both are using identical
 broadcom tigon3 gigabit chips and identical current kernels so that is
 a truly strange result.
 
 I'll investigate, it may be the same thing you're seeing.

Wow, sounds very similar.  In my case I was seeing pretty close to 3x
consistently.  You're more like 8x, but I was all e1000 not broadcom.

And note that sky2 doesn't have this problem.  Does the broadcom do TSO?
And sky2 not?  I noticed a much higher CPU load for sky2.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Larry McVoy
On Tue, Oct 02, 2007 at 03:32:16PM -0700, David Miller wrote:
 I'm starting to have a theory about what the bad case might
 be.
 
 A strong sender going to an even stronger receiver which can
 pull out packets into the process as fast as they arrive.
 This might be part of what keeps the receive window from
 growing.

I can back you up on that.  When I straced the receiving side that goes
slowly, all the reads were short, like 1-2K.  The way that works the 
reads were a lot larger as I recall.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-01 Thread Larry McVoy
On Sat, Sep 29, 2007 at 11:02:32AM -0700, Linus Torvalds wrote:
 On Sat, 29 Sep 2007, Larry McVoy wrote:
  I haven't kept up on switch technology but in the past they were much
  better than you are thinking.  The Kalpana switch that I had modified
  to support vlans (invented by yours truly), did not store and forward,
  it was cut through and could handle any load that was theoretically
  possible within about 1%.
 
 Hey, you may well be right. Maybe my assumptions about cutting corners are 
 just cynical and pessimistic. 

So I got a netgear switch and it works fine.  But my tests are busted.  
Catching netdev up, I'm trying to optimize traffic to a server that has
a gbit interface; I moved to a 24 port netgear that is all 10/100/1000
and I have a pile of clients to act as load generators.

I can do this on each of the clients 

dd if=/dev/zero bs=1024000 | rsh work dd of=/dev/null

and that cranks up to about 47K packets/second which is about 70MB/sec.

One of my clients also has gigabit so I played around with just that
one and it (itanium running hpux w/ broadcom gigabit) can push the load
as well.  One weird thing is that it is dependent on the direction the
data is flowing.  If the hp is sending then I get 46MB/sec, if linux is
sending then I get 18MB/sec.  Weird.  Linux is debian, running 

Linux work 2.6.18-5-k7 #1 SMP Thu Aug 30 02:52:31 UTC 2007 i686 

and dual e1000 cards:

e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection

I wrote a tiny little program to try and emulate this and I can't get
it to do as well.  I've tracked it down, I think, to the read side.
The server sources, the client sinks, the server looks like:

11689 accept(3, {sa_family=AF_INET, sin_port=htons(49376), 
sin_addr=inet_addr(10.3.1.38)}, [16]) = 4
11689 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0
11689 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
11689 clone(child_stack=0, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7ddf708) 
= 11694
11689 close(4)  = 0
11689 accept(3,  unfinished ...
11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1048576) = 1048576
11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1048576) = 1048576
11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1048576) = 1048576
11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1048576) = 1048576
...

but the client looks like

connect(3, {sa_family=AF_INET, sin_port=htons(31235), 
sin_addr=inet_addr(10.3.9.1)}, 16) = 0
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
1448
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
1448
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896

which I suspect may be the problem.

I played around with SO_RCVBUF/SO_SNDBUF and that didn't help.  So any ideas why
a simple dd piped through rsh is kicking my ass?  It must be something simple
but my test program is tiny and does nothing weird that I can see.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-01 Thread Larry McVoy
On Mon, Oct 01, 2007 at 07:14:37PM -0700, Linus Torvalds wrote:
 
 
 On Mon, 1 Oct 2007, Larry McVoy wrote:
  
  but the client looks like
  
  connect(3, {sa_family=AF_INET, sin_port=htons(31235), 
  sin_addr=inet_addr(10.3.9.1)}, 16) = 0
  read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
  1048576) = 2896
  read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
  1048576) = 1448
  read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
  1048576) = 2896
 ..
 
 This is exactly what I'd expect if the machine is *not* under excessive 
 load.

That's fine, but why is it that my trivial program can't do as well as 
dd | rsh dd?

A short summary is can someone please post a test program that sources
and sinks data at the wire speed?  because apparently I'm too old and
clueless to write such a thing.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-01 Thread Larry McVoy
On Mon, Oct 01, 2007 at 08:50:50PM -0700, David Miller wrote:
 From: [EMAIL PROTECTED] (Larry McVoy)
 Date: Mon, 1 Oct 2007 19:20:59 -0700
 
  A short summary is can someone please post a test program that sources
  and sinks data at the wire speed?  because apparently I'm too old and
  clueless to write such a thing.
 
 You're not showing us your test program so there is no way we
 can help you out.

Attached.  Drop it into an lmbench tree and build it.

 My initial inclination, even without that critical information,
 is to ask whether you are setting any socket options in way?

The only one I was playing with was SO_RCVBUF/SO_SNDBUF and I tried
disabling that and I tried playing with the read/write size.  Didn't
help.

 In particular, SO_RCVLOWAT can have a large effect here, if you're
 setting it to something, that would explain why dd is doing better.  A
 lot of people link to helper libraries with interfaces to setup
 sockets with all sorts of socket option settings by default, try not
 using such things if possible.

Agreed.  That was my first thought as well, I must have been doing 
something that messed up the defaults.  But you did get the strace
output, there wasn't anything weird there.

 You also shouldn't dork at all with the receive and send buffer sizes.
 They are adjusted dynamically by the kernel as the window grows.  But
 if you set them to specific values, this dynamic logic is turned off.

Yeah, dorking with those is left over from the bad old days of '95
when lmbench was first shipped.  But I turned that all off and no
difference.

So feel free to show me where I'm an idiot in the code, but if you
can't, then what would rock would be a little send.c / recv.c that
demonstrated filling the pipe.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
/*
 * bytes_tcp.c - simple TCP bandwidth source/sink
 *
 *	server usage:	bytes_tcp -s
 *	client usage:	bytes_tcp hostname [msgsize]
 *
 * Copyright (c) 1994 Larry McVoy.  
 * Copyright (c) 2002 Carl Staelin.  Distributed under the FSF GPL with
 * additional restriction that results may published only if
 * (1) the benchmark is unmodified, and
 * (2) the version in the sccsid below is included in the report.
 * Support for this development by Sun Microsystems is gratefully acknowledged.
 */
char	*id = $Id$\n;
#include bench.h
#define	XFER	(1024*1024)

int	server_main(int ac, char **av);
int	client_main(int ac, char **av);
void	source(int data);

void
transfer(int get, int server, char *buf)
{
	int	c;

	while ((get  0)  (c = read(server, buf, XFER))  0) {
		get -= c;
	}
	if (c  0) {
		perror(bytes_tcp: transfer: read failed);
		exit(4);
	}
}

/* ARGSUSED */
int
client_main(int ac, char **av)
{
	int	server;
	int	get = 256  20;
	char	buf[XFER];
	char*	usage = usage: %s -remotehost OR %s remotehost [msgsize]\n;

	if (ac != 2  ac != 3) {
		(void)fprintf(stderr, usage, av[0], av[0]);
		exit(0);
	}
	if (ac == 3) get = bytes(av[2]);
	server = tcp_connect(av[1], TCP_DATA+1, SOCKOPT_READ|SOCKOPT_REUSE);
	if (server  0) {
		perror(bytes_tcp: could not open socket to server);
		exit(2);
	}
	transfer(get, server, buf);
	close(server);
	exit(0);
	/*NOTREACHED*/
}

void
child()
{
	wait(0);
	signal(SIGCHLD, child);
}

/* ARGSUSED */
int
server_main(int ac, char **av)
{
	int	data, newdata;

	signal(SIGCHLD, child);
	data = tcp_server(TCP_DATA+1, SOCKOPT_READ|SOCKOPT_WRITE|SOCKOPT_REUSE);
	for ( ;; ) {
		newdata = tcp_accept(data, SOCKOPT_WRITE|SOCKOPT_READ);
		switch (fork()) {
		case -1:
			perror(fork);
			break;
		case 0:
			source(newdata);
			exit(0);
		default:
			close(newdata);
			break;
		}
	}
}

void
source(int data)
{
	char	buf[XFER];

	while (write(data, buf, sizeof(buf))  0);
}


int
main(int ac, char **av)
{
	char*	usage = Usage: %s -s OR %s -serverhost OR %s serverhost [msgsize]\n;
	if (ac  2 || 3  ac) {
		fprintf(stderr, usage, av[0], av[0], av[0]);
		exit(1);
	}
	if (ac == 2  !strcmp(av[1], -s)) {
		if (fork() == 0) server_main(ac, av);
		exit(0);
	} else {
		client_main(ac, av);
	}
	return(0);
}
/*
 * tcp_lib.c - routines for managing TCP connections.
 *
 * Positive port/program numbers are RPC ports, negative ones are TCP ports.
 *
 * Copyright (c) 1994-1996 Larry McVoy.
 */
#define		_LIB /* bench.h needs this */
#include	bench.h

/*
 * Get a TCP socket, bind it, figure out the port,
 * and advertise the port as program prog.
 *
 * XXX - it would be nice if you could advertise ascii strings.
 */
int
tcp_server(int prog, int rdwr)
{
	int	sock;
	struct	sockaddr_in s;

#ifdef	LIBTCP_VERBOSE
	fprintf(stderr, tcp_server(%u, %u)\n, prog, rdwr);
#endif
	if ((sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP))  0) {
		perror(socket);
		exit(1);
	}
	sock_optimize(sock, rdwr);
	bzero((void*)s, sizeof(s));
	s.sin_family = AF_INET;
	if (prog  0) {
		s.sin_port = htons(-prog);
	}
	if (bind(sock, (struct sockaddr*)s, sizeof(s))  0) {
		perror(bind);
		exit(2);
	}
	if (listen(sock, 100)  0) {
		perror(listen

on a different note

2007-10-01 Thread Larry McVoy
I do have a pretty nice cluster of linux boxes if you need lmbench results
or something like that.  Linux and the rest of the unix stuff for whatever
that is worth...

work ~/LMbench2/bin ls -1
alpha-glibc22-linux # need to upgrade to debian 4
hppa-glibc23-linux
hppa-hpux11
ia64-glibc23-linux
ia64-hpux11
mips-glibc23-linux
mips-irix
powerpc-aix
powerpc-glibc23-linux
powerpc-macosx
sparc-glibc23-linux
sparc-solaris
x86-darwin8.10.1
x86-freebsd2
x86-freebsd3
x86-freebsd4
x86-freebsd5
x86-freebsd6
x86-glibc20-linux   # redhat 5.2
x86-glibc21-linux   # redhat 6.1
x86-glibc22-linux   # redhat 7.2
x86-glibc23-linux   # redhat 9
x86-glibc24-linux   # debian from here on down
x86-glibc25-linux
x86-glibc26-linux
x86-netbsd
x86-openbsd
x86-sco3.2v5.0.7# ha!
x86-solaris
x86_64-glibc23-linux

-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html