Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-08 Thread Luigi Rizzo
On Fri, Dec 09, 2011 at 01:33:04AM +0100, Andre Oppermann wrote:
> On 08.12.2011 16:34, Luigi Rizzo wrote:
> >On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote:
...
> >>Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have
> >>LRO capable hardware setup locally to figure out what I've missed. Most
> >>of the machines in my lab are running em(4) NICs which don't support
> >>LRO, but I'll see if I can find something which does and perhaps
> >>resurrect this patch.
> 
> LRO can always be done in software.  You can do it at driver, ether_input
> or ip_input level.

storing LRO state at the driver (as it is done now) is very convenient,
because it is trivial to flush the pending segments at the end of
an rx interrupt. If you want to do LRO in ether_input() or ip_input(),
you need to add another call to flush the LRO state stored there.

> >a few comments:
> >1. i don't think it makes sense to send multiple acks on
> >coalesced segments (and the 82599 does not seem to do that).
> >First of all, the acks would get out with minimal spacing (ideally
> >less than 100ns) so chances are that the remote end will see
> >them in a single burst anyways. Secondly, the remote end can
> >easily tell that a single ACK is reporting multiple MSS and
> >behave as if an equivalent number of acks had arrived.
> 
> ABC (appropriate byte counting) gets in the way though.

right, during slow start the current ABC specification (RFC3465)
sets a prettly low limit on how much the window can be expanded
on each ACK. On the other hand...

> >2. i am a big fan of LRO (and similar solutions), because it can save
> >a lot of repeated work when passing packets up the stack, and the
> >mechanism becomes more and more effective as the system load increases,
> >which is a wonderful property in terms of system stability.
> >
> >For this reason, i think it would be useful to add support for software
> >LRO in the generic code (sys/net/if.c) so that drivers can directly use
> >the software implementation even without hardware support.
> 
> It hurts on higher RTT links in the general case.  For LAN RTT's
> it's good.

... on the other hand remember that LRO coalescing is limited to
the number of segments that arrive during a mitigation interval,
so even on a 10G interface is't only a handful of packets.
I better run some simulations to see how long it takes to
get full rate on a 10..50ms path when using LRO.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-08 Thread Andre Oppermann

On 08.12.2011 16:34, Luigi Rizzo wrote:

On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote:

On 12/08/11 05:08, Luigi Rizzo wrote:

...

I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
seems slightly faster than HEAD) using MTU=1500 and various
combinations of card capabilities (hwcsum,tso,lro), different window
sizes and interrupt mitigation configurations.

default latency is 16us, l=0 means no interrupt mitigation.
lro is the software implementation of lro (tcp_lro.c)
hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
seems to give the best results.

Summary:


[snip]


- enabling software lro on the transmit side actually slows
   down the throughput (4-5Gbit/s instead of 8.0).
   I am not sure why (perhaps acks are delayed too much) ?
   Adding a couple of lines in tcp_lro to reject
   pure acks seems to have much better effect.

The tcp_lro patch below might actually be useful also for
other cards.

--- tcp_lro.c   (revision 228284)
+++ tcp_lro.c   (working copy)
@@ -245,6 +250,8 @@

 ip_len = ntohs(ip->ip_len);
 tcp_data_len = ip_len - (tcp->th_off<<   2) - sizeof (*ip);
+   if (tcp_data_len == 0)
+   return -1;  /* not on ack */


 /*


There is a bug with our LRO implementation (first noticed by Jeff
Roberson) that I started fixing some time back but dropped the ball on.
The crux of the problem is that we currently only send an ACK for the
entire LRO chunk instead of all the segments contained therein. Given
that most stacks rely on the ACK clock to keep things ticking over, the
current behaviour kills performance. It may well be the cause of the
performance loss you have observed.


I should clarify better.
First of all, i tested two different LRO implementations: our
"Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented
by the 82599 (called RSC or receive-side-coalescing in the 82599
data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can
probably comment on the logic of both.

In my tests, either SW or HW LRO on the receive side HELPED A LOT,
not just in terms of raw throughput but also in terms of system
load on the receiver. On the receive side, LRO packs multiple data
segments into one that is passed up the stack.

As you mentioned this also reduces the number of acks generated,
but not dramatically (consider, the LRO is bounded by the number
of segments received in the mitigation interval).
In my tests the number of reads() on the receiver was reduced by
approx a factor of 3 compared to the !LRO case, meaning 4-5 segment
merged by LRO. Navdeep reported some numbers for cxgbe with similar
numbers.

Using Hardware LRO on the transmit side had no ill effect.
Being done in hardware i have no idea how it is implemented.

Using Software LRO on the transmit side did give a significant
throughput reduction. I can't explain the exact cause, though it
is possible that between reducing the number of segments to the
receiver and collapsing ACKs that it generates, the sender starves.
But it could well be that it is the extra delay on passing up the ACKs
that limits performance.
Either way, since the HW LRO did a fine job, i was trying to figure
out whether avoiding LRO on pure acks could help, and the two-line
patch above did help.

Note, my patch was just a proof-of-concept, and may cause
reordering if a data segment is followed by a pure ack.
But this can be fixed easily, handling a pure ack as
an out-of-sequence packet in tcp_lro_rx().


 WIP patch is at:
http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch

Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have
LRO capable hardware setup locally to figure out what I've missed. Most
of the machines in my lab are running em(4) NICs which don't support
LRO, but I'll see if I can find something which does and perhaps
resurrect this patch.


LRO can always be done in software.  You can do it at driver, ether_input
or ip_input level.


a few comments:
1. i don't think it makes sense to send multiple acks on
coalesced segments (and the 82599 does not seem to do that).
First of all, the acks would get out with minimal spacing (ideally
less than 100ns) so chances are that the remote end will see
them in a single burst anyways. Secondly, the remote end can
easily tell that a single ACK is reporting multiple MSS and
behave as if an equivalent number of acks had arrived.


ABC (appropriate byte counting) gets in the way though.


2. i am a big fan of LRO (and similar solutions), because it can save
a lot of repeated work when passing packets up the stack, and the
mechanism becomes more and more effective as the system load increases,
which is a wonderful property in terms of system stability.

For this reason, i think it would be useful to add support for software
LRO in the generic code (sys/net/if.c) so that drivers c

Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-08 Thread Andre Oppermann

On 08.12.2011 14:11, Lawrence Stewart wrote:

On 12/08/11 05:08, Luigi Rizzo wrote:

On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote:

On 06.12.2011 22:06, Luigi Rizzo wrote:

...

Even in my experiments there is a lot of instability in the results.
I don't know exactly where the problem is, but the high number of
read syscalls, and the huge impact of setting interrupt_rate=0
(defaults at 16us on the ixgbe) makes me think that there is something
that needs investigation in the protocol stack.

Of course we don't want to optimize specifically for the one-flow-at-10G
case, but devising something that makes the system less affected
by short timing variations, and can pass upstream interrupt mitigation
delays would help.


I'm not sure the variance is only coming from the network card and
driver side of things. The TCP processing and interactions with
scheduler and locking probably play a big role as well. There have
been many changes to TCP recently and maybe an inefficiency that
affects high-speed single sessions throughput has crept in. That's
difficult to debug though.


I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
seems slightly faster than HEAD) using MTU=1500 and various
combinations of card capabilities (hwcsum,tso,lro), different window
sizes and interrupt mitigation configurations.

default latency is 16us, l=0 means no interrupt mitigation.
lro is the software implementation of lro (tcp_lro.c)
hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
seems to give the best results.

Summary:


[snip]


- enabling software lro on the transmit side actually slows
down the throughput (4-5Gbit/s instead of 8.0).
I am not sure why (perhaps acks are delayed too much) ?
Adding a couple of lines in tcp_lro to reject
pure acks seems to have much better effect.

The tcp_lro patch below might actually be useful also for
other cards.

--- tcp_lro.c (revision 228284)
+++ tcp_lro.c (working copy)
@@ -245,6 +250,8 @@

ip_len = ntohs(ip->ip_len);
tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip);
+ if (tcp_data_len == 0)
+ return -1; /* not on ack */


/*


There is a bug with our LRO implementation (first noticed by Jeff Roberson) 
that I started fixing
some time back but dropped the ball on. The crux of the problem is that we 
currently only send an
ACK for the entire LRO chunk instead of all the segments contained therein. 
Given that most stacks
rely on the ACK clock to keep things ticking over, the current behaviour kills 
performance. It may
well be the cause of the performance loss you have observed. WIP patch is at:

http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch

Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have LRO 
capable hardware setup
locally to figure out what I've missed. Most of the machines in my lab are 
running em(4) NICs which
don't support LRO, but I'll see if I can find something which does and perhaps 
resurrect this patch.

If anyone has any ideas what I'm missing in the patch to make it work, please 
let me know.


On low RTT's the accumulated ACKing probably doesn't make any difference.
The congestion window will grow very fast anyway.  On longer RTT's it sure
will make a difference.  Unless you have a 10Gig path with > 50ms or so it's
difficult to empirically test though.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-08 Thread Luigi Rizzo
On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote:
> On 12/08/11 05:08, Luigi Rizzo wrote:
...
> >I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
> >seems slightly faster than HEAD) using MTU=1500 and various
> >combinations of card capabilities (hwcsum,tso,lro), different window
> >sizes and interrupt mitigation configurations.
> >
> >default latency is 16us, l=0 means no interrupt mitigation.
> >lro is the software implementation of lro (tcp_lro.c)
> >hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
> >seems to give the best results.
> >
> >Summary:
> 
> [snip]
> 
> >- enabling software lro on the transmit side actually slows
> >   down the throughput (4-5Gbit/s instead of 8.0).
> >   I am not sure why (perhaps acks are delayed too much) ?
> >   Adding a couple of lines in tcp_lro to reject
> >   pure acks seems to have much better effect.
> >
> >The tcp_lro patch below might actually be useful also for
> >other cards.
> >
> >--- tcp_lro.c   (revision 228284)
> >+++ tcp_lro.c   (working copy)
> >@@ -245,6 +250,8 @@
> >
> > ip_len = ntohs(ip->ip_len);
> > tcp_data_len = ip_len - (tcp->th_off<<  2) - sizeof (*ip);
> >+   if (tcp_data_len == 0)
> >+   return -1;  /* not on ack */
> >
> >
> > /*
> 
> There is a bug with our LRO implementation (first noticed by Jeff 
> Roberson) that I started fixing some time back but dropped the ball on. 
> The crux of the problem is that we currently only send an ACK for the 
> entire LRO chunk instead of all the segments contained therein. Given 
> that most stacks rely on the ACK clock to keep things ticking over, the 
> current behaviour kills performance. It may well be the cause of the 
> performance loss you have observed.

I should clarify better.
First of all, i tested two different LRO implementations: our
"Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented
by the 82599 (called RSC or receive-side-coalescing in the 82599
data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can
probably comment on the logic of both.

In my tests, either SW or HW LRO on the receive side HELPED A LOT,
not just in terms of raw throughput but also in terms of system
load on the receiver. On the receive side, LRO packs multiple data
segments into one that is passed up the stack.

As you mentioned this also reduces the number of acks generated,
but not dramatically (consider, the LRO is bounded by the number
of segments received in the mitigation interval).
In my tests the number of reads() on the receiver was reduced by
approx a factor of 3 compared to the !LRO case, meaning 4-5 segment
merged by LRO. Navdeep reported some numbers for cxgbe with similar
numbers.

Using Hardware LRO on the transmit side had no ill effect.
Being done in hardware i have no idea how it is implemented.

Using Software LRO on the transmit side did give a significant
throughput reduction. I can't explain the exact cause, though it
is possible that between reducing the number of segments to the
receiver and collapsing ACKs that it generates, the sender starves.
But it could well be that it is the extra delay on passing up the ACKs
that limits performance.
Either way, since the HW LRO did a fine job, i was trying to figure
out whether avoiding LRO on pure acks could help, and the two-line
patch above did help.

Note, my patch was just a proof-of-concept, and may cause
reordering if a data segment is followed by a pure ack.
But this can be fixed easily, handling a pure ack as
an out-of-sequence packet in tcp_lro_rx().

> WIP patch is at:
> http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch
> 
> Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have 
> LRO capable hardware setup locally to figure out what I've missed. Most 
> of the machines in my lab are running em(4) NICs which don't support 
> LRO, but I'll see if I can find something which does and perhaps 
> resurrect this patch.

a few comments:
1. i don't think it makes sense to send multiple acks on
   coalesced segments (and the 82599 does not seem to do that).
   First of all, the acks would get out with minimal spacing (ideally
   less than 100ns) so chances are that the remote end will see
   them in a single burst anyways. Secondly, the remote end can
   easily tell that a single ACK is reporting multiple MSS and
   behave as if an equivalent number of acks had arrived.

2. i am a big fan of LRO (and similar solutions), because it can save
   a lot of repeated work when passing packets up the stack, and the
   mechanism becomes more and more effective as the system load increases,
   which is a wonderful property in terms of system stability.

   For this reason, i think it would be useful to add support for software
   LRO in the generic code (sys/net/if.c) so that drivers can directly use
   the software implementation even without hardware suppor

Re: datapoints on 10G throughput with TCP ?

2011-12-08 Thread Slawa Olhovchenkov
On Mon, Dec 05, 2011 at 08:27:03PM +0100, Luigi Rizzo wrote:

> Hi,
> I am trying to establish the baseline performance for 10G throughput
> over TCP, and would like to collect some data points.  As a testing
> program i am using nuttcp from ports (as good as anything, i
> guess -- it is reasonably flexible, and if you use it in
> TCP with relatively large writes, the overhead for syscalls
> and gettimeofday() shouldn't kill you).
> 
> I'd be very grateful if you could do the following test:
> 
> - have two machines connected by a 10G link
> - on one run "nuttcp -S"
> - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip"
> 
> and send me a dump of the output, such as the one(s) at the end of
> the message.
> 
> I am mostly interested in two configurations:
> - one over loopback, which should tell how fast is the CPU+memory
>   As an example, one of my machines does about 15 Gbit/s, and
>   one of the faster ones does about 44 Gbit/s
> 
> - one over the wire using 1500 byte mss. Here it really matters
>   how good is the handling of small MTUs.
> 
> As a data point, on my machines i get 2..3.5 Gbit/s on the
> "slow" machine with a 1500 byte mtu and default card setting.
> Clearing the interrupt mitigation register (so no mitigation)
> brings the rate to 5-6 Gbit/s. Same hardware with linux does
> about 8 Gbit/s. HEAD seems 10-20% slower than RELENG_8 though i
> am not sure who is at fault.
> 
> The receive side is particularly critical - on FreeBSD
> the receiver is woken up every two packets (do the math
> below, between the number of rx calls and throughput and mss),
> resulting in almost 200K activations per second, and despite
> the fact that interrupt mitigation is set to a much lower
> value (so incoming packets should be batched).
> On linux, i see much fewer reads, presumably the process is
> woken up only at the end of a burst.

About relative performance FreeBSD and Linux I wrote in -performance@
at Jan'11 (Interrupt performance)

> 
>  EXAMPLES OF OUTPUT --
> 
> > nuttcp -t -T 5 -w 128 -v  10.0.1.2
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.0.1.2
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.0.1.2 with mss=1460, RTT=0.103 ms
> nuttcp-t: send window size = 131400, receive window size = 65700
> nuttcp-t: 3095.0982 MB in 5.00 real seconds = 633785.85 KB/sec = 5191.9737 
> Mbps
> nuttcp-t: host-retrans = 0
> nuttcp-t: 49522 I/O calls, msec/call = 0.10, calls/sec = 9902.99
> nuttcp-t: 0.0user 2.7sys 0:05real 54% 100i+2639d 752maxrss 0+3pf 258876+6csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.0.1.4
> nuttcp-r: send window size = 33580, receive window size = 131400
> nuttcp-r: 3095.0982 MB in 5.17 real seconds = 613526.42 KB/sec = 5026.0084 
> Mbps
> nuttcp-r: 1114794 I/O calls, msec/call = 0.00, calls/sec = 215801.03
> nuttcp-r: 0.1user 3.5sys 0:05real 69% 112i+1104d 626maxrss 0+15pf 
> 507653+188csw
> >
> 
> > nuttcp -t -T 5 -w 128 -v localhost
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.051 ms
> nuttcp-t: send window size = 143360, receive window size = 71680
> nuttcp-t: 26963.4375 MB in 5.00 real seconds = 5521440.59 KB/sec = 45231.6413 
> Mbps
> nuttcp-t: host-retrans = 0
> nuttcp-t: 431415 I/O calls, msec/call = 0.01, calls/sec = 86272.51
> nuttcp-t: 0.0user 4.6sys 0:05real 93% 102i+2681d 774maxrss 0+3pf 2510+1csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 127.0.0.1
> nuttcp-r: send window size = 43008, receive window size = 143360
> nuttcp-r: 26963.4375 MB in 5.20 real seconds = 5313135.74 KB/sec = 43525.2080 
> Mbps
> nuttcp-r: 767807 I/O calls, msec/call = 0.01, calls/sec = 147750.09
> nuttcp-r: 0.1user 3.9sys 0:05real 79% 98i+2570d 772maxrss 0+16pf 311014+8csw
> 
> 
> on the server, run  "
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-08 Thread Lawrence Stewart

On 12/08/11 05:08, Luigi Rizzo wrote:

On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote:

On 06.12.2011 22:06, Luigi Rizzo wrote:

...

Even in my experiments there is a lot of instability in the results.
I don't know exactly where the problem is, but the high number of
read syscalls, and the huge impact of setting interrupt_rate=0
(defaults at 16us on the ixgbe) makes me think that there is something
that needs investigation in the protocol stack.

Of course we don't want to optimize specifically for the one-flow-at-10G
case, but devising something that makes the system less affected
by short timing variations, and can pass upstream interrupt mitigation
delays would help.


I'm not sure the variance is only coming from the network card and
driver side of things.  The TCP processing and interactions with
scheduler and locking probably play a big role as well.  There have
been many changes to TCP recently and maybe an inefficiency that
affects high-speed single sessions throughput has crept in.  That's
difficult to debug though.


I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
seems slightly faster than HEAD) using MTU=1500 and various
combinations of card capabilities (hwcsum,tso,lro), different window
sizes and interrupt mitigation configurations.

default latency is 16us, l=0 means no interrupt mitigation.
lro is the software implementation of lro (tcp_lro.c)
hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
seems to give the best results.

Summary:


[snip]


- enabling software lro on the transmit side actually slows
   down the throughput (4-5Gbit/s instead of 8.0).
   I am not sure why (perhaps acks are delayed too much) ?
   Adding a couple of lines in tcp_lro to reject
   pure acks seems to have much better effect.

The tcp_lro patch below might actually be useful also for
other cards.

--- tcp_lro.c   (revision 228284)
+++ tcp_lro.c   (working copy)
@@ -245,6 +250,8 @@

 ip_len = ntohs(ip->ip_len);
 tcp_data_len = ip_len - (tcp->th_off<<  2) - sizeof (*ip);
+   if (tcp_data_len == 0)
+   return -1;  /* not on ack */


 /*


There is a bug with our LRO implementation (first noticed by Jeff 
Roberson) that I started fixing some time back but dropped the ball on. 
The crux of the problem is that we currently only send an ACK for the 
entire LRO chunk instead of all the segments contained therein. Given 
that most stacks rely on the ACK clock to keep things ticking over, the 
current behaviour kills performance. It may well be the cause of the 
performance loss you have observed. WIP patch is at:


http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch

Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have 
LRO capable hardware setup locally to figure out what I've missed. Most 
of the machines in my lab are running em(4) NICs which don't support 
LRO, but I'll see if I can find something which does and perhaps 
resurrect this patch.


If anyone has any ideas what I'm missing in the patch to make it work, 
please let me know.


Cheers,
Lawrence
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-08 Thread Luigi Rizzo
On Thu, Dec 08, 2011 at 12:06:26PM +0200, Daniel Kalchev wrote:
> 
> 
> On 07.12.11 22:23, Luigi Rizzo wrote:
> >
> >Sorry, forgot to mention that the above is with TSO DISABLED
> >(which is not the default). TSO seems to have a very bad
> >interaction with HWCSUM and non-zero mitigation.
> 
> I have this on both sender and receiver
> 
> # ifconfig ix1
> ix1: flags=8843 metric 0 mtu 1500
> 
> options=4bb
> ether 00:25:90:35:22:f1
> inet 10.2.101.11 netmask 0xff00 broadcast 10.2.101.255
> media: Ethernet autoselect (autoselect )
> status: active
> 
> without LRO on either end
> 
> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.051 ms
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 1802.4049 MB in 5.06 real seconds = 365077.76 KB/sec = 
> 2990.7170 Mbps
> nuttcp-t: host-retrans = 0
> nuttcp-t: 28839 I/O calls, msec/call = 0.18, calls/sec = 5704.44
> nuttcp-t: 0.0user 4.5sys 0:05real 90% 108i+1459d 630maxrss 0+2pf 87706+1csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 1802.4049 MB in 5.18 real seconds = 356247.49 KB/sec = 
> 2918.3794 Mbps
> nuttcp-r: 529295 I/O calls, msec/call = 0.01, calls/sec = 102163.86
> nuttcp-r: 0.1user 3.7sys 0:05real 73% 116i+1567d 618maxrss 0+15pf 
> 230404+0csw
> 
> with LRO on receiver
> 
> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.067 ms
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 2420.5000 MB in 5.02 real seconds = 493701.04 KB/sec = 
> 4044.3989 Mbps
> nuttcp-t: host-retrans = 2
> nuttcp-t: 38728 I/O calls, msec/call = 0.13, calls/sec = 7714.08
> nuttcp-t: 0.0user 4.1sys 0:05real 83% 107i+1436d 630maxrss 0+2pf 4896+0csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 2420.5000 MB in 5.15 real seconds = 481679.37 KB/sec = 
> 3945.9174 Mbps
> nuttcp-r: 242266 I/O calls, msec/call = 0.02, calls/sec = 47080.98
> nuttcp-r: 0.0user 2.4sys 0:05real 49% 112i+1502d 618maxrss 0+15pf 
> 156333+0csw
> 
> About 1/4 improvement...
> 
> With LRO on both sender and receiver
> 
> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.049 ms
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 2585.7500 MB in 5.02 real seconds = 527402.83 KB/sec = 
> 4320.4840 Mbps
> nuttcp-t: host-retrans = 1
> nuttcp-t: 41372 I/O calls, msec/call = 0.12, calls/sec = 8240.67
> nuttcp-t: 0.0user 4.6sys 0:05real 93% 106i+1421d 630maxrss 0+2pf 4286+0csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 2585.7500 MB in 5.15 real seconds = 514585.31 KB/sec = 
> 4215.4829 Mbps
> nuttcp-r: 282820 I/O calls, msec/call = 0.02, calls/sec = 54964.34
> nuttcp-r: 0.0user 2.7sys 0:05real 55% 114i+1540d 618maxrss 0+15pf 
> 188794+147csw
> 
> Even better...
> 
> With LRO on sender only:
> 
> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.054 ms
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 2077.5437 MB in 5.02 real seconds = 423740.81 KB/sec = 
> 3471.2847 Mbps
> nuttcp-t: host-retrans = 0
> nuttcp-t: 33241 I/O calls, msec/call = 0.15, calls/sec = 6621.01
> nuttcp-t: 0.0user 4.5sys 0:05real 92% 109i+1468d 630maxrss 0+2pf 49532+25csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 2077.5437 MB in 5.15 real seconds = 413415.33 KB/sec = 
> 3386.6984 Mbps
> nuttcp-r: 531979 I/O calls, msec/call = 0.01, calls/sec = 103378.67
> nuttcp-r: 0.0user 4.5sys 0:05real 88% 110i+1474d 618maxrss 0+15pf 
> 117367+0csw
> 
> 
> >also remember that hw.ixgbe.max_interrupt_rate has only
> >effect at module load -- i.e. you set it with the bootloader,
> >or with kenv before loading the module.
> 
> I have this in /boot/loader.conf
> 
> kern.ipc.nmbclusters=512000
> hw.ixgbe.max_interrupt_rate=0
> 
> o

Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-08 Thread Daniel Kalchev



On 07.12.11 22:23, Luigi Rizzo wrote:


Sorry, forgot to mention that the above is with TSO DISABLED
(which is not the default). TSO seems to have a very bad
interaction with HWCSUM and non-zero mitigation.


I have this on both sender and receiver

# ifconfig ix1
ix1: flags=8843 metric 0 mtu 1500

options=4bb

ether 00:25:90:35:22:f1
inet 10.2.101.11 netmask 0xff00 broadcast 10.2.101.255
media: Ethernet autoselect (autoselect )
status: active

without LRO on either end

# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.051 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1802.4049 MB in 5.06 real seconds = 365077.76 KB/sec = 
2990.7170 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 28839 I/O calls, msec/call = 0.18, calls/sec = 5704.44
nuttcp-t: 0.0user 4.5sys 0:05real 90% 108i+1459d 630maxrss 0+2pf 87706+1csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1802.4049 MB in 5.18 real seconds = 356247.49 KB/sec = 
2918.3794 Mbps

nuttcp-r: 529295 I/O calls, msec/call = 0.01, calls/sec = 102163.86
nuttcp-r: 0.1user 3.7sys 0:05real 73% 116i+1567d 618maxrss 0+15pf 
230404+0csw


with LRO on receiver

# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.067 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 2420.5000 MB in 5.02 real seconds = 493701.04 KB/sec = 
4044.3989 Mbps

nuttcp-t: host-retrans = 2
nuttcp-t: 38728 I/O calls, msec/call = 0.13, calls/sec = 7714.08
nuttcp-t: 0.0user 4.1sys 0:05real 83% 107i+1436d 630maxrss 0+2pf 4896+0csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 2420.5000 MB in 5.15 real seconds = 481679.37 KB/sec = 
3945.9174 Mbps

nuttcp-r: 242266 I/O calls, msec/call = 0.02, calls/sec = 47080.98
nuttcp-r: 0.0user 2.4sys 0:05real 49% 112i+1502d 618maxrss 0+15pf 
156333+0csw


About 1/4 improvement...

With LRO on both sender and receiver

# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.049 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 2585.7500 MB in 5.02 real seconds = 527402.83 KB/sec = 
4320.4840 Mbps

nuttcp-t: host-retrans = 1
nuttcp-t: 41372 I/O calls, msec/call = 0.12, calls/sec = 8240.67
nuttcp-t: 0.0user 4.6sys 0:05real 93% 106i+1421d 630maxrss 0+2pf 4286+0csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 2585.7500 MB in 5.15 real seconds = 514585.31 KB/sec = 
4215.4829 Mbps

nuttcp-r: 282820 I/O calls, msec/call = 0.02, calls/sec = 54964.34
nuttcp-r: 0.0user 2.7sys 0:05real 55% 114i+1540d 618maxrss 0+15pf 
188794+147csw


Even better...

With LRO on sender only:

# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.054 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 2077.5437 MB in 5.02 real seconds = 423740.81 KB/sec = 
3471.2847 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 33241 I/O calls, msec/call = 0.15, calls/sec = 6621.01
nuttcp-t: 0.0user 4.5sys 0:05real 92% 109i+1468d 630maxrss 0+2pf 49532+25csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 2077.5437 MB in 5.15 real seconds = 413415.33 KB/sec = 
3386.6984 Mbps

nuttcp-r: 531979 I/O calls, msec/call = 0.01, calls/sec = 103378.67
nuttcp-r: 0.0user 4.5sys 0:05real 88% 110i+1474d 618maxrss 0+15pf 
117367+0csw




also remember that hw.ixgbe.max_interrupt_rate has only
effect at module load -- i.e. you set it with the bootloader,
or with kenv before loading the module.


I have this in /boot/loader.conf

kern.ipc.nmbclusters=512000
hw.ixgbe.max_interrupt_rate=0

on both sender and receiver.


Please retry the measurements disabling tso (on both sides, but
it really matters only on the sender). Also, LRO requires HWCSUM.


How do I set HWCSUM? Is this different from RXCSUM/TXCSUM?

Still I get nowhere near what you get on my hardware... Here is what 
pciconf -vlbc has to 

Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-07 Thread Luigi Rizzo
On Wed, Dec 07, 2011 at 09:58:31PM +0200, Daniel Kalchev wrote:
> 
> On Dec 7, 2011, at 8:08 PM, Luigi Rizzo wrote:
> 
> > Summary:
> > 
> > - with default interrupt mitigation, the fastest configuration
> >  is with checksums enabled on both sender and receiver, lro
> >  enabled on the receiver. This gets about 8.0 Gbit/s
>
> I do not observe this. With defaults:
> ...

Sorry, forgot to mention that the above is with TSO DISABLED
(which is not the default). TSO seems to have a very bad
interaction with HWCSUM and non-zero mitigation.

also remember that hw.ixgbe.max_interrupt_rate has only
effect at module load -- i.e. you set it with the bootloader,
or with kenv before loading the module.

Please retry the measurements disabling tso (on both sides, but
it really matters only on the sender). Also, LRO requires HWCSUM.

cheers
luigi

> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.053 ms
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 1857.4978 MB in 5.02 real seconds = 378856.02 KB/sec = 3103.5885 
> Mbps
> nuttcp-t: host-retrans = 0
> nuttcp-t: 29720 I/O calls, msec/call = 0.17, calls/sec = 5919.63
> nuttcp-t: 0.0user 2.5sys 0:05real 52% 115i+1544d 630maxrss 0+2pf 107264+1csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 1857.4978 MB in 5.15 real seconds = 369617.39 KB/sec = 3027.9057 
> Mbps
> nuttcp-r: 543991 I/O calls, msec/call = 0.01, calls/sec = 105709.95
> nuttcp-r: 0.1user 4.1sys 0:05real 83% 110i+1482d 618maxrss 0+15pf 158432+0csw
> 
> On receiver:
> 
> ifconfig ix1 lro
> 
> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.068 ms
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 1673.3125 MB in 5.02 real seconds = 341312.36 KB/sec = 2796.0308 
> Mbps
> nuttcp-t: host-retrans = 1
> nuttcp-t: 26773 I/O calls, msec/call = 0.19, calls/sec = 5333.01
> nuttcp-t: 0.0user 1.0sys 0:05real 21% 113i+1518d 630maxrss 0+2pf 12772+1csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 1673.3125 MB in 5.15 real seconds = 332975.19 KB/sec = 2727.7327 
> Mbps
> nuttcp-r: 106268 I/O calls, msec/call = 0.05, calls/sec = 20650.82
> nuttcp-r: 0.0user 1.3sys 0:05real 28% 101i+1354d 618maxrss 0+15pf 64567+0csw
> 
> On sender:
> 
> ifconfig ix1 lro
> 
> (now both receiver and sender have LRO enabled)
> 
> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.063 ms
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 1611.7805 MB in 5.02 real seconds = 328716.18 KB/sec = 2692.8430 
> Mbps
> nuttcp-t: host-retrans = 1
> nuttcp-t: 25789 I/O calls, msec/call = 0.20, calls/sec = 5136.29
> nuttcp-t: 0.0user 1.0sys 0:05real 21% 109i+1465d 630maxrss 0+2pf 12697+0csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 1611.7805 MB in 5.15 real seconds = 320694.82 KB/sec = 2627.1319 
> Mbps
> nuttcp-r: 104346 I/O calls, msec/call = 0.05, calls/sec = 20275.05
> nuttcp-r: 0.0user 1.3sys 0:05real 27% 113i+1516d 618maxrss 0+15pf 63510+0csw
> 
> remove LRO from receiver (only sender has LRO):
> 
> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.065 ms
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 1884.8702 MB in 5.02 real seconds = 384464.57 KB/sec = 3149.5338 
> Mbps
> nuttcp-t: host-retrans = 0
> nuttcp-t: 30158 I/O calls, msec/call = 0.17, calls/sec = 6007.27
> nuttcp-t: 0.0user 2.7sys 0:05real 55% 104i+1403d 630maxrss 0+2pf 106046+0csw
> 
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 1884.8702 MB in 5.15 real seconds = 375093.52 KB/sec = 3072.7661 
> Mbps
> nuttcp-r: 540237 I/O calls, msec/call = 0.01, calls/sec = 104988.68
> nuttcp-r: 0.1user 4.2sys 0:05real 84% 110i+1483d 618maxrss 0+15pf 156340+0csw
> 
> Strange enough, setting hw.ixgbe.max_

Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-07 Thread Daniel Kalchev

On Dec 7, 2011, at 8:08 PM, Luigi Rizzo wrote:

> Summary:
> 
> - with default interrupt mitigation, the fastest configuration
>  is with checksums enabled on both sender and receiver, lro
>  enabled on the receiver. This gets about 8.0 Gbit/s

I do not observe this. With defaults:

# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.053 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1857.4978 MB in 5.02 real seconds = 378856.02 KB/sec = 3103.5885 Mbps
nuttcp-t: host-retrans = 0
nuttcp-t: 29720 I/O calls, msec/call = 0.17, calls/sec = 5919.63
nuttcp-t: 0.0user 2.5sys 0:05real 52% 115i+1544d 630maxrss 0+2pf 107264+1csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1857.4978 MB in 5.15 real seconds = 369617.39 KB/sec = 3027.9057 Mbps
nuttcp-r: 543991 I/O calls, msec/call = 0.01, calls/sec = 105709.95
nuttcp-r: 0.1user 4.1sys 0:05real 83% 110i+1482d 618maxrss 0+15pf 158432+0csw

On receiver:

ifconfig ix1 lro

# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.068 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1673.3125 MB in 5.02 real seconds = 341312.36 KB/sec = 2796.0308 Mbps
nuttcp-t: host-retrans = 1
nuttcp-t: 26773 I/O calls, msec/call = 0.19, calls/sec = 5333.01
nuttcp-t: 0.0user 1.0sys 0:05real 21% 113i+1518d 630maxrss 0+2pf 12772+1csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1673.3125 MB in 5.15 real seconds = 332975.19 KB/sec = 2727.7327 Mbps
nuttcp-r: 106268 I/O calls, msec/call = 0.05, calls/sec = 20650.82
nuttcp-r: 0.0user 1.3sys 0:05real 28% 101i+1354d 618maxrss 0+15pf 64567+0csw

On sender:

ifconfig ix1 lro

(now both receiver and sender have LRO enabled)

# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.063 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1611.7805 MB in 5.02 real seconds = 328716.18 KB/sec = 2692.8430 Mbps
nuttcp-t: host-retrans = 1
nuttcp-t: 25789 I/O calls, msec/call = 0.20, calls/sec = 5136.29
nuttcp-t: 0.0user 1.0sys 0:05real 21% 109i+1465d 630maxrss 0+2pf 12697+0csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1611.7805 MB in 5.15 real seconds = 320694.82 KB/sec = 2627.1319 Mbps
nuttcp-r: 104346 I/O calls, msec/call = 0.05, calls/sec = 20275.05
nuttcp-r: 0.0user 1.3sys 0:05real 27% 113i+1516d 618maxrss 0+15pf 63510+0csw

remove LRO from receiver (only sender has LRO):

# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.065 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1884.8702 MB in 5.02 real seconds = 384464.57 KB/sec = 3149.5338 Mbps
nuttcp-t: host-retrans = 0
nuttcp-t: 30158 I/O calls, msec/call = 0.17, calls/sec = 6007.27
nuttcp-t: 0.0user 2.7sys 0:05real 55% 104i+1403d 630maxrss 0+2pf 106046+0csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1884.8702 MB in 5.15 real seconds = 375093.52 KB/sec = 3072.7661 Mbps
nuttcp-r: 540237 I/O calls, msec/call = 0.01, calls/sec = 104988.68
nuttcp-r: 0.1user 4.2sys 0:05real 84% 110i+1483d 618maxrss 0+15pf 156340+0csw

Strange enough, setting hw.ixgbe.max_interrupt_rate=0 does not have any effect..

Daniel

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

2011-12-07 Thread Luigi Rizzo
On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote:
> On 06.12.2011 22:06, Luigi Rizzo wrote:
...
> >Even in my experiments there is a lot of instability in the results.
> >I don't know exactly where the problem is, but the high number of
> >read syscalls, and the huge impact of setting interrupt_rate=0
> >(defaults at 16us on the ixgbe) makes me think that there is something
> >that needs investigation in the protocol stack.
> >
> >Of course we don't want to optimize specifically for the one-flow-at-10G
> >case, but devising something that makes the system less affected
> >by short timing variations, and can pass upstream interrupt mitigation
> >delays would help.
> 
> I'm not sure the variance is only coming from the network card and
> driver side of things.  The TCP processing and interactions with
> scheduler and locking probably play a big role as well.  There have
> been many changes to TCP recently and maybe an inefficiency that
> affects high-speed single sessions throughput has crept in.  That's
> difficult to debug though.

I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
seems slightly faster than HEAD) using MTU=1500 and various
combinations of card capabilities (hwcsum,tso,lro), different window
sizes and interrupt mitigation configurations.

default latency is 16us, l=0 means no interrupt mitigation.
lro is the software implementation of lro (tcp_lro.c)
hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
seems to give the best results.

Summary:

- with default interrupt mitigation, the fastest configuration
  is with checksums enabled on both sender and receiver, lro
  enabled on the receiver. This gets about 8.0 Gbit/s

- lro is especially good because it packs data packets together,
  passing mitigation upstream and removing duplicate work in
  the ip and tcp stack.

- disabling LRO on the receiver brings performance to 6.5 Gbit/s.
  Also it increases the CPU load (also in userspace).

- disabling checksums on the sender reduces transmit speed to 5.5 Gbit/s

- checksums disabled on both sides (and no LRO on the receiver) go
  down to 4.8 Gbit/s

- I could not try the receive side without checksum but with lro

- with default interrupt mitigation, setting both
  HWCSUM and TSO on the sender is really disruptive.
  Depending on lro settings on the receiver i get 1.5 to 3.2 Gbit/s
  and huge variance

- Using both hwcsum and tso seems to work fine if you
  disable interrupt mitigation (reaching a peak of 9.4 Gbit/s).

- enabling software lro on the transmit side actually slows
  down the throughput (4-5Gbit/s instead of 8.0).
  I am not sure why (perhaps acks are delayed too much) ?
  Adding a couple of lines in tcp_lro to reject
  pure acks seems to have much better effect.

The tcp_lro patch below might actually be useful also for
other cards.

--- tcp_lro.c   (revision 228284)
+++ tcp_lro.c   (working copy)
@@ -245,6 +250,8 @@
 
ip_len = ntohs(ip->ip_len);
tcp_data_len = ip_len - (tcp->th_off << 2) - sizeof (*ip);
+   if (tcp_data_len == 0)
+   return -1;  /* not on ack */

 
/* 


cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: datapoints on 10G throughput with TCP ?

2011-12-07 Thread Andre Oppermann

On 06.12.2011 22:06, Luigi Rizzo wrote:

On Tue, Dec 06, 2011 at 07:40:21PM +0200, Daniel Kalchev wrote:

I see significant difference between number of interrupts on the Intel and the 
AMD blades. When performing a test between the Intel and AMD blades, the Intel 
blade generates 20,000-35,000 interrupts, while the AMD blade generates under 
1,000 interrupts.



Even in my experiments there is a lot of instability in the results.
I don't know exactly where the problem is, but the high number of
read syscalls, and the huge impact of setting interrupt_rate=0
(defaults at 16us on the ixgbe) makes me think that there is something
that needs investigation in the protocol stack.

Of course we don't want to optimize specifically for the one-flow-at-10G
case, but devising something that makes the system less affected
by short timing variations, and can pass upstream interrupt mitigation
delays would help.


I'm not sure the variance is only coming from the network card and
driver side of things.  The TCP processing and interactions with
scheduler and locking probably play a big role as well.  There have
been many changes to TCP recently and maybe an inefficiency that
affects high-speed single sessions throughput has crept in.  That's
difficult to debug though.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: datapoints on 10G throughput with TCP ?

2011-12-06 Thread Daniel O'Connor

On 07/12/2011, at 24:54, Daniel Kalchev wrote:
> It seems performance measurements are more dependent on the server (nuttcp 
> -S) machine.
> We will have to rule out the interrupt storms first of course, any advice?

You can control the storm threshold by setting the hw.intr_storm_threshold 
sysctl.

--
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
"The nice thing about standards is that there
are so many of them to choose from."
  -- Andrew Tanenbaum
GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C






___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: datapoints on 10G throughput with TCP ?

2011-12-06 Thread Luigi Rizzo
On Tue, Dec 06, 2011 at 07:40:21PM +0200, Daniel Kalchev wrote:
> I see significant difference between number of interrupts on the Intel and 
> the AMD blades. When performing a test between the Intel and AMD blades, the 
> Intel blade generates 20,000-35,000 interrupts, while the AMD blade generates 
> under 1,000 interrupts.
> 

Even in my experiments there is a lot of instability in the results.
I don't know exactly where the problem is, but the high number of
read syscalls, and the huge impact of setting interrupt_rate=0
(defaults at 16us on the ixgbe) makes me think that there is something
that needs investigation in the protocol stack.

Of course we don't want to optimize specifically for the one-flow-at-10G
case, but devising something that makes the system less affected
by short timing variations, and can pass upstream interrupt mitigation
delays would help.

I don't have a solution yet..

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: datapoints on 10G throughput with TCP ?

2011-12-06 Thread Daniel Kalchev
I see significant difference between number of interrupts on the Intel and the 
AMD blades. When performing a test between the Intel and AMD blades, the Intel 
blade generates 20,000-35,000 interrupts, while the AMD blade generates under 
1,000 interrupts.

There is no longer throttling, but the performance does not improve.. 

I set it via 

sysctl hw.intr_storm_threshold=0

Should this go to /boot/loader.conf instead.

Daniel

On Dec 6, 2011, at 7:21 PM, Jack Vogel wrote:

> Set the storm threshold to 0, that will disable it, its going to throttle 
> your performance
> when it happens.
> 
> Jack
> 

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: datapoints on 10G throughput with TCP ?

2011-12-06 Thread Jack Vogel
Set the storm threshold to 0, that will disable it, its going to throttle
your performance
when it happens.

Jack


On Tue, Dec 6, 2011 at 6:24 AM, Daniel Kalchev  wrote:

> Some tests with updated FreeBSD to 8-stable as of today, compared with the
> previous run
>
>
>
> On 06.12.11 13:18, Daniel Kalchev wrote:
>
>>
>> FreeBSD 8.2-STABLE #0: Wed Sep 28 11:23:59 EEST 2011
>> CPU: Intel(R) Xeon(R) CPU   E5620  @ 2.40GHz (2403.58-MHz
>> K8-class CPU)
>> real memory  = 51539607552 (49152 MB)
>> blade 1:
>>
>> # nuttcp -S
>> # nuttcp -t -T 5 -w 128 -v localhost
>> nuttcp-t: v6.1.2: socket
>> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
>> nuttcp-t: time limit = 5.00 seconds
>> nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.044 ms
>> nuttcp-t: send window size = 143360, receive window size = 71680
>> nuttcp-t: 8959.8750 MB in 5.02 real seconds = 1827635.67 KB/sec =
>> 14971.9914 Mbps
>> nuttcp-t: host-retrans = 0
>> nuttcp-t: 143358 I/O calls, msec/call = 0.04, calls/sec = 28556.81
>> nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 602maxrss 0+5pf 16+46csw
>>
>> nuttcp-r: v6.1.2: socket
>> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
>> nuttcp-r: accept from 127.0.0.1
>> nuttcp-r: send window size = 43008, receive window size = 143360
>> nuttcp-r: 8959.8750 MB in 5.17 real seconds = 1773171.07 KB/sec =
>> 14525.8174 Mbps
>> nuttcp-r: 219708 I/O calls, msec/call = 0.02, calls/sec = 42461.43
>> nuttcp-r: 0.0user 3.8sys 0:05real 76% 105i+1407d 614maxrss 1+17pf
>> 95059+22csw
>>
>
> New results:
>
> FreeBSD 8.2-STABLE #1: Tue Dec  6 13:51:01 EET 2011
>
>
>
> # nuttcp -t -T 5 -w 128 -v localhost
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.030 ms
>
> nuttcp-t: send window size = 143360, receive window size = 71680
> nuttcp-t: 12748.0625 MB in 5.02 real seconds = 2599947.38 KB/sec =
> 21298.7689 Mbps
> nuttcp-t: host-retrans = 0
> nuttcp-t: 203969 I/O calls, msec/call = 0.03, calls/sec = 40624.18
> nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 620maxrss 0+2pf 1+82csw
>
>
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 127.0.0.1
> nuttcp-r: send window size = 43008, receive window size = 143360
> nuttcp-r: 12748.0625 MB in 5.15 real seconds = 2536511.81 KB/sec =
> 20779.1048 Mbps
> nuttcp-r: 297000 I/O calls, msec/call = 0.02, calls/sec = 57709.75
> nuttcp-r: 0.1user 4.0sys 0:05real 81% 109i+1469d 626maxrss 0+15pf
> 121136+34csw
>
> Noticeable improvement.
>
>
>
>
>> blade 2:
>>
>> # nuttcp -t -T 5 -w 128 -v 10.2.101.12
>> nuttcp-t: v6.1.2: socket
>> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.12
>> nuttcp-t: time limit = 5.00 seconds
>> nuttcp-t: connect to 10.2.101.12 with mss=1448, RTT=0.059 ms
>> nuttcp-t: send window size = 131768, receive window size = 66608
>> nuttcp-t: 1340.6469 MB in 5.02 real seconds = 273449.90 KB/sec =
>> 2240.1016 Mbps
>> nuttcp-t: host-retrans = 171
>> nuttcp-t: 21451 I/O calls, msec/call = 0.24, calls/sec = 4272.78
>> nuttcp-t: 0.0user 1.9sys 0:05real 39% 120i+1610d 600maxrss 2+3pf
>> 75658+0csw
>>
>> nuttcp-r: v6.1.2: socket
>> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
>> nuttcp-r: accept from 10.2.101.11
>> nuttcp-r: send window size = 33304, receive window size = 131768
>> nuttcp-r: 1340.6469 MB in 5.17 real seconds = 265292.92 KB/sec =
>> 2173.2796 Mbps
>> nuttcp-r: 408764 I/O calls, msec/call = 0.01, calls/sec = 78992.15
>> nuttcp-r: 0.0user 3.3sys 0:05real 64% 105i+1413d 620maxrss 0+15pf
>> 105104+102csw
>>
>
> # nuttcp -t -T 5 -w 128 -v 10.2.101.11
> nuttcp-t: v6.1.2: socket
> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
>
> nuttcp-t: time limit = 5.00 seconds
> nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.055 ms
>
> nuttcp-t: send window size = 131768, receive window size = 66608
> nuttcp-t: 1964.8640 MB in 5.02 real seconds = 400757.59 KB/sec = 3283.0062
> Mbps
> nuttcp-t: host-retrans = 0
> nuttcp-t: 31438 I/O calls, msec/call = 0.16, calls/sec = 6261.87
> nuttcp-t: 0.0user 2.7sys 0:05real 55% 112i+1501d 1124maxrss 1+2pf
> 65+112csw
>
>
> nuttcp-r: v6.1.2: socket
> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
> nuttcp-r: accept from 10.2.101.12
>
> nuttcp-r: send window size = 33304, receive window size = 131768
> nuttcp-r: 1964.8640 MB in 5.15 real seconds = 390972.20 KB/sec = 3202.8442
> Mbps
> nuttcp-r: 560718 I/O calls, msec/call = 0.01, calls/sec = 108957.70
> nuttcp-r: 0.1user 4.2sys 0:05real 84% 111i+1494d 626maxrss 0+15pf
> 151930+16csw
>
> Again, improvement.
>
>
>
>>
>> Another pari of blades:
>>
>> FreeBSD 8.2-STABLE #0: Tue Aug  9 12:37:55 EEST 2011
>> CPU: AMD Opteron(tm) Processor 6134 (2300.04-MHz K8-class CPU)
>> real memory  = 68719476736 (65536 MB)
>>
>> first blade:
>>
>> # nuttcp -S
>> # nuttcp -t -T 5 -w 128 -v localhost
>> nuttcp-t: v6.1.2: socket
>> nuttcp-t: buflen=65

Re: datapoints on 10G throughput with TCP ?

2011-12-06 Thread Daniel Kalchev
Some tests with updated FreeBSD to 8-stable as of today, compared with 
the previous run



On 06.12.11 13:18, Daniel Kalchev wrote:


FreeBSD 8.2-STABLE #0: Wed Sep 28 11:23:59 EEST 2011
CPU: Intel(R) Xeon(R) CPU   E5620  @ 2.40GHz (2403.58-MHz 
K8-class CPU)

real memory  = 51539607552 (49152 MB)
blade 1:

# nuttcp -S
# nuttcp -t -T 5 -w 128 -v localhost
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.044 ms
nuttcp-t: send window size = 143360, receive window size = 71680
nuttcp-t: 8959.8750 MB in 5.02 real seconds = 1827635.67 KB/sec = 
14971.9914 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 143358 I/O calls, msec/call = 0.04, calls/sec = 28556.81
nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 602maxrss 0+5pf 16+46csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 127.0.0.1
nuttcp-r: send window size = 43008, receive window size = 143360
nuttcp-r: 8959.8750 MB in 5.17 real seconds = 1773171.07 KB/sec = 
14525.8174 Mbps

nuttcp-r: 219708 I/O calls, msec/call = 0.02, calls/sec = 42461.43
nuttcp-r: 0.0user 3.8sys 0:05real 76% 105i+1407d 614maxrss 1+17pf 
95059+22csw


New results:

FreeBSD 8.2-STABLE #1: Tue Dec  6 13:51:01 EET 2011


# nuttcp -t -T 5 -w 128 -v localhost
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.030 ms
nuttcp-t: send window size = 143360, receive window size = 71680
nuttcp-t: 12748.0625 MB in 5.02 real seconds = 2599947.38 KB/sec = 
21298.7689 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 203969 I/O calls, msec/call = 0.03, calls/sec = 40624.18
nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 620maxrss 0+2pf 1+82csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 127.0.0.1
nuttcp-r: send window size = 43008, receive window size = 143360
nuttcp-r: 12748.0625 MB in 5.15 real seconds = 2536511.81 KB/sec = 
20779.1048 Mbps

nuttcp-r: 297000 I/O calls, msec/call = 0.02, calls/sec = 57709.75
nuttcp-r: 0.1user 4.0sys 0:05real 81% 109i+1469d 626maxrss 0+15pf 
121136+34csw


Noticeable improvement.




blade 2:

# nuttcp -t -T 5 -w 128 -v 10.2.101.12
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.12
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.12 with mss=1448, RTT=0.059 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1340.6469 MB in 5.02 real seconds = 273449.90 KB/sec = 
2240.1016 Mbps

nuttcp-t: host-retrans = 171
nuttcp-t: 21451 I/O calls, msec/call = 0.24, calls/sec = 4272.78
nuttcp-t: 0.0user 1.9sys 0:05real 39% 120i+1610d 600maxrss 2+3pf 
75658+0csw


nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.11
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1340.6469 MB in 5.17 real seconds = 265292.92 KB/sec = 
2173.2796 Mbps

nuttcp-r: 408764 I/O calls, msec/call = 0.01, calls/sec = 78992.15
nuttcp-r: 0.0user 3.3sys 0:05real 64% 105i+1413d 620maxrss 0+15pf 
105104+102csw


# nuttcp -t -T 5 -w 128 -v 10.2.101.11
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.055 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1964.8640 MB in 5.02 real seconds = 400757.59 KB/sec = 
3283.0062 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 31438 I/O calls, msec/call = 0.16, calls/sec = 6261.87
nuttcp-t: 0.0user 2.7sys 0:05real 55% 112i+1501d 1124maxrss 1+2pf 
65+112csw


nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.12
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1964.8640 MB in 5.15 real seconds = 390972.20 KB/sec = 
3202.8442 Mbps

nuttcp-r: 560718 I/O calls, msec/call = 0.01, calls/sec = 108957.70
nuttcp-r: 0.1user 4.2sys 0:05real 84% 111i+1494d 626maxrss 0+15pf 
151930+16csw


Again, improvement.





Another pari of blades:

FreeBSD 8.2-STABLE #0: Tue Aug  9 12:37:55 EEST 2011
CPU: AMD Opteron(tm) Processor 6134 (2300.04-MHz K8-class CPU)
real memory  = 68719476736 (65536 MB)

first blade:

# nuttcp -S
# nuttcp -t -T 5 -w 128 -v localhost
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.090 ms
nuttcp-t: send window size = 143360, receive window size = 71680
nuttcp-t: 2695.0625 MB in 5.00 real seconds = 551756.90 KB/sec = 
4519.9925 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 43121 I/O calls, msec/call = 0.12, calls/sec = 8621.20
nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 620maxrss 0+4pf 2+71csw

nuttcp-r: v6.

Re: datapoints on 10G throughput with TCP ?

2011-12-06 Thread Daniel Kalchev



On 06.12.11 13:18, Daniel Kalchev wrote:

[...]
second blade:

# nuttcp -t -T 5 -w 128 -v 10.2.101.13
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.13
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.13 with mss=1448, RTT=0.164 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1290.3750 MB in 5.00 real seconds = 264173.96 KB/sec = 
2164.1131 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 20646 I/O calls, msec/call = 0.25, calls/sec = 4127.72
nuttcp-t: 0.0user 3.8sys 0:05real 77% 96i+1299d 616maxrss 0+3pf 
27389+0csw


nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.14
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1290.3750 MB in 5.14 real seconds = 256835.92 KB/sec = 
2103.9998 Mbps

nuttcp-r: 85668 I/O calls, msec/call = 0.06, calls/sec = 16651.70
nuttcp-r: 0.0user 4.8sys 0:05real 94% 107i+1437d 624maxrss 0+15pf 
11848+0csw



Not impresive... I am rebuilding now to -stable.

Daniel


I also noticed interrupt storms happening while this was running on the 
second pair of blades:


interrupt storm detected on "irq272:"; throttling interrupt source
interrupt storm detected on "irq272:"; throttling interrupt source
interrupt storm detected on "irq272:"; throttling interrupt source
interrupt storm detected on "irq270:"; throttling interrupt source
interrupt storm detected on "irq270:"; throttling interrupt source
interrupt storm detected on "irq270:"; throttling interrupt source

some stats

# sysctl -a dev.ix.1
dev.ix.1.%desc: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 
2.3.10

dev.ix.1.%driver: ix
dev.ix.1.%location: slot=0 function=1
dev.ix.1.%pnpinfo: vendor=0x8086 device=0x10fc subvendor=0x 
subdevice=0x class=0x02

dev.ix.1.%parent: pci3
dev.ix.1.flow_control: 3
dev.ix.1.advertise_gig: 0
dev.ix.1.enable_aim: 1
dev.ix.1.rx_processing_limit: 128
dev.ix.1.dropped: 0
dev.ix.1.mbuf_defrag_failed: 0
dev.ix.1.no_tx_dma_setup: 0
dev.ix.1.watchdog_events: 0
dev.ix.1.tso_tx: 1193460
dev.ix.1.link_irq: 1
dev.ix.1.queue0.interrupt_rate: 100
dev.ix.1.queue0.txd_head: 45
dev.ix.1.queue0.txd_tail: 45
dev.ix.1.queue0.no_desc_avail: 0
dev.ix.1.queue0.tx_packets: 23
dev.ix.1.queue0.rxd_head: 16
dev.ix.1.queue0.rxd_tail: 15
dev.ix.1.queue0.rx_packets: 16
dev.ix.1.queue0.rx_bytes: 2029
dev.ix.1.queue0.lro_queued: 0
dev.ix.1.queue0.lro_flushed: 0
dev.ix.1.queue1.interrupt_rate: 62500
dev.ix.1.queue1.txd_head: 0
dev.ix.1.queue1.txd_tail: 0
dev.ix.1.queue1.no_desc_avail: 0
dev.ix.1.queue1.tx_packets: 0
dev.ix.1.queue1.rxd_head: 0
dev.ix.1.queue1.rxd_tail: 2047
dev.ix.1.queue1.rx_packets: 0
dev.ix.1.queue1.rx_bytes: 0
dev.ix.1.queue1.lro_queued: 0
dev.ix.1.queue1.lro_flushed: 0
dev.ix.1.queue2.interrupt_rate: 20
dev.ix.1.queue2.txd_head: 545
dev.ix.1.queue2.txd_tail: 545
dev.ix.1.queue2.no_desc_avail: 0
dev.ix.1.queue2.tx_packets: 331690
dev.ix.1.queue2.rxd_head: 1099
dev.ix.1.queue2.rxd_tail: 1098
dev.ix.1.queue2.rx_packets: 498763
dev.ix.1.queue2.rx_bytes: 32954702
dev.ix.1.queue2.lro_queued: 0
dev.ix.1.queue2.lro_flushed: 0
dev.ix.1.queue3.interrupt_rate: 62500
dev.ix.1.queue3.txd_head: 0
dev.ix.1.queue3.txd_tail: 0
dev.ix.1.queue3.no_desc_avail: 0
dev.ix.1.queue3.tx_packets: 0
dev.ix.1.queue3.rxd_head: 0
dev.ix.1.queue3.rxd_tail: 2047
dev.ix.1.queue3.rx_packets: 0
dev.ix.1.queue3.rx_bytes: 0
dev.ix.1.queue3.lro_queued: 0
dev.ix.1.queue3.lro_flushed: 0
dev.ix.1.queue4.interrupt_rate: 100
dev.ix.1.queue4.txd_head: 13
dev.ix.1.queue4.txd_tail: 13
dev.ix.1.queue4.no_desc_avail: 0
dev.ix.1.queue4.tx_packets: 6
dev.ix.1.queue4.rxd_head: 6
dev.ix.1.queue4.rxd_tail: 5
dev.ix.1.queue4.rx_packets: 6
dev.ix.1.queue4.rx_bytes: 899
dev.ix.1.queue4.lro_queued: 0
dev.ix.1.queue4.lro_flushed: 0
dev.ix.1.queue5.interrupt_rate: 20
dev.ix.1.queue5.txd_head: 982
dev.ix.1.queue5.txd_tail: 982
dev.ix.1.queue5.no_desc_avail: 0
dev.ix.1.queue5.tx_packets: 302592
dev.ix.1.queue5.rxd_head: 956
dev.ix.1.queue5.rxd_tail: 955
dev.ix.1.queue5.rx_packets: 474044
dev.ix.1.queue5.rx_bytes: 31319840
dev.ix.1.queue5.lro_queued: 0
dev.ix.1.queue5.lro_flushed: 0
dev.ix.1.queue6.interrupt_rate: 20
dev.ix.1.queue6.txd_head: 1902
dev.ix.1.queue6.txd_tail: 1902
dev.ix.1.queue6.no_desc_avail: 0
dev.ix.1.queue6.tx_packets: 184922
dev.ix.1.queue6.rxd_head: 1410
dev.ix.1.queue6.rxd_tail: 1409
dev.ix.1.queue6.rx_packets: 402818
dev.ix.1.queue6.rx_bytes: 27759640
dev.ix.1.queue6.lro_queued: 0
dev.ix.1.queue6.lro_flushed: 0
dev.ix.1.queue7.interrupt_rate: 20
dev.ix.1.queue7.txd_head: 660
dev.ix.1.queue7.txd_tail: 660
dev.ix.1.queue7.no_desc_avail: 0
dev.ix.1.queue7.tx_packets: 378078
dev.ix.1.queue7.rxd_head: 885
dev.ix.1.queue7.rxd_tail: 884
dev.ix.1.queue7.rx_packets: 705397
dev.ix.1.queue7.rx_bytes: 46572290
dev.ix.1.queue7.lro_queued: 0
dev.ix.1.queue7.lro_flushed: 0
dev.ix.1.mac_stats.crc_errs: 0
dev.ix.1.mac_stats.ill_errs: 0
dev.ix.1.mac_stats.byt

Re: datapoints on 10G throughput with TCP ?

2011-12-06 Thread Daniel Kalchev
Here is what I get, with an existing install, no tuning other than 
kern.ipc.nmbclusters=512000


Pair of Supermicro blades:

FreeBSD 8.2-STABLE #0: Wed Sep 28 11:23:59 EEST 2011
CPU: Intel(R) Xeon(R) CPU   E5620  @ 2.40GHz (2403.58-MHz 
K8-class CPU)

real memory  = 51539607552 (49152 MB)
[...]
ix0:  
port 0xdc00-0xdc1f mem 0xfbc0-0xfbdf,0xfbbfc000-0xfbbf irq 
16 at device 0.0 on pci3

ix0: Using MSIX interrupts with 9 vectors
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: Ethernet address: xx:xx:xx:xx:xx:xx
ix0: PCI Express Bus: Speed 5.0Gb/s Width x8
ix1:  
port 0xd880-0xd89f mem 0xfb80-0xfb9f,0xfbbf8000-0xfbbfbfff irq 
17 at device 0.1 on pci3

ix1: Using MSIX interrupts with 9 vectors
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: Ethernet address: xx:xx:xx:xx:xx:xx
ix1: PCI Express Bus: Speed 5.0Gb/s Width x8


blade 1:

# nuttcp -S
# nuttcp -t -T 5 -w 128 -v localhost
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.044 ms
nuttcp-t: send window size = 143360, receive window size = 71680
nuttcp-t: 8959.8750 MB in 5.02 real seconds = 1827635.67 KB/sec = 
14971.9914 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 143358 I/O calls, msec/call = 0.04, calls/sec = 28556.81
nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 602maxrss 0+5pf 16+46csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 127.0.0.1
nuttcp-r: send window size = 43008, receive window size = 143360
nuttcp-r: 8959.8750 MB in 5.17 real seconds = 1773171.07 KB/sec = 
14525.8174 Mbps

nuttcp-r: 219708 I/O calls, msec/call = 0.02, calls/sec = 42461.43
nuttcp-r: 0.0user 3.8sys 0:05real 76% 105i+1407d 614maxrss 1+17pf 
95059+22csw


blade 2:

# nuttcp -t -T 5 -w 128 -v 10.2.101.12
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.12
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.2.101.12 with mss=1448, RTT=0.059 ms
nuttcp-t: send window size = 131768, receive window size = 66608
nuttcp-t: 1340.6469 MB in 5.02 real seconds = 273449.90 KB/sec = 
2240.1016 Mbps

nuttcp-t: host-retrans = 171
nuttcp-t: 21451 I/O calls, msec/call = 0.24, calls/sec = 4272.78
nuttcp-t: 0.0user 1.9sys 0:05real 39% 120i+1610d 600maxrss 2+3pf 75658+0csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.2.101.11
nuttcp-r: send window size = 33304, receive window size = 131768
nuttcp-r: 1340.6469 MB in 5.17 real seconds = 265292.92 KB/sec = 
2173.2796 Mbps

nuttcp-r: 408764 I/O calls, msec/call = 0.01, calls/sec = 78992.15
nuttcp-r: 0.0user 3.3sys 0:05real 64% 105i+1413d 620maxrss 0+15pf 
105104+102csw



Another pari of blades:

FreeBSD 8.2-STABLE #0: Tue Aug  9 12:37:55 EEST 2011
CPU: AMD Opteron(tm) Processor 6134 (2300.04-MHz K8-class CPU)
real memory  = 68719476736 (65536 MB)
[...]
ix0:  
port 0xe400-0xe41f mem 0xfe60-0xfe7f,0xfe4fc000-0xfe4f irq 
19 at device 0.0 on pci3

ix0: Using MSIX interrupts with 9 vectors
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: [ITHREAD]
ix0: Ethernet address: xx:xx:xx:xx:xx:xx
ix0: PCI Express Bus: Speed 5.0Gb/s Width x8
ix1:  
port 0xe800-0xe81f mem 0xfea0-0xfebf,0xfe8fc000-0xfe8f irq 
16 at device 0.1 on pci3

ix1: Using MSIX interrupts with 9 vectors
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: [ITHREAD]
ix1: Ethernet address: xx:xx:xx:xx:xx:xx
ix1: PCI Express Bus: Speed 5.0Gb/s Width x8

first blade:

# nuttcp -S
# nuttcp -t -T 5 -w 128 -v localhost
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.090 ms
nuttcp-t: send window size = 143360, receive window size = 71680
nuttcp-t: 2695.0625 MB in 5.00 real seconds = 551756.90 KB/sec = 
4519.9925 Mbps

nuttcp-t: host-retrans = 0
nuttcp-t: 43121 I/O calls, msec/call = 0.12, calls/sec = 8621.20
nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 620maxrss 0+4pf 2+71csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 127.0.0.1
nuttcp-r: send window size = 43008, receive window size = 143360
nuttcp-r: 2695.0625 MB in 5.14 real seconds = 536509.66 KB/sec = 
4395.0871 Mbps

nuttcp-r: 43126 I/O calls, msec/call = 0.12, calls/sec = 8383.94
nuttcp-r: 0.0user 3.1sys 0:05real 61% 94i+1264d 624maxrss 1+16pf 43019+0csw

second blade:

# nuttcp -t -T 5 -w 128 -v 10.2.101.13
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.13
nuttcp-t: time 

Re: datapoints on 10G throughput with TCP ?

2011-12-05 Thread Jack Vogel
You can't get line rate with ixgbe, in what configuration/hardware?
We surely do get line rate in validation here, but its sensitive to
your hardware and config.

Jack


On Mon, Dec 5, 2011 at 2:28 PM, Luigi Rizzo  wrote:

> On Mon, Dec 05, 2011 at 11:15:09PM +0200, Daniel Kalchev wrote:
> >
> > On Dec 5, 2011, at 9:27 PM, Luigi Rizzo wrote:
> >
> > > - have two machines connected by a 10G link
> > > - on one run "nuttcp -S"
> > > - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip"
> > >
> >
> > Any particular tuning of FreeBSD?
>
> actually my point is first to see how good or bad are the defaults.
>
> I have noticed that setting hw.ixgbe.max_interrupt_rate=0
> (it is a tunable, you need to do it before loading the module)
> improves the throughput by a fair amount (but still way below
> line rate with 1500 byte packets).
>
> other things (larger windows) don't seem to help much.
>
> cheers
> luigi
> ___
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
>
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: datapoints on 10G throughput with TCP ?

2011-12-05 Thread Luigi Rizzo
On Mon, Dec 05, 2011 at 03:08:54PM -0800, Jack Vogel wrote:
> You can't get line rate with ixgbe, in what configuration/hardware?
> We surely do get line rate in validation here, but its sensitive to
> your hardware and config.

sources from HEAD as of a week or so, default parameter setting,
82599 on an Intel dual port 10G card, Intel i7-870 CPU (4 cores)
at 2.93 GHz, on asus MB and the card on a PCIe-x16 slot, MTU=1500 bytes.
Same hardware, same defaults and nuttcp on linux does 8.5 Gbit/s.

I can do line rate with a single flow if i use MTU=9000 and set
max_interrupt_rate=0 (even reducing the CPU speed to 1.2 GHz).

I can saturate the link with multiple flows (say nuttcp -N 8).

cheers
luigi

> Jack
> 
> 
> On Mon, Dec 5, 2011 at 2:28 PM, Luigi Rizzo  wrote:
> 
> > On Mon, Dec 05, 2011 at 11:15:09PM +0200, Daniel Kalchev wrote:
> > >
> > > On Dec 5, 2011, at 9:27 PM, Luigi Rizzo wrote:
> > >
> > > > - have two machines connected by a 10G link
> > > > - on one run "nuttcp -S"
> > > > - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip"
> > > >
> > >
> > > Any particular tuning of FreeBSD?
> >
> > actually my point is first to see how good or bad are the defaults.
> >
> > I have noticed that setting hw.ixgbe.max_interrupt_rate=0
> > (it is a tunable, you need to do it before loading the module)
> > improves the throughput by a fair amount (but still way below
> > line rate with 1500 byte packets).
> >
> > other things (larger windows) don't seem to help much.
> >
> > cheers
> > luigi
> > ___
> > freebsd-current@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
> >
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: datapoints on 10G throughput with TCP ?

2011-12-05 Thread Luigi Rizzo
On Mon, Dec 05, 2011 at 11:15:09PM +0200, Daniel Kalchev wrote:
> 
> On Dec 5, 2011, at 9:27 PM, Luigi Rizzo wrote:
> 
> > - have two machines connected by a 10G link
> > - on one run "nuttcp -S"
> > - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip"
> > 
> 
> Any particular tuning of FreeBSD?

actually my point is first to see how good or bad are the defaults.

I have noticed that setting hw.ixgbe.max_interrupt_rate=0
(it is a tunable, you need to do it before loading the module)
improves the throughput by a fair amount (but still way below
line rate with 1500 byte packets).

other things (larger windows) don't seem to help much.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: datapoints on 10G throughput with TCP ?

2011-12-05 Thread Daniel Kalchev

On Dec 5, 2011, at 9:27 PM, Luigi Rizzo wrote:

> - have two machines connected by a 10G link
> - on one run "nuttcp -S"
> - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip"
> 

Any particular tuning of FreeBSD?

Daniel

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


datapoints on 10G throughput with TCP ?

2011-12-05 Thread Luigi Rizzo
Hi,
I am trying to establish the baseline performance for 10G throughput
over TCP, and would like to collect some data points.  As a testing
program i am using nuttcp from ports (as good as anything, i
guess -- it is reasonably flexible, and if you use it in
TCP with relatively large writes, the overhead for syscalls
and gettimeofday() shouldn't kill you).

I'd be very grateful if you could do the following test:

- have two machines connected by a 10G link
- on one run "nuttcp -S"
- on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip"

and send me a dump of the output, such as the one(s) at the end of
the message.

I am mostly interested in two configurations:
- one over loopback, which should tell how fast is the CPU+memory
  As an example, one of my machines does about 15 Gbit/s, and
  one of the faster ones does about 44 Gbit/s

- one over the wire using 1500 byte mss. Here it really matters
  how good is the handling of small MTUs.

As a data point, on my machines i get 2..3.5 Gbit/s on the
"slow" machine with a 1500 byte mtu and default card setting.
Clearing the interrupt mitigation register (so no mitigation)
brings the rate to 5-6 Gbit/s. Same hardware with linux does
about 8 Gbit/s. HEAD seems 10-20% slower than RELENG_8 though i
am not sure who is at fault.

The receive side is particularly critical - on FreeBSD
the receiver is woken up every two packets (do the math
below, between the number of rx calls and throughput and mss),
resulting in almost 200K activations per second, and despite
the fact that interrupt mitigation is set to a much lower
value (so incoming packets should be batched).
On linux, i see much fewer reads, presumably the process is
woken up only at the end of a burst.

cheers
luigi


 EXAMPLES OF OUTPUT --

> nuttcp -t -T 5 -w 128 -v  10.0.1.2
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.0.1.2
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 10.0.1.2 with mss=1460, RTT=0.103 ms
nuttcp-t: send window size = 131400, receive window size = 65700
nuttcp-t: 3095.0982 MB in 5.00 real seconds = 633785.85 KB/sec = 5191.9737 Mbps
nuttcp-t: host-retrans = 0
nuttcp-t: 49522 I/O calls, msec/call = 0.10, calls/sec = 9902.99
nuttcp-t: 0.0user 2.7sys 0:05real 54% 100i+2639d 752maxrss 0+3pf 258876+6csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 10.0.1.4
nuttcp-r: send window size = 33580, receive window size = 131400
nuttcp-r: 3095.0982 MB in 5.17 real seconds = 613526.42 KB/sec = 5026.0084 Mbps
nuttcp-r: 1114794 I/O calls, msec/call = 0.00, calls/sec = 215801.03
nuttcp-r: 0.1user 3.5sys 0:05real 69% 112i+1104d 626maxrss 0+15pf 507653+188csw
>

> nuttcp -t -T 5 -w 128 -v localhost
nuttcp-t: v6.1.2: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost
nuttcp-t: time limit = 5.00 seconds
nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.051 ms
nuttcp-t: send window size = 143360, receive window size = 71680
nuttcp-t: 26963.4375 MB in 5.00 real seconds = 5521440.59 KB/sec = 45231.6413 
Mbps
nuttcp-t: host-retrans = 0
nuttcp-t: 431415 I/O calls, msec/call = 0.01, calls/sec = 86272.51
nuttcp-t: 0.0user 4.6sys 0:05real 93% 102i+2681d 774maxrss 0+3pf 2510+1csw

nuttcp-r: v6.1.2: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 127.0.0.1
nuttcp-r: send window size = 43008, receive window size = 143360
nuttcp-r: 26963.4375 MB in 5.20 real seconds = 5313135.74 KB/sec = 43525.2080 
Mbps
nuttcp-r: 767807 I/O calls, msec/call = 0.01, calls/sec = 147750.09
nuttcp-r: 0.1user 3.9sys 0:05real 79% 98i+2570d 772maxrss 0+16pf 311014+8csw


on the server, run  "
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"