Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On Fri, Dec 09, 2011 at 01:33:04AM +0100, Andre Oppermann wrote: > On 08.12.2011 16:34, Luigi Rizzo wrote: > >On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote: ... > >>Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have > >>LRO capable hardware setup locally to figure out what I've missed. Most > >>of the machines in my lab are running em(4) NICs which don't support > >>LRO, but I'll see if I can find something which does and perhaps > >>resurrect this patch. > > LRO can always be done in software. You can do it at driver, ether_input > or ip_input level. storing LRO state at the driver (as it is done now) is very convenient, because it is trivial to flush the pending segments at the end of an rx interrupt. If you want to do LRO in ether_input() or ip_input(), you need to add another call to flush the LRO state stored there. > >a few comments: > >1. i don't think it makes sense to send multiple acks on > >coalesced segments (and the 82599 does not seem to do that). > >First of all, the acks would get out with minimal spacing (ideally > >less than 100ns) so chances are that the remote end will see > >them in a single burst anyways. Secondly, the remote end can > >easily tell that a single ACK is reporting multiple MSS and > >behave as if an equivalent number of acks had arrived. > > ABC (appropriate byte counting) gets in the way though. right, during slow start the current ABC specification (RFC3465) sets a prettly low limit on how much the window can be expanded on each ACK. On the other hand... > >2. i am a big fan of LRO (and similar solutions), because it can save > >a lot of repeated work when passing packets up the stack, and the > >mechanism becomes more and more effective as the system load increases, > >which is a wonderful property in terms of system stability. > > > >For this reason, i think it would be useful to add support for software > >LRO in the generic code (sys/net/if.c) so that drivers can directly use > >the software implementation even without hardware support. > > It hurts on higher RTT links in the general case. For LAN RTT's > it's good. ... on the other hand remember that LRO coalescing is limited to the number of segments that arrive during a mitigation interval, so even on a 10G interface is't only a handful of packets. I better run some simulations to see how long it takes to get full rate on a 10..50ms path when using LRO. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On 08.12.2011 16:34, Luigi Rizzo wrote: On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote: On 12/08/11 05:08, Luigi Rizzo wrote: ... I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which seems slightly faster than HEAD) using MTU=1500 and various combinations of card capabilities (hwcsum,tso,lro), different window sizes and interrupt mitigation configurations. default latency is 16us, l=0 means no interrupt mitigation. lro is the software implementation of lro (tcp_lro.c) hwlro is the hardware one (on 82599). Using a window of 100 Kbytes seems to give the best results. Summary: [snip] - enabling software lro on the transmit side actually slows down the throughput (4-5Gbit/s instead of 8.0). I am not sure why (perhaps acks are delayed too much) ? Adding a couple of lines in tcp_lro to reject pure acks seems to have much better effect. The tcp_lro patch below might actually be useful also for other cards. --- tcp_lro.c (revision 228284) +++ tcp_lro.c (working copy) @@ -245,6 +250,8 @@ ip_len = ntohs(ip->ip_len); tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip); + if (tcp_data_len == 0) + return -1; /* not on ack */ /* There is a bug with our LRO implementation (first noticed by Jeff Roberson) that I started fixing some time back but dropped the ball on. The crux of the problem is that we currently only send an ACK for the entire LRO chunk instead of all the segments contained therein. Given that most stacks rely on the ACK clock to keep things ticking over, the current behaviour kills performance. It may well be the cause of the performance loss you have observed. I should clarify better. First of all, i tested two different LRO implementations: our "Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented by the 82599 (called RSC or receive-side-coalescing in the 82599 data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can probably comment on the logic of both. In my tests, either SW or HW LRO on the receive side HELPED A LOT, not just in terms of raw throughput but also in terms of system load on the receiver. On the receive side, LRO packs multiple data segments into one that is passed up the stack. As you mentioned this also reduces the number of acks generated, but not dramatically (consider, the LRO is bounded by the number of segments received in the mitigation interval). In my tests the number of reads() on the receiver was reduced by approx a factor of 3 compared to the !LRO case, meaning 4-5 segment merged by LRO. Navdeep reported some numbers for cxgbe with similar numbers. Using Hardware LRO on the transmit side had no ill effect. Being done in hardware i have no idea how it is implemented. Using Software LRO on the transmit side did give a significant throughput reduction. I can't explain the exact cause, though it is possible that between reducing the number of segments to the receiver and collapsing ACKs that it generates, the sender starves. But it could well be that it is the extra delay on passing up the ACKs that limits performance. Either way, since the HW LRO did a fine job, i was trying to figure out whether avoiding LRO on pure acks could help, and the two-line patch above did help. Note, my patch was just a proof-of-concept, and may cause reordering if a data segment is followed by a pure ack. But this can be fixed easily, handling a pure ack as an out-of-sequence packet in tcp_lro_rx(). WIP patch is at: http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have LRO capable hardware setup locally to figure out what I've missed. Most of the machines in my lab are running em(4) NICs which don't support LRO, but I'll see if I can find something which does and perhaps resurrect this patch. LRO can always be done in software. You can do it at driver, ether_input or ip_input level. a few comments: 1. i don't think it makes sense to send multiple acks on coalesced segments (and the 82599 does not seem to do that). First of all, the acks would get out with minimal spacing (ideally less than 100ns) so chances are that the remote end will see them in a single burst anyways. Secondly, the remote end can easily tell that a single ACK is reporting multiple MSS and behave as if an equivalent number of acks had arrived. ABC (appropriate byte counting) gets in the way though. 2. i am a big fan of LRO (and similar solutions), because it can save a lot of repeated work when passing packets up the stack, and the mechanism becomes more and more effective as the system load increases, which is a wonderful property in terms of system stability. For this reason, i think it would be useful to add support for software LRO in the generic code (sys/net/if.c) so that drivers c
Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On 08.12.2011 14:11, Lawrence Stewart wrote: On 12/08/11 05:08, Luigi Rizzo wrote: On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote: On 06.12.2011 22:06, Luigi Rizzo wrote: ... Even in my experiments there is a lot of instability in the results. I don't know exactly where the problem is, but the high number of read syscalls, and the huge impact of setting interrupt_rate=0 (defaults at 16us on the ixgbe) makes me think that there is something that needs investigation in the protocol stack. Of course we don't want to optimize specifically for the one-flow-at-10G case, but devising something that makes the system less affected by short timing variations, and can pass upstream interrupt mitigation delays would help. I'm not sure the variance is only coming from the network card and driver side of things. The TCP processing and interactions with scheduler and locking probably play a big role as well. There have been many changes to TCP recently and maybe an inefficiency that affects high-speed single sessions throughput has crept in. That's difficult to debug though. I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which seems slightly faster than HEAD) using MTU=1500 and various combinations of card capabilities (hwcsum,tso,lro), different window sizes and interrupt mitigation configurations. default latency is 16us, l=0 means no interrupt mitigation. lro is the software implementation of lro (tcp_lro.c) hwlro is the hardware one (on 82599). Using a window of 100 Kbytes seems to give the best results. Summary: [snip] - enabling software lro on the transmit side actually slows down the throughput (4-5Gbit/s instead of 8.0). I am not sure why (perhaps acks are delayed too much) ? Adding a couple of lines in tcp_lro to reject pure acks seems to have much better effect. The tcp_lro patch below might actually be useful also for other cards. --- tcp_lro.c (revision 228284) +++ tcp_lro.c (working copy) @@ -245,6 +250,8 @@ ip_len = ntohs(ip->ip_len); tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip); + if (tcp_data_len == 0) + return -1; /* not on ack */ /* There is a bug with our LRO implementation (first noticed by Jeff Roberson) that I started fixing some time back but dropped the ball on. The crux of the problem is that we currently only send an ACK for the entire LRO chunk instead of all the segments contained therein. Given that most stacks rely on the ACK clock to keep things ticking over, the current behaviour kills performance. It may well be the cause of the performance loss you have observed. WIP patch is at: http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have LRO capable hardware setup locally to figure out what I've missed. Most of the machines in my lab are running em(4) NICs which don't support LRO, but I'll see if I can find something which does and perhaps resurrect this patch. If anyone has any ideas what I'm missing in the patch to make it work, please let me know. On low RTT's the accumulated ACKing probably doesn't make any difference. The congestion window will grow very fast anyway. On longer RTT's it sure will make a difference. Unless you have a 10Gig path with > 50ms or so it's difficult to empirically test though. -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote: > On 12/08/11 05:08, Luigi Rizzo wrote: ... > >I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which > >seems slightly faster than HEAD) using MTU=1500 and various > >combinations of card capabilities (hwcsum,tso,lro), different window > >sizes and interrupt mitigation configurations. > > > >default latency is 16us, l=0 means no interrupt mitigation. > >lro is the software implementation of lro (tcp_lro.c) > >hwlro is the hardware one (on 82599). Using a window of 100 Kbytes > >seems to give the best results. > > > >Summary: > > [snip] > > >- enabling software lro on the transmit side actually slows > > down the throughput (4-5Gbit/s instead of 8.0). > > I am not sure why (perhaps acks are delayed too much) ? > > Adding a couple of lines in tcp_lro to reject > > pure acks seems to have much better effect. > > > >The tcp_lro patch below might actually be useful also for > >other cards. > > > >--- tcp_lro.c (revision 228284) > >+++ tcp_lro.c (working copy) > >@@ -245,6 +250,8 @@ > > > > ip_len = ntohs(ip->ip_len); > > tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip); > >+ if (tcp_data_len == 0) > >+ return -1; /* not on ack */ > > > > > > /* > > There is a bug with our LRO implementation (first noticed by Jeff > Roberson) that I started fixing some time back but dropped the ball on. > The crux of the problem is that we currently only send an ACK for the > entire LRO chunk instead of all the segments contained therein. Given > that most stacks rely on the ACK clock to keep things ticking over, the > current behaviour kills performance. It may well be the cause of the > performance loss you have observed. I should clarify better. First of all, i tested two different LRO implementations: our "Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented by the 82599 (called RSC or receive-side-coalescing in the 82599 data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can probably comment on the logic of both. In my tests, either SW or HW LRO on the receive side HELPED A LOT, not just in terms of raw throughput but also in terms of system load on the receiver. On the receive side, LRO packs multiple data segments into one that is passed up the stack. As you mentioned this also reduces the number of acks generated, but not dramatically (consider, the LRO is bounded by the number of segments received in the mitigation interval). In my tests the number of reads() on the receiver was reduced by approx a factor of 3 compared to the !LRO case, meaning 4-5 segment merged by LRO. Navdeep reported some numbers for cxgbe with similar numbers. Using Hardware LRO on the transmit side had no ill effect. Being done in hardware i have no idea how it is implemented. Using Software LRO on the transmit side did give a significant throughput reduction. I can't explain the exact cause, though it is possible that between reducing the number of segments to the receiver and collapsing ACKs that it generates, the sender starves. But it could well be that it is the extra delay on passing up the ACKs that limits performance. Either way, since the HW LRO did a fine job, i was trying to figure out whether avoiding LRO on pure acks could help, and the two-line patch above did help. Note, my patch was just a proof-of-concept, and may cause reordering if a data segment is followed by a pure ack. But this can be fixed easily, handling a pure ack as an out-of-sequence packet in tcp_lro_rx(). > WIP patch is at: > http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch > > Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have > LRO capable hardware setup locally to figure out what I've missed. Most > of the machines in my lab are running em(4) NICs which don't support > LRO, but I'll see if I can find something which does and perhaps > resurrect this patch. a few comments: 1. i don't think it makes sense to send multiple acks on coalesced segments (and the 82599 does not seem to do that). First of all, the acks would get out with minimal spacing (ideally less than 100ns) so chances are that the remote end will see them in a single burst anyways. Secondly, the remote end can easily tell that a single ACK is reporting multiple MSS and behave as if an equivalent number of acks had arrived. 2. i am a big fan of LRO (and similar solutions), because it can save a lot of repeated work when passing packets up the stack, and the mechanism becomes more and more effective as the system load increases, which is a wonderful property in terms of system stability. For this reason, i think it would be useful to add support for software LRO in the generic code (sys/net/if.c) so that drivers can directly use the software implementation even without hardware suppor
Re: datapoints on 10G throughput with TCP ?
On Mon, Dec 05, 2011 at 08:27:03PM +0100, Luigi Rizzo wrote: > Hi, > I am trying to establish the baseline performance for 10G throughput > over TCP, and would like to collect some data points. As a testing > program i am using nuttcp from ports (as good as anything, i > guess -- it is reasonably flexible, and if you use it in > TCP with relatively large writes, the overhead for syscalls > and gettimeofday() shouldn't kill you). > > I'd be very grateful if you could do the following test: > > - have two machines connected by a 10G link > - on one run "nuttcp -S" > - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip" > > and send me a dump of the output, such as the one(s) at the end of > the message. > > I am mostly interested in two configurations: > - one over loopback, which should tell how fast is the CPU+memory > As an example, one of my machines does about 15 Gbit/s, and > one of the faster ones does about 44 Gbit/s > > - one over the wire using 1500 byte mss. Here it really matters > how good is the handling of small MTUs. > > As a data point, on my machines i get 2..3.5 Gbit/s on the > "slow" machine with a 1500 byte mtu and default card setting. > Clearing the interrupt mitigation register (so no mitigation) > brings the rate to 5-6 Gbit/s. Same hardware with linux does > about 8 Gbit/s. HEAD seems 10-20% slower than RELENG_8 though i > am not sure who is at fault. > > The receive side is particularly critical - on FreeBSD > the receiver is woken up every two packets (do the math > below, between the number of rx calls and throughput and mss), > resulting in almost 200K activations per second, and despite > the fact that interrupt mitigation is set to a much lower > value (so incoming packets should be batched). > On linux, i see much fewer reads, presumably the process is > woken up only at the end of a burst. About relative performance FreeBSD and Linux I wrote in -performance@ at Jan'11 (Interrupt performance) > > EXAMPLES OF OUTPUT -- > > > nuttcp -t -T 5 -w 128 -v 10.0.1.2 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.0.1.2 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.0.1.2 with mss=1460, RTT=0.103 ms > nuttcp-t: send window size = 131400, receive window size = 65700 > nuttcp-t: 3095.0982 MB in 5.00 real seconds = 633785.85 KB/sec = 5191.9737 > Mbps > nuttcp-t: host-retrans = 0 > nuttcp-t: 49522 I/O calls, msec/call = 0.10, calls/sec = 9902.99 > nuttcp-t: 0.0user 2.7sys 0:05real 54% 100i+2639d 752maxrss 0+3pf 258876+6csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.0.1.4 > nuttcp-r: send window size = 33580, receive window size = 131400 > nuttcp-r: 3095.0982 MB in 5.17 real seconds = 613526.42 KB/sec = 5026.0084 > Mbps > nuttcp-r: 1114794 I/O calls, msec/call = 0.00, calls/sec = 215801.03 > nuttcp-r: 0.1user 3.5sys 0:05real 69% 112i+1104d 626maxrss 0+15pf > 507653+188csw > > > > > nuttcp -t -T 5 -w 128 -v localhost > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.051 ms > nuttcp-t: send window size = 143360, receive window size = 71680 > nuttcp-t: 26963.4375 MB in 5.00 real seconds = 5521440.59 KB/sec = 45231.6413 > Mbps > nuttcp-t: host-retrans = 0 > nuttcp-t: 431415 I/O calls, msec/call = 0.01, calls/sec = 86272.51 > nuttcp-t: 0.0user 4.6sys 0:05real 93% 102i+2681d 774maxrss 0+3pf 2510+1csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 127.0.0.1 > nuttcp-r: send window size = 43008, receive window size = 143360 > nuttcp-r: 26963.4375 MB in 5.20 real seconds = 5313135.74 KB/sec = 43525.2080 > Mbps > nuttcp-r: 767807 I/O calls, msec/call = 0.01, calls/sec = 147750.09 > nuttcp-r: 0.1user 3.9sys 0:05real 79% 98i+2570d 772maxrss 0+16pf 311014+8csw > > > on the server, run " > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On 12/08/11 05:08, Luigi Rizzo wrote: On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote: On 06.12.2011 22:06, Luigi Rizzo wrote: ... Even in my experiments there is a lot of instability in the results. I don't know exactly where the problem is, but the high number of read syscalls, and the huge impact of setting interrupt_rate=0 (defaults at 16us on the ixgbe) makes me think that there is something that needs investigation in the protocol stack. Of course we don't want to optimize specifically for the one-flow-at-10G case, but devising something that makes the system less affected by short timing variations, and can pass upstream interrupt mitigation delays would help. I'm not sure the variance is only coming from the network card and driver side of things. The TCP processing and interactions with scheduler and locking probably play a big role as well. There have been many changes to TCP recently and maybe an inefficiency that affects high-speed single sessions throughput has crept in. That's difficult to debug though. I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which seems slightly faster than HEAD) using MTU=1500 and various combinations of card capabilities (hwcsum,tso,lro), different window sizes and interrupt mitigation configurations. default latency is 16us, l=0 means no interrupt mitigation. lro is the software implementation of lro (tcp_lro.c) hwlro is the hardware one (on 82599). Using a window of 100 Kbytes seems to give the best results. Summary: [snip] - enabling software lro on the transmit side actually slows down the throughput (4-5Gbit/s instead of 8.0). I am not sure why (perhaps acks are delayed too much) ? Adding a couple of lines in tcp_lro to reject pure acks seems to have much better effect. The tcp_lro patch below might actually be useful also for other cards. --- tcp_lro.c (revision 228284) +++ tcp_lro.c (working copy) @@ -245,6 +250,8 @@ ip_len = ntohs(ip->ip_len); tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip); + if (tcp_data_len == 0) + return -1; /* not on ack */ /* There is a bug with our LRO implementation (first noticed by Jeff Roberson) that I started fixing some time back but dropped the ball on. The crux of the problem is that we currently only send an ACK for the entire LRO chunk instead of all the segments contained therein. Given that most stacks rely on the ACK clock to keep things ticking over, the current behaviour kills performance. It may well be the cause of the performance loss you have observed. WIP patch is at: http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have LRO capable hardware setup locally to figure out what I've missed. Most of the machines in my lab are running em(4) NICs which don't support LRO, but I'll see if I can find something which does and perhaps resurrect this patch. If anyone has any ideas what I'm missing in the patch to make it work, please let me know. Cheers, Lawrence ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On Thu, Dec 08, 2011 at 12:06:26PM +0200, Daniel Kalchev wrote: > > > On 07.12.11 22:23, Luigi Rizzo wrote: > > > >Sorry, forgot to mention that the above is with TSO DISABLED > >(which is not the default). TSO seems to have a very bad > >interaction with HWCSUM and non-zero mitigation. > > I have this on both sender and receiver > > # ifconfig ix1 > ix1: flags=8843 metric 0 mtu 1500 > > options=4bb > ether 00:25:90:35:22:f1 > inet 10.2.101.11 netmask 0xff00 broadcast 10.2.101.255 > media: Ethernet autoselect (autoselect ) > status: active > > without LRO on either end > > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.051 ms > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 1802.4049 MB in 5.06 real seconds = 365077.76 KB/sec = > 2990.7170 Mbps > nuttcp-t: host-retrans = 0 > nuttcp-t: 28839 I/O calls, msec/call = 0.18, calls/sec = 5704.44 > nuttcp-t: 0.0user 4.5sys 0:05real 90% 108i+1459d 630maxrss 0+2pf 87706+1csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 1802.4049 MB in 5.18 real seconds = 356247.49 KB/sec = > 2918.3794 Mbps > nuttcp-r: 529295 I/O calls, msec/call = 0.01, calls/sec = 102163.86 > nuttcp-r: 0.1user 3.7sys 0:05real 73% 116i+1567d 618maxrss 0+15pf > 230404+0csw > > with LRO on receiver > > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.067 ms > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 2420.5000 MB in 5.02 real seconds = 493701.04 KB/sec = > 4044.3989 Mbps > nuttcp-t: host-retrans = 2 > nuttcp-t: 38728 I/O calls, msec/call = 0.13, calls/sec = 7714.08 > nuttcp-t: 0.0user 4.1sys 0:05real 83% 107i+1436d 630maxrss 0+2pf 4896+0csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 2420.5000 MB in 5.15 real seconds = 481679.37 KB/sec = > 3945.9174 Mbps > nuttcp-r: 242266 I/O calls, msec/call = 0.02, calls/sec = 47080.98 > nuttcp-r: 0.0user 2.4sys 0:05real 49% 112i+1502d 618maxrss 0+15pf > 156333+0csw > > About 1/4 improvement... > > With LRO on both sender and receiver > > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.049 ms > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 2585.7500 MB in 5.02 real seconds = 527402.83 KB/sec = > 4320.4840 Mbps > nuttcp-t: host-retrans = 1 > nuttcp-t: 41372 I/O calls, msec/call = 0.12, calls/sec = 8240.67 > nuttcp-t: 0.0user 4.6sys 0:05real 93% 106i+1421d 630maxrss 0+2pf 4286+0csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 2585.7500 MB in 5.15 real seconds = 514585.31 KB/sec = > 4215.4829 Mbps > nuttcp-r: 282820 I/O calls, msec/call = 0.02, calls/sec = 54964.34 > nuttcp-r: 0.0user 2.7sys 0:05real 55% 114i+1540d 618maxrss 0+15pf > 188794+147csw > > Even better... > > With LRO on sender only: > > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.054 ms > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 2077.5437 MB in 5.02 real seconds = 423740.81 KB/sec = > 3471.2847 Mbps > nuttcp-t: host-retrans = 0 > nuttcp-t: 33241 I/O calls, msec/call = 0.15, calls/sec = 6621.01 > nuttcp-t: 0.0user 4.5sys 0:05real 92% 109i+1468d 630maxrss 0+2pf 49532+25csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 2077.5437 MB in 5.15 real seconds = 413415.33 KB/sec = > 3386.6984 Mbps > nuttcp-r: 531979 I/O calls, msec/call = 0.01, calls/sec = 103378.67 > nuttcp-r: 0.0user 4.5sys 0:05real 88% 110i+1474d 618maxrss 0+15pf > 117367+0csw > > > >also remember that hw.ixgbe.max_interrupt_rate has only > >effect at module load -- i.e. you set it with the bootloader, > >or with kenv before loading the module. > > I have this in /boot/loader.conf > > kern.ipc.nmbclusters=512000 > hw.ixgbe.max_interrupt_rate=0 > > o
Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On 07.12.11 22:23, Luigi Rizzo wrote: Sorry, forgot to mention that the above is with TSO DISABLED (which is not the default). TSO seems to have a very bad interaction with HWCSUM and non-zero mitigation. I have this on both sender and receiver # ifconfig ix1 ix1: flags=8843 metric 0 mtu 1500 options=4bb ether 00:25:90:35:22:f1 inet 10.2.101.11 netmask 0xff00 broadcast 10.2.101.255 media: Ethernet autoselect (autoselect ) status: active without LRO on either end # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.051 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1802.4049 MB in 5.06 real seconds = 365077.76 KB/sec = 2990.7170 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 28839 I/O calls, msec/call = 0.18, calls/sec = 5704.44 nuttcp-t: 0.0user 4.5sys 0:05real 90% 108i+1459d 630maxrss 0+2pf 87706+1csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1802.4049 MB in 5.18 real seconds = 356247.49 KB/sec = 2918.3794 Mbps nuttcp-r: 529295 I/O calls, msec/call = 0.01, calls/sec = 102163.86 nuttcp-r: 0.1user 3.7sys 0:05real 73% 116i+1567d 618maxrss 0+15pf 230404+0csw with LRO on receiver # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.067 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 2420.5000 MB in 5.02 real seconds = 493701.04 KB/sec = 4044.3989 Mbps nuttcp-t: host-retrans = 2 nuttcp-t: 38728 I/O calls, msec/call = 0.13, calls/sec = 7714.08 nuttcp-t: 0.0user 4.1sys 0:05real 83% 107i+1436d 630maxrss 0+2pf 4896+0csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 2420.5000 MB in 5.15 real seconds = 481679.37 KB/sec = 3945.9174 Mbps nuttcp-r: 242266 I/O calls, msec/call = 0.02, calls/sec = 47080.98 nuttcp-r: 0.0user 2.4sys 0:05real 49% 112i+1502d 618maxrss 0+15pf 156333+0csw About 1/4 improvement... With LRO on both sender and receiver # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.049 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 2585.7500 MB in 5.02 real seconds = 527402.83 KB/sec = 4320.4840 Mbps nuttcp-t: host-retrans = 1 nuttcp-t: 41372 I/O calls, msec/call = 0.12, calls/sec = 8240.67 nuttcp-t: 0.0user 4.6sys 0:05real 93% 106i+1421d 630maxrss 0+2pf 4286+0csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 2585.7500 MB in 5.15 real seconds = 514585.31 KB/sec = 4215.4829 Mbps nuttcp-r: 282820 I/O calls, msec/call = 0.02, calls/sec = 54964.34 nuttcp-r: 0.0user 2.7sys 0:05real 55% 114i+1540d 618maxrss 0+15pf 188794+147csw Even better... With LRO on sender only: # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.054 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 2077.5437 MB in 5.02 real seconds = 423740.81 KB/sec = 3471.2847 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 33241 I/O calls, msec/call = 0.15, calls/sec = 6621.01 nuttcp-t: 0.0user 4.5sys 0:05real 92% 109i+1468d 630maxrss 0+2pf 49532+25csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 2077.5437 MB in 5.15 real seconds = 413415.33 KB/sec = 3386.6984 Mbps nuttcp-r: 531979 I/O calls, msec/call = 0.01, calls/sec = 103378.67 nuttcp-r: 0.0user 4.5sys 0:05real 88% 110i+1474d 618maxrss 0+15pf 117367+0csw also remember that hw.ixgbe.max_interrupt_rate has only effect at module load -- i.e. you set it with the bootloader, or with kenv before loading the module. I have this in /boot/loader.conf kern.ipc.nmbclusters=512000 hw.ixgbe.max_interrupt_rate=0 on both sender and receiver. Please retry the measurements disabling tso (on both sides, but it really matters only on the sender). Also, LRO requires HWCSUM. How do I set HWCSUM? Is this different from RXCSUM/TXCSUM? Still I get nowhere near what you get on my hardware... Here is what pciconf -vlbc has to
Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On Wed, Dec 07, 2011 at 09:58:31PM +0200, Daniel Kalchev wrote: > > On Dec 7, 2011, at 8:08 PM, Luigi Rizzo wrote: > > > Summary: > > > > - with default interrupt mitigation, the fastest configuration > > is with checksums enabled on both sender and receiver, lro > > enabled on the receiver. This gets about 8.0 Gbit/s > > I do not observe this. With defaults: > ... Sorry, forgot to mention that the above is with TSO DISABLED (which is not the default). TSO seems to have a very bad interaction with HWCSUM and non-zero mitigation. also remember that hw.ixgbe.max_interrupt_rate has only effect at module load -- i.e. you set it with the bootloader, or with kenv before loading the module. Please retry the measurements disabling tso (on both sides, but it really matters only on the sender). Also, LRO requires HWCSUM. cheers luigi > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.053 ms > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 1857.4978 MB in 5.02 real seconds = 378856.02 KB/sec = 3103.5885 > Mbps > nuttcp-t: host-retrans = 0 > nuttcp-t: 29720 I/O calls, msec/call = 0.17, calls/sec = 5919.63 > nuttcp-t: 0.0user 2.5sys 0:05real 52% 115i+1544d 630maxrss 0+2pf 107264+1csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 1857.4978 MB in 5.15 real seconds = 369617.39 KB/sec = 3027.9057 > Mbps > nuttcp-r: 543991 I/O calls, msec/call = 0.01, calls/sec = 105709.95 > nuttcp-r: 0.1user 4.1sys 0:05real 83% 110i+1482d 618maxrss 0+15pf 158432+0csw > > On receiver: > > ifconfig ix1 lro > > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.068 ms > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 1673.3125 MB in 5.02 real seconds = 341312.36 KB/sec = 2796.0308 > Mbps > nuttcp-t: host-retrans = 1 > nuttcp-t: 26773 I/O calls, msec/call = 0.19, calls/sec = 5333.01 > nuttcp-t: 0.0user 1.0sys 0:05real 21% 113i+1518d 630maxrss 0+2pf 12772+1csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 1673.3125 MB in 5.15 real seconds = 332975.19 KB/sec = 2727.7327 > Mbps > nuttcp-r: 106268 I/O calls, msec/call = 0.05, calls/sec = 20650.82 > nuttcp-r: 0.0user 1.3sys 0:05real 28% 101i+1354d 618maxrss 0+15pf 64567+0csw > > On sender: > > ifconfig ix1 lro > > (now both receiver and sender have LRO enabled) > > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.063 ms > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 1611.7805 MB in 5.02 real seconds = 328716.18 KB/sec = 2692.8430 > Mbps > nuttcp-t: host-retrans = 1 > nuttcp-t: 25789 I/O calls, msec/call = 0.20, calls/sec = 5136.29 > nuttcp-t: 0.0user 1.0sys 0:05real 21% 109i+1465d 630maxrss 0+2pf 12697+0csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 1611.7805 MB in 5.15 real seconds = 320694.82 KB/sec = 2627.1319 > Mbps > nuttcp-r: 104346 I/O calls, msec/call = 0.05, calls/sec = 20275.05 > nuttcp-r: 0.0user 1.3sys 0:05real 27% 113i+1516d 618maxrss 0+15pf 63510+0csw > > remove LRO from receiver (only sender has LRO): > > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.065 ms > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 1884.8702 MB in 5.02 real seconds = 384464.57 KB/sec = 3149.5338 > Mbps > nuttcp-t: host-retrans = 0 > nuttcp-t: 30158 I/O calls, msec/call = 0.17, calls/sec = 6007.27 > nuttcp-t: 0.0user 2.7sys 0:05real 55% 104i+1403d 630maxrss 0+2pf 106046+0csw > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 1884.8702 MB in 5.15 real seconds = 375093.52 KB/sec = 3072.7661 > Mbps > nuttcp-r: 540237 I/O calls, msec/call = 0.01, calls/sec = 104988.68 > nuttcp-r: 0.1user 4.2sys 0:05real 84% 110i+1483d 618maxrss 0+15pf 156340+0csw > > Strange enough, setting hw.ixgbe.max_
Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On Dec 7, 2011, at 8:08 PM, Luigi Rizzo wrote: > Summary: > > - with default interrupt mitigation, the fastest configuration > is with checksums enabled on both sender and receiver, lro > enabled on the receiver. This gets about 8.0 Gbit/s I do not observe this. With defaults: # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.053 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1857.4978 MB in 5.02 real seconds = 378856.02 KB/sec = 3103.5885 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 29720 I/O calls, msec/call = 0.17, calls/sec = 5919.63 nuttcp-t: 0.0user 2.5sys 0:05real 52% 115i+1544d 630maxrss 0+2pf 107264+1csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1857.4978 MB in 5.15 real seconds = 369617.39 KB/sec = 3027.9057 Mbps nuttcp-r: 543991 I/O calls, msec/call = 0.01, calls/sec = 105709.95 nuttcp-r: 0.1user 4.1sys 0:05real 83% 110i+1482d 618maxrss 0+15pf 158432+0csw On receiver: ifconfig ix1 lro # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.068 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1673.3125 MB in 5.02 real seconds = 341312.36 KB/sec = 2796.0308 Mbps nuttcp-t: host-retrans = 1 nuttcp-t: 26773 I/O calls, msec/call = 0.19, calls/sec = 5333.01 nuttcp-t: 0.0user 1.0sys 0:05real 21% 113i+1518d 630maxrss 0+2pf 12772+1csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1673.3125 MB in 5.15 real seconds = 332975.19 KB/sec = 2727.7327 Mbps nuttcp-r: 106268 I/O calls, msec/call = 0.05, calls/sec = 20650.82 nuttcp-r: 0.0user 1.3sys 0:05real 28% 101i+1354d 618maxrss 0+15pf 64567+0csw On sender: ifconfig ix1 lro (now both receiver and sender have LRO enabled) # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.063 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1611.7805 MB in 5.02 real seconds = 328716.18 KB/sec = 2692.8430 Mbps nuttcp-t: host-retrans = 1 nuttcp-t: 25789 I/O calls, msec/call = 0.20, calls/sec = 5136.29 nuttcp-t: 0.0user 1.0sys 0:05real 21% 109i+1465d 630maxrss 0+2pf 12697+0csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1611.7805 MB in 5.15 real seconds = 320694.82 KB/sec = 2627.1319 Mbps nuttcp-r: 104346 I/O calls, msec/call = 0.05, calls/sec = 20275.05 nuttcp-r: 0.0user 1.3sys 0:05real 27% 113i+1516d 618maxrss 0+15pf 63510+0csw remove LRO from receiver (only sender has LRO): # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.065 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1884.8702 MB in 5.02 real seconds = 384464.57 KB/sec = 3149.5338 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 30158 I/O calls, msec/call = 0.17, calls/sec = 6007.27 nuttcp-t: 0.0user 2.7sys 0:05real 55% 104i+1403d 630maxrss 0+2pf 106046+0csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1884.8702 MB in 5.15 real seconds = 375093.52 KB/sec = 3072.7661 Mbps nuttcp-r: 540237 I/O calls, msec/call = 0.01, calls/sec = 104988.68 nuttcp-r: 0.1user 4.2sys 0:05real 84% 110i+1483d 618maxrss 0+15pf 156340+0csw Strange enough, setting hw.ixgbe.max_interrupt_rate=0 does not have any effect.. Daniel ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?
On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote: > On 06.12.2011 22:06, Luigi Rizzo wrote: ... > >Even in my experiments there is a lot of instability in the results. > >I don't know exactly where the problem is, but the high number of > >read syscalls, and the huge impact of setting interrupt_rate=0 > >(defaults at 16us on the ixgbe) makes me think that there is something > >that needs investigation in the protocol stack. > > > >Of course we don't want to optimize specifically for the one-flow-at-10G > >case, but devising something that makes the system less affected > >by short timing variations, and can pass upstream interrupt mitigation > >delays would help. > > I'm not sure the variance is only coming from the network card and > driver side of things. The TCP processing and interactions with > scheduler and locking probably play a big role as well. There have > been many changes to TCP recently and maybe an inefficiency that > affects high-speed single sessions throughput has crept in. That's > difficult to debug though. I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which seems slightly faster than HEAD) using MTU=1500 and various combinations of card capabilities (hwcsum,tso,lro), different window sizes and interrupt mitigation configurations. default latency is 16us, l=0 means no interrupt mitigation. lro is the software implementation of lro (tcp_lro.c) hwlro is the hardware one (on 82599). Using a window of 100 Kbytes seems to give the best results. Summary: - with default interrupt mitigation, the fastest configuration is with checksums enabled on both sender and receiver, lro enabled on the receiver. This gets about 8.0 Gbit/s - lro is especially good because it packs data packets together, passing mitigation upstream and removing duplicate work in the ip and tcp stack. - disabling LRO on the receiver brings performance to 6.5 Gbit/s. Also it increases the CPU load (also in userspace). - disabling checksums on the sender reduces transmit speed to 5.5 Gbit/s - checksums disabled on both sides (and no LRO on the receiver) go down to 4.8 Gbit/s - I could not try the receive side without checksum but with lro - with default interrupt mitigation, setting both HWCSUM and TSO on the sender is really disruptive. Depending on lro settings on the receiver i get 1.5 to 3.2 Gbit/s and huge variance - Using both hwcsum and tso seems to work fine if you disable interrupt mitigation (reaching a peak of 9.4 Gbit/s). - enabling software lro on the transmit side actually slows down the throughput (4-5Gbit/s instead of 8.0). I am not sure why (perhaps acks are delayed too much) ? Adding a couple of lines in tcp_lro to reject pure acks seems to have much better effect. The tcp_lro patch below might actually be useful also for other cards. --- tcp_lro.c (revision 228284) +++ tcp_lro.c (working copy) @@ -245,6 +250,8 @@ ip_len = ntohs(ip->ip_len); tcp_data_len = ip_len - (tcp->th_off << 2) - sizeof (*ip); + if (tcp_data_len == 0) + return -1; /* not on ack */ /* cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: datapoints on 10G throughput with TCP ?
On 06.12.2011 22:06, Luigi Rizzo wrote: On Tue, Dec 06, 2011 at 07:40:21PM +0200, Daniel Kalchev wrote: I see significant difference between number of interrupts on the Intel and the AMD blades. When performing a test between the Intel and AMD blades, the Intel blade generates 20,000-35,000 interrupts, while the AMD blade generates under 1,000 interrupts. Even in my experiments there is a lot of instability in the results. I don't know exactly where the problem is, but the high number of read syscalls, and the huge impact of setting interrupt_rate=0 (defaults at 16us on the ixgbe) makes me think that there is something that needs investigation in the protocol stack. Of course we don't want to optimize specifically for the one-flow-at-10G case, but devising something that makes the system less affected by short timing variations, and can pass upstream interrupt mitigation delays would help. I'm not sure the variance is only coming from the network card and driver side of things. The TCP processing and interactions with scheduler and locking probably play a big role as well. There have been many changes to TCP recently and maybe an inefficiency that affects high-speed single sessions throughput has crept in. That's difficult to debug though. -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: datapoints on 10G throughput with TCP ?
On 07/12/2011, at 24:54, Daniel Kalchev wrote: > It seems performance measurements are more dependent on the server (nuttcp > -S) machine. > We will have to rule out the interrupt storms first of course, any advice? You can control the storm threshold by setting the hw.intr_storm_threshold sysctl. -- Daniel O'Connor software and network engineer for Genesis Software - http://www.gsoft.com.au "The nice thing about standards is that there are so many of them to choose from." -- Andrew Tanenbaum GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: datapoints on 10G throughput with TCP ?
On Tue, Dec 06, 2011 at 07:40:21PM +0200, Daniel Kalchev wrote: > I see significant difference between number of interrupts on the Intel and > the AMD blades. When performing a test between the Intel and AMD blades, the > Intel blade generates 20,000-35,000 interrupts, while the AMD blade generates > under 1,000 interrupts. > Even in my experiments there is a lot of instability in the results. I don't know exactly where the problem is, but the high number of read syscalls, and the huge impact of setting interrupt_rate=0 (defaults at 16us on the ixgbe) makes me think that there is something that needs investigation in the protocol stack. Of course we don't want to optimize specifically for the one-flow-at-10G case, but devising something that makes the system less affected by short timing variations, and can pass upstream interrupt mitigation delays would help. I don't have a solution yet.. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: datapoints on 10G throughput with TCP ?
I see significant difference between number of interrupts on the Intel and the AMD blades. When performing a test between the Intel and AMD blades, the Intel blade generates 20,000-35,000 interrupts, while the AMD blade generates under 1,000 interrupts. There is no longer throttling, but the performance does not improve.. I set it via sysctl hw.intr_storm_threshold=0 Should this go to /boot/loader.conf instead. Daniel On Dec 6, 2011, at 7:21 PM, Jack Vogel wrote: > Set the storm threshold to 0, that will disable it, its going to throttle > your performance > when it happens. > > Jack > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: datapoints on 10G throughput with TCP ?
Set the storm threshold to 0, that will disable it, its going to throttle your performance when it happens. Jack On Tue, Dec 6, 2011 at 6:24 AM, Daniel Kalchev wrote: > Some tests with updated FreeBSD to 8-stable as of today, compared with the > previous run > > > > On 06.12.11 13:18, Daniel Kalchev wrote: > >> >> FreeBSD 8.2-STABLE #0: Wed Sep 28 11:23:59 EEST 2011 >> CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2403.58-MHz >> K8-class CPU) >> real memory = 51539607552 (49152 MB) >> blade 1: >> >> # nuttcp -S >> # nuttcp -t -T 5 -w 128 -v localhost >> nuttcp-t: v6.1.2: socket >> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost >> nuttcp-t: time limit = 5.00 seconds >> nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.044 ms >> nuttcp-t: send window size = 143360, receive window size = 71680 >> nuttcp-t: 8959.8750 MB in 5.02 real seconds = 1827635.67 KB/sec = >> 14971.9914 Mbps >> nuttcp-t: host-retrans = 0 >> nuttcp-t: 143358 I/O calls, msec/call = 0.04, calls/sec = 28556.81 >> nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 602maxrss 0+5pf 16+46csw >> >> nuttcp-r: v6.1.2: socket >> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp >> nuttcp-r: accept from 127.0.0.1 >> nuttcp-r: send window size = 43008, receive window size = 143360 >> nuttcp-r: 8959.8750 MB in 5.17 real seconds = 1773171.07 KB/sec = >> 14525.8174 Mbps >> nuttcp-r: 219708 I/O calls, msec/call = 0.02, calls/sec = 42461.43 >> nuttcp-r: 0.0user 3.8sys 0:05real 76% 105i+1407d 614maxrss 1+17pf >> 95059+22csw >> > > New results: > > FreeBSD 8.2-STABLE #1: Tue Dec 6 13:51:01 EET 2011 > > > > # nuttcp -t -T 5 -w 128 -v localhost > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.030 ms > > nuttcp-t: send window size = 143360, receive window size = 71680 > nuttcp-t: 12748.0625 MB in 5.02 real seconds = 2599947.38 KB/sec = > 21298.7689 Mbps > nuttcp-t: host-retrans = 0 > nuttcp-t: 203969 I/O calls, msec/call = 0.03, calls/sec = 40624.18 > nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 620maxrss 0+2pf 1+82csw > > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 127.0.0.1 > nuttcp-r: send window size = 43008, receive window size = 143360 > nuttcp-r: 12748.0625 MB in 5.15 real seconds = 2536511.81 KB/sec = > 20779.1048 Mbps > nuttcp-r: 297000 I/O calls, msec/call = 0.02, calls/sec = 57709.75 > nuttcp-r: 0.1user 4.0sys 0:05real 81% 109i+1469d 626maxrss 0+15pf > 121136+34csw > > Noticeable improvement. > > > > >> blade 2: >> >> # nuttcp -t -T 5 -w 128 -v 10.2.101.12 >> nuttcp-t: v6.1.2: socket >> nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.12 >> nuttcp-t: time limit = 5.00 seconds >> nuttcp-t: connect to 10.2.101.12 with mss=1448, RTT=0.059 ms >> nuttcp-t: send window size = 131768, receive window size = 66608 >> nuttcp-t: 1340.6469 MB in 5.02 real seconds = 273449.90 KB/sec = >> 2240.1016 Mbps >> nuttcp-t: host-retrans = 171 >> nuttcp-t: 21451 I/O calls, msec/call = 0.24, calls/sec = 4272.78 >> nuttcp-t: 0.0user 1.9sys 0:05real 39% 120i+1610d 600maxrss 2+3pf >> 75658+0csw >> >> nuttcp-r: v6.1.2: socket >> nuttcp-r: buflen=65536, nstream=1, port=5001 tcp >> nuttcp-r: accept from 10.2.101.11 >> nuttcp-r: send window size = 33304, receive window size = 131768 >> nuttcp-r: 1340.6469 MB in 5.17 real seconds = 265292.92 KB/sec = >> 2173.2796 Mbps >> nuttcp-r: 408764 I/O calls, msec/call = 0.01, calls/sec = 78992.15 >> nuttcp-r: 0.0user 3.3sys 0:05real 64% 105i+1413d 620maxrss 0+15pf >> 105104+102csw >> > > # nuttcp -t -T 5 -w 128 -v 10.2.101.11 > nuttcp-t: v6.1.2: socket > nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 > > nuttcp-t: time limit = 5.00 seconds > nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.055 ms > > nuttcp-t: send window size = 131768, receive window size = 66608 > nuttcp-t: 1964.8640 MB in 5.02 real seconds = 400757.59 KB/sec = 3283.0062 > Mbps > nuttcp-t: host-retrans = 0 > nuttcp-t: 31438 I/O calls, msec/call = 0.16, calls/sec = 6261.87 > nuttcp-t: 0.0user 2.7sys 0:05real 55% 112i+1501d 1124maxrss 1+2pf > 65+112csw > > > nuttcp-r: v6.1.2: socket > nuttcp-r: buflen=65536, nstream=1, port=5001 tcp > nuttcp-r: accept from 10.2.101.12 > > nuttcp-r: send window size = 33304, receive window size = 131768 > nuttcp-r: 1964.8640 MB in 5.15 real seconds = 390972.20 KB/sec = 3202.8442 > Mbps > nuttcp-r: 560718 I/O calls, msec/call = 0.01, calls/sec = 108957.70 > nuttcp-r: 0.1user 4.2sys 0:05real 84% 111i+1494d 626maxrss 0+15pf > 151930+16csw > > Again, improvement. > > > >> >> Another pari of blades: >> >> FreeBSD 8.2-STABLE #0: Tue Aug 9 12:37:55 EEST 2011 >> CPU: AMD Opteron(tm) Processor 6134 (2300.04-MHz K8-class CPU) >> real memory = 68719476736 (65536 MB) >> >> first blade: >> >> # nuttcp -S >> # nuttcp -t -T 5 -w 128 -v localhost >> nuttcp-t: v6.1.2: socket >> nuttcp-t: buflen=65
Re: datapoints on 10G throughput with TCP ?
Some tests with updated FreeBSD to 8-stable as of today, compared with the previous run On 06.12.11 13:18, Daniel Kalchev wrote: FreeBSD 8.2-STABLE #0: Wed Sep 28 11:23:59 EEST 2011 CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2403.58-MHz K8-class CPU) real memory = 51539607552 (49152 MB) blade 1: # nuttcp -S # nuttcp -t -T 5 -w 128 -v localhost nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.044 ms nuttcp-t: send window size = 143360, receive window size = 71680 nuttcp-t: 8959.8750 MB in 5.02 real seconds = 1827635.67 KB/sec = 14971.9914 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 143358 I/O calls, msec/call = 0.04, calls/sec = 28556.81 nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 602maxrss 0+5pf 16+46csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 127.0.0.1 nuttcp-r: send window size = 43008, receive window size = 143360 nuttcp-r: 8959.8750 MB in 5.17 real seconds = 1773171.07 KB/sec = 14525.8174 Mbps nuttcp-r: 219708 I/O calls, msec/call = 0.02, calls/sec = 42461.43 nuttcp-r: 0.0user 3.8sys 0:05real 76% 105i+1407d 614maxrss 1+17pf 95059+22csw New results: FreeBSD 8.2-STABLE #1: Tue Dec 6 13:51:01 EET 2011 # nuttcp -t -T 5 -w 128 -v localhost nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.030 ms nuttcp-t: send window size = 143360, receive window size = 71680 nuttcp-t: 12748.0625 MB in 5.02 real seconds = 2599947.38 KB/sec = 21298.7689 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 203969 I/O calls, msec/call = 0.03, calls/sec = 40624.18 nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 620maxrss 0+2pf 1+82csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 127.0.0.1 nuttcp-r: send window size = 43008, receive window size = 143360 nuttcp-r: 12748.0625 MB in 5.15 real seconds = 2536511.81 KB/sec = 20779.1048 Mbps nuttcp-r: 297000 I/O calls, msec/call = 0.02, calls/sec = 57709.75 nuttcp-r: 0.1user 4.0sys 0:05real 81% 109i+1469d 626maxrss 0+15pf 121136+34csw Noticeable improvement. blade 2: # nuttcp -t -T 5 -w 128 -v 10.2.101.12 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.12 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.12 with mss=1448, RTT=0.059 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1340.6469 MB in 5.02 real seconds = 273449.90 KB/sec = 2240.1016 Mbps nuttcp-t: host-retrans = 171 nuttcp-t: 21451 I/O calls, msec/call = 0.24, calls/sec = 4272.78 nuttcp-t: 0.0user 1.9sys 0:05real 39% 120i+1610d 600maxrss 2+3pf 75658+0csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.11 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1340.6469 MB in 5.17 real seconds = 265292.92 KB/sec = 2173.2796 Mbps nuttcp-r: 408764 I/O calls, msec/call = 0.01, calls/sec = 78992.15 nuttcp-r: 0.0user 3.3sys 0:05real 64% 105i+1413d 620maxrss 0+15pf 105104+102csw # nuttcp -t -T 5 -w 128 -v 10.2.101.11 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.11 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.11 with mss=1448, RTT=0.055 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1964.8640 MB in 5.02 real seconds = 400757.59 KB/sec = 3283.0062 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 31438 I/O calls, msec/call = 0.16, calls/sec = 6261.87 nuttcp-t: 0.0user 2.7sys 0:05real 55% 112i+1501d 1124maxrss 1+2pf 65+112csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.12 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1964.8640 MB in 5.15 real seconds = 390972.20 KB/sec = 3202.8442 Mbps nuttcp-r: 560718 I/O calls, msec/call = 0.01, calls/sec = 108957.70 nuttcp-r: 0.1user 4.2sys 0:05real 84% 111i+1494d 626maxrss 0+15pf 151930+16csw Again, improvement. Another pari of blades: FreeBSD 8.2-STABLE #0: Tue Aug 9 12:37:55 EEST 2011 CPU: AMD Opteron(tm) Processor 6134 (2300.04-MHz K8-class CPU) real memory = 68719476736 (65536 MB) first blade: # nuttcp -S # nuttcp -t -T 5 -w 128 -v localhost nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.090 ms nuttcp-t: send window size = 143360, receive window size = 71680 nuttcp-t: 2695.0625 MB in 5.00 real seconds = 551756.90 KB/sec = 4519.9925 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 43121 I/O calls, msec/call = 0.12, calls/sec = 8621.20 nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 620maxrss 0+4pf 2+71csw nuttcp-r: v6.
Re: datapoints on 10G throughput with TCP ?
On 06.12.11 13:18, Daniel Kalchev wrote: [...] second blade: # nuttcp -t -T 5 -w 128 -v 10.2.101.13 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.13 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.13 with mss=1448, RTT=0.164 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1290.3750 MB in 5.00 real seconds = 264173.96 KB/sec = 2164.1131 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 20646 I/O calls, msec/call = 0.25, calls/sec = 4127.72 nuttcp-t: 0.0user 3.8sys 0:05real 77% 96i+1299d 616maxrss 0+3pf 27389+0csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.14 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1290.3750 MB in 5.14 real seconds = 256835.92 KB/sec = 2103.9998 Mbps nuttcp-r: 85668 I/O calls, msec/call = 0.06, calls/sec = 16651.70 nuttcp-r: 0.0user 4.8sys 0:05real 94% 107i+1437d 624maxrss 0+15pf 11848+0csw Not impresive... I am rebuilding now to -stable. Daniel I also noticed interrupt storms happening while this was running on the second pair of blades: interrupt storm detected on "irq272:"; throttling interrupt source interrupt storm detected on "irq272:"; throttling interrupt source interrupt storm detected on "irq272:"; throttling interrupt source interrupt storm detected on "irq270:"; throttling interrupt source interrupt storm detected on "irq270:"; throttling interrupt source interrupt storm detected on "irq270:"; throttling interrupt source some stats # sysctl -a dev.ix.1 dev.ix.1.%desc: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.3.10 dev.ix.1.%driver: ix dev.ix.1.%location: slot=0 function=1 dev.ix.1.%pnpinfo: vendor=0x8086 device=0x10fc subvendor=0x subdevice=0x class=0x02 dev.ix.1.%parent: pci3 dev.ix.1.flow_control: 3 dev.ix.1.advertise_gig: 0 dev.ix.1.enable_aim: 1 dev.ix.1.rx_processing_limit: 128 dev.ix.1.dropped: 0 dev.ix.1.mbuf_defrag_failed: 0 dev.ix.1.no_tx_dma_setup: 0 dev.ix.1.watchdog_events: 0 dev.ix.1.tso_tx: 1193460 dev.ix.1.link_irq: 1 dev.ix.1.queue0.interrupt_rate: 100 dev.ix.1.queue0.txd_head: 45 dev.ix.1.queue0.txd_tail: 45 dev.ix.1.queue0.no_desc_avail: 0 dev.ix.1.queue0.tx_packets: 23 dev.ix.1.queue0.rxd_head: 16 dev.ix.1.queue0.rxd_tail: 15 dev.ix.1.queue0.rx_packets: 16 dev.ix.1.queue0.rx_bytes: 2029 dev.ix.1.queue0.lro_queued: 0 dev.ix.1.queue0.lro_flushed: 0 dev.ix.1.queue1.interrupt_rate: 62500 dev.ix.1.queue1.txd_head: 0 dev.ix.1.queue1.txd_tail: 0 dev.ix.1.queue1.no_desc_avail: 0 dev.ix.1.queue1.tx_packets: 0 dev.ix.1.queue1.rxd_head: 0 dev.ix.1.queue1.rxd_tail: 2047 dev.ix.1.queue1.rx_packets: 0 dev.ix.1.queue1.rx_bytes: 0 dev.ix.1.queue1.lro_queued: 0 dev.ix.1.queue1.lro_flushed: 0 dev.ix.1.queue2.interrupt_rate: 20 dev.ix.1.queue2.txd_head: 545 dev.ix.1.queue2.txd_tail: 545 dev.ix.1.queue2.no_desc_avail: 0 dev.ix.1.queue2.tx_packets: 331690 dev.ix.1.queue2.rxd_head: 1099 dev.ix.1.queue2.rxd_tail: 1098 dev.ix.1.queue2.rx_packets: 498763 dev.ix.1.queue2.rx_bytes: 32954702 dev.ix.1.queue2.lro_queued: 0 dev.ix.1.queue2.lro_flushed: 0 dev.ix.1.queue3.interrupt_rate: 62500 dev.ix.1.queue3.txd_head: 0 dev.ix.1.queue3.txd_tail: 0 dev.ix.1.queue3.no_desc_avail: 0 dev.ix.1.queue3.tx_packets: 0 dev.ix.1.queue3.rxd_head: 0 dev.ix.1.queue3.rxd_tail: 2047 dev.ix.1.queue3.rx_packets: 0 dev.ix.1.queue3.rx_bytes: 0 dev.ix.1.queue3.lro_queued: 0 dev.ix.1.queue3.lro_flushed: 0 dev.ix.1.queue4.interrupt_rate: 100 dev.ix.1.queue4.txd_head: 13 dev.ix.1.queue4.txd_tail: 13 dev.ix.1.queue4.no_desc_avail: 0 dev.ix.1.queue4.tx_packets: 6 dev.ix.1.queue4.rxd_head: 6 dev.ix.1.queue4.rxd_tail: 5 dev.ix.1.queue4.rx_packets: 6 dev.ix.1.queue4.rx_bytes: 899 dev.ix.1.queue4.lro_queued: 0 dev.ix.1.queue4.lro_flushed: 0 dev.ix.1.queue5.interrupt_rate: 20 dev.ix.1.queue5.txd_head: 982 dev.ix.1.queue5.txd_tail: 982 dev.ix.1.queue5.no_desc_avail: 0 dev.ix.1.queue5.tx_packets: 302592 dev.ix.1.queue5.rxd_head: 956 dev.ix.1.queue5.rxd_tail: 955 dev.ix.1.queue5.rx_packets: 474044 dev.ix.1.queue5.rx_bytes: 31319840 dev.ix.1.queue5.lro_queued: 0 dev.ix.1.queue5.lro_flushed: 0 dev.ix.1.queue6.interrupt_rate: 20 dev.ix.1.queue6.txd_head: 1902 dev.ix.1.queue6.txd_tail: 1902 dev.ix.1.queue6.no_desc_avail: 0 dev.ix.1.queue6.tx_packets: 184922 dev.ix.1.queue6.rxd_head: 1410 dev.ix.1.queue6.rxd_tail: 1409 dev.ix.1.queue6.rx_packets: 402818 dev.ix.1.queue6.rx_bytes: 27759640 dev.ix.1.queue6.lro_queued: 0 dev.ix.1.queue6.lro_flushed: 0 dev.ix.1.queue7.interrupt_rate: 20 dev.ix.1.queue7.txd_head: 660 dev.ix.1.queue7.txd_tail: 660 dev.ix.1.queue7.no_desc_avail: 0 dev.ix.1.queue7.tx_packets: 378078 dev.ix.1.queue7.rxd_head: 885 dev.ix.1.queue7.rxd_tail: 884 dev.ix.1.queue7.rx_packets: 705397 dev.ix.1.queue7.rx_bytes: 46572290 dev.ix.1.queue7.lro_queued: 0 dev.ix.1.queue7.lro_flushed: 0 dev.ix.1.mac_stats.crc_errs: 0 dev.ix.1.mac_stats.ill_errs: 0 dev.ix.1.mac_stats.byt
Re: datapoints on 10G throughput with TCP ?
Here is what I get, with an existing install, no tuning other than kern.ipc.nmbclusters=512000 Pair of Supermicro blades: FreeBSD 8.2-STABLE #0: Wed Sep 28 11:23:59 EEST 2011 CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2403.58-MHz K8-class CPU) real memory = 51539607552 (49152 MB) [...] ix0: port 0xdc00-0xdc1f mem 0xfbc0-0xfbdf,0xfbbfc000-0xfbbf irq 16 at device 0.0 on pci3 ix0: Using MSIX interrupts with 9 vectors ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: Ethernet address: xx:xx:xx:xx:xx:xx ix0: PCI Express Bus: Speed 5.0Gb/s Width x8 ix1: port 0xd880-0xd89f mem 0xfb80-0xfb9f,0xfbbf8000-0xfbbfbfff irq 17 at device 0.1 on pci3 ix1: Using MSIX interrupts with 9 vectors ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: Ethernet address: xx:xx:xx:xx:xx:xx ix1: PCI Express Bus: Speed 5.0Gb/s Width x8 blade 1: # nuttcp -S # nuttcp -t -T 5 -w 128 -v localhost nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.044 ms nuttcp-t: send window size = 143360, receive window size = 71680 nuttcp-t: 8959.8750 MB in 5.02 real seconds = 1827635.67 KB/sec = 14971.9914 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 143358 I/O calls, msec/call = 0.04, calls/sec = 28556.81 nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 602maxrss 0+5pf 16+46csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 127.0.0.1 nuttcp-r: send window size = 43008, receive window size = 143360 nuttcp-r: 8959.8750 MB in 5.17 real seconds = 1773171.07 KB/sec = 14525.8174 Mbps nuttcp-r: 219708 I/O calls, msec/call = 0.02, calls/sec = 42461.43 nuttcp-r: 0.0user 3.8sys 0:05real 76% 105i+1407d 614maxrss 1+17pf 95059+22csw blade 2: # nuttcp -t -T 5 -w 128 -v 10.2.101.12 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.12 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.2.101.12 with mss=1448, RTT=0.059 ms nuttcp-t: send window size = 131768, receive window size = 66608 nuttcp-t: 1340.6469 MB in 5.02 real seconds = 273449.90 KB/sec = 2240.1016 Mbps nuttcp-t: host-retrans = 171 nuttcp-t: 21451 I/O calls, msec/call = 0.24, calls/sec = 4272.78 nuttcp-t: 0.0user 1.9sys 0:05real 39% 120i+1610d 600maxrss 2+3pf 75658+0csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.2.101.11 nuttcp-r: send window size = 33304, receive window size = 131768 nuttcp-r: 1340.6469 MB in 5.17 real seconds = 265292.92 KB/sec = 2173.2796 Mbps nuttcp-r: 408764 I/O calls, msec/call = 0.01, calls/sec = 78992.15 nuttcp-r: 0.0user 3.3sys 0:05real 64% 105i+1413d 620maxrss 0+15pf 105104+102csw Another pari of blades: FreeBSD 8.2-STABLE #0: Tue Aug 9 12:37:55 EEST 2011 CPU: AMD Opteron(tm) Processor 6134 (2300.04-MHz K8-class CPU) real memory = 68719476736 (65536 MB) [...] ix0: port 0xe400-0xe41f mem 0xfe60-0xfe7f,0xfe4fc000-0xfe4f irq 19 at device 0.0 on pci3 ix0: Using MSIX interrupts with 9 vectors ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: [ITHREAD] ix0: Ethernet address: xx:xx:xx:xx:xx:xx ix0: PCI Express Bus: Speed 5.0Gb/s Width x8 ix1: port 0xe800-0xe81f mem 0xfea0-0xfebf,0xfe8fc000-0xfe8f irq 16 at device 0.1 on pci3 ix1: Using MSIX interrupts with 9 vectors ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: [ITHREAD] ix1: Ethernet address: xx:xx:xx:xx:xx:xx ix1: PCI Express Bus: Speed 5.0Gb/s Width x8 first blade: # nuttcp -S # nuttcp -t -T 5 -w 128 -v localhost nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.090 ms nuttcp-t: send window size = 143360, receive window size = 71680 nuttcp-t: 2695.0625 MB in 5.00 real seconds = 551756.90 KB/sec = 4519.9925 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 43121 I/O calls, msec/call = 0.12, calls/sec = 8621.20 nuttcp-t: 0.0user 4.9sys 0:05real 99% 106i+1428d 620maxrss 0+4pf 2+71csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 127.0.0.1 nuttcp-r: send window size = 43008, receive window size = 143360 nuttcp-r: 2695.0625 MB in 5.14 real seconds = 536509.66 KB/sec = 4395.0871 Mbps nuttcp-r: 43126 I/O calls, msec/call = 0.12, calls/sec = 8383.94 nuttcp-r: 0.0user 3.1sys 0:05real 61% 94i+1264d 624maxrss 1+16pf 43019+0csw second blade: # nuttcp -t -T 5 -w 128 -v 10.2.101.13 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.2.101.13 nuttcp-t: time
Re: datapoints on 10G throughput with TCP ?
You can't get line rate with ixgbe, in what configuration/hardware? We surely do get line rate in validation here, but its sensitive to your hardware and config. Jack On Mon, Dec 5, 2011 at 2:28 PM, Luigi Rizzo wrote: > On Mon, Dec 05, 2011 at 11:15:09PM +0200, Daniel Kalchev wrote: > > > > On Dec 5, 2011, at 9:27 PM, Luigi Rizzo wrote: > > > > > - have two machines connected by a 10G link > > > - on one run "nuttcp -S" > > > - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip" > > > > > > > Any particular tuning of FreeBSD? > > actually my point is first to see how good or bad are the defaults. > > I have noticed that setting hw.ixgbe.max_interrupt_rate=0 > (it is a tunable, you need to do it before loading the module) > improves the throughput by a fair amount (but still way below > line rate with 1500 byte packets). > > other things (larger windows) don't seem to help much. > > cheers > luigi > ___ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: datapoints on 10G throughput with TCP ?
On Mon, Dec 05, 2011 at 03:08:54PM -0800, Jack Vogel wrote: > You can't get line rate with ixgbe, in what configuration/hardware? > We surely do get line rate in validation here, but its sensitive to > your hardware and config. sources from HEAD as of a week or so, default parameter setting, 82599 on an Intel dual port 10G card, Intel i7-870 CPU (4 cores) at 2.93 GHz, on asus MB and the card on a PCIe-x16 slot, MTU=1500 bytes. Same hardware, same defaults and nuttcp on linux does 8.5 Gbit/s. I can do line rate with a single flow if i use MTU=9000 and set max_interrupt_rate=0 (even reducing the CPU speed to 1.2 GHz). I can saturate the link with multiple flows (say nuttcp -N 8). cheers luigi > Jack > > > On Mon, Dec 5, 2011 at 2:28 PM, Luigi Rizzo wrote: > > > On Mon, Dec 05, 2011 at 11:15:09PM +0200, Daniel Kalchev wrote: > > > > > > On Dec 5, 2011, at 9:27 PM, Luigi Rizzo wrote: > > > > > > > - have two machines connected by a 10G link > > > > - on one run "nuttcp -S" > > > > - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip" > > > > > > > > > > Any particular tuning of FreeBSD? > > > > actually my point is first to see how good or bad are the defaults. > > > > I have noticed that setting hw.ixgbe.max_interrupt_rate=0 > > (it is a tunable, you need to do it before loading the module) > > improves the throughput by a fair amount (but still way below > > line rate with 1500 byte packets). > > > > other things (larger windows) don't seem to help much. > > > > cheers > > luigi > > ___ > > freebsd-current@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" > > ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: datapoints on 10G throughput with TCP ?
On Mon, Dec 05, 2011 at 11:15:09PM +0200, Daniel Kalchev wrote: > > On Dec 5, 2011, at 9:27 PM, Luigi Rizzo wrote: > > > - have two machines connected by a 10G link > > - on one run "nuttcp -S" > > - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip" > > > > Any particular tuning of FreeBSD? actually my point is first to see how good or bad are the defaults. I have noticed that setting hw.ixgbe.max_interrupt_rate=0 (it is a tunable, you need to do it before loading the module) improves the throughput by a fair amount (but still way below line rate with 1500 byte packets). other things (larger windows) don't seem to help much. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: datapoints on 10G throughput with TCP ?
On Dec 5, 2011, at 9:27 PM, Luigi Rizzo wrote: > - have two machines connected by a 10G link > - on one run "nuttcp -S" > - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip" > Any particular tuning of FreeBSD? Daniel ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
datapoints on 10G throughput with TCP ?
Hi, I am trying to establish the baseline performance for 10G throughput over TCP, and would like to collect some data points. As a testing program i am using nuttcp from ports (as good as anything, i guess -- it is reasonably flexible, and if you use it in TCP with relatively large writes, the overhead for syscalls and gettimeofday() shouldn't kill you). I'd be very grateful if you could do the following test: - have two machines connected by a 10G link - on one run "nuttcp -S" - on the other one run "nuttcp -t -T 5 -w 128 -v the.other.ip" and send me a dump of the output, such as the one(s) at the end of the message. I am mostly interested in two configurations: - one over loopback, which should tell how fast is the CPU+memory As an example, one of my machines does about 15 Gbit/s, and one of the faster ones does about 44 Gbit/s - one over the wire using 1500 byte mss. Here it really matters how good is the handling of small MTUs. As a data point, on my machines i get 2..3.5 Gbit/s on the "slow" machine with a 1500 byte mtu and default card setting. Clearing the interrupt mitigation register (so no mitigation) brings the rate to 5-6 Gbit/s. Same hardware with linux does about 8 Gbit/s. HEAD seems 10-20% slower than RELENG_8 though i am not sure who is at fault. The receive side is particularly critical - on FreeBSD the receiver is woken up every two packets (do the math below, between the number of rx calls and throughput and mss), resulting in almost 200K activations per second, and despite the fact that interrupt mitigation is set to a much lower value (so incoming packets should be batched). On linux, i see much fewer reads, presumably the process is woken up only at the end of a burst. cheers luigi EXAMPLES OF OUTPUT -- > nuttcp -t -T 5 -w 128 -v 10.0.1.2 nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 10.0.1.2 nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 10.0.1.2 with mss=1460, RTT=0.103 ms nuttcp-t: send window size = 131400, receive window size = 65700 nuttcp-t: 3095.0982 MB in 5.00 real seconds = 633785.85 KB/sec = 5191.9737 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 49522 I/O calls, msec/call = 0.10, calls/sec = 9902.99 nuttcp-t: 0.0user 2.7sys 0:05real 54% 100i+2639d 752maxrss 0+3pf 258876+6csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 10.0.1.4 nuttcp-r: send window size = 33580, receive window size = 131400 nuttcp-r: 3095.0982 MB in 5.17 real seconds = 613526.42 KB/sec = 5026.0084 Mbps nuttcp-r: 1114794 I/O calls, msec/call = 0.00, calls/sec = 215801.03 nuttcp-r: 0.1user 3.5sys 0:05real 69% 112i+1104d 626maxrss 0+15pf 507653+188csw > > nuttcp -t -T 5 -w 128 -v localhost nuttcp-t: v6.1.2: socket nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> localhost nuttcp-t: time limit = 5.00 seconds nuttcp-t: connect to 127.0.0.1 with mss=14336, RTT=0.051 ms nuttcp-t: send window size = 143360, receive window size = 71680 nuttcp-t: 26963.4375 MB in 5.00 real seconds = 5521440.59 KB/sec = 45231.6413 Mbps nuttcp-t: host-retrans = 0 nuttcp-t: 431415 I/O calls, msec/call = 0.01, calls/sec = 86272.51 nuttcp-t: 0.0user 4.6sys 0:05real 93% 102i+2681d 774maxrss 0+3pf 2510+1csw nuttcp-r: v6.1.2: socket nuttcp-r: buflen=65536, nstream=1, port=5001 tcp nuttcp-r: accept from 127.0.0.1 nuttcp-r: send window size = 43008, receive window size = 143360 nuttcp-r: 26963.4375 MB in 5.20 real seconds = 5313135.74 KB/sec = 43525.2080 Mbps nuttcp-r: 767807 I/O calls, msec/call = 0.01, calls/sec = 147750.09 nuttcp-r: 0.1user 3.9sys 0:05real 79% 98i+2570d 772maxrss 0+16pf 311014+8csw on the server, run " ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"